The quest for high-quality data
Ihab Ilyas highlights the data-quality problem and describes the HoloClean framework, a state-of-the-art prediction engine for structured data with direct applications in detecting and repairing data errors, as well as imputing missing labels and values.
Talk Title | The quest for high-quality data |
Speakers | Ihab Ilyas (University of Waterloo) |
Conference | O’Reilly Artificial Intelligence Conference |
Conf Tag | Put AI to Work |
Location | London, United Kingdom |
Date | October 15-17, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | Talk Video |
“AI starts with good data” is a statement that receives wide agreement from data scientists, analysts, and business owners. There has been a significant increase in our ability to build complex AI models for prediction, classification, and various analytics tasks, and there’s an abundance of (fairly easy to use) tools that allow data scientists and analysts to provision complex models within days. However, the lack of data or data-quality issues remains the main bottleneck holding back further adoption of AI technologies. Even with advances in building robust models, the reality is that noisy data and incomplete data remain the biggest hurdles to effective end-to-end solutions. Multiple studies prove that cleaning data is a much more effective investment than enhancing learning robustness. Ihab Ilyas highlights this data quality problem and describes the HoloClean framework, a state-of-the-art prediction engine for structured data with direct applications in detecting and repairing data errors, as well as imputing missing labels and values. The framework uses techniques such as data augmentation and self-supervised learning to build models that describe how data is generated and how errors and anomalies are introduced.