Machine-learned model quality monitoring in fast data and streaming applications

Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu reviews monitoring methods, focusing on their applicability in fast data and streaming applications.


Talk Title	Machine-learned model quality monitoring in fast data and streaming applications
Speakers	Emre Velipasaoglu (Lightbend)
Conference	Strata Data Conference
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	May 22-24, 2018
URL	Talk Page
Slides	Talk Slides
Video

Most machine learning algorithms are designed to work with stationary data. These algorithms are usually the first ones tried by teams building machine learning applications, because they are readily available in popular open source libraries, such as Python scikit-learn, and distributed machine learning libraries like Spark MLlib. But real-life streaming data is rarely stationary, and its statistical characteristics—as well as quality and relevance of models that depend on it—change over time. Machine-learned models built on data observed within a fixed time period usually suffer loss of prediction quality due to what is known as concept drift. There are several methods to deal with concept drift. The most common method is periodically retraining the models with new data while perhaps down-weighting the old data or completely removing it. The length of the period is usually determined based on the cost of retraining. The changes in the input data and the quality of predictions are not monitored, and the cost of inaccurate predictions is not included in these calculations. One alternative on the other end of the complexity spectrum is using adaptive learning methods. However, these algorithms still require tuning of parameters to perform well. An attractive alternative in between is monitoring the machine-learned model quality by testing the inputs and predictions for changes over time, and using change points in retraining decisions. There has been significant development in this area within the last two decades. While most of these methods are appropriate for the classification models, there are some new methods appropriate for regression problems as well. Emre Velipasaoglu reviews monitoring methods, focusing on their applicability in fast data and streaming applications.

Machine-learned model quality monitoring in fast data and streaming applications

Machine-learned model quality monitoring in fast data and streaming applications

Human in the loop: Bayesian rules enabling explainable AI

StreamDM: Advanced data science with Spark Streaming

Apache Spark programming

Not your parents' machine learning: How to ship an XGBoost churn prediction app in under four weeks

Learning how to design automatically updating AI with Apache Kafka and Deeplearning4j