Continuous machine learning over streaming data: The story continues.
Roger Barga, Sudipto Guha, and Kapil Chhabra explain how unsupervised learning with the robust random cut forest (RRCF) algorithm enables insights into streaming data and share new applications to impute missing values, forecast future values, detect hotspots, and perform classification tasks. They also demonstrate how to implement unsupervised learning over massive data streams.
Talk Title | Continuous machine learning over streaming data: The story continues. |
Speakers | Roger Barga (Amazon Web Services), Sudipto Guha (Amazon Web Services), Kapil Chhabra (Amazon Web Services ) |
Conference | Strata Data Conference |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 11-13, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Roger Barga, Sudipto Guha, and Kapil Chhabra explain how unsupervised learning with the robust random cut forest (RRCF) algorithm enables insights into streaming data and share new applications to impute missing values, forecast future values, detect hotspots, and perform classification tasks. They also demonstrate how to implement unsupervised learning over massive data streams. In this extension of their talk at Strata San Jose 2018, where they first presented the RRCF algorithm—which maintains an efficient sketch of a data stream and continuously adapts (learns) each time it sees a new data record—Roger, Sudipto, and Kapil discuss new applications and results, including implementation details. After briefly introducing the RRCF algorithm, they present its application to impute missing values in a data stream. They then detail its application to forecast future values, when the stream is a time series of data, and describe how the RRCF algorithm can be used to detect emerging hotspots in a data stream and perform multiclass classification over streaming data. For each application of the RRCF, Roger, Sudipto, and Kapil present an actual customer use case along with the results of experiments that compare RRCF application with best-in-class methods. They conclude with a deep dive into the efficient implementation the RRCF algorithm that enables it to operate and continuously learn in real time over massive data streams.