November 28, 2019

470 words 3 mins read

Continuous machine learning over streaming data

Continuous machine learning over streaming data

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs.

Talk Title Continuous machine learning over streaming data
Speakers Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)
Conference Strata Data Conference
Conf Tag Big Data Expo
Location San Jose, California
Date March 6-8, 2018
URL Talk Page
Slides Talk Slides
Video

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data, focusing on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs. Roger, Nina, Sudipto, and Ryan begin by discussing unsupervised machine learning algorithms that they have extended to operate on streams of data, which requires the machine learning model to continuously “evolve” as data streams through the system. The first example is the Robust Random Cut Forrest (RRCF) for anomaly detection that continuously learns each time it sees a new data record and emits a high anomaly score when it detects an outlier. The algorithm learns what “normal” looks like and evolves this model as new data streams in. They also discuss using stream clustering to reveal the internal structure of a data stream, which is capable of performing fast incremental clustering of records and constantly adapts to changes in the underlying stream of data, and share a new method to identify anomalies in directed graphs streaming in at high rates. Practical applications include the ability to detect anomalies in flow logs, such as denial of service attacks, port scans, and inter-VPC attacks. They conclude with techniques that are common to all of these machine learning algorithms. Along they way, they also explore functions powered by machine learning that give customers insights into their data. Explainable machine learning has been a common customer request. Roger, Nina, Sudipto, and Ryan describe an enhanced anomaly detection function that returns an anomaly score for every data record, which can identify exactly what fields in the record contributed to the anomaly score, their contribution factor (1–100), and how each value changed, and explain how they enable a customer to identify false alarms or specify when they want to be alerted. This anomaly detection function then takes this user feedback as training data and learns to eliminate false positive or to automatically classify anomalies. Roger, Nina, Sudipto, and Ryan conclude with a discussion of how these algorithms are implemented and provided to customers in Kinesis Analytics, actual customer applications and success stories, and a live demo.

comments powered by Disqus