Continuous machine learning over streaming data

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs.


Talk Title	Continuous machine learning over streaming data
Speakers	Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)
Conference	Strata Data Conference
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 6-8, 2018
URL	Talk Page
Slides	Talk Slides
Video

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data, focusing on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs. Roger, Nina, Sudipto, and Ryan begin by discussing unsupervised machine learning algorithms that they have extended to operate on streams of data, which requires the machine learning model to continuously “evolve” as data streams through the system. The first example is the Robust Random Cut Forrest (RRCF) for anomaly detection that continuously learns each time it sees a new data record and emits a high anomaly score when it detects an outlier. The algorithm learns what “normal” looks like and evolves this model as new data streams in. They also discuss using stream clustering to reveal the internal structure of a data stream, which is capable of performing fast incremental clustering of records and constantly adapts to changes in the underlying stream of data, and share a new method to identify anomalies in directed graphs streaming in at high rates. Practical applications include the ability to detect anomalies in flow logs, such as denial of service attacks, port scans, and inter-VPC attacks. They conclude with techniques that are common to all of these machine learning algorithms. Along they way, they also explore functions powered by machine learning that give customers insights into their data. Explainable machine learning has been a common customer request. Roger, Nina, Sudipto, and Ryan describe an enhanced anomaly detection function that returns an anomaly score for every data record, which can identify exactly what fields in the record contributed to the anomaly score, their contribution factor (1–100), and how each value changed, and explain how they enable a customer to identify false alarms or specify when they want to be alerted. This anomaly detection function then takes this user feedback as training data and learns to eliminate false positive or to automatically classify anomalies. Roger, Nina, Sudipto, and Ryan conclude with a discussion of how these algorithms are implemented and provided to customers in Kinesis Analytics, actual customer applications and success stories, and a live demo.

Continuous machine learning over streaming data

Kafka streaming applications with Akka Streams and Kafka Streams

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams

Data-driven fuel management at Ryanair

Machine-learned model quality monitoring in fast data and streaming applications

Modern real-time streaming architectures

Bladder cancer diagnosis using deep learning