Scalable anomaly detection with Spark and SOS
Jeroen Janssens dives into stochastic outlier section (SOS), an unsupervised algorithm for detecting anomalies in large, high-dimensional data. SOS has been implemented in Python, R, and, most recently, Spark. He illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of Spark, and applies SOS to a real-world use case.
Talk Title | Scalable anomaly detection with Spark and SOS |
Speakers | Jeroen Janssens (Data Science Workshops) |
Conference | Strata Data Conference |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 24-26, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Jeroen Janssens dives into SOS, an unsupervised algorithm for detecting anomalies in large, high-dimensional data, that he developed in MATLAB. SOS employs the concept of affinity to compute an outlier probability for each data point. It has a superior performance while being more robust to data perturbations and parameter settings. SOS was ported to both Python and R to allow for a wider adoption by the data science community. SOS has been implemented on a variety of distributed, large-scale data processing technologies, including Spark MLlib and Apache Flink; most recently the MLlib implementation was ported to Spark ML pipelines, because that’s superseded MLlib and provides a uniform set of high-level APIs built on top of dataframes. Jeroen illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of ML pipelines, explains the process of porting it from MLlib, and applies SOS to a real-world use case. By the end, you’ll have a good understanding of the algorithm and how to integrate anomaly detection in your own (streaming) machine learning pipeline.