February 10, 2020

268 words 2 mins read

Scalable anomaly detection with Spark and SOS

Scalable anomaly detection with Spark and SOS

Jeroen Janssens dives into stochastic outlier section (SOS), an unsupervised algorithm for detecting anomalies in large, high-dimensional data. SOS has been implemented in Python, R, and, most recently, Spark. He illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of Spark, and applies SOS to a real-world use case.

Talk Title Scalable anomaly detection with Spark and SOS
Speakers Jeroen Janssens (Data Science Workshops)
Conference Strata Data Conference
Conf Tag Make Data Work
Location New York, New York
Date September 24-26, 2019
URL Talk Page
Slides Talk Slides
Video

Jeroen Janssens dives into SOS, an unsupervised algorithm for detecting anomalies in large, high-dimensional data, that he developed in MATLAB. SOS employs the concept of affinity to compute an outlier probability for each data point. It has a superior performance while being more robust to data perturbations and parameter settings. SOS was ported to both Python and R to allow for a wider adoption by the data science community. SOS has been implemented on a variety of distributed, large-scale data processing technologies, including Spark MLlib and Apache Flink; most recently the MLlib implementation was ported to Spark ML pipelines, because that’s superseded MLlib and provides a uniform set of high-level APIs built on top of dataframes. Jeroen illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of ML pipelines, explains the process of porting it from MLlib, and applies SOS to a real-world use case. By the end, you’ll have a good understanding of the algorithm and how to integrate anomaly detection in your own (streaming) machine learning pipeline.

comments powered by Disqus