February 10, 2020

268 words 2 mins read

Scalable anomaly detection with Spark and SOS

Scalable anomaly detection with Spark and SOS

Jeroen Janssens dives into stochastic outlier section (SOS), an unsupervised algorithm for detecting anomalies in large, high-dimensional data. SOS has been implemented in Python, R, and, most recently, Spark. He illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of Spark, and applies SOS to a real-world use case.


Talk Title	Scalable anomaly detection with Spark and SOS
Speakers	Jeroen Janssens (Data Science Workshops)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 24-26, 2019
URL	Talk Page
Slides	Talk Slides
Video

Jeroen Janssens dives into SOS, an unsupervised algorithm for detecting anomalies in large, high-dimensional data, that he developed in MATLAB. SOS employs the concept of affinity to compute an outlier probability for each data point. It has a superior performance while being more robust to data perturbations and parameter settings. SOS was ported to both Python and R to allow for a wider adoption by the data science community. SOS has been implemented on a variety of distributed, large-scale data processing technologies, including Spark MLlib and Apache Flink; most recently the MLlib implementation was ported to Spark ML pipelines, because that’s superseded MLlib and provides a uniform set of high-level APIs built on top of dataframes. Jeroen illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of ML pipelines, explains the process of porting it from MLlib, and applies SOS to a real-world use case. By the end, you’ll have a good understanding of the algorithm and how to integrate anomaly detection in your own (streaming) machine learning pipeline.

api flink streaming apache performance algorithm spark ml large-scale data science anomaly detection use case machine learning python scalable pipeline unsupervised

comments powered by Disqus

Spark NLP in action: How Indeed applies NLP to standardize rsum content at scale

Spark NLP in action: How Indeed applies NLP to standardize rsum content at scale

January 6, 2020

Alexander Thomas and Alexis Yelton demonstrate how to use Spark NLP and Apache Spark to standardize semistructured text, illustrated by Indeed's standardization process for rsum content.

One-click deployment for containerized ML and DL environments

One-click deployment for containerized ML and DL environments

December 29, 2019

Nanda Vijaydev explains how to spin up instant ML/DL environments using containersall while ensuring enterprise-grade security and performance. Find out how to provide your data science teams with on-demand access to the tools and data they need, whether on-premises or in the cloud.

Analytics Zoo: Distributed TensorFlow in production on Apache Spark

Analytics Zoo: Distributed TensorFlow in production on Apache Spark

December 27, 2019

Yuhao Yang and Jennie Wang demonstrate how to run distributed TensorFlow on Apache Spark with the open source software package Analytics Zoo. Compared to other solutions, Analytics Zoo is built for production environments and encourages more industry users to run deep learning applications with the big data ecosystems.

Introducing Kubeflow (with special guests TensorFlow and Apache Spark)

Introducing Kubeflow (with special guests TensorFlow and Apache Spark)

February 4, 2020

Modeling is easyproductizing models, less so. Distributed training? Forget about it. Say hello to Kubeflow with Holden Karaua system that makes it easy for data scientists to containerize their models to train and serve on Kubernetes.

Building machine learning inference pipelines at scale

Building machine learning inference pipelines at scale

January 31, 2020

Real-life ML workloads require more than training and predicting: data often needs to be preprocessed and postprocessed. Developers and data scientists have to train and deploy a sequence of algorithms that collaborate in delivering predictions from raw data. Julien Simon outlines how to build machine learning inference pipelines using open source libraries and how to scale them on AWS.

Deep learning with TensorFlow and Spark using GPUs and Docker containers

Deep learning with TensorFlow and Spark using GPUs and Docker containers

January 12, 2020

Organizations need to keep ahead of their competition by using the latest AI, ML, and DL technologies such as Spark, TensorFlow, and H2O. The challenge is in how to deploy these tools and keep them running in a consistent manner while maximizing the use of scarce hardware resources, such as GPUs. Thomas Phelan discusses the effective deployment of such applications in a container environment.