Machine learning for nonstationary streaming data using Structured Streaming and StreamDM

The StreamDM library provides the largest collection of data stream mining algorithms for Spark. Heitor Murilo Gomes and Albert Bifet explain how to use StreamDM and Structured Streaming to develop, apply, and evaluate learning models specially for nonstationary streams (i.e., those with concept drifts).


Talk Title	Machine learning for nonstationary streaming data using Structured Streaming and StreamDM
Speakers	Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

Adapting StreamDM to the novel Structured Streaming engine simplifies both its use and development. Currently, the open source StreamDM library provides the largest collection of data stream mining algorithms for Spark, including both supervised and unsupervised learning algorithms that can be updated online. The main difference between batch machine learning implementations in Spark (MLlib and Spark ML) and StreamDM is that the latter focus on algorithms that can be trained and adapted incrementally. This can be a huge advantage in some domains as it enables automatically updating the learning models. StreamDM is currently under development by Huawei Noah’s Ark Lab and Télécom ParisTech. There is a vast literature on the topic of addressing concept drift and learning from streaming data. Still, these methods can be complex to implement and integrate. Heitor Murilo Gomes and Albert Bifet explain how to use StreamDM and Structured Streaming to develop, apply, and evaluate learning models specially for nonstationary streams (i.e., those with concept drifts). Adapting StreamDM for Structured Streaming is a natural step that facilitates future integration with major technology improvements, such as continuous processing. Heitor and Albert also introduce a simple yet powerful methodology to address concept drifts using active strategies like combining ensemble models and drift detectors and reactive strategies and reactive strategies like forgetting mechanisms and periodical resets (windowed approaches).

Machine learning for nonstationary streaming data using Structured Streaming and StreamDM

StreamDM: Advanced data science with Spark Streaming

Machine-learned model quality monitoring in fast data and streaming applications

Building deep reinforcement learning applications on BigDL and Spark

Continuous machine learning over streaming data

Machine-learned model quality monitoring in fast data and streaming applications

Distributed systems for stream processing: Apache Kafka and Spark Streaming