February 13, 2020

306 words 2 mins read

Hands-on machine learning with Kafka-based streaming pipelines

Hands-on machine learning with Kafka-based streaming pipelines

Boris Lublinsky and Dean Wampler examine ML use in streaming data pipelines, how to do periodic model retraining, and low-latency scoring in live streams. Learn about Kafka as the data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, metadata tracking, and more.


Talk Title	Hands-on machine learning with Kafka-based streaming pipelines
Speakers	Boris Lublinsky (Lightbend), Dean Wampler (Anyscale)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 24-26, 2019
URL	Talk Page
Slides	Talk Slides
Video

One possibility of training and serving (scoring) models is to treat trained model as code, then run that code for scoring. This works fine if the model never changes for the lifetime of the scoring process but isn’t ideal with long-running data streams, where you’d like to retrain the model periodically (due to concept drift) and score with the new model. The better way is to treat the model as data and have this model data exchanged between the training and scoring systems, which allows updating models in the running context. Boris Lublinsky and Dean Wampler explore different approaches to model training and serving that use this technique, where one or both functions are made an integrated part of the data-processing pipeline implementation (i.e., as an additional functional transformation of the data). The advantage of this approach is that model serving is implemented as part of the larger data-transformation pipeline. Such pipelines can be implemented using streaming engines—Spark Streaming, Flink, or Beam—or streaming libraries—Akka Streams or Kafka Streams. Boris and Dean will use Akka Streams, Flink, and Spark Structured Streaming in their demos. Outline: Speculative execution of model serving Guaranteed execution time Consensus-based model serving Quality-based model serving Model training Performance optimizations Real-world production concerns Data governance metadata Management and monitoring

flink kafka code streaming management performance spark optimization governance machine learning monitoring pipeline data stream

comments powered by Disqus

Scalable anomaly detection with Spark and SOS

Scalable anomaly detection with Spark and SOS

February 10, 2020

Jeroen Janssens dives into stochastic outlier section (SOS), an unsupervised algorithm for detecting anomalies in large, high-dimensional data. SOS has been implemented in Python, R, and, most recently, Spark. He illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of Spark, and applies SOS to a real-world use case.

Analytics Zoo: Distributed TensorFlow in production on Apache Spark

Analytics Zoo: Distributed TensorFlow in production on Apache Spark

December 27, 2019

Yuhao Yang and Jennie Wang demonstrate how to run distributed TensorFlow on Apache Spark with the open source software package Analytics Zoo. Compared to other solutions, Analytics Zoo is built for production environments and encourages more industry users to run deep learning applications with the big data ecosystems.

Online machine learning in streaming applications

Online machine learning in streaming applications

February 11, 2020

Stavros Kontopoulos and Debasish Ghosh explore online machine learning algorithm choices for streaming applications, especially those with resource-constrained use cases like IoT and personalization. They dive into Hoeffding Adaptive Trees, classic sketch data structures, and drift detection algorithms from implementation to production deployment, describing the pros and cons of each of them.

Open source streaming analytics with the Kafka, Flink, Cassandra (KFC) stack

Open source streaming analytics with the Kafka, Flink, Cassandra (KFC) stack

January 27, 2020

Streaming analytics is a popular subject in enterprise organizations because customers want real-time experiences, such as notifications and advice based on online behavior and other users actions. Bas Geerdink details an open source reference solution for streaming analytics that covers many use cases that follow a "pipes and filters" pattern, built with Scala, Flink, Kafka, and Cassandra.

Turn devices into data scientistsat the edge

Turn devices into data scientistsat the edge

December 28, 2019

Todays approach to processing streaming data is based on legacy big-data centric architectures, the cloud, and the assumption that organizations have access to data scientists to make sense of it allleaving organizations increasingly overwhelmed. Simon Crosby shares a new architecture for edge intelligence that turns this thinking on its head.

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

December 21, 2019

Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance.