February 13, 2020

306 words 2 mins read

Hands-on machine learning with Kafka-based streaming pipelines

Hands-on machine learning with Kafka-based streaming pipelines

Boris Lublinsky and Dean Wampler examine ML use in streaming data pipelines, how to do periodic model retraining, and low-latency scoring in live streams. Learn about Kafka as the data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, metadata tracking, and more.

Talk Title Hands-on machine learning with Kafka-based streaming pipelines
Speakers Boris Lublinsky (Lightbend), Dean Wampler (Anyscale)
Conference Strata Data Conference
Conf Tag Make Data Work
Location New York, New York
Date September 24-26, 2019
URL Talk Page
Slides Talk Slides
Video

One possibility of training and serving (scoring) models is to treat trained model as code, then run that code for scoring. This works fine if the model never changes for the lifetime of the scoring process but isn’t ideal with long-running data streams, where you’d like to retrain the model periodically (due to concept drift) and score with the new model. The better way is to treat the model as data and have this model data exchanged between the training and scoring systems, which allows updating models in the running context. Boris Lublinsky and Dean Wampler explore different approaches to model training and serving that use this technique, where one or both functions are made an integrated part of the data-processing pipeline implementation (i.e., as an additional functional transformation of the data). The advantage of this approach is that model serving is implemented as part of the larger data-transformation pipeline. Such pipelines can be implemented using streaming engines—Spark Streaming, Flink, or Beam—or streaming libraries—Akka Streams or Kafka Streams. Boris and Dean will use Akka Streams, Flink, and Spark Structured Streaming in their demos. Outline: Speculative execution of model serving Guaranteed execution time Consensus-based model serving Quality-based model serving Model training Performance optimizations Real-world production concerns Data governance metadata Management and monitoring

comments powered by Disqus