November 29, 2019

250 words 2 mins read

Best practices for productionizing Apache Spark MLlib models

Best practices for productionizing Apache Spark MLlib models

Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving.


Talk Title	Best practices for productionizing Apache Spark MLlib models
Speakers	Joseph Bradley (Databricks)
Conference	Strata Data Conference
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 6-8, 2018
URL	Talk Page
Slides	Talk Slides
Video

Apache Spark has become a key tool for data scientists to explore, understand, and transform massive datasets and build and train advanced machine learning models. The question then becomes how to deploy these machine learning models in a production environment. How do you embed what you’ve learned into customer-facing data applications? When companies begin to employ machine learning in actual production workflows, they encounter new sources of friction. Sharing models across teams can be challenging, especially when sharing means migrating to new deployment environments. Ensuring that identical models are deployed in different systems, especially while maintaining complex featurization logic, can cause subtle bugs and changes of behavior. Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving. Joseph concludes with a demo that illustrates key parts of these workflows. You’ll leave with a high-level view of deployment modes as well as tips and resources for getting started with each mode.

streaming apache dataset spark machine learning

comments powered by Disqus

Apache Spark programming

Apache Spark programming

November 29, 2019

Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Sparks streaming capabilities and machine learning APIs.

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale

November 24, 2019

Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a users probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlibs alternating least squares (ALS) approach.

Cuttlefish: Lightweight primitives for online tuning

Cuttlefish: Lightweight primitives for online tuning

November 28, 2019

Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time.

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

November 25, 2019

When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.

Kafka streaming applications with Akka Streams and Kafka Streams

Kafka streaming applications with Akka Streams and Kafka Streams

November 24, 2019

Dean Wampler compares and contrasts data processing with Akka Streams and Kafka Streams, microservice streaming applications based on Kafka. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to choose them instead.

Machine-learned model quality monitoring in fast data and streaming applications

Machine-learned model quality monitoring in fast data and streaming applications

November 23, 2019

Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications.