Best practices for productionizing Apache Spark MLlib models
Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving.
Talk Title | Best practices for productionizing Apache Spark MLlib models |
Speakers | Joseph Bradley (Databricks) |
Conference | Strata Data Conference |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 6-8, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Apache Spark has become a key tool for data scientists to explore, understand, and transform massive datasets and build and train advanced machine learning models. The question then becomes how to deploy these machine learning models in a production environment. How do you embed what you’ve learned into customer-facing data applications? When companies begin to employ machine learning in actual production workflows, they encounter new sources of friction. Sharing models across teams can be challenging, especially when sharing means migrating to new deployment environments. Ensuring that identical models are deployed in different systems, especially while maintaining complex featurization logic, can cause subtle bugs and changes of behavior. Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving. Joseph concludes with a demo that illustrates key parts of these workflows. You’ll leave with a high-level view of deployment modes as well as tips and resources for getting started with each mode.