"CASE": A Mesos Scheduler for Distributed Machine Learning
Many machine learning frameworks support distributed training of models. Distributed training is becoming increasingly important as organizations train on larger datasets with increased importance on …
Talk Title | "CASE": A Mesos Scheduler for Distributed Machine Learning |
Speakers | Steven Bairos-Novak (Software Engineer, Pinterest), Karthik Anantha Padmanabhan (Software Engineer) |
Conference | Open Source Summit North America |
Conf Tag | |
Location | Vancouver, BC, Canada |
Date | Aug 27-31, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Many machine learning frameworks support distributed training of models. Distributed training is becoming increasingly important as organizations train on larger datasets with increased importance on reducing the overall training time. Every ML framework comes with its own specification on how to do distributed machine learning. Typically all of them have their own notion of workers and have a mechanism on how these workers communicate to update and share their learnt parameters.The lifecycle of these workers needs to managed differently for different ML frameworks and typically requires the use of an external cluster manager to schedule workers on machines and manage their lifecycle. In this talk, Karthik will talk about “Case”, a Mesos batch scheduler that supports launching and managing the lifecycle of workers across multiple ML frameworks ( Tensorflow, LightGBM, XGBoost etc ).