Running large-scale machine learning experiments in the cloud
Machine learning involves a lot of experimentation. Data scientists spend days, weeks, or months performing algorithm searches, model architecture searches, hyperparameter searches, etc. Shashank Prasanna breaks down how you can easily run large-scale machine learning experiments using containers, Kubernetes, Amazon ECS, and SageMaker.
Talk Title | Running large-scale machine learning experiments in the cloud |
Speakers | Shashank Prasanna (Amazon Web Services) |
Conference | O’Reilly Artificial Intelligence Conference |
Conf Tag | Put AI to Work |
Location | San Jose, California |
Date | September 10-12, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Machine learning involves a lot of experimentation; there’s no question about it. Data scientists and researchers spend days, weeks, or months performing steps such as algorithm searches, model architecture searches, and hyperparameter searches, as well as exploring different validation schemes, model averaging, and others. This can be time consuming, but it’s necessary to arrive at the best-performing model for the problem at hand. Even though virtually infinite compute and storage capacity is now accessible to anyone in the cloud, most machine learning workflows still involve interactively running experiments on a single GPU instance because of the complexity of setting up, managing, and scheduling experiments at scale. With container-based technologies such as Kubernetes, Amazon ECS, and Amazon SageMaker, data scientists and researchers can focus on designing and running experiments and let these services manage infrastructure setup, scheduling, and orchestrating the machine learning experiments. Shashank Prasanna breaks down how you can easily run large-scale machine learning experiments on CPU and GPU clusters with these services and how they compare. You’ll also learn how to manage trade-offs between time to solution and cost by scaling up or scaling back resources.