Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud
Running deep learning (DL) jobs requires end to end workflow to accelerate model training iteratively. It must be scalable on massive data and computational resources, and be framework agnostic to rel …
Talk Title | Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud |
Speakers | Yang Che (Senior Engineer, Alibaba), Kai Zhang (Staff Engineer, Alibaba) |
Conference | KubeCon + CloudNativeCon North America |
Conf Tag | |
Location | Seattle, WA, USA |
Date | Dec 9-14, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Running deep learning (DL) jobs requires end to end workflow to accelerate model training iteratively. It must be scalable on massive data and computational resources, and be framework agnostic to relieve the pain of managing diverse dependencies. In Alibaba cloud, we use Kubernetes to build elastic DL platform for continuous model training and optimization. It manages heterogeneous cluster including CPU/GPU/FPGA. Jobs are automatically scheduled to the best-fit resources. Kubeflow, which is a great machine learning scaffold on Kubernetes, is used to setup training pipeline. Project Arena is created to manage and instrument jobs with friendly user experience. In this talk, we will discuss how the platform is designed, and how it facilitates users to focus on DL tasks instead of managing underlying complexity. A demo shows how to run distributed neural network training in a minute.