January 8, 2020

214 words 2 mins read

Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud

Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud

Running deep learning (DL) jobs requires end to end workflow to accelerate model training iteratively. It must be scalable on massive data and computational resources, and be framework agnostic to rel …

Talk Title Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud
Speakers Yang Che (Senior Engineer, Alibaba), Kai Zhang (Staff Engineer, Alibaba)
Conference KubeCon + CloudNativeCon North America
Conf Tag
Location Seattle, WA, USA
Date Dec 9-14, 2018
URL Talk Page
Slides Talk Slides
Video

Running deep learning (DL) jobs requires end to end workflow to accelerate model training iteratively. It must be scalable on massive data and computational resources, and be framework agnostic to relieve the pain of managing diverse dependencies. In Alibaba cloud, we use Kubernetes to build elastic DL platform for continuous model training and optimization. It manages heterogeneous cluster including CPU/GPU/FPGA. Jobs are automatically scheduled to the best-fit resources. Kubeflow, which is a great machine learning scaffold on Kubernetes, is used to setup training pipeline. Project Arena is created to manage and instrument jobs with friendly user experience. In this talk, we will discuss how the platform is designed, and how it facilitates users to focus on DL tasks instead of managing underlying complexity. A demo shows how to run distributed neural network training in a minute.

comments powered by Disqus