January 8, 2020

214 words 2 mins read

Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud

Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud

Running deep learning (DL) jobs requires end to end workflow to accelerate model training iteratively. It must be scalable on massive data and computational resources, and be framework agnostic to rel …


Talk Title	Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud
Speakers	Yang Che (Senior Engineer, Alibaba), Kai Zhang (Staff Engineer, Alibaba)
Conference	KubeCon + CloudNativeCon North America
Conf Tag
Location	Seattle, WA, USA
Date	Dec 9-14, 2018
URL	Talk Page
Slides	Talk Slides
Video

Running deep learning (DL) jobs requires end to end workflow to accelerate model training iteratively. It must be scalable on massive data and computational resources, and be framework agnostic to relieve the pain of managing diverse dependencies. In Alibaba cloud, we use Kubernetes to build elastic DL platform for continuous model training and optimization. It manages heterogeneous cluster including CPU/GPU/FPGA. Jobs are automatically scheduled to the best-fit resources. Kubeflow, which is a great machine learning scaffold on Kubernetes, is used to setup training pipeline. Project Arena is created to manage and instrument jobs with friendly user experience. In this talk, we will discuss how the platform is designed, and how it facilitates users to focus on DL tasks instead of managing underlying complexity. A demo shows how to run distributed neural network training in a minute.

cluster alibaba framework complexity dl network deep learning machine learning cloud scalable pipeline kubernetes neural network optimization

comments powered by Disqus

Scaling AI Inference Workloads with GPUs and Kubernetes

Scaling AI Inference Workloads with GPUs and Kubernetes

January 7, 2020

Deep Learning (DL) is a computational intense form of machine learning that has revolutionize many fields including computer vision, automated speech recognition, natural language processing and artif …

Distributed deep learning with containers on heterogeneous GPU clusters

Distributed deep learning with containers on heterogeneous GPU clusters

November 26, 2019

Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters.

Pangeo: Big data climate science in the cloud

Pangeo: Big data climate science in the cloud

January 6, 2020

Climate science is being flooded with petabytes of data, overwhelming traditional modes of data analysis. The Pangeo project is building a platform to take big data climate science into the cloud using SciPy and large-scale interactive computing tools. Join Ryan Abernathey and Yuvi Panda to find out what the Pangeo team is building and why and learn how to use it.

Distributed TensorFlow on Hops

Distributed TensorFlow on Hops

December 30, 2019

Fabio Buso offers demonstrations of frameworks for building distributed TensorFlow applications on the Hops platform and walks you through the whole model lifecycle, from debugging and visualizing models on TensorBoard to parallel experimentation and distributed training (with the help of Spark) to model deployment and inferencing using TensorFlow Serving and Kubernetes.

Distributed training of deep learning models

Distributed training of deep learning models

December 10, 2019

Mathew Salvaris, Miguel Gonzalez-Fierro, and Ilia Karmanov offer a comparison of two platforms for running distributed deep learning training in the cloud, using a ResNet network trained on the ImageNet dataset as an example. You'll examine the performance of each as the number of nodes scales and learn some tips and tricks as well as some pitfalls to watch out for.

Practical considerations when shifting to using deep learning for your text data

Practical considerations when shifting to using deep learning for your text data

December 3, 2019

Emmanuel Ameisen and Yan Kou share a guide for moving your company toward deep learning using a collection of NLP best practices gathered from conversations with 75+ teams from Google, Facebook, Amazon, Twitter, Salesforce, Airbnb, Capital One, Bloomberg, and others.