Large Scale Distributed Deep Learning on Kubernetes Clusters


Talk Title	Large Scale Distributed Deep Learning on Kubernetes Clusters
Speakers	Yong Tang (Director of Engineering, MobileIron), Yuan Tang (Senior Software Engineer, Ant Financial)
Conference	KubeCon + CloudNativeCon
Conf Tag
Location	Shanghai, China
Date	Jun 23-26, 2019
URL	Talk Page
Slides	Talk Slides
Video

The focus of this talk is the deployments of large scale distributed deep learning with Kubernetes. The usage of operators to manage and automate training processes for machine learning are discussed. We share our experiences and compare two open source Kubernetes operators, tf-operator and mpi-operator in this talk. Both operators manage training jobs for TensorFlow but they have different distribution strategies, which lead to different performance results with respect to the utilization ratio among CPU, GPU, and network. Deep learning tasks are both network and GPU intensive such that a proper optimization for orchestration is very important. There could easily be an imbalance leads to idle compute capacity which is too expensive for GPU nodes (compared with CPUs). We will share our experiences with the hope to provide helpful insight for better economics with machine learning tasks.

Large Scale Distributed Deep Learning on Kubernetes Clusters

A Method for the Cost Optimization of Kubernetes-based Deep Learning Training and Inference

Minimizing GPU Cost for Your Deep Learning on Kubernetes

Multi-Cloud Machine Learning Data and Workflow with Kubernetes

Deep Dive: Rook

Keynote: Tencent: Kubernetes in the Billions

Porter - An Open Source Load Balancer for Bare Metal Kubernetes