October 29, 2019

195 words 1 min read

Large Scale Distributed Deep Learning with Kubernetes Operators

Large Scale Distributed Deep Learning with Kubernetes Operators

The focus of this talk is the usage of Kubernetes operators to manage and automate training process for machine learning tasks. Two open source Kubernetes operators, tf-operator and mpi-operator, will …


Talk Title	Large Scale Distributed Deep Learning with Kubernetes Operators
Speakers	Yong Tang (Director of Engineering, MobileIron), Yuan Tang (Senior Software Engineer, Ant Financial)
Conference	KubeCon + CloudNativeCon Europe
Conf Tag
Location	Barcelona, Spain
Date	May 19-23, 2019
URL	Talk Page
Slides	Talk Slides
Video

The focus of this talk is the usage of Kubernetes operators to manage and automate training process for machine learning tasks. Two open source Kubernetes operators, tf-operator and mpi-operator, will be discussed. Both operators manage training jobs for TensorFlow but they have different distribution strategies. The tf-operator fits the parameter server distribution strategy which has a centralized parameter server for coordination. The mpi-operator, on the other hand, utilize MPI allreduce primitive implementation. While the parameter server strategy requires a right ratio of CPU (for parameter servers) and GPU (for workers) to reach network-optimal, the all reduce distribution might be easier to optimize network cost. We will share our performance numbers in out talk for comparison of those two operators.

large scale gpu tensorflow open source network deep learning machine learning performance kubernetes

comments powered by Disqus

Large Scale Distributed Deep Learning on Kubernetes Clusters

Large Scale Distributed Deep Learning on Kubernetes Clusters

October 2, 2019

The focus of this talk is the deployments of large scale distributed deep learning with Kubernetes. The usage of operators to manage and automate training processes for machine learning are discussed. …

Running eBays High-Performance Workloads with Kubernetes

Running eBays High-Performance Workloads with Kubernetes

October 25, 2019

In the past two years weve been expanding our k8s deployments significantly fast by moving more and more production workloads into kubernetes.Were now running multiple thousand-node k8s clusters fro …

Hyperparameter Tuning Using Kubeflow

Hyperparameter Tuning Using Kubeflow

October 12, 2019

In machine learning, hyperparameter tuning refers to the process of finding the optimal constraints for training models. Choosing optimal hyperparameters can drastically improve the performance of a m …

Minimizing GPU Cost for Your Deep Learning on Kubernetes

Minimizing GPU Cost for Your Deep Learning on Kubernetes

September 28, 2019

More and more data scientists run their Nvidia GPU based deep learning tasks on Kubernetes. Meanwhile, it's found over 40% cost are wasted on idle GPU in the cluster. So one important challenge is how …

Multi-Cloud Machine Learning Data and Workflow with Kubernetes

Multi-Cloud Machine Learning Data and Workflow with Kubernetes

September 27, 2019

Autonomous vehicles require hardware accelerated machine learning for critical problems such as tracking and classification. Momenta trains ML models in on-prem regions and public clouds, each comes w …

A Method for the Cost Optimization of Kubernetes-based Deep Learning Training and Inference

A Method for the Cost Optimization of Kubernetes-based Deep Learning Training and Inference

September 26, 2019

To improve the throughput capacity of the training or inference applications without adding extra GPU cores, we share one GPU core between multiple deep learning workloads in a kubernetes cluster by c …