November 27, 2019

190 words 1 min read

Improving Performance of Deep Learning Workloads With Volcano

Improving Performance of Deep Learning Workloads With Volcano

Baidu internally has improved the performance of large-scale deep learning workloads by using the Volcano project. The CRD-based computing resource model makes it possible to use resources more effici …


Talk Title	Improving Performance of Deep Learning Workloads With Volcano
Speakers	Ti Zhou (Architect, Baidu)
Conference	KubeCon + CloudNativeCon North America
Conf Tag
Location	San Diego, CA, USA
Date	Nov 15-21, 2019
URL	Talk Page
Slides	Talk Slides
Video

Baidu internally has improved the performance of large-scale deep learning workloads by using the Volcano project. The CRD-based computing resource model makes it possible to use resources more efficiently and configure computing models more flexibly. The Volcano project has unified abstraction of the underlying capabilities of group scheduling, fair share, priority queue, job suspend/resume, etc., which makes up for the lack of functionality of the native job based training operator.After using Volcano, Baidu’s internal resource utilization increased by 15%, and the training task completion speed increased by 10%. This talk will introduce the overall function of Volcano, transformation of the old operator to support Volcano, and the comparison of the performance of deep learning training tasks before and after using Volcano.

performance large-scale deep learning

comments powered by Disqus

Networking Optimizations for Multi-Node Deep Learning on Kubernetes

Networking Optimizations for Multi-Node Deep Learning on Kubernetes

November 24, 2019

Training a Neural Network may take days or weeks, even on a top of the line GPU. To reduce training time, distributed computation is often employed to spread the work across multiple GPUs and multiple …

Large Scale Distributed Deep Learning with Kubernetes Operators

Large Scale Distributed Deep Learning with Kubernetes Operators

October 29, 2019

The focus of this talk is the usage of Kubernetes operators to manage and automate training process for machine learning tasks. Two open source Kubernetes operators, tf-operator and mpi-operator, will …

Going from 5s to 5ms: Benefits of a Node-Local DNSCache

Going from 5s to 5ms: Benefits of a Node-Local DNSCache

October 27, 2019

DNS is one of the most heavily used services in Kubernetes clusters.The kubernetes community has long struggled with these mysterious 5s connection delays or DNS failures.This talk discusses how this …

Latest Kubernetes Scalability Improvements

Latest Kubernetes Scalability Improvements

October 17, 2019

As the kubernetes project evolved, it started to increasingly gain adoption by enterprise and large scale users. Kubernetes, with a series of performance and scalability improvements, had come to supp …

Extending Deployment for Internet Financial Mission-Critical Scenarios

Extending Deployment for Internet Financial Mission-Critical Scenarios

October 3, 2019

The default deployment provides a good solution to perform a general version upgrade. However, deploying highly available and reliable services of large-scale as Internet financial applications is a d …

Large Scale Distributed Deep Learning on Kubernetes Clusters

Large Scale Distributed Deep Learning on Kubernetes Clusters

October 2, 2019

The focus of this talk is the deployments of large scale distributed deep learning with Kubernetes. The usage of operators to manage and automate training processes for machine learning are discussed. …