November 24, 2019

224 words 2 mins read

Networking Optimizations for Multi-Node Deep Learning on Kubernetes

Networking Optimizations for Multi-Node Deep Learning on Kubernetes

Training a Neural Network may take days or weeks, even on a top of the line GPU. To reduce training time, distributed computation is often employed to spread the work across multiple GPUs and multiple …


Talk Title	Networking Optimizations for Multi-Node Deep Learning on Kubernetes
Speakers	Erez Cohen (Vice President for CloudX & AI Program, Mellanox), Rajat Chopra (Principal Engineer, Nvidia)
Conference	KubeCon + CloudNativeCon North America
Conf Tag
Location	San Diego, CA, USA
Date	Nov 15-21, 2019
URL	Talk Page
Slides	Talk Slides
Video

Training a Neural Network may take days or weeks, even on a top of the line GPU. To reduce training time, distributed computation is often employed to spread the work across multiple GPUs and multiple nodes. Horovod is the best example of such a scalable architecture. At NVIDIA, in collaboration with the community, we have configured Kubernetes and multi-node infrastructure to deliver performance that scales as we add more GPUs and nodes. This talk presents the problems and solutions related to networking discovered during this journey.The inexhaustive list includes solutions like CNI for multiple networks using SRIOV, enabling RDMA over IB and Ethernet (RoCE) to provide low latency, high throughput and direct GPU to NIC connectivity (GPUDirect), enforcing PCI affinity of GPUs with respect to Network Interfaces, using Source-Based routing within pods for L3 networks and much more.

routing gpu ethernet optimization infrastructure network deep learning nvidia performance scalable kubernetes neural network networking

comments powered by Disqus

Large Scale Distributed Deep Learning on Kubernetes Clusters

Large Scale Distributed Deep Learning on Kubernetes Clusters

October 2, 2019

The focus of this talk is the deployments of large scale distributed deep learning with Kubernetes. The usage of operators to manage and automate training processes for machine learning are discussed. …

Running High-performance User-space Packet Processing Apps in Kubernetes

Running High-performance User-space Packet Processing Apps in Kubernetes

November 24, 2019

With 5G on the horizon, networking is transforming around us. Network functions have already found their way from proprietary blackbox into servers running in Linux. The Linux networking stack simply …

Lightning Talk: Managing Drivers in a Kubernetes Cluster

Lightning Talk: Managing Drivers in a Kubernetes Cluster

November 17, 2019

As a cluster operator, managing drivers (Mellanox networking, NVIDIA compute and graphics drivers, …) at scale today is a real issue, from installation to upgrade every step you take brings you furt …

Large Scale Distributed Deep Learning with Kubernetes Operators

Large Scale Distributed Deep Learning with Kubernetes Operators

October 29, 2019

The focus of this talk is the usage of Kubernetes operators to manage and automate training process for machine learning tasks. Two open source Kubernetes operators, tf-operator and mpi-operator, will …

A Method for the Cost Optimization of Kubernetes-based Deep Learning Training and Inference

A Method for the Cost Optimization of Kubernetes-based Deep Learning Training and Inference

September 26, 2019

To improve the throughput capacity of the training or inference applications without adding extra GPU cores, we share one GPU core between multiple deep learning workloads in a kubernetes cluster by c …

Keynote: Tencent: Kubernetes in the Billions

Keynote: Tencent: Kubernetes in the Billions

September 24, 2019

At Tencent, our business touches everything from gaming, social media, payments, to cloud computing. Wed like to share our story of how K8s is broadly used at Tencent, taking care of our infrastructu …