November 24, 2019

224 words 2 mins read

Networking Optimizations for Multi-Node Deep Learning on Kubernetes

Networking Optimizations for Multi-Node Deep Learning on Kubernetes

Training a Neural Network may take days or weeks, even on a top of the line GPU. To reduce training time, distributed computation is often employed to spread the work across multiple GPUs and multiple …

Talk Title Networking Optimizations for Multi-Node Deep Learning on Kubernetes
Speakers Erez Cohen (Vice President for CloudX & AI Program, Mellanox), Rajat Chopra (Principal Engineer, Nvidia)
Conference KubeCon + CloudNativeCon North America
Conf Tag
Location San Diego, CA, USA
Date Nov 15-21, 2019
URL Talk Page
Slides Talk Slides
Video

Training a Neural Network may take days or weeks, even on a top of the line GPU. To reduce training time, distributed computation is often employed to spread the work across multiple GPUs and multiple nodes. Horovod is the best example of such a scalable architecture. At NVIDIA, in collaboration with the community, we have configured Kubernetes and multi-node infrastructure to deliver performance that scales as we add more GPUs and nodes. This talk presents the problems and solutions related to networking discovered during this journey.The inexhaustive list includes solutions like CNI for multiple networks using SRIOV, enabling RDMA over IB and Ethernet (RoCE) to provide low latency, high throughput and direct GPU to NIC connectivity (GPUDirect), enforcing PCI affinity of GPUs with respect to Network Interfaces, using Source-Based routing within pods for L3 networks and much more.

comments powered by Disqus