December 10, 2019

357 words 2 mins read

Distributed training of deep learning models

Distributed training of deep learning models

Mathew Salvaris, Miguel Gonzalez-Fierro, and Ilia Karmanov offer a comparison of two platforms for running distributed deep learning training in the cloud, using a ResNet network trained on the ImageNet dataset as an example. You'll examine the performance of each as the number of nodes scales and learn some tips and tricks as well as some pitfalls to watch out for.

Talk Title Distributed training of deep learning models
Speakers Mathew Salvaris (Microsoft), Miguel Gonzalez-Fierro (Microsoft), Ilia Karmanov (Microsoft)
Conference Strata Data Conference
Conf Tag Making Data Work
Location London, United Kingdom
Date May 22-24, 2018
URL Talk Page
Slides Talk Slides
Video

In the last year, there have been a number of attempts to train deep CNNs on the ImageNet dataset in the shortest time possible. (The most recent attempt managed to do it in 15 minutes.) But all of these attempts happen on custom clusters, which are out of the reach of most data scientists. One of the key advantages of the cloud is being able to scale out compute resources as required. Mathew Salvaris, Miguel Gonzalez-Fierro, and Ilia Karmanov offer a comparison of two platforms for running distributed deep learning training in the cloud. Both utilize Docker containers, making it possible to run any deep learning framework on them. You’ll examine the performance of each as the number of nodes scales and learn some tips and tricks as well as some pitfalls to watch out for. The first is a service called Batch AI, which uses the Azure Batch infrastructure to easily run deep learning jobs at scale across GPUs. The second is an open source toolkit that allows data scientists to spin up clusters in a turnkey fashion. It utilizes Kubernetes and Grafana for easy job scheduling and monitoring. This solution has been used in daily production for Microsoft internal groups. Mathew, Miguel, and Ilia use these training platforms to train a ResNet network on the ImageNet dataset using each of the following frameworks: CNTK, TensorFlow (Horovod), PyTorch, MxNet, and Chainer. They then compare and contrast the performance. The examples presented can also be used as templates for your own deep learning problems.

comments powered by Disqus