Distributed deep learning with containers on heterogeneous GPU clusters

Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters.


Talk Title	Distributed deep learning with containers on heterogeneous GPU clusters
Speakers	dong meng (MapR)
Conference	Strata Data Conference
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 6-8, 2018
URL	Talk Page
Slides	Talk Slides
Video

There have been years of active research and development in deep learning, and organizations have begun to explore methods in which they can train and serve deep learning on a cluster in a distributed fashion. Many build a dedicated GPU HPC cluster that works well in a research or development setting, but data has to be moved consistently between clusters. There is overhead in managing the data used to train deep learning models and managing the models between research/development and production. Dong Meng outlines the topics that need to be addressed to successfully utilize distributed deep learning, such as consistency, fault tolerance, communication, resource management, and programming libraries, and offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters. Along the way, Dong demonstrates a simple distributed deep learning training program and explains how to leverage pub/sub capability to build global real-time deep learning applications on NVIDIA GPUs. For consistency, most DL libraries introduce a parameter server and worker architecture to enable synchronization. The checkpoint reload strategy has been used to provide fault tolerance. By designing the volume topology in the distributed filesystem, you can move the GPU computing closer to the data locality. This addresses possible communication congestion by bringing together your deep learning model, your data, and your applications. For resource management, Kubernetes orchestrates the containers to train and deploy deep learning models with GPUs. You’ll learn how to utilize the converged data platform to serve as the data infrastructure to provide a distributed filesystem, key-value storage, and streams to store and build the data pipeline. With deep learning libraries like TensorFlow or Apache MXNet housed in persistent application client containers (PAAC), you can persist the model to the distributed filesystem, provide DL frameworks with full access to vast data on the distributed filesystem, and serve models to score the data coming in through streams. Furthermore, you can manage the model version and library dependencies through container images and customize the machine learning server for production.

Distributed deep learning with containers on heterogeneous GPU clusters

Automating GPU Infrastructure for Kubernetes

The Path to GPU as a Service in Kubernetes

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale

Controllers: Lambda Functions for Extending your Infrastructure

Apache Kafka + Apache Mesos = Highly scalable streaming microservices

Running Hyperledger Sawtooth in Production