Deploying deep learning models on GPU-enabled Kubernetes clusters
Interested in deep learning models and how to deploy them on Kubernetes at production scale? Not sure if you need to use GPUs or CPUs? Mathew Salvaris and Fidan Boylu Uz help you out by providing a step-by-step guide to creating a pretrained deep learning model, packaging it in a Docker container, and deploying as a web service on a Kubernetes cluster.
Talk Title | Deploying deep learning models on GPU-enabled Kubernetes clusters |
Speakers | Mathew Salvaris (Microsoft), Fidan Boylu Uz (Microsoft) |
Conference | O’Reilly Artificial Intelligence Conference |
Conf Tag | Put AI to Work |
Location | New York, New York |
Date | April 16-18, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
One of the major challenges that data scientists often face is that once they have trained the model, they need to deploy it at production scale. It’s widely accepted that GPUs should be used for deep learning training, due to their significant speed when compared to CPUs. However, for tasks like inference (which are not as resource heavy as training), CPUs are usually sufficient and are more attractive due to their lower cost. But when inference speed is a bottleneck, GPUs provide considerable gains both from financial and time perspectives. Coupled with containerized applications and container orchestrators like Kubernetes, it’s now possible to go from training to deployment with GPUs faster and more easily while satisfying latency and throughput goals for production grade deployments. Mathew Salvaris and Fidan Boylu Uz offer a step-by-step guide to creating a pretrained deep learning model, packaging it in a Docker container, and deploying as a web service on a Kubernetes cluster. You’ll learn how to test and verify each step and discover the gotchas you may encounter. You’ll also explore a demo of how to make calls to the deployed service to score images on a predeployed Kubernetes cluster as well as benchmarking results that provide a rough gauge of the performance of deep learning models on GPU and CPU clusters. The tests use two frameworks—TensorFlow (1.8) and Keras (2.1.6) with a TensorFlow (1.6) backend—for five different models: These models were selected in order to test a wide range of networks, from small parameter efficient models such as MobileNet to large networks such as NasNetLarge. For each, a Docker image with an API for scoring images has been prepared and deployed on four different cluster configurations: Overall, results show that the throughput scales almost linearly with the number of GPUs and that GPUs always outperform CPUs at a similar price point. Mathew and Fidan also found that the performance on GPU clusters were far more consistent than CPUs—possibly because there’s no contention for resources between the model and the web service that’s present in the CPU only deployment. These results suggest that for deep learning inference tasks that use models with high number of parameters, GPU-based deployments benefit from the lack of resource contention and provide significantly higher throughput values compared to a CPU cluster of similar cost. The session uses notebooks that you return to later.