October 25, 2019

181 words 1 min read

Production GPU Cluster with K8s for AI and DL Workloads

Production GPU Cluster with K8s for AI and DL Workloads

We will present NVIDIA's experience in building and operating a production GPU cluster with K8s for AI/DL and HPC workloads. Running GPU accelerated workloads in K8s has unique challenges, and we'll d …


Talk Title	Production GPU Cluster with K8s for AI and DL Workloads
Speakers	Madhukar Korupolu (Distinguished Engineer, NVIDIA)
Conference	KubeCon + CloudNativeCon Europe
Conf Tag
Location	Barcelona, Spain
Date	May 19-23, 2019
URL	Talk Page
Slides	Talk Slides
Video

We will present NVIDIA’s experience in building and operating a production GPU cluster with K8s for AI/DL and HPC workloads. Running GPU accelerated workloads in K8s has unique challenges, and we’ll describe how we addressed some of these in production at scale. We will describe the tools we have built for automated provisioning of GPU nodes (including CUDA driver upgrades), a custom scheduler specialized for batch jobs and monitoring GPU jobs in production with health checks and telemetry. We will also discuss gaps we have identified to enable more reliable and efficient utilization of GPU resources (e.g., GPU affinity, sharing, co-scheduling) and share an update of our current projects.

health automated gpu k8s dl ai cuda telemetry nvidia monitoring cluster

comments powered by Disqus

Co-Location of CPU and GPU Workloads with High Resource Efficiency

Co-Location of CPU and GPU Workloads with High Resource Efficiency

October 9, 2019

Users run various workloads in Kubernetes including long running services and AI batch jobs. Normally, GPU machines are dedicated only for AI training and the resource utilization is low in some time. …

Minimizing GPU Cost for Your Deep Learning on Kubernetes

Minimizing GPU Cost for Your Deep Learning on Kubernetes

September 28, 2019

More and more data scientists run their Nvidia GPU based deep learning tasks on Kubernetes. Meanwhile, it's found over 40% cost are wasted on idle GPU in the cluster. So one important challenge is how …

Running eBays High-Performance Workloads with Kubernetes

Running eBays High-Performance Workloads with Kubernetes

October 25, 2019

In the past two years weve been expanding our k8s deployments significantly fast by moving more and more production workloads into kubernetes.Were now running multiple thousand-node k8s clusters fro …

Monitoring Service Architecture and Health with BPF

Monitoring Service Architecture and Health with BPF

October 17, 2019

Kubernetes has made it incredibly easy to build distributed applications out of large numbers of microservices. Monitoring, or even accurately tracking, the interaction between each of these services …

Keynote: Tencent: Kubernetes in the Billions

Keynote: Tencent: Kubernetes in the Billions

September 24, 2019

At Tencent, our business touches everything from gaming, social media, payments, to cloud computing. Wed like to share our story of how K8s is broadly used at Tencent, taking care of our infrastructu …

CITIC Bank's Containerized Exploration Road

CITIC Bank's Containerized Exploration Road

September 21, 2019

This presentation will introduce CITIC Bank, one of China's largest commercial banks, in its valuable practice in building a container platform. CITIC Bank and Alauda have completed the Cloud Native c …