Production GPU Cluster with K8s for AI and DL Workloads
We will present NVIDIA's experience in building and operating a production GPU cluster with K8s for AI/DL and HPC workloads. Running GPU accelerated workloads in K8s has unique challenges, and we'll d …
Talk Title | Production GPU Cluster with K8s for AI and DL Workloads |
Speakers | Madhukar Korupolu (Distinguished Engineer, NVIDIA) |
Conference | KubeCon + CloudNativeCon Europe |
Conf Tag | |
Location | Barcelona, Spain |
Date | May 19-23, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
We will present NVIDIA’s experience in building and operating a production GPU cluster with K8s for AI/DL and HPC workloads. Running GPU accelerated workloads in K8s has unique challenges, and we’ll describe how we addressed some of these in production at scale. We will describe the tools we have built for automated provisioning of GPU nodes (including CUDA driver upgrades), a custom scheduler specialized for batch jobs and monitoring GPU jobs in production with health checks and telemetry. We will also discuss gaps we have identified to enable more reliable and efficient utilization of GPU resources (e.g., GPU affinity, sharing, co-scheduling) and share an update of our current projects.