October 25, 2019

181 words 1 min read

Production GPU Cluster with K8s for AI and DL Workloads

Production GPU Cluster with K8s for AI and DL Workloads

We will present NVIDIA's experience in building and operating a production GPU cluster with K8s for AI/DL and HPC workloads. Running GPU accelerated workloads in K8s has unique challenges, and we'll d …

Talk Title Production GPU Cluster with K8s for AI and DL Workloads
Speakers Madhukar Korupolu (Distinguished Engineer, NVIDIA)
Conference KubeCon + CloudNativeCon Europe
Conf Tag
Location Barcelona, Spain
Date May 19-23, 2019
URL Talk Page
Slides Talk Slides
Video

We will present NVIDIA’s experience in building and operating a production GPU cluster with K8s for AI/DL and HPC workloads. Running GPU accelerated workloads in K8s has unique challenges, and we’ll describe how we addressed some of these in production at scale. We will describe the tools we have built for automated provisioning of GPU nodes (including CUDA driver upgrades), a custom scheduler specialized for batch jobs and monitoring GPU jobs in production with health checks and telemetry. We will also discuss gaps we have identified to enable more reliable and efficient utilization of GPU resources (e.g., GPU affinity, sharing, co-scheduling) and share an update of our current projects.

comments powered by Disqus