Co-Location of CPU and GPU Workloads with High Resource Efficiency
Users run various workloads in Kubernetes including long running services and AI batch jobs. Normally, GPU machines are dedicated only for AI training and the resource utilization is low in some time. …
Talk Title | Co-Location of CPU and GPU Workloads with High Resource Efficiency |
Speakers | Jian He (Staff Engineer, Alibaba), Penghao Cen (Senior Engineer, Ant Financial) |
Conference | KubeCon + CloudNativeCon |
Conf Tag | |
Location | Shanghai, China |
Date | Jun 23-26, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Users run various workloads in Kubernetes including long running services and AI batch jobs. Normally, GPU machines are dedicated only for AI training and the resource utilization is low in some time. Have you ever thought about co-locating different kinds of workloads on same node so you can save machines, aka money? In this talk we will share experience and practices of leveraging co-location mechanism in Kubernetes cluster. In detail: Why & how we created a new QoS class from BestEffort? Why & How we created node level cgroup for batch jobs? How we use a CRD named PodGroup to achieve gang scheduling? How we do the utilization evaluation? In the past months, we build a co-location cluster which has more than 100 GPU (NVIDIA Tesla P100) nodes and more than 500 CPU nodes. We co-deployed both long-running services and AI batch jobs and achieved utilization increase of 10%.