October 9, 2019

226 words 2 mins read

Co-Location of CPU and GPU Workloads with High Resource Efficiency

Co-Location of CPU and GPU Workloads with High Resource Efficiency

Users run various workloads in Kubernetes including long running services and AI batch jobs. Normally, GPU machines are dedicated only for AI training and the resource utilization is low in some time. …

Talk Title Co-Location of CPU and GPU Workloads with High Resource Efficiency
Speakers Jian He (Staff Engineer, Alibaba), Penghao Cen (Senior Engineer, Ant Financial)
Conference KubeCon + CloudNativeCon
Conf Tag
Location Shanghai, China
Date Jun 23-26, 2019
URL Talk Page
Slides Talk Slides
Video

Users run various workloads in Kubernetes including long running services and AI batch jobs. Normally, GPU machines are dedicated only for AI training and the resource utilization is low in some time. Have you ever thought about co-locating different kinds of workloads on same node so you can save machines, aka money? In this talk we will share experience and practices of leveraging co-location mechanism in Kubernetes cluster. In detail: Why & how we created a new QoS class from BestEffort? Why & How we created node level cgroup for batch jobs? How we use a CRD named PodGroup to achieve gang scheduling? How we do the utilization evaluation? In the past months, we build a co-location cluster which has more than 100 GPU (NVIDIA Tesla P100) nodes and more than 500 CPU nodes. We co-deployed both long-running services and AI batch jobs and achieved utilization increase of 10%.

comments powered by Disqus