October 9, 2019

226 words 2 mins read

Co-Location of CPU and GPU Workloads with High Resource Efficiency

Co-Location of CPU and GPU Workloads with High Resource Efficiency

Users run various workloads in Kubernetes including long running services and AI batch jobs. Normally, GPU machines are dedicated only for AI training and the resource utilization is low in some time. …


Talk Title	Co-Location of CPU and GPU Workloads with High Resource Efficiency
Speakers	Jian He (Staff Engineer, Alibaba), Penghao Cen (Senior Engineer, Ant Financial)
Conference	KubeCon + CloudNativeCon
Conf Tag
Location	Shanghai, China
Date	Jun 23-26, 2019
URL	Talk Page
Slides	Talk Slides
Video

Users run various workloads in Kubernetes including long running services and AI batch jobs. Normally, GPU machines are dedicated only for AI training and the resource utilization is low in some time. Have you ever thought about co-locating different kinds of workloads on same node so you can save machines, aka money? In this talk we will share experience and practices of leveraging co-location mechanism in Kubernetes cluster. In detail: Why & how we created a new QoS class from BestEffort? Why & How we created node level cgroup for batch jobs? How we use a CRD named PodGroup to achieve gang scheduling? How we do the utilization evaluation? In the past months, we build a co-location cluster which has more than 100 GPU (NVIDIA Tesla P100) nodes and more than 500 CPU nodes. We co-deployed both long-running services and AI batch jobs and achieved utilization increase of 10%.

kubernetes gpu ai nvidia cluster

comments powered by Disqus

Minimizing GPU Cost for Your Deep Learning on Kubernetes

Minimizing GPU Cost for Your Deep Learning on Kubernetes

September 28, 2019

More and more data scientists run their Nvidia GPU based deep learning tasks on Kubernetes. Meanwhile, it's found over 40% cost are wasted on idle GPU in the cluster. So one important challenge is how …

Large Scale Distributed Deep Learning on Kubernetes Clusters

Large Scale Distributed Deep Learning on Kubernetes Clusters

October 2, 2019

The focus of this talk is the deployments of large scale distributed deep learning with Kubernetes. The usage of operators to manage and automate training processes for machine learning are discussed. …

Protecting Sensitive Code with Encrypted Container Images on Kubernetes

Protecting Sensitive Code with Encrypted Container Images on Kubernetes

October 1, 2019

Many enterprises are driven by trade secrets in their code - whether it is a proprietary AI model, or a secret high frequency trading strategy. It is of utmost importance that critical algorithms, pro …

A Method for the Cost Optimization of Kubernetes-based Deep Learning Training and Inference

A Method for the Cost Optimization of Kubernetes-based Deep Learning Training and Inference

September 26, 2019

To improve the throughput capacity of the training or inference applications without adding extra GPU cores, we share one GPU core between multiple deep learning workloads in a kubernetes cluster by c …

Keynote: Tencent: Kubernetes in the Billions

Keynote: Tencent: Kubernetes in the Billions

September 24, 2019

At Tencent, our business touches everything from gaming, social media, payments, to cloud computing. Wed like to share our story of how K8s is broadly used at Tencent, taking care of our infrastructu …

Extending Kubernetes Scheduler for Multi-Cluster and Multi-Cloud Workloads

Extending Kubernetes Scheduler for Multi-Cluster and Multi-Cloud Workloads

October 9, 2019

Kubernetes did a great job implementing a rich and flexible scheduler for Pods. Today we are extending Pod scheduling for Multi-Cluster environments, expanding and optimizing it even further for non-K …