November 19, 2019

158 words 1 min read

Supercharge Kubeflow Performance on GPU Clusters

Supercharge Kubeflow Performance on GPU Clusters

AI/ML applications on Kubernetes can be optimized for performance at many levels.This presentation provides an overview of the optimizations such as:- Distributed training on multiple GPUs with optima …


Talk Title	Supercharge Kubeflow Performance on GPU Clusters
Speakers	Meenakshi Kaushik (Product Manager, Cisco), Neelima Mukiri (Principal Engineer, Cisco)
Conference	KubeCon + CloudNativeCon North America
Conf Tag
Location	San Diego, CA, USA
Date	Nov 15-21, 2019
URL	Talk Page
Slides	Talk Slides
Video

AI/ML applications on Kubernetes can be optimized for performance at many levels.This presentation provides an overview of the optimizations such as:- Distributed training on multiple GPUs with optimal selection of interconnects between the GPUs and CPUs.- Utilizing different types of GPUs/Servers for different workloads like training and inference.- OS level optimizations to get optimal performance on the hardware.- Usage of GPU Passthrough for optimal utilization and performance.This presentation will also cover how the selection of machine learning framework, like Kubeflow, can impact performance and hardware utilization.

cluster ai/ml framework gpu machine learning performance kubernetes hardware optimization

comments powered by Disqus

Large Scale Distributed Deep Learning on Kubernetes Clusters

Large Scale Distributed Deep Learning on Kubernetes Clusters

October 2, 2019

The focus of this talk is the deployments of large scale distributed deep learning with Kubernetes. The usage of operators to manage and automate training processes for machine learning are discussed. …

Running eBays High-Performance Workloads with Kubernetes

Running eBays High-Performance Workloads with Kubernetes

October 25, 2019

In the past two years weve been expanding our k8s deployments significantly fast by moving more and more production workloads into kubernetes.Were now running multiple thousand-node k8s clusters fro …

A Method for the Cost Optimization of Kubernetes-based Deep Learning Training and Inference

A Method for the Cost Optimization of Kubernetes-based Deep Learning Training and Inference

September 26, 2019

To improve the throughput capacity of the training or inference applications without adding extra GPU cores, we share one GPU core between multiple deep learning workloads in a kubernetes cluster by c …

Large Scale Distributed Deep Learning with Kubernetes Operators

Large Scale Distributed Deep Learning with Kubernetes Operators

October 29, 2019

The focus of this talk is the usage of Kubernetes operators to manage and automate training process for machine learning tasks. Two open source Kubernetes operators, tf-operator and mpi-operator, will …

Economics and Best Practices of Running AI/ML Workloads on Kubernetes

Economics and Best Practices of Running AI/ML Workloads on Kubernetes

October 19, 2019

In this session, we will discuss how Kubernetes driven AI/ML building blocks are making AI/ML simple, fast and efficient for data scientists, data engineers, devops engineers and everyday users. We wi …

Delivering TV Everywhere with Cloud Native Solutions

Delivering TV Everywhere with Cloud Native Solutions

October 16, 2019

Traditional TV players are facing huge challenges from the rapid growth of emerging video services such as Netflix, Amazon Prime and YouTube TV. TV service providers must modernize and accelerate the …