December 13, 2019

199 words 1 min read

Measuring and Optimizing Kubeflow Clusters at Lyft

Measuring and Optimizing Kubeflow Clusters at Lyft

Machine learning workloads are often resource-intensive operations. As companies adopt more of these workloads, tracking resource consumption and optimizing spending becomes more challenging.At Lyft, …


Talk Title	Measuring and Optimizing Kubeflow Clusters at Lyft
Speakers	Richard Liu (Senior Software Engineer, Google), Konstantin Gizdarski (Software Engineer, Lyft)
Conference	KubeCon + CloudNativeCon North America
Conf Tag
Location	San Diego, CA, USA
Date	Nov 15-21, 2019
URL	Talk Page
Slides	Talk Slides
Video

Machine learning workloads are often resource-intensive operations. As companies adopt more of these workloads, tracking resource consumption and optimizing spending becomes more challenging.At Lyft, we developed a system which scrapes metrics from Kubernetes clusters and persists them in data warehouses. We then built a pipeline that transforms snapshots into cluster utilization metrics along the dimensions of CPU, memory, and GPU. Finally we join these metrics into our cost and usage dataset, so teams can budget resources accordingly and reduce spending.In this talk, we will give an overview of Infraspend - our infrastructure for tracking Kubernetes usage. Attendees will learn how the data we collected helped Lyft reduce spending for Kubeflow clusters. The audience will also gain insights into how Kubernetes clusters can be optimized without performance or stability compromises.

cluster metrics gpu dataset infrastructure data warehouse tracking machine learning performance pipeline kubernetes

comments powered by Disqus

Building and Managing a Centralized Kubeflow Platform at Spotify

Building and Managing a Centralized Kubeflow Platform at Spotify

December 8, 2019

Machine learning workflows within Spotify have been migrated to Kubernetes by adopting Kubeflow and Kubeflow Pipelines. It helps teams increase model development speed and reduce the time to productio …

Running High-performance User-space Packet Processing Apps in Kubernetes

Running High-performance User-space Packet Processing Apps in Kubernetes

November 24, 2019

With 5G on the horizon, networking is transforming around us. Network functions have already found their way from proprietary blackbox into servers running in Linux. The Linux networking stack simply …

Supercharge Kubeflow Performance on GPU Clusters

Supercharge Kubeflow Performance on GPU Clusters

November 19, 2019

AI/ML applications on Kubernetes can be optimized for performance at many levels.This presentation provides an overview of the optimizations such as:- Distributed training on multiple GPUs with optima …

Running eBays High-Performance Workloads with Kubernetes

Running eBays High-Performance Workloads with Kubernetes

October 25, 2019

In the past two years weve been expanding our k8s deployments significantly fast by moving more and more production workloads into kubernetes.Were now running multiple thousand-node k8s clusters fro …

Large Scale Distributed Deep Learning on Kubernetes Clusters

Large Scale Distributed Deep Learning on Kubernetes Clusters

October 2, 2019

The focus of this talk is the deployments of large scale distributed deep learning with Kubernetes. The usage of operators to manage and automate training processes for machine learning are discussed. …

Keynote: Tencent: Kubernetes in the Billions

Keynote: Tencent: Kubernetes in the Billions

September 24, 2019

At Tencent, our business touches everything from gaming, social media, payments, to cloud computing. Wed like to share our story of how K8s is broadly used at Tencent, taking care of our infrastructu …