December 13, 2019

199 words 1 min read

Measuring and Optimizing Kubeflow Clusters at Lyft

Measuring and Optimizing Kubeflow Clusters at Lyft

Machine learning workloads are often resource-intensive operations. As companies adopt more of these workloads, tracking resource consumption and optimizing spending becomes more challenging.At Lyft, …

Talk Title Measuring and Optimizing Kubeflow Clusters at Lyft
Speakers Richard Liu (Senior Software Engineer, Google), Konstantin Gizdarski (Software Engineer, Lyft)
Conference KubeCon + CloudNativeCon North America
Conf Tag
Location San Diego, CA, USA
Date Nov 15-21, 2019
URL Talk Page
Slides Talk Slides
Video

Machine learning workloads are often resource-intensive operations. As companies adopt more of these workloads, tracking resource consumption and optimizing spending becomes more challenging.At Lyft, we developed a system which scrapes metrics from Kubernetes clusters and persists them in data warehouses. We then built a pipeline that transforms snapshots into cluster utilization metrics along the dimensions of CPU, memory, and GPU. Finally we join these metrics into our cost and usage dataset, so teams can budget resources accordingly and reduce spending.In this talk, we will give an overview of Infraspend - our infrastructure for tracking Kubernetes usage. Attendees will learn how the data we collected helped Lyft reduce spending for Kubeflow clusters. The audience will also gain insights into how Kubernetes clusters can be optimized without performance or stability compromises.

comments powered by Disqus