Measuring and Optimizing Kubeflow Clusters at Lyft
Machine learning workloads are often resource-intensive operations. As companies adopt more of these workloads, tracking resource consumption and optimizing spending becomes more challenging.At Lyft, …
Talk Title | Measuring and Optimizing Kubeflow Clusters at Lyft |
Speakers | Richard Liu (Senior Software Engineer, Google), Konstantin Gizdarski (Software Engineer, Lyft) |
Conference | KubeCon + CloudNativeCon North America |
Conf Tag | |
Location | San Diego, CA, USA |
Date | Nov 15-21, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Machine learning workloads are often resource-intensive operations. As companies adopt more of these workloads, tracking resource consumption and optimizing spending becomes more challenging.At Lyft, we developed a system which scrapes metrics from Kubernetes clusters and persists them in data warehouses. We then built a pipeline that transforms snapshots into cluster utilization metrics along the dimensions of CPU, memory, and GPU. Finally we join these metrics into our cost and usage dataset, so teams can budget resources accordingly and reduce spending.In this talk, we will give an overview of Infraspend - our infrastructure for tracking Kubernetes usage. Attendees will learn how the data we collected helped Lyft reduce spending for Kubeflow clusters. The audience will also gain insights into how Kubernetes clusters can be optimized without performance or stability compromises.