December 23, 2019

212 words 1 min read

Lightning Talk: How the Observability Team at Spotify Radically Decreased On-Call Alerts

Lightning Talk: How the Observability Team at Spotify Radically Decreased On-Call Alerts

The Reliability team at Spotify took over the monitoring stack and decreased incident pages by 42% within 6 months. At first, they were devoting all their time to managing on-call alerts and tech debt …


Talk Title	Lightning Talk: How the Observability Team at Spotify Radically Decreased On-Call Alerts
Speakers	Lauren Muhlhauser (Site Reliability Engineer, Spotify)
Conference	KubeCon + CloudNativeCon North America
Conf Tag
Location	San Diego, CA, USA
Date	Nov 15-21, 2019
URL	Talk Page
Slides	Talk Slides
Video

The Reliability team at Spotify took over the monitoring stack and decreased incident pages by 42% within 6 months. At first, they were devoting all their time to managing on-call alerts and tech debt. Now, on-call alerts are manageable and infrequent, and the team is on a path to using entirely open sourced products.This stack was developed years prior, when there were few well-developed open source solutions available. Lauren describes how migrations to new tools (Grafana and Prometheus) decreased their backlog and on-call pages. She will also cover the improvements the team made to their own open source products (Heroic and FFWD) and why they chose to continue using and maintaining them. Lastly, she will discuss a new tool that the team will be repurposing and open sourcing in the near future.

grafana reliability open source prometheus monitoring incident

comments powered by Disqus

From New Cluster to Insight. Deploying Monitoring and Logging to Kubernetes

From New Cluster to Insight. Deploying Monitoring and Logging to Kubernetes

November 4, 2019

The question that most people ask after spinning up their first Kubernetes cluster is "how do I do monitoring and logging". In this session we'll utilize open source tools like Prometheus, Helm, Graf …

Enable Serverless Metrics in Apache OpenWhisk on Kubernetes with Prometheus

Enable Serverless Metrics in Apache OpenWhisk on Kubernetes with Prometheus

October 6, 2019

Serverless functions are event-triggered, stateless and ephemeral, which makes metrics essential to a Serverless platform. Both system metrics and user metrics are helpful for operators and developers …

Building a Database as a Service on Kubernetes

Building a Database as a Service on Kubernetes

December 15, 2019

Stateful, scalable storage on Kubernetes is an unsolved problem. Creating it as a service is even more difficult. The cloud-native ecosystem offers many tools such as the operator-sdk, Prometheus, Gra …

Introducing Metal: Kubernetes Native Bare Metal Host Management

Introducing Metal: Kubernetes Native Bare Metal Host Management

November 30, 2019

Metal (metal kubed) is a new open source bare metal host provisioning tool created to enable Kubernetes-native infrastructure management. Metal enables the management of bare metal hosts via custo …

What WePay Learned From Processing Billions of Dollars on GKE Using Linkerd

What WePay Learned From Processing Billions of Dollars on GKE Using Linkerd

November 11, 2019

WePay processes billions of dollars worth of payments each year. As the number of services that process payment requests grow in WePays infrastructure, so does the challenge of monitoring, debugging, …

M3 and Prometheus, Monitoring at Planet Scale for Everyone

M3 and Prometheus, Monitoring at Planet Scale for Everyone

November 6, 2019

For the past few years Prometheus has solved the monitoring needs of many and it is exceptional at what it does. Prometheus has exploded in popularity and now many wish to store more metrics, at longe …