Lightning Talk: How the Observability Team at Spotify Radically Decreased On-Call Alerts
The Reliability team at Spotify took over the monitoring stack and decreased incident pages by 42% within 6 months. At first, they were devoting all their time to managing on-call alerts and tech debt …
Talk Title | Lightning Talk: How the Observability Team at Spotify Radically Decreased On-Call Alerts |
Speakers | Lauren Muhlhauser (Site Reliability Engineer, Spotify) |
Conference | KubeCon + CloudNativeCon North America |
Conf Tag | |
Location | San Diego, CA, USA |
Date | Nov 15-21, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
The Reliability team at Spotify took over the monitoring stack and decreased incident pages by 42% within 6 months. At first, they were devoting all their time to managing on-call alerts and tech debt. Now, on-call alerts are manageable and infrequent, and the team is on a path to using entirely open sourced products.This stack was developed years prior, when there were few well-developed open source solutions available. Lauren describes how migrations to new tools (Grafana and Prometheus) decreased their backlog and on-call pages. She will also cover the improvements the team made to their own open source products (Heroic and FFWD) and why they chose to continue using and maintaining them. Lastly, she will discuss a new tool that the team will be repurposing and open sourcing in the near future.