December 23, 2019

212 words 1 min read

Lightning Talk: How the Observability Team at Spotify Radically Decreased On-Call Alerts

Lightning Talk: How the Observability Team at Spotify Radically Decreased On-Call Alerts

The Reliability team at Spotify took over the monitoring stack and decreased incident pages by 42% within 6 months. At first, they were devoting all their time to managing on-call alerts and tech debt …

Talk Title Lightning Talk: How the Observability Team at Spotify Radically Decreased On-Call Alerts
Speakers Lauren Muhlhauser (Site Reliability Engineer, Spotify)
Conference KubeCon + CloudNativeCon North America
Conf Tag
Location San Diego, CA, USA
Date Nov 15-21, 2019
URL Talk Page
Slides Talk Slides
Video

The Reliability team at Spotify took over the monitoring stack and decreased incident pages by 42% within 6 months. At first, they were devoting all their time to managing on-call alerts and tech debt. Now, on-call alerts are manageable and infrequent, and the team is on a path to using entirely open sourced products.This stack was developed years prior, when there were few well-developed open source solutions available. Lauren describes how migrations to new tools (Grafana and Prometheus) decreased their backlog and on-call pages. She will also cover the improvements the team made to their own open source products (Heroic and FFWD) and why they chose to continue using and maintaining them. Lastly, she will discuss a new tool that the team will be repurposing and open sourcing in the near future.

comments powered by Disqus