December 3, 2019

217 words 2 mins read

How to Include Latency in SLO-based Alerting

How to Include Latency in SLO-based Alerting

Chapter 5 of The Site Reliability Workbook is an excellent study of how to create meaningful alerts based on SLOs by measuring the rate at which the error budget is burned over different time window …

Talk Title How to Include Latency in SLO-based Alerting
Speakers Björn Rabenstein (Engineer, Grafana Labs)
Conference KubeCon + CloudNativeCon North America
Conf Tag
Location San Diego, CA, USA
Date Nov 15-21, 2019
URL Talk Page
Slides Talk Slides
Video

Chapter 5 of “The Site Reliability Workbook” is an excellent study of how to create meaningful alerts based on SLOs by measuring the rate at which the error budget is burned over different time windows. This rather complex approach is blissfully straight-forward to implement in Prometheus, as demonstrated in the chapter itself. However, all of it is based on error rates, leaving latency concerns out of scope. Björn “Beorn” Rabenstein will explore various options of applying the same ideas to latency-based SLOs. The foundation is a precise and meaningful definition of the SLO. From there, Beorn will explore various techniques to translate the SLO into an error budget and how to measure its burn rate with Prometheus. Once that is done, creating error-budget-based alerts is relatively simple. There are, however, pitfalls and trade-offs along the way, which Beorn will help cope with.

comments powered by Disqus