December 3, 2019

217 words 2 mins read

How to Include Latency in SLO-based Alerting

How to Include Latency in SLO-based Alerting

Chapter 5 of The Site Reliability Workbook is an excellent study of how to create meaningful alerts based on SLOs by measuring the rate at which the error budget is burned over different time window …


Talk Title	How to Include Latency in SLO-based Alerting
Speakers	Björn Rabenstein (Engineer, Grafana Labs)
Conference	KubeCon + CloudNativeCon North America
Conf Tag
Location	San Diego, CA, USA
Date	Nov 15-21, 2019
URL	Talk Page
Slides	Talk Slides
Video

Chapter 5 of “The Site Reliability Workbook” is an excellent study of how to create meaningful alerts based on SLOs by measuring the rate at which the error budget is burned over different time windows. This rather complex approach is blissfully straight-forward to implement in Prometheus, as demonstrated in the chapter itself. However, all of it is based on error rates, leaving latency concerns out of scope. Björn “Beorn” Rabenstein will explore various options of applying the same ideas to latency-based SLOs. The foundation is a precise and meaningful definition of the SLO. From there, Beorn will explore various techniques to translate the SLO into an error budget and how to measure its burn rate with Prometheus. Once that is done, creating error-budget-based alerts is relatively simple. There are, however, pitfalls and trade-offs along the way, which Beorn will help cope with.

prometheus book reliability

comments powered by Disqus

From New Cluster to Insight. Deploying Monitoring and Logging to Kubernetes

From New Cluster to Insight. Deploying Monitoring and Logging to Kubernetes

November 4, 2019

The question that most people ask after spinning up their first Kubernetes cluster is "how do I do monitoring and logging". In this session we'll utilize open source tools like Prometheus, Helm, Graf …

Managing Large-Scale Kubernetes Clusters Effectively and Reliably

Managing Large-Scale Kubernetes Clusters Effectively and Reliably

September 28, 2019

As the business grows, we need to deploy Kubernetets into several data centers all around the world. There are more than ten thousands of Nodes in a single data center. The critical challenge we are f …

Intro: Prometheus

Intro: Prometheus

December 1, 2019

Prometheus is an open-source monitoring system and time series database. It features a multi-dimensional data model with a powerful query language and integrates many aspects of systems and service mo …

Introducing Metal: Kubernetes Native Bare Metal Host Management

Introducing Metal: Kubernetes Native Bare Metal Host Management

November 30, 2019

Metal (metal kubed) is a new open source bare metal host provisioning tool created to enable Kubernetes-native infrastructure management. Metal enables the management of bare metal hosts via custo …

Flyte: Cloud Native Machine Learning & Data Processing Platform

Flyte: Cloud Native Machine Learning & Data Processing Platform

November 29, 2019

Flyte is the backbone for large-scale Machine Learning and Data Processing (ETL) pipelines at Lyft. It is used across business critical applications ranging from ETA, Pricing, Mapping, Autonomous, etc …

Thanos Deep Dive: Inside a Distributed Monitoring System

Thanos Deep Dive: Inside a Distributed Monitoring System

November 29, 2019

Thanos is an open-source CNCF Sandbox project that builds upon Prometheus components to create a global-scale highly available monitoring system. It seamlessly extends Prometheus in a few simple steps …