November 29, 2019

301 words 2 mins read

SLO burn

SLO burn

Jamie Wilkinson offers a brief overview of SLOs, shares a practical guide to implementing sustainable SLO-based alerting for systems of any size, and outlines the tooling required to supplement the system in the absence of cause-based alerting.


Talk Title	SLO burn
Speakers	Jamie Wilkinson (Google)
Conference	Velocity
Conf Tag	Build resilient systems at scale
Location	New York, New York
Date	September 20-22, 2016
URL	Talk Page
Slides	Talk Slides
Video

As systems grow, they get more components—and more ways to fail. The alerts of the last system’s design can slowly “boil the frog,” and all of a sudden the SRE team finds they have no time left to address scaling problems because they’re constantly firefighting. Alert fatigue sets in, and the team burns out. Naturally, maintenance work will always increase as the system itself grows. To make alerting sustainable, instead of on cause, only page on symptom, and even then only by declaring what the acceptable threshold of symptom is—also known as the SLO (and its complement, the error budget). Even at Google scale, many teams have yet to implement the change in their monitoring to realize SLO-based alerts. But systems don’t need to be the size of a planet to benefit from these patterns. Jamie Wilkinson offers a brief overview of SLOs and shares a practical guide to implementing sustainable SLO-based alerting for systems of any size. Whether you’re on call for 10 machines or 10 data centers, you’ll find something of value, as Jaime—a well-rested champion of work-life balance—demonstrates how to select service objectives and construct robust and low-maintenance alerting rules, using Prometheus for a live demonstration. You’ll also explore the tooling required to help make such a system retain observability in the absence of noisy caused-based alerts, now that they’re not telling you exactly which components are failing.

google guide prometheus data center monitoring sre

comments powered by Disqus

A practical guide to monitoring and alerting with time series at scale

A practical guide to monitoring and alerting with time series at scale

November 28, 2019

Monitoring only sucks when the cost of maintenance scales proportionally with the size of the system being monitored. Recently, tools have emerged that assist with scaling out monitoring configurations sublinearly with the size of the system. Jamie Wilkinson explores time series-based alerting and offers practical examples that can be employed in your environment today.

Tracing polyglot systems: An OpenTracing tutorial

Tracing polyglot systems: An OpenTracing tutorial

November 28, 2019

Priyanka Sharma and Yuri Shkuro demonstrate how distributed tracing works and how to employ it in the development and operations of your applications in the programming language of your choice: Java, Go, Python, Node.js, C#, or C++.

Zero to Kubernetes in five minutes (sponsored by Mesosphere)

Zero to Kubernetes in five minutes (sponsored by Mesosphere)

November 28, 2019

Getting Kubernetes up and running is only half the battle. Now you need to get the supporting infrastructure in place. Dan Mennell shares a templated approach to deploying what is needed to get started with source control, CI/CD, and monitoring with Prometheus, along with other things.

Running microservice environments is no free lunch

Running microservice environments is no free lunch

November 23, 2019

Migrating toward microservices tends to result in a 20x larger environment than monolithic counterparts. While the bright side of microservices and their enabling container platforms is high availability and scalability, what about the dark sidethe side that nobody talks about in their presentations. Alois Mayr and Alexander Ramos uncover the truth so you dont have to learn it the hard way.

How we built an election report-casting app for the 2015 Nigeria general elections (with little experience building mobile apps, using agile scrum methods for the first time)

How we built an election report-casting app for the 2015 Nigeria general elections (with little experience building mobile apps, using agile scrum methods for the first time)

October 30, 2019

Bulama Yusuf explains how he and his team introduced agile methodologies to build a mobile app with a cloud-based backend at an organization that previously used the waterfall method of software development (and had never built a mobile app before). Bulama outlines the challenges the team faced and the lessons they learned along the way.

Small-scale engineering

Small-scale engineering

November 29, 2019

Effie Mouzeli explains why small-scale engineering is just as challenging as large-scale engineering and offers ideas on how to survive technical debt, poor communication, and other everyday challenges.