January 15, 2020

474 words 3 mins read

Want to solve overmonitoring and alert fatigue? Create the right incentives

Want to solve overmonitoring and alert fatigue? Create the right incentives

Keeping your signal-to-noise ratio high is a nontrivial problem. Modern tools make it easy to overmonitor (which leads to noise). The result? Missed alarms and unhappy customers. Filtering the noise is not the answer. Kishore Jalleda explains how Yahoo reduced the alert volume from ~200K a month to a few hundred by creating the right incentives and culture.

Talk Title Want to solve overmonitoring and alert fatigue? Create the right incentives
Speakers Kishore Jalleda (Yahoo)
Conference O’Reilly Velocity Conference
Conf Tag Build Resilient Distributed Systems
Location London, United Kingdom
Date October 18-20, 2017
URL Talk Page
Slides Talk Slides
Video

Telemetry monitors and their constant beeping is a pretty common sight at hospitals. But when nurses work among constantly beeping monitors, they may start to ignore the alarms. This reflects the experience of being paged while on call—although unlike a missed page telling you your website or service is down, failing to act on an alarm at a hospital could have much more critical consequences. Keeping your signal-to-noise ratio high is a nontrivial problem. Modern tools make it easy to overmonitor (which leads to noise). The result? Missed alarms and unhappy customers. Filtering the noise is not the answer. Kishore Jalleda explains how Yahoo reduced the alert volume from ~200K a month to a few hundred by creating the right incentives and culture. When Kishore started at Yahoo, 200,000+ alerts were triggered per month, which mostly went to dashboards monitored by large (50+ person) SRE teams in the US, Bangalore, and the Philippines. Service quality suffered. Outages occurred almost every day. SRE response times were in the order of hours. SRE credibility was almost lost. In response, dev teams started to staff their own ops teams or have devs do ops work (in addition to feature development). Kishore outlines his solution—the Clean Room initiative—and shares the people, process, and software and tools changes that made this initiative a success. Think of the Clean Room as a virtual room into all the dev teams can enter and use the provided services (mainly monitoring). The catch is you are only allowed to enter if you are below certain alert counts in a given time period (mostly a week), and once you are in, you must continue to remain “clean”—meaning, you must stick to your alert budgets for the week. If at any point you go over budgets, you will be kicked out of the Clean Room: no more SRE support, although you are free to staff up your own SRE teams or have your devs perform that function. The corollary to that is, if you are a dev team and are living within your means (figuratively speaking, that is), you will get high-quality SRE service with ultralow response times. The idea behind the initiative was to create the right incentives, so people do the right thing.

comments powered by Disqus