Learning from failure: Why a total site outage can be a good thing

Alex Elman explains how Indeed used a site-wide outage as an opportunity to build resilience, improve reliability, and make lasting improvements to the engineering culture.


Talk Title	Learning from failure: Why a total site outage can be a good thing
Speakers	Alex Elman (Indeed)
Conference	O’Reilly Velocity Conference
Conf Tag	Building and maintaining complex distributed systems
Location	San Jose, California
Date	June 11-13, 2019
URL	Talk Page
Slides	Talk Slides
Video

Although an outage is a terrifying prospect, you should embrace it as an opportunity. Failure can expand and improve your understanding of your systems. Three years ago, Indeed suffered one of the worst outages in its history. No single fault or failure caused this outage. Rather, it was a complex interaction of bugs, design decisions, capacity loss, and poor situational awareness during incident response. Indeed learned valuable lessons from this event. It identified ways to make the systems more resilient and improved the approach to the incident lifecycle within the engineering culture. Alex Elman uses the narrative of this incident to demonstrate how a site-wide outage can inform increased resilience and reduced operational complexity. Learning from failure is a feedback loop rather than a one-off process. He applies Indeed’s outage as a practical example of what an iteration of this loop can look like. He shares with other SREs the success that has risen from this failure. Indeed hasn’t had a global site outage in the three years since this event. Alex begins with a discussion of failure to set the stage for delivering the incident background, then discusses incident response and situational awareness. He explains conducting incident postmortems and learning from failure and designing for reliability, including resilience patterns such as circuit breaking and graceful degradation. Finally, he gets into resilience testing, running chaos tests, and closing the feedback loop, leaving some time for a question and answer session.

Learning from failure: Why a total site outage can be a good thing

The Ops in the Serverless

Lightning Talk: How the Observability Team at Spotify Radically Decreased On-Call Alerts

Keynote: Finding the Joy in Chaos Engineering

Keynote: Metrics, Logs & Traces; What Does the Future Hold for Observability?

Plan to Fail: A Good Captain Doesnt Sail Without Life Rafts

Security precognition: A look at chaos engineering in security incident response