Measuring chaos: Chaos engineering and team health
Once restricted to companies like Netflix, chaos engineering is becoming a common practice in organizations of all sizes. Paul Osman outlines techniques Under Armour uses to measure service health with chaos engineering. He details its operational maturity model and how the company uses it to blamelessly identify teams that need additional help and action items to improve resiliency and happiness.
Talk Title | Measuring chaos: Chaos engineering and team health |
Speakers | Paul Osman (Under Armour Connected Fitness) |
Conference | O’Reilly Velocity Conference |
Conf Tag | Build systems that drive business |
Location | Berlin, Germany |
Date | November 5-7, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Chaos engineering is exploding in popularity. Once restricted to companies like Netflix, it’s becoming a common practice in organizations of all sizes. A number of great talks have delivered techniques for introducing your organization to chaos engineering, but without effective methods for measuring impact, organizations can fall victim to resiliency theater. The results of this are predictable: adoption struggles, teams feel burned out, and chaos engineering feels like a chore. Paul Osman details how Under Armour measures the impact of chaos engineering. He walks you through a service maturity model the company created and how it uses game days to evaluate services against it. You’ll see how Under Armour uses this data to create an overall view of team health, blamelessly identifying teams that are overburdened and need additional help from its infrastructure and SRE teams to get back on track. He also walks you through how the company uses health report cards to create visibility and shared accountability for team health and psychological safety. Under Armour’s reliability engineering team is taking the lead on moving its culture from one of fire fighting (reactive) to building inspection (proactive). He explains how Under Armour tracks incidents through five stages (ending with continuous chaos experiments) and uses incident data to prioritize proactive reliability work.