Debugging complex systems
Terran Melconian explores an organized process for observing a misbehaving complex system, reasoning about possible causes, and isolating the fault. While it is not generally taught, all the successful senior engineers with operational experience Terran has talked to use a variant of this process.
Talk Title | Debugging complex systems |
Speakers | Terran Melconian (Air Network Simulation and Analysis) |
Conference | O’Reilly Velocity Conference |
Conf Tag | Build resilient systems at scale |
Location | New York, New York |
Date | October 2-4, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Skills for diagnosing failures in complex, interacting systems are critically important but rarely taught. Even those with experience and expertise can struggle to articulate how they do what they do in order to pass the knowledge on. Drawing on a combination of his own experience carrying the pager and time spent observing and teaching others, Terran Melconian has distilled an explicit, teachable process for efficiently isolating faults. Starting with the observed symptom (for example, a page), Terran demonstrates how to draw a diagram of possible causes, which bifurcate the search space, and how to collect new observations to decide which path to follow at each fork. He also shares a very common anti-pattern—observe a symptom, hypothesize a cause for the fault, and write and deploy code to address this cause—explains why this process often fails to produce effective results, and outlines what to do instead.