January 8, 2020

229 words 2 mins read

Debugging complex systems

Debugging complex systems

Terran Melconian explores an organized process for observing a misbehaving complex system, reasoning about possible causes, and isolating the fault. While it is not generally taught, all the successful senior engineers with operational experience Terran has talked to use a variant of this process.

Talk Title Debugging complex systems
Speakers Terran Melconian (Air Network Simulation and Analysis)
Conference O’Reilly Velocity Conference
Conf Tag Build resilient systems at scale
Location New York, New York
Date October 2-4, 2017
URL Talk Page
Slides Talk Slides
Video

Skills for diagnosing failures in complex, interacting systems are critically important but rarely taught. Even those with experience and expertise can struggle to articulate how they do what they do in order to pass the knowledge on. Drawing on a combination of his own experience carrying the pager and time spent observing and teaching others, Terran Melconian has distilled an explicit, teachable process for efficiently isolating faults. Starting with the observed symptom (for example, a page), Terran demonstrates how to draw a diagram of possible causes, which bifurcate the search space, and how to collect new observations to decide which path to follow at each fork. He also shares a very common anti-pattern—observe a symptom, hypothesize a cause for the fault, and write and deploy code to address this cause—explains why this process often fails to produce effective results, and outlines what to do instead.

comments powered by Disqus