December 26, 2019

235 words 2 mins read

Move Fast, Unbreak Things!

Move Fast, Unbreak Things!

Every network fails, and large networks fail more often. Many times the issue is clearly visible, but every now and then there is something that goes by undetected …

Talk Title Move Fast, Unbreak Things!
Speakers Petr Lapukhov
Conference NANOG66
Conf Tag
Location San Diego, California
Date Feb 8 2016 - Feb 10 2016
URL Talk Page
Slides Talk Slides
Video Talk Video

Every network fails, and large networks fail more often. Many times the issue is clearly visible, but every now and then there is something that goes by undetected by traditional monitoring systems (read - link down alarms, or packet drop/error counters). This talk summarizes Facebook’s experience of building a “black-box” fault detection and isolation system for data-center and backbone networks. The heart of the system is high-rate active probing component that allows for detection of failures regardless of the underlying cause. One of the prominent aspects of the system is its aim at real-time detection, allowing for practical reaction times from 10 to 20 seconds. We argue that this is likely one key feature that made system practical and useful to operations. Retrospectively, we review the system’s evolution, which went through multiple iterations, and compare different kinds of problems that arise in data-center, backbone and edge segments of the networks. Finally, we discuss the challenges specific to fault isolation and present our current approach, as well as the vision for future evolution.

comments powered by Disqus