Move Fast, Unbreak Things!
Every network fails, and large networks fail more often. Many times the issue is clearly visible, but every now and then there is something that goes by undetected …
Talk Title | Move Fast, Unbreak Things! |
Speakers | Petr Lapukhov |
Conference | NANOG66 |
Conf Tag | |
Location | San Diego, California |
Date | Feb 8 2016 - Feb 10 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | Talk Video |
Every network fails, and large networks fail more often. Many times the issue is clearly visible, but every now and then there is something that goes by undetected by traditional monitoring systems (read - link down alarms, or packet drop/error counters). This talk summarizes Facebook’s experience of building a “black-box” fault detection and isolation system for data-center and backbone networks. The heart of the system is high-rate active probing component that allows for detection of failures regardless of the underlying cause. One of the prominent aspects of the system is its aim at real-time detection, allowing for practical reaction times from 10 to 20 seconds. We argue that this is likely one key feature that made system practical and useful to operations. Retrospectively, we review the system’s evolution, which went through multiple iterations, and compare different kinds of problems that arise in data-center, backbone and edge segments of the networks. Finally, we discuss the challenges specific to fault isolation and present our current approach, as well as the vision for future evolution.