Network Telemetry at Yahoo!
Providing 1 billion monthly active users with responsive, rich applications requires a large scale network. Locked within processes running on network devices are …
Talk Title | Network Telemetry at Yahoo! |
Speakers | Matt Hudgins (Yahoo!) , Varun Varma (Yahoo) |
Conference | NANOG70 |
Conf Tag | |
Location | Bellevue, WA |
Date | Jun 5 2017 - Jun 7 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | Talk Video |
Providing 1 billion monthly active users with responsive, rich applications requires a large scale network. Locked within processes running on network devices are valuable bits of control and data plane metrics like prefix usage, peer interface utilization and routing session flaps. By making this data available to any number of subscribers, we enable Yahoo! Engineers to create cost saving data visualizations and anomaly detection software. This paper explains the challenges encountered and architecture decisions made in building our real time network telemetry stack that currently polls millions of metrics from dozens of sites on five continents. A key goal of our system is to minimize the effort required to poll a new device type or write a new consumer application. To accomplish this, we abstracted scale away from engineers looking to poll devices and consumption away from engineers looking to build consumer applications. Our Python polling layer is built to be future proof, modular and horizontally scalable. We chose to use Python as our language because of its readability and community support. Python’s open source community provides a ready made plugin system called Yapsy. Polling plugins in our system are Yapsy plugins that specify how to get and clean data from a device before placing the results onto a Kafka bus. The platform then horizontally scales (unlike MRTG or Cacti) by scheduling the plugin through Celery, a Python distributed task queue. This yields many benefits, including the freedom to use the best polling method for a given device and the luxury of not needing to worry about scaling your plugin. For instance, where vendors support a robust API, we use that, but for API deficient vendors, we poll by SNMP instead. We also developed configuration driven SNMP polling that allows us to define SNMP table relations in configuration rather than code. This approach eases the mental burden of cross-SNMP table correlations, and allows us to poll new metrics without having to touch source code.