The problem with preaggregated metrics
Preaggregated metrics and time series form the backbone of many monitoring setups. They have many redeeming qualities but simply aren't sufficient for capturing or responding to the many ways things can go wrong in modern or complex systems. Christine Yen outlines the problems inherent in the use and implementation of preaggregated metrics.
Talk Title | The problem with preaggregated metrics |
Speakers | Christine Yen (Honeycomb) |
Conference | O’Reilly Velocity Conference |
Conf Tag | Build Resilient Distributed Systems |
Location | San Jose, California |
Date | June 20-22, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Preaggregated metrics and time series form the backbone of many monitoring setups. They have many redeeming qualities but simply aren’t sufficient for capturing or responding to the many ways things can go wrong in modern or complex systems. Preaggregating a small set of metrics is a perfectly reasonable technique for top-level KPIs but not for the day-to-day operations and debugging work that happens by your engineers on the front lines: it forces your engineers to predict what metrics will be interesting sometime in the future and hobbles their ability to quickly react to unexpected factors. Christine Yen outlines the problems inherent in the use and implementation of preaggregated metrics and covers the implementation details inherent in building an RRD (the basis of many preaggregated metrics systems), highlighting another axis in which data is constrained. Contiguous time series stored on disk are speedy to read and easy to conceptualize but are at risk for a combinatorial explosion of inputs blowing up the underlying storage. Along the way, Christine stresses the importance of context. Relying on individual metrics and segments is like trying to extrapolate a 3D model of a room from hundreds of one-dimensional data points. When exploring a dataset, it’s crucial to be able to easily understand and visualize the interplay between the various attributes and measurements of a system event, but these one-dimensional metrics rob your engineers of this ability.