November 28, 2019

213 words 1 min read

A practical guide to monitoring and alerting with time series at scale

A practical guide to monitoring and alerting with time series at scale

Monitoring only sucks when the cost of maintenance scales proportionally with the size of the system being monitored. Recently, tools have emerged that assist with scaling out monitoring configurations sublinearly with the size of the system. Jamie Wilkinson explores time series-based alerting and offers practical examples that can be employed in your environment today.

Talk Title A practical guide to monitoring and alerting with time series at scale
Speakers Jamie Wilkinson (Google)
Conference Velocity
Conf Tag Build resilient systems at scale
Location Santa Clara, California
Date June 21-23, 2016
URL Talk Page
Slides Talk Slides
Video

Monitoring is the foundational bedrock of site reliability yet is the bane of most sysadmins’ lives. Why? Monitoring sucks when the cost of maintenance scales proportionally with the size of the system being monitored. Recently, tools like Riemann and Prometheus have emerged to address this problem by scaling out monitoring configurations sublinearly with the size of the system. In a talk complementing the Google SRE book chapter “Practical Alerting from Time Series Data,” Jamie Wilkinson explores the theory of alert design and time series-based alerting methods and offers practical examples in Prometheus that you can deploy in your environment today to reduce the amount of alert spam and help operators keep a healthy level of production hygiene.

comments powered by Disqus