Increasing visibility of distributed systems in production

Understanding the state of a running application is the key to efficiently troubleshooting production issues and ultimately anticipating outages. Pierre Vincent demonstrates how to make monitoring an integral part of development, using health checks, metrics, tracing, and other patterns to get a clearer picture of applications in production.


Talk Title	Increasing visibility of distributed systems in production
Speakers	Pierre Vincent (Poppulo)
Conference	O’Reilly Velocity Conference
Conf Tag	Build Resilient Distributed Systems
Location	London, United Kingdom
Date	October 18-20, 2017
URL	Talk Page
Slides	Talk Slides
Video

Understanding the running state of an application is the key to efficiently troubleshoot production issues and ultimately anticipate outages. When systems grow larger and become distributed, the visibility of application health needs to become a first-class concern; as the likelihood of something going wrong increases, the focus shifts from increasing mean time between failures to reducing mean time to recovery. The best way to achieve this consistently is to make monitoring an integral part of product development, instead of it being just an afterthought. Pierre Vincent demonstrates how to build in monitoring, using health checks, metrics, tracing, and other patterns to get a clearer picture of applications in production. Monitoring can start simple, with basic telemetry such as health checks, which increase visibility in the system’s status. Exposing more advanced metrics can give more details on how the system is working on a system level (e.g., resource usage), application level (e.g., response times), and business level (e.g., completed sales). These health checks and metrics can then be used to trigger alerts when observed values are outside of expected thresholds. Pierre offers an overview of monitoring patterns and tools that will help you build a fuller picture of a running application.