Performance debugging: Finding bottlenecks in distributed systems

Performance debugging is a crucial part of ensuring code is production ready, particularly as a company and its products grow. However, bottlenecks that hold these services back can be hard to identify. Christian Grabowski shares his experience debugging bottlenecks in distributed systems, at both a macro (metrics, distributed tracing) and a micro (user space and kernel space profiling) level.


Talk Title	Performance debugging: Finding bottlenecks in distributed systems
Speakers	Christian Grabowski (NS1)
Conference	O’Reilly Velocity Conference
Conf Tag	Building and maintaining complex distributed systems
Location	San Jose, California
Date	June 12-14, 2018
URL	Talk Page
Slides	Talk Slides
Video

Whether a company is seeing rapid growth or has an existing large customer base, the performance of its software is crucial and can be impacted by a range of variables. These variables include how a company delivers applications to customers, how host machines run the software, and everything in between. Performance debugging is a crucial part of ensuring code is production ready, particularly as a company and its products grow. Debugging bottlenecks that prevent existing software from performing optimally can open up a business’s system to scale and handle more usage. However, most of the battle in the debugging process is actually identifying the bottlenecks rather than fixing them. Skills such as tracing, monitoring, and profiling are invaluable in identifying these bottlenecks. Christian Grabowski shares his experience debugging bottlenecks in distributed systems, at both a macro (metrics, distributed tracing) and a micro (user space and kernel space profiling) level, focusing particularly on tuning REST API services to handle databases that had doubled in size in a matter of a day and taming a resource-hungry, high-throughput metrics ingestion service. In the macro view, the goal is to identify the bottleneck(s) of a distributed system. Which service is preventing higher throughput? Which service is adding latency? Which service is using all of the resources? Thankfully, there are many available tools to pinpoint the answer to these questions, such as operational metrics and distributed tracing. The micro view, on the other hand, examines where bottlenecks exist in the service itself. This can involve blocks of code, the right balance of resources, or the configuration of the service or machine. Recent technology is emerging to help identify these issues, such as dynamic tracing with things like eBPF. Join Christian to learn how to overcome and solve these bottlenecks, making software scale and perform substantially better.

Performance debugging: Finding bottlenecks in distributed systems

Sharded and Federated Prometheus Servers to Monitor Distributed Databases

Low-Overhead Tracing Using eBPF for Observability into Kubernetes Apps and Services

Got a Need for Speed? Accelerate Your Prometheus Dashboard Using Trickster

Web performance API deep dive

You call it data lake; we call it Data Historian.

Code Property Graph: A modern, queryable data storage for source code