Performance debugging: Finding bottlenecks in distributed systems
Performance debugging is a crucial part of ensuring code is production ready, particularly as a company and its products grow. However, bottlenecks that hold these services back can be hard to identify. Christian Grabowski shares his experience debugging bottlenecks in distributed systems, at both a macro (metrics, distributed tracing) and a micro (user space and kernel space profiling) level.
Talk Title | Performance debugging: Finding bottlenecks in distributed systems |
Speakers | Christian Grabowski (NS1) |
Conference | O’Reilly Velocity Conference |
Conf Tag | Building and maintaining complex distributed systems |
Location | San Jose, California |
Date | June 12-14, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Whether a company is seeing rapid growth or has an existing large customer base, the performance of its software is crucial and can be impacted by a range of variables. These variables include how a company delivers applications to customers, how host machines run the software, and everything in between. Performance debugging is a crucial part of ensuring code is production ready, particularly as a company and its products grow. Debugging bottlenecks that prevent existing software from performing optimally can open up a business’s system to scale and handle more usage. However, most of the battle in the debugging process is actually identifying the bottlenecks rather than fixing them. Skills such as tracing, monitoring, and profiling are invaluable in identifying these bottlenecks. Christian Grabowski shares his experience debugging bottlenecks in distributed systems, at both a macro (metrics, distributed tracing) and a micro (user space and kernel space profiling) level, focusing particularly on tuning REST API services to handle databases that had doubled in size in a matter of a day and taming a resource-hungry, high-throughput metrics ingestion service. In the macro view, the goal is to identify the bottleneck(s) of a distributed system. Which service is preventing higher throughput? Which service is adding latency? Which service is using all of the resources? Thankfully, there are many available tools to pinpoint the answer to these questions, such as operational metrics and distributed tracing. The micro view, on the other hand, examines where bottlenecks exist in the service itself. This can involve blocks of code, the right balance of resources, or the configuration of the service or machine. Recent technology is emerging to help identify these issues, such as dynamic tracing with things like eBPF. Join Christian to learn how to overcome and solve these bottlenecks, making software scale and perform substantially better.