Online performance analysis of distributed dataflow systems
Vasia Kalavri offers an overview of Strymon, a system for predictive data center analytics, and its online critical path analysis module. Strymon analyzes live traces from distributed dataflow systems like Apache Spark, Apache Flink, and TensorFlow to predict bottlenecks and provide insights on streaming application performance.
Talk Title | Online performance analysis of distributed dataflow systems |
Speakers | Vasiliki Kalavri (ETH Zurich) |
Conference | O’Reilly Velocity Conference |
Conf Tag | Build Resilient Distributed Systems |
Location | London, United Kingdom |
Date | October 18-20, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Understanding the performance of distributed dataflow systems like Apache Spark, Apache Flink, and Tensorflow is hard. Parallel computation is interleaved with data and control communication, and execution dependencies typically span multiple system components. In such environments, bottleneck detection is cumbersome and currently relies heavily on humans. After decades of systems research, state-of-the-art performance analysis techniques are commonly based on offline trace processing and thus are only suitable for batch computations and postmortem reports. Vasia Kalavri offers an overview of Strymon, a system for predictive data center analytics, and its online critical path analysis module. Strymon analyzes live traces from distributed dataflow systems to predict bottlenecks and provide insights on streaming application performance—leveraging logging and monitoring pipelines of modern production data centers to ingest cross-layer events in a streaming fashion and predict possible effects of such events in what-if scenarios.