January 17, 2020

224 words 2 mins read

Online performance analysis of distributed dataflow systems

Online performance analysis of distributed dataflow systems

Vasia Kalavri offers an overview of Strymon, a system for predictive data center analytics, and its online critical path analysis module. Strymon analyzes live traces from distributed dataflow systems like Apache Spark, Apache Flink, and TensorFlow to predict bottlenecks and provide insights on streaming application performance.

Talk Title Online performance analysis of distributed dataflow systems
Speakers Vasiliki Kalavri (ETH Zurich)
Conference O’Reilly Velocity Conference
Conf Tag Build Resilient Distributed Systems
Location London, United Kingdom
Date October 18-20, 2017
URL Talk Page
Slides Talk Slides
Video

Understanding the performance of distributed dataflow systems like Apache Spark, Apache Flink, and Tensorflow is hard. Parallel computation is interleaved with data and control communication, and execution dependencies typically span multiple system components. In such environments, bottleneck detection is cumbersome and currently relies heavily on humans. After decades of systems research, state-of-the-art performance analysis techniques are commonly based on offline trace processing and thus are only suitable for batch computations and postmortem reports. Vasia Kalavri offers an overview of Strymon, a system for predictive data center analytics, and its online critical path analysis module. Strymon analyzes live traces from distributed dataflow systems to predict bottlenecks and provide insights on streaming application performance—leveraging logging and monitoring pipelines of modern production data centers to ingest cross-layer events in a streaming fashion and predict possible effects of such events in what-if scenarios.

comments powered by Disqus