February 1, 2020

219 words 2 mins read

Accelerating I/O in Big Data A Data Driven Approach and Case Studies

Accelerating I/O in Big Data A Data Driven Approach and Case Studies

The I/O infrastructure is key to Big Data ecosystem. New networking and storage hardware technologies are continuously being developed while software I/O stack remains relatively slow. In order to ens …

Talk Title Accelerating I/O in Big Data A Data Driven Approach and Case Studies
Speakers Yingqi (Lucy) Lu (Software Development Engineer, Intel Corporation)
Conference Open Source Summit North America
Conf Tag
Location Vancouver, BC, Canada
Date Aug 27-31, 2018
URL Talk Page
Slides Talk Slides
Video

The I/O infrastructure is key to Big Data ecosystem. New networking and storage hardware technologies are continuously being developed while software I/O stack remains relatively slow. In order to ensure applications are able to take full advantage of modern devices, deep understanding of I/O subsystems and optimizations to Java libraries and Big Data frameworks are required. In this presentation, a data driven approach is used to identify software I/O bottlenecks inside four Big Data frameworks - Apache Cassandra, HBase, Spark and HDFS. To fix the bottlenecks, new Java library APIs Intel contributes to OpenJDK are introduced. Corresponding software changes to the target Big Data frameworks are also discussed in the presentation as examples of how to use the new Java APIs. At the end of each case study, performance analysis is used to demonstrate throughput and latency improvements from the software optimizations.

comments powered by Disqus