Accelerating I/O in Big Data A Data Driven Approach and Case Studies
The I/O infrastructure is key to Big Data ecosystem. New networking and storage hardware technologies are continuously being developed while software I/O stack remains relatively slow. In order to ens …
Talk Title | Accelerating I/O in Big Data A Data Driven Approach and Case Studies |
Speakers | Yingqi (Lucy) Lu (Software Development Engineer, Intel Corporation) |
Conference | Open Source Summit North America |
Conf Tag | |
Location | Vancouver, BC, Canada |
Date | Aug 27-31, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
The I/O infrastructure is key to Big Data ecosystem. New networking and storage hardware technologies are continuously being developed while software I/O stack remains relatively slow. In order to ensure applications are able to take full advantage of modern devices, deep understanding of I/O subsystems and optimizations to Java libraries and Big Data frameworks are required. In this presentation, a data driven approach is used to identify software I/O bottlenecks inside four Big Data frameworks - Apache Cassandra, HBase, Spark and HDFS. To fix the bottlenecks, new Java library APIs Intel contributes to OpenJDK are introduced. Corresponding software changes to the target Big Data frameworks are also discussed in the presentation as examples of how to use the new Java APIs. At the end of each case study, performance analysis is used to demonstrate throughput and latency improvements from the software optimizations.