December 21, 2019

279 words 2 mins read

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance.

Talk Title Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric
Speakers Yuan Zhou (Intel), haodong tang (Intel), Jian Zhang (Intel)
Conference Strata Data Conference
Conf Tag Big Data Expo
Location San Francisco, California
Date March 26-28, 2019
URL Talk Page
Slides Talk Slides
Video

As a unified data processing engine, Spark is expected to achieve high throughput and ultralow latency for different workloads like ad hoc queries, real-time streaming, and machine learning. However, under certain workloads (large join/aggregation), its performance is limited by the overhead from the persistence on local shuffle drives and transferring with TCP/IP networking. Previous studies showed this can be improved using RDMA networking and fast storage like NVMe SSDs, which should have orders of magnitude improvements, but the performance gain didn’t go that much due to the long I/O stack in the shuffle stage. Thanks to the new DCPMM technology, which offers persistency with memory-like speed, we’re able to shorten the I/O stack and make Spark a 100% in-memory computing platform. Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance. An open source, community-driven project contributed by storage and big data engineers, Spark-PMoF leverages PMoF (persistent memory over fabric) technology and enables the codesign of a storage/network stack to speed up Spark shuffle performance. By using P2P-connected persistent shuffle storage with memory-like speed, you can fully bypass the context switch and greatly improve big data analytics performance without hurting any Spark consistency. Initial benchmark results using microworkloads show Spark-PMOF achieves great improvements.

comments powered by Disqus