Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance.


Talk Title	Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric
Speakers	Yuan Zhou (Intel), haodong tang (Intel), Jian Zhang (Intel)
Conference	Strata Data Conference
Conf Tag	Big Data Expo
Location	San Francisco, California
Date	March 26-28, 2019
URL	Talk Page
Slides	Talk Slides
Video

As a unified data processing engine, Spark is expected to achieve high throughput and ultralow latency for different workloads like ad hoc queries, real-time streaming, and machine learning. However, under certain workloads (large join/aggregation), its performance is limited by the overhead from the persistence on local shuffle drives and transferring with TCP/IP networking. Previous studies showed this can be improved using RDMA networking and fast storage like NVMe SSDs, which should have orders of magnitude improvements, but the performance gain didn’t go that much due to the long I/O stack in the shuffle stage. Thanks to the new DCPMM technology, which offers persistency with memory-like speed, we’re able to shorten the I/O stack and make Spark a 100% in-memory computing platform. Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance. An open source, community-driven project contributed by storage and big data engineers, Spark-PMoF leverages PMoF (persistent memory over fabric) technology and enables the codesign of a storage/network stack to speed up Spark shuffle performance. By using P2P-connected persistent shuffle storage with memory-like speed, you can fully bypass the context switch and greatly improve big data analytics performance without hurting any Spark consistency. Initial benchmark results using microworkloads show Spark-PMOF achieves great improvements.

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

The Data Analytics Platform or How to Make Data Science in a Box Possible

Low Latency Multi-cluster Kubernetes Networking in AWS

Scaling visualization for big data and analytics in the cloud

User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL

Running High-performance User-space Packet Processing Apps in Kubernetes

Large Scale Distributed Deep Learning with Kubernetes Operators