Hadoop and object stores: Can we do it better?

Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speedup for DFSIO on Hadoop and a 500% speedup for Terasort on Spark.


Talk Title	Hadoop and object stores: Can we do it better?
Speakers	Trent Gray-Donald (IBM), Gil Vernik (IBM)
Conference	Strata Data Conference
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	May 23-25, 2017
URL	Talk Page
Slides	Talk Slides
Video

Apache Hadoop installations traditionally collocate compute and storage using. However, there is a new trend toward disaggregation, where the compute and storage clusters are separate and can be scaled independently. The popularity of object stores has also increased recently. Object stores offer a highly scalable and low cost alternative for storing the huge amounts of data being generated and collected by individuals, businesses, and organizations. Combining both trends, many have started to explore the impact of using object storage as the primary data storage for their Hadoop and Spark compute clusters. Hadoop can easily operate without HDFS and contains modules that provide a convenient way to access object stores like Amazon S3, OpenStack Swift, Azure Blob Storage, and IBM Cloud Object Storage. Many big data projects, like Apache Spark or Alluxio, rely on these same Hadoop modules to ease their integration with object stores. However, these modules are designed to work with filesystems rather than object stores. Thus, they contain flows and operations (e.g., the creation of temporary objects and rename) that are not native to object stores and introduce unnecessary complexity and overhead. Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark specifically designed to optimize their performance with object stores. Trent and Gil describe how Stocator works and share real-life examples and benchmarks that demonstrate how it can greatly improve performance and reduce the quantity of resources used. In tests, running DFSIO on Hadoop with Stocator achieved a 37% speed up for write flows and a 96% speed up for the read flows compared to Hadoop with its native object store connectors, while running Terasort on Spark with Stocator achieved a 500% speedup compared to the native connectors. Wordcount on Spark, where the output is much smaller than the input, achieved a 100% speedup.

Hadoop and object stores: Can we do it better?

The state of Spark in the cloud

Modern Big Data Pipelines over Kubernetes [I]

Using R for scalable data analytics: From single machines to Hadoop Spark clusters

BoFs: Data-Aware Scheduling in Kubernetes [I]

Compressed linear algebra in Apache SystemML

Paint the landscape and secure your data center with Apache Spot