Hadoop and object stores: Can we do it better?
Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speedup for DFSIO on Hadoop and a 500% speedup for Terasort on Spark.
Talk Title | Hadoop and object stores: Can we do it better? |
Speakers | Trent Gray-Donald (IBM), Gil Vernik (IBM) |
Conference | Strata Data Conference |
Conf Tag | Making Data Work |
Location | London, United Kingdom |
Date | May 23-25, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Apache Hadoop installations traditionally collocate compute and storage using. However, there is a new trend toward disaggregation, where the compute and storage clusters are separate and can be scaled independently. The popularity of object stores has also increased recently. Object stores offer a highly scalable and low cost alternative for storing the huge amounts of data being generated and collected by individuals, businesses, and organizations. Combining both trends, many have started to explore the impact of using object storage as the primary data storage for their Hadoop and Spark compute clusters. Hadoop can easily operate without HDFS and contains modules that provide a convenient way to access object stores like Amazon S3, OpenStack Swift, Azure Blob Storage, and IBM Cloud Object Storage. Many big data projects, like Apache Spark or Alluxio, rely on these same Hadoop modules to ease their integration with object stores. However, these modules are designed to work with filesystems rather than object stores. Thus, they contain flows and operations (e.g., the creation of temporary objects and rename) that are not native to object stores and introduce unnecessary complexity and overhead. Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark specifically designed to optimize their performance with object stores. Trent and Gil describe how Stocator works and share real-life examples and benchmarks that demonstrate how it can greatly improve performance and reduce the quantity of resources used. In tests, running DFSIO on Hadoop with Stocator achieved a 37% speed up for write flows and a 96% speed up for the read flows compared to Hadoop with its native object store connectors, while running Terasort on Spark with Stocator achieved a 500% speedup compared to the native connectors. Wordcount on Spark, where the output is much smaller than the input, achieved a 100% speedup.