December 29, 2019

288 words 2 mins read

The columnar roadmap: Apache Parquet and Apache Arrow

The columnar roadmap: Apache Parquet and Apache Arrow

Julien Le Dem explains how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future, how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions, and how standard Arrow-based APIs are paving the way to breaking the silos of big data.


Talk Title	The columnar roadmap: Apache Parquet and Apache Arrow
Speakers	Julien Le Dem (WeWork)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 26-28, 2017
URL	Talk Page
Slides	Talk Slides
Video

The Hadoop ecosystem has standardized on columnar formats, with Apache Parquet for on-disk storage and Apache Arrow for in-memory storage. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in I/O reading from disk as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable, as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL becomes as fast as native internal performance. Julien Le Dem explains how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future, how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions, and how standard Arrow-based APIs—such as universal function libraries that can be written in any language and a Standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board—are paving the way to breaking the silos of big data.

api roadm apache performance sql spark ecosystem hadoop big data programming optimization python oadm

comments powered by Disqus

The state of Spark in the cloud

The state of Spark in the cloud

November 29, 2019

Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline.

Architecting a next-generation data platform

Architecting a next-generation data platform

December 5, 2019

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, and Mark Grover explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Big data computations: Comparing Apache HAWQ, Druid, and GPU databases

Big data computations: Comparing Apache HAWQ, Druid, and GPU databases

December 5, 2019

The class of big data computations known as distributed merge trees was built to aggregate user information across multiple data sources in the media domain. Vijay Srinivas Agneeswaran explores prototypes built on top of Apache HAWQ, Druid, and Kinetica, one of the open source GPU databases. Results show that Kinetica on a single G2.8x node outperformed clusters of HAWQ and Druid nodes.

Architecting a next-generation data platform

Architecting a next-generation data platform

November 9, 2019

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Hadoop and object stores: Can we do it better?

Hadoop and object stores: Can we do it better?

December 3, 2019

Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speedup for DFSIO on Hadoop and a 500% speedup for Terasort on Spark.

Modern Big Data Pipelines over Kubernetes [I]

Modern Big Data Pipelines over Kubernetes [I]

December 3, 2019

Big data used to be synonymous with Hadoop, but our ecosystem has evolved over time with new database, streaming and machine learning solutions which dont necessarily benefit from the Hadoop deployme …