December 29, 2019

288 words 2 mins read

The columnar roadmap: Apache Parquet and Apache Arrow

The columnar roadmap: Apache Parquet and Apache Arrow

Julien Le Dem explains how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future, how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions, and how standard Arrow-based APIs are paving the way to breaking the silos of big data.

Talk Title The columnar roadmap: Apache Parquet and Apache Arrow
Speakers Julien Le Dem (WeWork)
Conference Strata Data Conference
Conf Tag Make Data Work
Location New York, New York
Date September 26-28, 2017
URL Talk Page
Slides Talk Slides
Video

The Hadoop ecosystem has standardized on columnar formats, with Apache Parquet for on-disk storage and Apache Arrow for in-memory storage. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in I/O reading from disk as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable, as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL becomes as fast as native internal performance. Julien Le Dem explains how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future, how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions, and how standard Arrow-based APIs—such as universal function libraries that can be written in any language and a Standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board—are paving the way to breaking the silos of big data.

comments powered by Disqus