The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem explains how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future, how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions, and how standard Arrow-based APIs are paving the way to breaking the silos of big data.
Talk Title | The columnar roadmap: Apache Parquet and Apache Arrow |
Speakers | Julien Le Dem (WeWork) |
Conference | Strata Data Conference |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 26-28, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
The Hadoop ecosystem has standardized on columnar formats, with Apache Parquet for on-disk storage and Apache Arrow for in-memory storage. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in I/O reading from disk as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable, as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL becomes as fast as native internal performance. Julien Le Dem explains how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future, how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions, and how standard Arrow-based APIs—such as universal function libraries that can be written in any language and a Standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board—are paving the way to breaking the silos of big data.