The future of column-oriented data processing with Arrow and Parquet

In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, like RDMA, SSDs, and nonvolatile memory.


Talk Title	The future of column-oriented data processing with Arrow and Parquet
Speakers
Conference	Strata + Hadoop World
Conf Tag	Make Data Work
Location	New York, New York
Date	September 27-29, 2016
URL	Talk Page
Slides	Talk Slides
Video

In pursuit of speed and efficiency, big data processing is continuing its logical evolution toward columnar execution. A number of key big data technologies, including Kudu, Ibis, and Drill, have or will soon have in-memory columnar capabilities. The solid foundation laid by Apache Arrow and Apache Parquet for a shared columnar representation across the ecosystem promises a great future. Modern CPUs achieve higher throughput using SIMD instructions and vectorization on Arrow’s columnar in-memory representation. Similarly, Parquet provides storage and I/O optimized columnar data access using statistics and appropriate encodings. For interoperability, row-based encodings (CSV, Thrift, Avro) combined with general-purpose compression algorithms (GZip, LZO, Snappy) are common but inefficient. The Arrow and Parquet projects define standard columnar representations allowing interoperability without the usual cost of serialization. Jacques Nadeau, vice president of Apache Arrow, and Julien Le Dem, vice president of Apache Parquet, discuss the future of columnar data processing and the hardware trends it takes advantage of. Arrow-based interconnection between the various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enable them to be used together seamlessly and efficiently without overhead: when collocated on the same processing node, read-only shared memory and IPC avoid communication overhead; when remote, scatter-gather I/O sends the memory representation directly to the socket, avoiding serialization costs, and soon RDMA will allow exposing data remotely. As in-memory processing becomes more popular, the traditional tiering of RAM as working space and HDD as persistent storage is outdated. More tiers are now available like SSDs and nonvolatile memory, which provide much higher data density and achieve a latency close to RAM at a fraction of the cost. Execution engines can take advantage of more granular tiering and avoid the traditional spilling to disk, which impacts performance by an order of magnitude when the working dataset does not fit in main memory.

The future of column-oriented data processing with Arrow and Parquet

Faster conclusions using in-memory columnar SQL and machine learning

Scala and the JVM as a big data platform: Lessons from Apache Spark

Stream analytics in the enterprise: A look at Intels internal IoT implementation

Tuning Spark machine-learning workloads

AI is not a matter of strength but of intelligence

Benefits of scaling machine learning