December 13, 2019

233 words 2 mins read

File format benchmark: Avro, JSON, ORC, and Parquet

File format benchmark: Avro, JSON, ORC, and Parquet

Picking the best data format depends on what kind of data you have and how you plan to use it. Owen O'Malley outlines the performance differences between formats in different use cases and offers an overview of the advantages and disadvantages of each to help you improve the performance of your applications.

Talk Title File format benchmark: Avro, JSON, ORC, and Parquet
Speakers
Conference Strata + Hadoop World
Conf Tag Make Data Work
Location New York, New York
Date September 27-29, 2016
URL Talk Page
Slides Talk Slides
Video

The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Picking the best data format depends on what kind of data you have and how you plan to use it. Depending on your use case, different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. Owen O’Malley outlines the performance differences between formats in different use cases and offers an overview of the advantages and disadvantages of each to help you improve the performance of your applications. Use cases include: All of the benchmark code will be open source so that the experiments can be replicated. Furthermore, it is important to benchmark on real data rather than synthetic data. You’ll use the GitHub logs data available freely from the GitHub Archive.

comments powered by Disqus