File format benchmark: Avro, JSON, ORC, and Parquet
Picking the best data format depends on what kind of data you have and how you plan to use it. Owen O'Malley outlines the performance differences between formats in different use cases and offers an overview of the advantages and disadvantages of each to help you improve the performance of your applications.
Talk Title | File format benchmark: Avro, JSON, ORC, and Parquet |
Speakers | |
Conference | Strata + Hadoop World |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 27-29, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Picking the best data format depends on what kind of data you have and how you plan to use it. Depending on your use case, different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. Owen O’Malley outlines the performance differences between formats in different use cases and offers an overview of the advantages and disadvantages of each to help you improve the performance of your applications. Use cases include: All of the benchmark code will be open source so that the experiments can be replicated. Furthermore, it is important to benchmark on real data rather than synthetic data. You’ll use the GitHub logs data available freely from the GitHub Archive.