December 13, 2019

233 words 2 mins read

File format benchmark: Avro, JSON, ORC, and Parquet

File format benchmark: Avro, JSON, ORC, and Parquet

Picking the best data format depends on what kind of data you have and how you plan to use it. Owen O'Malley outlines the performance differences between formats in different use cases and offers an overview of the advantages and disadvantages of each to help you improve the performance of your applications.


Talk Title	File format benchmark: Avro, JSON, ORC, and Parquet
Speakers
Conference	Strata + Hadoop World
Conf Tag	Make Data Work
Location	New York, New York
Date	September 27-29, 2016
URL	Talk Page
Slides	Talk Slides
Video

The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Picking the best data format depends on what kind of data you have and how you plan to use it. Depending on your use case, different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. Owen O’Malley outlines the performance differences between formats in different use cases and offers an overview of the advantages and disadvantages of each to help you improve the performance of your applications. Use cases include: All of the benchmark code will be open source so that the experiments can be replicated. Furthermore, it is important to benchmark on real data rather than synthetic data. You’ll use the GitHub logs data available freely from the GitHub Archive.

code json synthetic github open source big data use case performance

comments powered by Disqus

Fast cars, big data: How streaming data can help Formula 1

Fast cars, big data: How streaming data can help Formula 1

December 13, 2019

Modern cars produce data. Lots of data. And Formula 1 cars produce more than their fair share. Ted Dunning presents a demo of how data streaming can be applied to the analytics problems posed by modern motorsports. Although he won't be bringing Formula 1 cars to the talk, Ted demonstrates a physics-based simulator to analyze realistic data from simulated cars.

How a Spark-based feature store can accelerate big data adoption in financial services

How a Spark-based feature store can accelerate big data adoption in financial services

December 12, 2019

Kaushik Deka and Phil Jarymiszyn discuss the benefits of a Spark-based feature store, a library of reusable features that allows data scientists to solve business problems across the enterprise. Kaushik and Phil outline three challenges they facedsemantic data integration within a data lake, high-performance feature engineering, and metadata governanceand explain how they overcame them.

Evaluating models for a needle in a haystack: Applications in predictive maintenance

Evaluating models for a needle in a haystack: Applications in predictive maintenance

December 13, 2019

In the realm of predictive maintenance, the event of interest is an equipment failure. In real scenarios, this is usually a rare event. Unless the data collection has been taking place over a long period of time, the data will have very few of these events or, in the worst case, none at all. Danielle Dean and Shaheen Gauher discuss the various ways of building and evaluating models for such data.

Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies

Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies

December 10, 2019

David Talby and Claudiu Branzan lead a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, Titan, and Elasticsearch; data science components include custom UIMA annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Tracing polyglot systems: An OpenTracing tutorial

Tracing polyglot systems: An OpenTracing tutorial

November 28, 2019

Priyanka Sharma and Yuri Shkuro demonstrate how distributed tracing works and how to employ it in the development and operations of your applications in the programming language of your choice: Java, Go, Python, Node.js, C#, or C++.

ChatOps in 2016

ChatOps in 2016

November 27, 2019

By now, you've probably have heard of ChatOps (especially if you're in operations). GitHub has been using ChatOps for more than five years and continues to scale these practices. Ben Lavender explains the guidelines that GitHub has created to work with ChatOps and the lessons learned in the process.