Building real-time BI systems with HDFS and Kudu
Ruhollah Farchtchi explores best practices for building systems that support ad hoc queries over real-time data and offers an overview of Kudu, a new storage layer for Hadoop that is specifically designed for use cases that require fast analytics on rapidly changing data with a simultaneous combination of sequential and random reads and writes.
Talk Title | Building real-time BI systems with HDFS and Kudu |
Speakers | Ruhollah Farchtchi (Zoomdata) |
Conference | Strata + Hadoop World |
Conf Tag | Making Data Work |
Location | London, United Kingdom |
Date | June 1-3, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
One of the key challenges in working with real-time and streaming data is that the data format for capturing data is not necessarily the optimal format for ad hoc analytic queries. For example, Avro is a convenient and popular serialization service that is great for initially bringing data into HDFS. Avro has native integration with Flume and other tools that make it a good choice for landing data in Hadoop. But columnar file formats, such as Parquet and ORC, are much better optimized for ad hoc queries that aggregate over large number of similar rows. Ruhollah Farchtchi explores best practices for dealing with these challenges and the append-only nature of HDFS and discusses how to make sure data is distributed appropriately. This is challenging to do with static data and even tougher with real-time, dynamic data. Ruhollah also explains how to deal with updates to existing data, whether due to restatements or a need to compact the data. Ruhollah then offers an overview of Kudu, a new storage layer for Hadoop that is specifically designed for fast analytics on rapidly changing data, demonstrates how Kudu simplifies the architecture of such systems, and reviews a number of lessons learned from working with Kudu, including how to use dictionary attributes to optimize storage of denormalized dimensional data; how to achieve a high degree of parallelization of queries via data distribution and sizing the right number of tablets based on available cores; and how to balance insert rates versus read-heavy workloads.