January 22, 2020

244 words 2 mins read

Hudi: Unifying storage and serving for batch and near-real-time analytics

Hudi: Unifying storage and serving for batch and near-real-time analytics

Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond.

Talk Title Hudi: Unifying storage and serving for batch and near-real-time analytics
Speakers Nishith Agarwal (Uber), Balaji Varadarajan (Uber), Vinoth Chandar (Apache Hudi)
Conference Strata Data Conference
Conf Tag Make Data Work
Location New York, New York
Date September 11-13, 2018
URL Talk Page
Slides Talk Slides
Video

Hudi (formerly Hoodie) is an open source analytical storage system created at Uber to manage petabytes of data on HDFS-like distributed storage. Hudi enables near-real-time ingestion and provides different views of the data—a read-optimized view for batch analytics, a real-time view for driving dashboards, and an incremental view for powering data pipelines. Hudi also effectively manages files on underlying storage to maximize operational health and reliability. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar outline the design and architecture of merge-on-read storage and explain how it lowers data latency across the board while simultaneously achieving orders of magnitude of efficiency over traditional batch ingestion. They make the case for near-real-time dashboarding on top of Hudi datasets, which can be cheaper than pure streaming architectures, and detail how Uber leverages Hudi for use cases around ingestion, incremental ETL, and GDPR compliance.

comments powered by Disqus