January 22, 2020

244 words 2 mins read

Hudi: Unifying storage and serving for batch and near-real-time analytics

Hudi: Unifying storage and serving for batch and near-real-time analytics

Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond.


Talk Title	Hudi: Unifying storage and serving for batch and near-real-time analytics
Speakers	Nishith Agarwal (Uber), Balaji Varadarajan (Uber), Vinoth Chandar (Apache Hudi)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

Hudi (formerly Hoodie) is an open source analytical storage system created at Uber to manage petabytes of data on HDFS-like distributed storage. Hudi enables near-real-time ingestion and provides different views of the data—a read-optimized view for batch analytics, a real-time view for driving dashboards, and an incremental view for powering data pipelines. Hudi also effectively manages files on underlying storage to maximize operational health and reliability. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar outline the design and architecture of merge-on-read storage and explain how it lowers data latency across the board while simultaneously achieving orders of magnitude of efficiency over traditional batch ingestion. They make the case for near-real-time dashboarding on top of Hudi datasets, which can be cheaper than pure streaming architectures, and detail how Uber leverages Hudi for use cases around ingestion, incremental ETL, and GDPR compliance.

health reliability streaming dataset gdpr dashboard open source analytics uber hdfs use case pipeline

comments powered by Disqus

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

January 20, 2020

Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works.

IoT edge processing with Apache NiFi, Apache MiniFi, and multiple deep learning libraries

IoT edge processing with Apache NiFi, Apache MiniFi, and multiple deep learning libraries

January 21, 2020

Timothy Spann leads a hands-on deep dive into using Apache MiniFi with Apache MXNet and other deep learning libraries on edge devices.

Why and how to leverage the power and simplicity of SQL on Apache Flink

Why and how to leverage the power and simplicity of SQL on Apache Flink

January 16, 2020

Fabian Hueske discusses why SQL is a great approach to unify batch and stream processing. He gives an update on Apache Flink's SQL support and shares some interesting use cases from large-scale production deployments. Finally, Fabian presents Flink's new query service that enables users and applications to submit streaming and batch SQL queries and retrieve low-latency updated results.

How BT delivers better broadband and TV using Spark and Kafka

How BT delivers better broadband and TV using Spark and Kafka

December 9, 2019

In the past year, British Telecom has added a streaming network analytics use case to its multitenant data platform. Phillip Radley demonstrates how the solution works and explains how it delivers better broadband and TV services, using Kafka and Spark on YARN and HDFS encryption.

StreamDM: Advanced data science with Spark Streaming

StreamDM: Advanced data science with Spark Streaming

December 5, 2019

Heitor Murilo Gomes and Albert Bifet offer an overview of StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei's Noahs Ark Lab and Tlcom ParisTech.

You call it data lake; we call it Data Historian.

You call it data lake; we call it Data Historian.

December 4, 2019

There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp. Naghman Waheed and Brian Arnold offer an overview of Monsanto's Data Historian platform, which can ingest, store, and access datasets without compromising ease of use, governance, or security.