November 21, 2019

311 words 2 mins read

Radically modular data ingestion APIs in Apache Beam

Radically modular data ingestion APIs in Apache Beam

Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn.

Talk Title Radically modular data ingestion APIs in Apache Beam
Speakers Eugene Kirpichov (Google)
Conference Strata Data Conference
Conf Tag Big Data Expo
Location San Jose, California
Date March 6-8, 2018
URL Talk Page
Slides Talk Slides
Video

Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Beam also offers a rich set of I/O connectors to popular storage systems. Beam adopts the philosophy that interacting with a storage system is just another parallel data processing task, so the I/O connectors are packaged as simple Beam transforms. However, the batch/streaming dichotomy still existed for authors of new I/O connectors: it is common knowledge that efficiently ingesting batch and streaming data is fundamentally different and, at a low level, requires fundamentally different (and rather heavyweight) APIs. Over the years Google has identified a number of issues with these APIs, and its attempts to improve them have yielded an unexpected result. It generalized the most fundamental data processing primitive: the Map operation (DoFn in Beam). The generalization is called Splittable DoFn, and it not only allows developing data ingestion APIs in a way agnostic to batch versus streaming but is lightweight and transparently blends with the rest of the Beam programming model, enabling and popularizing new, highly modular design patterns for data ingestion APIs. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using Splittable DoFn.

comments powered by Disqus