November 16, 2019

299 words 2 mins read

Triggers in Apache Beam (incubating): User-controlled balance of completeness, latency, and cost in streaming big data pipelines

Triggers in Apache Beam (incubating): User-controlled balance of completeness, latency, and cost in streaming big data pipelines

Drawing on important real-world use cases, Kenneth Knowles delves into the details of the language- and runner-independent semantics developed for triggers in Apache Beam, demonstrating how the semantics support the use cases as well as all of the above variability in streaming systems. Kenneth then describes some of the particular implementations of those semantics in Google Cloud Dataflow.

Talk Title Triggers in Apache Beam (incubating): User-controlled balance of completeness, latency, and cost in streaming big data pipelines
Speakers Kenneth Knowles (Google)
Conference Strata + Hadoop World
Conf Tag Making Data Work
Location London, United Kingdom
Date June 1-3, 2016
URL Talk Page
Slides Talk Slides
Video

In a streaming data processing system, where data is generally unbounded, triggers specify when each stage of computation should emit output. With a small language of primitive conditions and ways of combining them, triggers provide the flexibility to tailor a streaming pipeline to a variety of use cases and data sources, enabling a practitioner to achieve an appropriate balance between accuracy, latency, and cost. Here are some conditions under which one may choose to “fire,” aka trigger output: To support the variety of streaming systems in existence today and yet to come, as well as the variability built into each one, a foundational semantics for triggers must be based on fundamental aspects of streaming processing. To maintain the unified batch/streaming programming model, you must ensure trigger semantics remain consistent across a number of dimensions, including: Drawing on important real-world use cases, Kenneth Knowles delves into the details of the language- and runner-independent semantics developed for triggers in Apache Beam, demonstrating how the semantics support the use cases as well as all of the above variability in streaming systems. Kenneth then describes some of the particular implementations of those semantics in Google Cloud Dataflow.

comments powered by Disqus