Triggers in Apache Beam (incubating)
Triggers specify when a stage of computation should emit output. With a small language of primitive conditions, triggers provide the flexibility to tailor a streaming pipeline to a variety of use cases and data sources. Kenneth Knowles delves into the details of language- and runner-independent semantics for triggers in Apache Beam and explores real-world implementations in Google Cloud Dataflow.
Talk Title | Triggers in Apache Beam (incubating) |
Speakers | |
Conference | Strata + Hadoop World |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 27-29, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
In a streaming data processing system, where data is generally unbounded, triggers specify when each stage of computation should emit output. With a small language of primitive conditions and multiple ways of combining them, triggers provide the flexibility to tailor a streaming pipeline to a variety of use cases and data sources, enabling a practitioner to achieve an appropriate balance between accuracy, latency, and cost. (Some conditions under which one may choose to “fire”—aka trigger output—include after the system believes all data for the current window is processed, after at least 1,000 elements have arrived for processing, when the first of trigger A and trigger B fires, or according to trigger A until trigger B fires.) To support the variety of streaming systems in existence today and yet to come, as well as the variability built into each one, a foundational semantics for triggers must be based on fundamental aspects of stream processing. Since we also aim to maintain the unified batch/streaming programming model, trigger semantics must remain consistent across a number of dimensions, including reordering and/or delay of data, small bundles of data where an operation may buffer data until a trigger fires, large bundles of data where an operation processes it all before firing the result to the next stage, arbitrarily good (or bad) approximations of event time, and retrying a computation (for example, when processing time and event time may both have progressed, and more data may have arrived, and we’d like to process it all together in large bundles for performance). Drawing on important real-world use cases, Kenneth Knowles delves into the details of language- and runner-independent semantics for triggers in Apache Beam and explores real-world implementations in Google Cloud Dataflow.