Triggers in Apache Beam (incubating): User-controlled balance of completeness, latency, and cost in streaming big data pipelines
![Triggers in Apache Beam (incubating): User-controlled balance of completeness, latency, and cost in streaming big data pipelines](/2016/images/all/oreilly_hud65600b59d1e38fc68bc705db1cca132_23814_900x500_fit_q75_box.jpg)
Drawing on important real-world use cases, Kenneth Knowles delves into the details of the language- and runner-independent semantics developed for triggers in Apache Beam, demonstrating how the semantics support the use cases as well as all of the above variability in streaming systems. Kenneth then describes some of the particular implementations of those semantics in Google Cloud Dataflow.
Talk Title | Triggers in Apache Beam (incubating): User-controlled balance of completeness, latency, and cost in streaming big data pipelines |
Speakers | Kenneth Knowles (Google) |
Conference | Strata + Hadoop World |
Conf Tag | Making Data Work |
Location | London, United Kingdom |
Date | June 1-3, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
In a streaming data processing system, where data is generally unbounded, triggers specify when each stage of computation should emit output. With a small language of primitive conditions and ways of combining them, triggers provide the flexibility to tailor a streaming pipeline to a variety of use cases and data sources, enabling a practitioner to achieve an appropriate balance between accuracy, latency, and cost. Here are some conditions under which one may choose to “fire,” aka trigger output: To support the variety of streaming systems in existence today and yet to come, as well as the variability built into each one, a foundational semantics for triggers must be based on fundamental aspects of streaming processing. To maintain the unified batch/streaming programming model, you must ensure trigger semantics remain consistent across a number of dimensions, including: Drawing on important real-world use cases, Kenneth Knowles delves into the details of the language- and runner-independent semantics developed for triggers in Apache Beam, demonstrating how the semantics support the use cases as well as all of the above variability in streaming systems. Kenneth then describes some of the particular implementations of those semantics in Google Cloud Dataflow.