Triggers in Apache Beam (incubating)

Triggers specify when a stage of computation should emit output. With a small language of primitive conditions, triggers provide the flexibility to tailor a streaming pipeline to a variety of use cases and data sources. Kenneth Knowles delves into the details of language- and runner-independent semantics for triggers in Apache Beam and explores real-world implementations in Google Cloud Dataflow.


Talk Title	Triggers in Apache Beam (incubating)
Speakers
Conference	Strata + Hadoop World
Conf Tag	Make Data Work
Location	New York, New York
Date	September 27-29, 2016
URL	Talk Page
Slides	Talk Slides
Video

In a streaming data processing system, where data is generally unbounded, triggers specify when each stage of computation should emit output. With a small language of primitive conditions and multiple ways of combining them, triggers provide the flexibility to tailor a streaming pipeline to a variety of use cases and data sources, enabling a practitioner to achieve an appropriate balance between accuracy, latency, and cost. (Some conditions under which one may choose to “fire”—aka trigger output—include after the system believes all data for the current window is processed, after at least 1,000 elements have arrived for processing, when the first of trigger A and trigger B fires, or according to trigger A until trigger B fires.) To support the variety of streaming systems in existence today and yet to come, as well as the variability built into each one, a foundational semantics for triggers must be based on fundamental aspects of stream processing. Since we also aim to maintain the unified batch/streaming programming model, trigger semantics must remain consistent across a number of dimensions, including reordering and/or delay of data, small bundles of data where an operation may buffer data until a trigger fires, large bundles of data where an operation processes it all before firing the result to the next stage, arbitrarily good (or bad) approximations of event time, and retrying a computation (for example, when processing time and event time may both have progressed, and more data may have arrived, and we’d like to process it all together in large bundles for performance). Drawing on important real-world use cases, Kenneth Knowles delves into the details of language- and runner-independent semantics for triggers in Apache Beam and explores real-world implementations in Google Cloud Dataflow.

Triggers in Apache Beam (incubating)

Triggers in Apache Beam (incubating): User-controlled balance of completeness, latency, and cost in streaming big data pipelines

Watermarks: Time and progress in Apache Beam (incubating) and beyond

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

Watermarks: Time and progress in streaming dataflow and beyond

TensorFlow: Large-scale analytics and distributed machine learning with TensorFlow, BigQuery, and Dataflow (Apache Beam)

Bulk image processing using Kubernetes