November 16, 2019

299 words 2 mins read

Triggers in Apache Beam (incubating): User-controlled balance of completeness, latency, and cost in streaming big data pipelines

Triggers in Apache Beam (incubating): User-controlled balance of completeness, latency, and cost in streaming big data pipelines

Drawing on important real-world use cases, Kenneth Knowles delves into the details of the language- and runner-independent semantics developed for triggers in Apache Beam, demonstrating how the semantics support the use cases as well as all of the above variability in streaming systems. Kenneth then describes some of the particular implementations of those semantics in Google Cloud Dataflow.


Talk Title	Triggers in Apache Beam (incubating): User-controlled balance of completeness, latency, and cost in streaming big data pipelines
Speakers	Kenneth Knowles (Google)
Conference	Strata + Hadoop World
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	June 1-3, 2016
URL	Talk Page
Slides	Talk Slides
Video

In a streaming data processing system, where data is generally unbounded, triggers specify when each stage of computation should emit output. With a small language of primitive conditions and ways of combining them, triggers provide the flexibility to tailor a streaming pipeline to a variety of use cases and data sources, enabling a practitioner to achieve an appropriate balance between accuracy, latency, and cost. Here are some conditions under which one may choose to “fire,” aka trigger output: To support the variety of streaming systems in existence today and yet to come, as well as the variability built into each one, a foundational semantics for triggers must be based on fundamental aspects of streaming processing. To maintain the unified batch/streaming programming model, you must ensure trigger semantics remain consistent across a number of dimensions, including: Drawing on important real-world use cases, Kenneth Knowles delves into the details of the language- and runner-independent semantics developed for triggers in Apache Beam, demonstrating how the semantics support the use cases as well as all of the above variability in streaming systems. Kenneth then describes some of the particular implementations of those semantics in Google Cloud Dataflow.

google streaming apache big data programming use case cloud pipeline

comments powered by Disqus

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

October 27, 2019

Jayant Shekhar, Amandeep Khurana, Krishna Sankar, and Vartika Singh guide participants through techniques for building machine-learning apps using Spark MLlib and Spark ML and demonstrate the principles of graph processing with Spark GraphX.

Watermarks: Time and progress in streaming dataflow and beyond

Watermarks: Time and progress in streaming dataflow and beyond

November 16, 2019

Watermarks are a system for measuring progress and completeness in out-of-order stream processing systems and are used to emit correct results in a timely way. Given the trend toward out-of-order processing in current streaming systems, understanding watermarks is an increasingly important skill. Slava Chernyak explains watermarks and demonstrates how to apply them using real-world cases.

IoT in the enterprise: A look at Intel (IoT) Inside

IoT in the enterprise: A look at Intel (IoT) Inside

October 23, 2019

Moty Fania shares Intels IT experience implementing an on-premises big data IoT platform for internal use cases. This unique platform was built on top of several open source technologies and enables highly scalable stream analytics with a stack of algorithms such as multisensor change detection, anomaly detection, and more.

TensorFlow: Large-scale analytics and distributed machine learning with TensorFlow, BigQuery, and Dataflow (Apache Beam)

TensorFlow: Large-scale analytics and distributed machine learning with TensorFlow, BigQuery, and Dataflow (Apache Beam)

October 20, 2019

Kazunori Sato and Amy Unruh explore how you can use TensorFlow to drive large-scale distributed machine learning against your analytic data sitting in Google BigQuery, with data preprocessing driven by Dataflow (now Apache Beam). Kazunori and Amy dive into practical examples of how these technologies can work together to enable a powerful workflow for distributed machine learning.

Mobile, open source, and the drive to the cloud

Mobile, open source, and the drive to the cloud

November 8, 2019

Open technologies are leading the way to a simplified development experience, end to end, from mobile to the cloud. Patrick Bohrer explores the role of these open technologies in driving down the time it takes to build, integrate, and deliver powerful apps that blur the lines between mobile and cloud.

Think outside the container

Think outside the container

November 3, 2019

There is much more to Docker than just deploying your flagship application. Carolyn VanSlyck, Ash Wilson, and Nick Silkey look at three use cases for Docker that go beyond the typical software development pipeline. You'll leave knowing how you can use Docker on the side, even if you haven't tackled Dockerizing your application.