October 31, 2019

282 words 2 mins read

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)

Unbounded, out-of-order, global-scale data is now the norm. Even for the same computation, each use case entails its own balance between completeness, latency, and cost. Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner, including Apache Spark, Apache Flink, and Google Cloud Dataflow.

Talk Title Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)
Speakers Kenneth Knowles (Google)
Conference Strata + Hadoop World
Conf Tag Big Data Expo
Location San Jose, California
Date March 14-16, 2017
URL Talk Page
Slides Talk Slides
Video

The rise of unbounded, out-of-order, global-scale data requires increasingly sophisticated programming models to make stream processing feasible. When computing over an unbounded stream of data, each use case entails its own balance between three factors: completeness (confidence that you have all the data), latency (waiting to learn from the data), and cost (adding compute power to lower latency). Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner. Beam gives you this power by identifying and separating four concerns common to all streaming computations: Regardless of backend, these questions must be answered. With Beam, you can answer these questions independently with loosely coupled APIs corresponding to each question: what—reading, transformation, aggregation, and writing; where—event time windowing; when—watermarks and triggers; and how—accumulation modes. With these, you can build a readable and portable pipeline focused on your problem rather than the quirks of your backend, which you can then execute on your runner of choice, including Apache Flink, Apache Spark, Apache Gearpump (also incubating), Apache Apex, or Google Cloud Dataflow.

comments powered by Disqus