Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)
Unbounded, out-of-order, global-scale data is now the norm. Even for the same computation, each use case entails its own balance between completeness, latency, and cost. Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner, including Apache Spark, Apache Flink, and Google Cloud Dataflow.
Talk Title | Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating) |
Speakers | Kenneth Knowles (Google) |
Conference | Strata + Hadoop World |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 14-16, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
The rise of unbounded, out-of-order, global-scale data requires increasingly sophisticated programming models to make stream processing feasible. When computing over an unbounded stream of data, each use case entails its own balance between three factors: completeness (confidence that you have all the data), latency (waiting to learn from the data), and cost (adding compute power to lower latency). Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner. Beam gives you this power by identifying and separating four concerns common to all streaming computations: Regardless of backend, these questions must be answered. With Beam, you can answer these questions independently with loosely coupled APIs corresponding to each question: what—reading, transformation, aggregation, and writing; where—event time windowing; when—watermarks and triggers; and how—accumulation modes. With these, you can build a readable and portable pipeline focused on your problem rather than the quirks of your backend, which you can then execute on your runner of choice, including Apache Flink, Apache Spark, Apache Gearpump (also incubating), Apache Apex, or Google Cloud Dataflow.