October 31, 2019

282 words 2 mins read

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)

Unbounded, out-of-order, global-scale data is now the norm. Even for the same computation, each use case entails its own balance between completeness, latency, and cost. Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner, including Apache Spark, Apache Flink, and Google Cloud Dataflow.


Talk Title	Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)
Speakers	Kenneth Knowles (Google)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 14-16, 2017
URL	Talk Page
Slides	Talk Slides
Video

The rise of unbounded, out-of-order, global-scale data requires increasingly sophisticated programming models to make stream processing feasible. When computing over an unbounded stream of data, each use case entails its own balance between three factors: completeness (confidence that you have all the data), latency (waiting to learn from the data), and cost (adding compute power to lower latency). Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner. Beam gives you this power by identifying and separating four concerns common to all streaming computations: Regardless of backend, these questions must be answered. With Beam, you can answer these questions independently with loosely coupled APIs corresponding to each question: what—reading, transformation, aggregation, and writing; where—event time windowing; when—watermarks and triggers; and how—accumulation modes. With these, you can build a readable and portable pipeline focused on your problem rather than the quirks of your backend, which you can then execute on your runner of choice, including Apache Flink, Apache Spark, Apache Gearpump (also incubating), Apache Apex, or Google Cloud Dataflow.

api flink google streaming apache spark programming use case cloud pipeline

comments powered by Disqus

Why stream? The advantages of working with streaming data

Why stream? The advantages of working with streaming data

October 31, 2019

Life doesnt happen in batches. Being able to work with data from continuous events as data streams is a better fit to the way life happens, but doing so presents some challenges. Ellen Friedman examines the advantages and issues involved in working with streaming data, takes a look at emerging technologies for streaming, and describes best practices for this style of work.

Zillow: Transforming real estate through big data and machine learning

Zillow: Transforming real estate through big data and machine learning

October 30, 2019

Zillow pioneered providing access to unprecedented information about the housing market. Long gone are the days when you needed an agent to get comparables and prior sale and listing data. And with more data, data science has enabled more use cases. Jasjeet Thind explains how Zillow uses Spark and machine learning to transform real estate.

Using R for scalable data analytics: From single machines to Hadoop Spark clusters

Using R for scalable data analytics: From single machines to Hadoop Spark clusters

October 31, 2019

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Virtualizing Hadoop and Spark: Architecture, performance, and best practices (sponsored by VMware)

Virtualizing Hadoop and Spark: Architecture, performance, and best practices (sponsored by VMware)

October 31, 2019

Justin Murray outlines the benefits of virtualizing Hadoop and Spark, covering the main architectural approaches at a technical level and demonstrating how the core Hadoop architecture maps into virtual machines and how those relate to physical servers. You'll gain a set of design approaches and best practices to make your application infrastructure fit well with the virtualization layer.

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop

October 31, 2019

Marcel Kornacker and Mostafa Mokhtar help simplify the process of making good SQL-on-Hadoop decisions and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning.

Uber's data science workbench

Uber's data science workbench

October 31, 2019

Peng Du and Randy Wei offer an overview of Ubers data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services, providing convenient features such as task scheduling, model publishing, and job monitoring.