November 29, 2019

201 words 1 min read

Unified stateful big data processing in Apache Beam (incubating)

Unified stateful big data processing in Apache Beam (incubating)

Apache Beam's new State API brings scalability and consistency to fine-grained stateful processing while remaining portable to any Beam runner. Aljoscha Krettek introduces the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner.


Talk Title	Unified stateful big data processing in Apache Beam (incubating)
Speakers	Aljoscha Krettek (Ververica)
Conference	Strata Data Conference
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	May 23-25, 2017
URL	Talk Page
Slides	Talk Slides
Video

Apache Beam lets you process unbounded, out-of-order, global-scale data with portable high-level pipelines, but not all use cases are pipelines of simple “map” and “combine” operations. Aljoscha Krettek introduces Beam’s new State API, which brings scalability and consistency to fine-grained stateful processing while interoperating with Beam’s other features such as consistent event-time windowing and windowed side inputs—all while remaining portable to any Beam runner, including Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Aljoscha covers the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner. Examples of new use cases unlocked by Beam’s new mutable state and timers include:

api flink google apache spark big data use case cloud pipeline

comments powered by Disqus

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)

October 31, 2019

Unbounded, out-of-order, global-scale data is now the norm. Even for the same computation, each use case entails its own balance between completeness, latency, and cost. Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner, including Apache Spark, Apache Flink, and Google Cloud Dataflow.

The state of Spark in the cloud

The state of Spark in the cloud

November 29, 2019

Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline.

Paint the landscape and secure your data center with Apache Spot

Paint the landscape and secure your data center with Apache Spot

November 4, 2019

Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud scale for anomaly detection.

Sparklyr: An R interface for Apache Spark

Sparklyr: An R interface for Apache Spark

November 2, 2019

Sparklyr makes it easy and practical to analyze big data with Ryou can filter and aggregate Spark DataFrames to bring data into R for analysis and visualization and use R to orchestrate distributed machine learning in Spark using Spark ML and H2O SparkingWater. Edgar Ruiz walks you through these features and demonstrates how to use sparklyr to create R functions that access the full Spark API.

Fear of and uncertainty about open source

Fear of and uncertainty about open source

November 16, 2019

How does a small engineering team decide which technologies to use? Or whether to be open source or not? To be self-hosted or in the cloud? Wes Chow discusses the choices Chartbeat has made, how theyve succeeded and failed, and the framework by which the company makes decisions and argues for transparency and empathy from free and proprietary technologists to ease the pain.

Evolving and Supporting Stateful, Multi-Tenant Decisioning Applications in Production [A]

Evolving and Supporting Stateful, Multi-Tenant Decisioning Applications in Production [A]

November 29, 2019

With our adoption of Kubernetes at Capital One, we have simultaneously reduced our application delivery time-to-market while providing a common platform for streaming pipelines. We leverage Kubernetes …