November 25, 2019

205 words 1 min read

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.


Talk Title	How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark
Speakers	Jordan Hambleton (Cloudera), GuruDharmateja Medasani (Domino Data Lab)
Conference	Strata Data Conference
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 6-8, 2018
URL	Talk Page
Slides	Talk Slides
Video

Streaming data continuously from Kafka allows users to gain insights faster, but when these pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results. Jordan and Guru demonstrate how Apache Spark integrates with Apache Kafka for streaming data in a distributed and scalable fashion, covering considerations and approaches for building fault-tolerant streams and detailing a few strategies of offset management to easily recover a stream and prevent data loss. Topics include:

kafka management streaming apache spark scalable pipeline

comments powered by Disqus

Stream processing with Kafka

Stream processing with Kafka

November 20, 2019

Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data.

Apache Kafka + Apache Mesos = Highly scalable streaming microservices

Apache Kafka + Apache Mesos = Highly scalable streaming microservices

November 18, 2019

Kai Whner shares a highly scalable, mission-critical infrastructure using Apache Kafka and Apache Mesos: Kafka brokers are used as the distributed messaging backbone; Kafkas Streams API embeds stream processing into any external application without the need for a dedicated streaming cluster; and Mesos is used as a scalable infrastructure to leverage the benefits of a cloud-native platform.

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

November 22, 2019

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams

November 20, 2019

Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead.

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale

November 24, 2019

Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a users probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlibs alternating least squares (ALS) approach.

Kafka streaming applications with Akka Streams and Kafka Streams

Kafka streaming applications with Akka Streams and Kafka Streams

November 24, 2019

Dean Wampler compares and contrasts data processing with Akka Streams and Kafka Streams, microservice streaming applications based on Kafka. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to choose them instead.