January 18, 2020

233 words 2 mins read

Building Robust Streaming Data Pipelines with Apache Spark

Building Robust Streaming Data Pipelines with Apache Spark

There are challenges to architecting a solution that will allow for developers to stream data into Kafka and be able to manage dirty data which is always an issue in ETL pipelines. I'd like to share l …


Talk Title	Building Robust Streaming Data Pipelines with Apache Spark
Speakers	Zak Hassan (Senior Software Engineer - AI/ML CoE, CTO Office, Red Hat Inc.)
Conference	Open Source Summit North America
Conf Tag
Location	Los Angeles, CA, United States
Date	Sep 10-14, 2017
URL	Talk Page
Slides	Talk Slides
Video

There are challenges to architecting a solution that will allow for developers to stream data into Kafka and be able to manage dirty data which is always an issue in ETL pipelines. I’d like to share lessons learned and demonstrate how we can put Apache Kafka, Apache Spark and Apache Camel together to provide developers with a continuous data pipeline for the Spark applications. Without data it is very difficult to take advantage of its full capabilities of Spark. Companies sometimes have their data stored in many different systems and Apache Camel allows developers to Extract, Transform and Load their data to many systems Apache Kafka is one example. Apache Kafka is great for aggregating data in a centralized location and Apache Spark already comes with a built in connector to connect to Kafka. I’ll also be explaining lessons learned from running these technologies inside docker.

kafka streaming apache spark etl docker pipeline

comments powered by Disqus

SMACK Stack and Beyond - Building Fast Data Pipelines

SMACK Stack and Beyond - Building Fast Data Pipelines

January 6, 2020

Our world seems to move faster and faster and so are our requirements for data analytics. For many use cases such as fraud detection or reacting on sensor data the response times of traditional batch …

Online performance analysis of distributed dataflow systems

Online performance analysis of distributed dataflow systems

January 17, 2020

Vasia Kalavri offers an overview of Strymon, a system for predictive data center analytics, and its online critical path analysis module. Strymon analyzes live traces from distributed dataflow systems like Apache Spark, Apache Flink, and TensorFlow to predict bottlenecks and provide insights on streaming application performance.

Apache Spark and machine learning on microservices

Apache Spark and machine learning on microservices

January 14, 2020

Hadoop-based data platforms that power ETL jobs and machine learning pipelines are great examples of monolithic architectures that could be redesigned with microservices. Stepan Pushkarev walks you through building and deploying data processing, reporting services, training, and prediction pipelines as decoupled microservices connected with the rest of the enterprise architecture.

The rise of the streaming platform

The rise of the streaming platform

January 9, 2020

Streaming platforms have emerged as a popular, new trend, but what exactly is a streaming platform? With Apache Kafka at the core, streaming platforms offer an entirely new perspective on managing the flow of data. Neha Narkhede shares examples of Kafka in action and explains why streaming platforms have become the central nervous system for modern digital businesses.

Architecting a next-generation data platform

Architecting a next-generation data platform

January 5, 2020

Using Customer 360 and the IoT as examples, Jonathan Seidman, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Building big data applications on Azure

Building big data applications on Azure

January 4, 2020

As big data solutions are rapidly moving to the cloud, it's becoming increasingly important to know how to use Apache Hadoop, Spark, R Server, and other open source technologies in the cloud. Pranav Rastogi walks you through building big data applications on Azure HDInsight and other Azure services.