October 25, 2019

224 words 2 mins read

Embeddable data transformation for real-time streams

Embeddable data transformation for real-time streams

Real-time analysis starts with transforming raw data into structured records. Typically this is done with bespoke business logic custom written for each use case. Joey Echeverria presents a configuration-based, reusable library for data transformation that can be embedded in real-time stream-processing systems and demonstrates its real-world use cases with Apache Kafka and Apache Hadoop.

Talk Title Embeddable data transformation for real-time streams
Speakers Joey Echeverria (Rocana)
Conference Strata + Hadoop World
Conf Tag Big Data Expo
Location San Jose, California
Date March 29-31, 2016
URL Talk Page
Slides Talk Slides

Real-time stream analysis starts with ingesting raw data and extracting structured records. While stream-processing frameworks such as Apache Spark and Apache Storm provide primitives for processing individual records, processing windows of records, and grouping/joining records, the process of performing common actions such as filtering, applying regular expressions to extract data, and converting records from one schema to another are left to developers writing business logic. Joey Echeverria presents an alternative approach based on a reusable library that provides configuration-based data transformation. This allows users to write command data-transformation rules once and reuse them in multiple contexts. A common pattern is to consume a single, raw stream and transform it using the same rules before storing in different repositories such as Apache Solr for search and Apache Hadoop HDFS for deep storage. Topics include:

comments powered by Disqus