Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned

Mauricio Aristizabal shares lessons learned from migrating Impact's traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company's data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for "fast data" BI queries, and using Kafka's data bus for loose coupling between components.


Talk Title	Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned
Speakers	Mauricio Aristizabal (Impact)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

Mauricio Aristizabal shares lessons learned from migrating Impact’s traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company’s data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for “fast data” BI queries, and using Kafka’s data bus for loose coupling between components. The new platform satisfies several requirements. It uses the same SQL, JDBC, and star schema already in use by thousands of reports and apps. Every event is reflected in every store within 30 seconds (with a path to single-digit). It contains multiple stores for performant access for many different use cases. It’s scalable, available, and secure—all automatically, simply by using the chosen stack. Engineers and data scientists can interface using multiple languages and frameworks. And it’s code based, so it’s easier to test, debug, diff, maintain, profile, and reuse than graphical drag-and-drop tools. The platform changes data capture agents to load every change in the company’s OLTP MySQL DBS into Kafka. A data lake in HBase stores every one of those OLTP changes (even each change to same record and column). It enables streaming dimension, fact, and aggregation processing with Spark and Spark SQL and includes a “fast” star schema data warehouse in Kudu. Streaming Kudu writers update facts and aggregates in real time. It also includes authorization and a data dictionary with Sentry and Navigator. Topics include:

Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned

The SMACK stack on Mesosphere DC/OS using cloud infrastructure

You call it data lake; we call it Data Historian.

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams

Distributed systems for stream processing: Apache Kafka and Spark Streaming

Audi's journey to an enterprise big data platform

Autonomous ETL with materialized views