High-performance clickstream analytics with Apache Phoenix and HBase

Traditional data-warehousing techniques are sometimes limited by the scalability of the implementation tools themselves. Arun Thangamani explains how the advanced architectural approaches by tools like Apache Phoenix and HBase allow new, highly scalable live-analytics solutions using the same traditional techniques and showcases a successful implementation at CDK.


Talk Title	High-performance clickstream analytics with Apache Phoenix and HBase
Speakers	Arun Thangamani (CDK)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 29-31, 2016
URL	Talk Page
Slides	Talk Slides
Video

CDK Global (formerly ADP Dealer Services) provides digital-marketing services, front-office products, and support to 3000+ auto dealers worldwide. Its challenge is to provide real-time clickstream analytics to dealers operating with traditional technologies. Arun Thangamani demonstrates how CDK leverages Apache Phoenix and HBase to improve performance. Arun showcases a successful implementation that uses Phoenix and HBase with just five nodes, enabling clickstream analysis for marketing and sales. At the heart of CDK’s live-analytics solution is the core Phoenix-HBase table, comprising 1.5 billion rows and 15 columns. An average use filters for 0.5–1.5 million rows and aggregates them to feed a live dealer-analytics service. Multiple ETL workflows determine the input for the table throughout the day. Often, inserts into the table can be as high as 25 million rows, which still manage to load in less than five minutes. In addition, one of the primary requirements is to keep exactly N days’ worth of data, which is achieved by utilizing the timestamp property of the cells in HBase, avoiding specific external deletes completely. Arun dives into how Phoenix-HBase architecturally enables CDK’s use case as well as the technical workflow for initial, daily loading and aggregation. Arun will explore the challenges CDK faced implementing Phoenix-HBase, talk about tips and techniques for performance tuning, and explain how Hadoop Phoenix-HBase-based workflow improved response time by 10x–20x. By the end of the presentation, you’ll understand how bucketed index-based storage/query/aggregation with Phoenix-HBase can be used for clickstream analytics as well as other use cases in general.

High-performance clickstream analytics with Apache Phoenix and HBase

Developing a big data business strategy

Fast data made easy with Apache Kafka and Apache Kudu (incubating)

Faster conclusions using in-memory columnar SQL and machine learning

Hadoop in the cloud: Good fit or round peg in a square hole?

IoT in the enterprise: A look at Intel (IoT) Inside

Python scalability: A convenient truth