High-performance clickstream analytics with Apache Phoenix and HBase
Traditional data-warehousing techniques are sometimes limited by the scalability of the implementation tools themselves. Arun Thangamani explains how the advanced architectural approaches by tools like Apache Phoenix and HBase allow new, highly scalable live-analytics solutions using the same traditional techniques and showcases a successful implementation at CDK.
Talk Title | High-performance clickstream analytics with Apache Phoenix and HBase |
Speakers | Arun Thangamani (CDK) |
Conference | Strata + Hadoop World |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 29-31, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
CDK Global (formerly ADP Dealer Services) provides digital-marketing services, front-office products, and support to 3000+ auto dealers worldwide. Its challenge is to provide real-time clickstream analytics to dealers operating with traditional technologies. Arun Thangamani demonstrates how CDK leverages Apache Phoenix and HBase to improve performance. Arun showcases a successful implementation that uses Phoenix and HBase with just five nodes, enabling clickstream analysis for marketing and sales. At the heart of CDK’s live-analytics solution is the core Phoenix-HBase table, comprising 1.5 billion rows and 15 columns. An average use filters for 0.5–1.5 million rows and aggregates them to feed a live dealer-analytics service. Multiple ETL workflows determine the input for the table throughout the day. Often, inserts into the table can be as high as 25 million rows, which still manage to load in less than five minutes. In addition, one of the primary requirements is to keep exactly N days’ worth of data, which is achieved by utilizing the timestamp property of the cells in HBase, avoiding specific external deletes completely. Arun dives into how Phoenix-HBase architecturally enables CDK’s use case as well as the technical workflow for initial, daily loading and aggregation. Arun will explore the challenges CDK faced implementing Phoenix-HBase, talk about tips and techniques for performance tuning, and explain how Hadoop Phoenix-HBase-based workflow improved response time by 10x–20x. By the end of the presentation, you’ll understand how bucketed index-based storage/query/aggregation with Phoenix-HBase can be used for clickstream analytics as well as other use cases in general.