Real-time analytics using Kudu at petabyte scale
Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand.
Talk Title | Real-time analytics using Kudu at petabyte scale |
Speakers | Sridhar Alla (BlueWhale), Shekhar Agrawal (Comcast) |
Conference | Strata + Hadoop World |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 14-16, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Kudu is redefining the big data ecosystem and opening doors to capabilities not previously available. Sridhar Alla and Shekhar Agrawal explain how Comcast has deployed the largest Kudu cluster thus far and is rapidly developing advanced applications to provide real-time analytics at petabyte scale while avoiding the expensive denormalization processes, covering how real-time analytics using Kudu scale much higher than using other NoSQL databases. Sridhar and Shekhar release the practical implementation details and talk about the extensive benchmarks at 1 trillion-event table sizes. While the Spark platform processes both the historical data and the real-time events streaming through Kafka, the middle tier accesses Kudu tables to generate subsecond real-time dashboards while still having the power of Hadoop to deliver batch analytics and integrations with other platforms. This is key to the success of the platform—previously Comcast had to rely on variety of multitiered architectures to provide fast storage and still be able to update just like NoSQL engines—but without the lag caused by several thousand updates per second.