November 3, 2019

254 words 2 mins read

Real-time analytics using Kudu at petabyte scale

Real-time analytics using Kudu at petabyte scale

Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand.

Talk Title Real-time analytics using Kudu at petabyte scale
Speakers Sridhar Alla (BlueWhale), Shekhar Agrawal (Comcast)
Conference Strata + Hadoop World
Conf Tag Big Data Expo
Location San Jose, California
Date March 14-16, 2017
URL Talk Page
Slides Talk Slides
Video

Kudu is redefining the big data ecosystem and opening doors to capabilities not previously available. Sridhar Alla and Shekhar Agrawal explain how Comcast has deployed the largest Kudu cluster thus far and is rapidly developing advanced applications to provide real-time analytics at petabyte scale while avoiding the expensive denormalization processes, covering how real-time analytics using Kudu scale much higher than using other NoSQL databases. Sridhar and Shekhar release the practical implementation details and talk about the extensive benchmarks at 1 trillion-event table sizes. While the Spark platform processes both the historical data and the real-time events streaming through Kafka, the middle tier accesses Kudu tables to generate subsecond real-time dashboards while still having the power of Hadoop to deliver batch analytics and integrations with other platforms. This is key to the success of the platform—previously Comcast had to rely on variety of multitiered architectures to provide fast storage and still be able to update just like NoSQL engines—but without the lag caused by several thousand updates per second.

comments powered by Disqus