Beyond Hadoop at Yahoo: Interactive analytics with Druid
Himanshu Gupta explains why Yahoo has been increasingly investing in interactive analytics and how it leverages Druid to power a variety of internal- and external-facing data applications.
|Talk Title||Beyond Hadoop at Yahoo: Interactive analytics with Druid|
|Conference||Strata + Hadoop World|
|Conf Tag||Make Data Work|
|Location||New York, New York|
|Date||September 27-29, 2016|
Yahoo initially built Hadoop as an answer to a very acute pain around efficiently storing and processing large volumes of data. Since Yahoo open sourced Hadoop, it has become widely adopted in the technology world. However, time has taught us that when a system becomes extremely popular for solving one class of problems, its limitations in solving other problems become more apparent. Himanshu Gupta explains why Yahoo has been increasingly investing in interactive analytics and how it leverages Druid to power a variety of internal- and external-facing data applications. Millions of users around the globe interact with Yahoo through their web browsers and mobile devices, and these interactions generate billions of events every day. As Yahoo’s data volumes have grown, it’s faced increasing demand to make the data more accessible, both to internal users and to its customers. Not all of Yahoo’s end users are backend analysts, and many have no prior experience with traditional analytic tools, so Yahoo wanted to build simple, interactive data applications that anyone could use to derive insights from data. To support these use cases, Yahoo elected to invest in the Druid open source project. Today, Yahoo has multiple Druid clusters to support analytics for a variety of use cases, such as application performance, user activity, ads metrics, and many more. Each demands that Yahoo’s data applications update in real time and handle interactive ad hoc querying at a very high scale. Himanshu explores Yahoo’s use cases with Druid, shares the lessons learned from scaling Druid deployment, monitoring clusters, and ingesting data, and offers strategies for accelerating queries by leveraging approximate sketch-based algorithms.