Big data computations: Comparing Apache HAWQ, Druid, and GPU databases
The class of big data computations known as distributed merge trees was built to aggregate user information across multiple data sources in the media domain. Vijay Srinivas Agneeswaran explores prototypes built on top of Apache HAWQ, Druid, and Kinetica, one of the open source GPU databases. Results show that Kinetica on a single G2.8x node outperformed clusters of HAWQ and Druid nodes.
Talk Title | Big data computations: Comparing Apache HAWQ, Druid, and GPU databases |
Speakers | Vijay Agneeswaran (Walmart Labs) |
Conference | Strata Data Conference |
Conf Tag | Making Data Work |
Location | London, United Kingdom |
Date | May 23-25, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
The class of big data computations known as distributed merge trees was built to aggregate user information across multiple data sources in the media domain. This class is characterized by nonscalar aggregates all the way to the root of the merge tree—equivalent to a set union operation in SQL at every level of the tree. Typical big data technologies were mostly supporting only scalar aggregates. The set union operation must be implemented outside of the data store, resulting in nonstandard implementation and consequent inefficiencies. Vijay Srinivas Agneeswaran explores a prototype built on top of Druid, one of the claimants to the throne of analytical data processing, to illustrate the problem. Druid supports only scalar aggregates; as a result, the set union operation had to be implemented at the application level. Data transfer into and out of Druid and the complexity of thread processing at the Java layer led to inefficiencies, resulting in a computation time of 200+ seconds. With its ability to perform multidimensional partitioning of data, support for full SQL queries (and, consequently, support for set union operations), and its efficient distributed query optimization techniques, Apache HAWQ looked like the ideal candidate for this use case. However, HAWQ’s dependence on Hadoop as the underlying filesystem plus the inherent complexity of the computation led to poorer than expected results. HAWQ took about 100 seconds to process the same query, but the SLA was less than 10 seconds. It turned out that the multidimensional partitioning was inefficient. Vijay explains how this problem was solved through multiple HAWQ clusters and an intelligent client that stores metadata to route queries to appropriate clusters. By ensuring each HAWQ cluster is independent, the time to execute the query was reduced to 30 seconds. Vijay then explores an implementation of the same query with a GPU database (Kinetica) to benchmark its performance on an Amazon g2.8x instance. The response time for the same query was around 12 seconds—and with a bit more optimization, the SLA will be met.