Autonomous ETL with materialized views
Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness.
Talk Title | Autonomous ETL with materialized views |
Speakers | Adesh Rao (Qubole), Abhishek Somani (Qubole) |
Conference | Strata Data Conference |
Conf Tag | Making Data Work |
Location | London, United Kingdom |
Date | May 22-24, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
SQL-on-Hadoop engines like Hive, Presto, Impala, Drill, and Spark SQL have made major strides in improving the performance of ad hoc and reporting queries. A big component of the performance improvement is to store the data sorted, bucketed, or partitioned on key columns. However, experience shows that these techniques are not used appropriately because of high operational overheads. Therefore, users have to manage with slow query times or unmanageable operational issues like very large number of partitions. Qubole uses materialized views in Apache Hive to provide autonomous ETL, enabling data engineering teams to restructure the data in the right format and structure based on their workloads. Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness. Adesh and Abhishek first make a case for materialized views as the foundation for autonomous ETL to restructure data and then address challenges with materialized views and how these can be addressed within the framework, particularly for the creation and use of materialized views, automatic detection of changes to source tables and consequent invalidation of related materialized views, and automatic full and partial refreshes of materialized views on invalidation. Although Qubole uses these techniques with Apache Hive and Apache Presto, they have been implemented in an engine-agnostic fashion so that engines such as Spark SQL can utilize them as well.