December 11, 2019

312 words 2 mins read

Autonomous ETL with materialized views

Autonomous ETL with materialized views

Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness.

Talk Title Autonomous ETL with materialized views
Speakers Adesh Rao (Qubole), Abhishek Somani (Qubole)
Conference Strata Data Conference
Conf Tag Making Data Work
Location London, United Kingdom
Date May 22-24, 2018
URL Talk Page
Slides Talk Slides
Video

SQL-on-Hadoop engines like Hive, Presto, Impala, Drill, and Spark SQL have made major strides in improving the performance of ad hoc and reporting queries. A big component of the performance improvement is to store the data sorted, bucketed, or partitioned on key columns. However, experience shows that these techniques are not used appropriately because of high operational overheads. Therefore, users have to manage with slow query times or unmanageable operational issues like very large number of partitions. Qubole uses materialized views in Apache Hive to provide autonomous ETL, enabling data engineering teams to restructure the data in the right format and structure based on their workloads. Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness. Adesh and Abhishek first make a case for materialized views as the foundation for autonomous ETL to restructure data and then address challenges with materialized views and how these can be addressed within the framework, particularly for the creation and use of materialized views, automatic detection of changes to source tables and consequent invalidation of related materialized views, and automatic full and partial refreshes of materialized views on invalidation. Although Qubole uses these techniques with Apache Hive and Apache Presto, they have been implemented in an engine-agnostic fashion so that engines such as Spark SQL can utilize them as well.

comments powered by Disqus