Parquet performance tuning: The missing guide
Netflix is exploring new avenues for data processing where traditional approaches fail to scale. Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet's features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he's learned, creating the missing guide you need.
Talk Title | Parquet performance tuning: The missing guide |
Speakers | |
Conference | Strata + Hadoop World |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 27-29, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Increasing demand for more and higher-granularity data continues to push the boundaries of what is possible to process using big data technologies. Netflix’s Big Data Platform team manages a highly organized and curated data warehouse in Amazon S3 with over 40 petabytes of data. At this scale, we are reaching the limits of partitioning, with thousands of tables and millions of partitions per table. To work around the diminishing returns of additional partition layers, the team increasingly relies on the Parquet file format and recently made additions to Presto that resulted in an over 100x performance improvement for some real-world queries over Parquet data. The team is currently adding similar functionality to other processing engines like Spark, Hive, and Pig. Data written in Parquet is not optimized by default for these newer features, so the team is tuning how they write Parquet to maximize the benefit. Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need. Topics include: