December 11, 2019

291 words 2 mins read

Parquet performance tuning: The missing guide

Parquet performance tuning: The missing guide

Netflix is exploring new avenues for data processing where traditional approaches fail to scale. Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet's features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he's learned, creating the missing guide you need.


Talk Title	Parquet performance tuning: The missing guide
Speakers
Conference	Strata + Hadoop World
Conf Tag	Make Data Work
Location	New York, New York
Date	September 27-29, 2016
URL	Talk Page
Slides	Talk Slides
Video

Increasing demand for more and higher-granularity data continues to push the boundaries of what is possible to process using big data technologies. Netflix’s Big Data Platform team manages a highly organized and curated data warehouse in Amazon S3 with over 40 petabytes of data. At this scale, we are reaching the limits of partitioning, with thousands of tables and millions of partitions per table. To work around the diminishing returns of additional partition layers, the team increasingly relies on the Parquet file format and recently made additions to Presto that resulted in an over 100x performance improvement for some real-world queries over Parquet data. The team is currently adding similar functionality to other processing engines like Spark, Hive, and Pig. Data written in Parquet is not optimized by default for these newer features, so the team is tuning how they write Parquet to maximize the benefit. Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need. Topics include:

guide spark etl netflix data warehouse big data performance

comments powered by Disqus

Lessons learned running Hadoop and Spark in Docker

Lessons learned running Hadoop and Spark in Docker

December 11, 2019

Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale environments poses new challenges, especially for big data applications like Hadoop. Thomas Phelan shares lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.

Powering real-time analytics on Xfinity using Kudu

Powering real-time analytics on Xfinity using Kudu

December 10, 2019

Sridhar Alla and Kiran Muglurmath explain how real-time analytics on Comcast Xfinity set-top boxes (STBs) help drive several customer-facing and internal data-science-oriented applications and how Comcast uses Kudu to fill the gaps in batch and real-time storage and computation needs, allowing Comcast to process the high-speed data without the elaborate solutions needed till now.

Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies

Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies

December 10, 2019

David Talby and Claudiu Branzan lead a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, Titan, and Elasticsearch; data science components include custom UIMA annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

The Netflix data platform: Now and in the future

The Netflix data platform: Now and in the future

December 9, 2019

The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds.

Tuning Spark machine-learning workloads

Tuning Spark machine-learning workloads

December 8, 2019

Spark's efficiency and speed can help reduce the TCO of existing clusters. This is because Spark's performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload able to improve runtimes by a factor of 2.22.

Petascale genomics

Petascale genomics

November 17, 2019

The advent of next-generation DNA sequencing technologies is revolutionizing life sciences research by routinely generating extremely large datasets. Tom White explains how big data tools developed to handle large-scale Internet data (like Hadoop) help scientists effectively manage this new scale of data and also enable addressing a host of questions that were previously out of reach.