November 25, 2019

247 words 2 mins read

How to use Impala's query plan and profile to fix performance issues

Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.


Talk Title	How to use Impala's query plan and profile to fix performance issues
Speakers	Juan Yu (Cloudera)
Conference	Strata Data Conference
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 6-8, 2018
URL	Talk Page
Slides	Talk Slides
Video

Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. However, Impala is a complex engine and requires a thorough technical understanding to utilize it fully. When Impala is improperly configured or used, it may use too many resources, and performance could be very poor. For many users, understanding Impala query performance is like a trip on the mystery bus. Impala provides a query plan and query profile to help users choose an optimal plan and understand how a query is executed and how many resources it uses. But digging through query profiles isn’t fun for everyone. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.

sql big data apache performance

comments powered by Disqus

Speed up mission-critical analytics in the cloud (sponsored by Kyligence)

Speed up mission-critical analytics in the cloud (sponsored by Kyligence)

November 20, 2019

As organizations look to scale their analytics capability, the need to grow beyond a traditional data warehouse becomes critical, and cloud-based solutions allow more flexibility while being more cost efficient. Billy Liu offers an overview of Kyligence Cloud, a managed Apache Kylin online service designed to speed up mission-critical analytics at web scale for big data.

What's new in Hadoop 3.0

What's new in Hadoop 3.0

November 19, 2019

Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

Magellan: Scalable and fast geospatial analytics

Magellan: Scalable and fast geospatial analytics

November 23, 2019

How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellana geospatial optimization engine that seamlessly integrates with Sparkand explains how it provides scalability and performance without sacrificing simplicity.

Deploying SQL Stream Processing in Kubernetes with Ease

Deploying SQL Stream Processing in Kubernetes with Ease

November 22, 2019

Real-time processing allows you to act faster and SQL allows you to construct flows quicker and reuse existing skills. Apache Kafka is a key component but how do you peek into the data, the topologies …

Metrics-driven tuning of Apache Spark at scale

Metrics-driven tuning of Apache Spark at scale

November 22, 2019

Spark applications need to be well tuned so that individual applications run quickly and reliably and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

November 22, 2019

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.