January 6, 2020

228 words 2 mins read

Scaling Impala: Common mistakes and best practices

Scaling Impala: Common mistakes and best practices

Apache Impala is an MPP SQL query engine for planet-scale queries. When set up and used properly, Impala is able to handle hundreds of nodes and tens of thousands of queries hourly. Manish Maheshwari explains how to avoid pitfalls in Impala configuration (memory limits, admission pools, metadata management, statistics), along with best practices and anti-patterns for end users or BI applications.


Talk Title	Scaling Impala: Common mistakes and best practices
Speakers	Manish Maheshwari (Cloudera)
Conference	Strata Data Conference
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	April 30-May 2, 2019
URL	Talk Page
Slides	Talk Slides
Video

Apache Impala is a complex engine and requires a thorough technical understanding to utilize it fully. Without proper configuration or usage, Impala’s performance becomes unpredictable, and end-user experience suffers. However, for many users and administrators, the right configuration of Impala is still a mystery. Drawing on work with some of the largest clusters in the world, Manish Maheshwari shares ingestion best practices to keep an Impala deployment scalable and details admission control configuration to provide a consistent experience to end users. Manish also takes a high-level look at Impala’s query profile, which is used as a first step in any performance troubleshooting, and discusses common mistakes users and BI tools make when interacting with Impala. Manish concludes by detailing an ideal setup to show all of this in practice.

apache bi performance scalable cluster

comments powered by Disqus

VPP Accelerated High Performance & Scalable L3DSR L4 Load Balancer on Top Clos

VPP Accelerated High Performance & Scalable L3DSR L4 Load Balancer on Top Clos

December 28, 2019

Delivering fast packets is getting difficult under rapid increase of service endpoints at huge traffic large-scale data center.Especially, time to on-demand deployment of load balancing function neede …

Analytics Zoo: Distributed TensorFlow in production on Apache Spark

Analytics Zoo: Distributed TensorFlow in production on Apache Spark

December 27, 2019

Yuhao Yang and Jennie Wang demonstrate how to run distributed TensorFlow on Apache Spark with the open source software package Analytics Zoo. Compared to other solutions, Analytics Zoo is built for production environments and encourages more industry users to run deep learning applications with the big data ecosystems.

Data processing at the speed of 100 Gbps using Apache Crail

Data processing at the speed of 100 Gbps using Apache Crail

December 26, 2019

Modern networking and storage technologies like RDMA or NVMe are finding their way into the data center. Patrick Stuedi offers an overview of Apache Crail (incubating), a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. Patrick explains what Crail does and how it benefits workloads based on TensorFlow or Spark.

When SQL users run wild: Resource management features and techniques to tame Apache Impala

When SQL users run wild: Resource management features and techniques to tame Apache Impala

December 19, 2019

As the popularity and utilization of Apache Impala deployments increases, clusters often become victims of their own success when demand for resources exceeds the supply. Tim Armstrong dives into the latest resource management features in Impala to maintain high cluster availability and optimal performance and provides examples of how to configure them in your Impala deployment.

Liberating Kubernetes From Kube-proxy and Iptables

Liberating Kubernetes From Kube-proxy and Iptables

December 4, 2019

iptables and Netfilter are the two foundational technologies of kube-proxy for implementing a Service abstraction. They carry legacy accumulated over 20 years of development grounded in a more traditi …

Keynote: Tencent: Kubernetes in the Billions

Keynote: Tencent: Kubernetes in the Billions

September 24, 2019

At Tencent, our business touches everything from gaming, social media, payments, to cloud computing. Wed like to share our story of how K8s is broadly used at Tencent, taking care of our infrastructu …