January 2, 2020

366 words 2 mins read

Best practices with Kudu: An end-to-end user case from the automobile industry

Best practices with Kudu: An end-to-end user case from the automobile industry

Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios.

Talk Title Best practices with Kudu: An end-to-end user case from the automobile industry
Speakers Wei Chen (Intel), Zhaojuan Bian (Intel)
Conference Strata + Hadoop World
Conf Tag Make Data Work
Location Singapore
Date December 6-8, 2016
URL Talk Page
Slides Talk Slides
Video

Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios. The end-to-end system for streaming data injection and real-time and batch analytics uses Kafka and Spark for the messaging, streaming, and batch jobs. For the storage layer, the customer wanted to evaluate HDFS, HBase, and Kudu solutions for its usage scenarios. Wei and Zhaojuan discuss the challenges they encountered in tuning Kudu performance, largely because it’s a new storage engine, so there isn’t much available information to refer to. The performance of the Kudu-based cluster varies significantly with different workload setups, hardware selections, and software parameters (OS VM parameters, hashed tablet count, maintenance thread number, etc.). For example, table schema design is critical to the performance of time series injection workloads. Small range partitioning is good to achieve a high injection rate since the number of bloom filter lookups can be reduced. However, it will result in the increase of scanned tablet count for analytic jobs. Different scenarios also require different hardware resources. For injection intensive scenarios, SSDs must be used as WAL disks. Faster, higher core count CPUs are also needed when active tablet count increases. However, after fixing these performance issues, Kudu offers a balanced solution. Topics include:

comments powered by Disqus