Best practices with Kudu: An end-to-end user case from the automobile industry


Talk Title	Best practices with Kudu: An end-to-end user case from the automobile industry
Speakers	Wei Chen (Intel), Zhaojuan Bian (Intel)
Conference	Strata + Hadoop World
Conf Tag	Make Data Work
Location	Singapore
Date	December 6-8, 2016
URL	Talk Page
Slides	Talk Slides
Video

Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios. The end-to-end system for streaming data injection and real-time and batch analytics uses Kafka and Spark for the messaging, streaming, and batch jobs. For the storage layer, the customer wanted to evaluate HDFS, HBase, and Kudu solutions for its usage scenarios. Wei and Zhaojuan discuss the challenges they encountered in tuning Kudu performance, largely because it’s a new storage engine, so there isn’t much available information to refer to. The performance of the Kudu-based cluster varies significantly with different workload setups, hardware selections, and software parameters (OS VM parameters, hashed tablet count, maintenance thread number, etc.). For example, table schema design is critical to the performance of time series injection workloads. Small range partitioning is good to achieve a high injection rate since the number of bloom filter lookups can be reduced. However, it will result in the increase of scanned tablet count for analytic jobs. Different scenarios also require different hardware resources. For injection intensive scenarios, SSDs must be used as WAL disks. Faster, higher core count CPUs are also needed when active tablet count increases. However, after fixing these performance issues, Kudu offers a balanced solution. Topics include:

vm kafka streaming messaging spark analytics hdfs mobile use case performance cluster hardware

Best practices with Kudu: An end-to-end user case from the automobile industry

Analytics at ING: Technology solutions to create a real-time, data-driven bank

Choice Hotels's journey to better understand its customers through self-service analytics

Stream analytics in the enterprise: A look at Intels internal IoT implementation

Stream analytics in the enterprise: A look at Intels internal IoT implementation

Fast data made easy with Apache Kafka and Apache Kudu (incubating)

An architecture for merging fast data and enterprise applications: The SMACK stack