November 9, 2019

308 words 2 mins read

Big data for big data: Machine-learning models of Hadoop cluster behavior

Big data for big data: Machine-learning models of Hadoop cluster behavior

Sean Suchter and Shekhar Gupta describe the use of very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events.

Talk Title Big data for big data: Machine-learning models of Hadoop cluster behavior
Speakers Sean Suchter (Pepperdata), Shekhar Gupta (Pepperdata)
Conference Strata + Hadoop World
Conf Tag Big Data Expo
Location San Jose, California
Date March 14-16, 2017
URL Talk Page
Slides Talk Slides
Video

The performance of batch processing systems such as YARN is generally determined by the throughput, which measures the amount of workload (tasks) completed in a given time window. For a given cluster size, the throughput can be increased by running as much workload as possible on each host, utilizing all the free resources available on the host. Because each node is running a complex combination of different tasks and containers, the performance characteristics of the cluster are dynamically changing. As a result, there is always a danger of overutilizing host memory, which can result in extreme swapping or thrashing. The impact of thrashing can be very severe; it can actually reduce the throughput instead of increasing it. Sean Suchter and Shekhar Gupta explain how they used very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events. (To build this system, they used hand-labeling of bad events combined with large-scale data processing using Hadoop, HBase, Spark, and IPython for experimentation.) By using very fine-grained (five-second) data from many production clusters running very different workloads, Sean and Shekhar have trained a generalized model that very rapidly detects the onset of thrashing within seconds of the first symptom. This detection has proven fast enough to enable effective mitigation of thrashing, allowing the hosts to continuously provide high throughput. Sean and Shekhar discuss the methods they used and share novel findings about big data cluster performance.

comments powered by Disqus