Big data for big data: Machine-learning models of Hadoop cluster behavior
Sean Suchter and Shekhar Gupta describe the use of very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events.
Talk Title | Big data for big data: Machine-learning models of Hadoop cluster behavior |
Speakers | Sean Suchter (Pepperdata), Shekhar Gupta (Pepperdata) |
Conference | Strata + Hadoop World |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 14-16, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
The performance of batch processing systems such as YARN is generally determined by the throughput, which measures the amount of workload (tasks) completed in a given time window. For a given cluster size, the throughput can be increased by running as much workload as possible on each host, utilizing all the free resources available on the host. Because each node is running a complex combination of different tasks and containers, the performance characteristics of the cluster are dynamically changing. As a result, there is always a danger of overutilizing host memory, which can result in extreme swapping or thrashing. The impact of thrashing can be very severe; it can actually reduce the throughput instead of increasing it. Sean Suchter and Shekhar Gupta explain how they used very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events. (To build this system, they used hand-labeling of bad events combined with large-scale data processing using Hadoop, HBase, Spark, and IPython for experimentation.) By using very fine-grained (five-second) data from many production clusters running very different workloads, Sean and Shekhar have trained a generalized model that very rapidly detects the onset of thrashing within seconds of the first symptom. This detection has proven fast enough to enable effective mitigation of thrashing, allowing the hosts to continuously provide high throughput. Sean and Shekhar discuss the methods they used and share novel findings about big data cluster performance.