TuneIn: How to get your jobs tuned while you are sleeping
Have you ever tuned a Spark or MR job? If the answer is yes, you already know how difficult it is to tune more than hundred parameters to optimize the resources used. Manoj Kumar, Pralabh Kumar, and Arpan Agrawal offer an overview of TuneIn, an auto-tuning tool developed to minimize the resource usage of jobs. Experiments have shown up to a 50% reduction in resource usage.
Talk Title | TuneIn: How to get your jobs tuned while you are sleeping |
Speakers | Manoj Kumar (LinkedIn), Pralabh Kumar (LinkedIn), Arpan Agrawal (LinkedIn) |
Conference | Strata Data Conference |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 11-13, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Have you ever tuned a Spark, Hive, or Pig job? If the answer is yes, you already know that it is a never-ending cycle that involves executing the job, observing the running job, making sense out of hundreds of metrics, and then rerunning it with the better parameters. Now imagine doing this for tens of thousands of jobs. Manual performance optimization at this scale is both tedious and costly, requires significant domain expertise, and results in a lot of wasted resources. LinkedIn solved this problem by developing Dr. Elephant, an open source self-serve performance monitoring and tuning tool for Hadoop and Spark. While it has proven to be very successful at LinkedIn as well as other companies, it relies on a developer’s initiative to check and apply the recommendations manually. It also expects some expertise from developers to arrive at the optimal configuration from the recommendations. Manoj Kumar, Pralabh Kumar, and Arpan Agrawal offer an overview of TuneIn, an auto-tuning framework developed on top of Dr. Elephant. You’ll learn how LinkedIn uses an iterative optimization approach to find the optimal parameter values, the various optimization algorithms the team tried and why the particle swarm optimization algorithm gave the best results, and how they avoided using any extra execution by tuning the jobs during their regularly scheduled executions. Manoj, Pralabh, and Arpan also share techniques that ensure faster convergence and zero failed executions while tuning, explain how LinkedIn achieved a more than 50% reduction in resource usage by tuning a small set of parameters, and outline lessons learned and a future roadmap for the tool.