TuneIn: How to get your jobs tuned while you are sleeping

Have you ever tuned a Spark or MR job? If the answer is yes, you already know how difficult it is to tune more than hundred parameters to optimize the resources used. Manoj Kumar, Pralabh Kumar, and Arpan Agrawal offer an overview of TuneIn, an auto-tuning tool developed to minimize the resource usage of jobs. Experiments have shown up to a 50% reduction in resource usage.


Talk Title	TuneIn: How to get your jobs tuned while you are sleeping
Speakers	Manoj Kumar (LinkedIn), Pralabh Kumar (LinkedIn), Arpan Agrawal (LinkedIn)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

Have you ever tuned a Spark, Hive, or Pig job? If the answer is yes, you already know that it is a never-ending cycle that involves executing the job, observing the running job, making sense out of hundreds of metrics, and then rerunning it with the better parameters. Now imagine doing this for tens of thousands of jobs. Manual performance optimization at this scale is both tedious and costly, requires significant domain expertise, and results in a lot of wasted resources. LinkedIn solved this problem by developing Dr. Elephant, an open source self-serve performance monitoring and tuning tool for Hadoop and Spark. While it has proven to be very successful at LinkedIn as well as other companies, it relies on a developer’s initiative to check and apply the recommendations manually. It also expects some expertise from developers to arrive at the optimal configuration from the recommendations. Manoj Kumar, Pralabh Kumar, and Arpan Agrawal offer an overview of TuneIn, an auto-tuning framework developed on top of Dr. Elephant. You’ll learn how LinkedIn uses an iterative optimization approach to find the optimal parameter values, the various optimization algorithms the team tried and why the particle swarm optimization algorithm gave the best results, and how they avoided using any extra execution by tuning the jobs during their regularly scheduled executions. Manoj, Pralabh, and Arpan also share techniques that ensure faster convergence and zero failed executions while tuning, explain how LinkedIn achieved a more than 50% reduction in resource usage by tuning a small set of parameters, and outline lessons learned and a future roadmap for the tool.

TuneIn: How to get your jobs tuned while you are sleeping

Distributed TensorFlow on Hops

Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops

Human in the loop: Bayesian rules enabling explainable AI

Pangeo: Big data climate science in the cloud

Shenzhen Go: A visual Go environment for everybody, even professionals

Autonomous ETL with materialized views