December 8, 2019

255 words 2 mins read

Tuning Spark machine-learning workloads

Tuning Spark machine-learning workloads

Spark's efficiency and speed can help reduce the TCO of existing clusters. This is because Spark's performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload able to improve runtimes by a factor of 2.22.

Talk Title Tuning Spark machine-learning workloads
Speakers
Conference Strata + Hadoop World
Conf Tag Make Data Work
Location New York, New York
Date September 27-29, 2016
URL Talk Page
Slides Talk Slides
Video

Spark’s efficiency and speed can help big data administrators reduce the total cost of ownership (TCO) of their existing clusters. This is because Spark’s performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload. Using this methodology, Raj has been able to improve runtimes by a factor of 2.22. Since Spark has a large number of tunables, a bottom-up approach to finding the optimal runtime by varying Spark workers and Spark worker cores can create an explosion of tuning runs for a given workload because of the multiplicative nature of possible configurations. The discussed methodology uses a hybrid top-bottom approach that searches the configuration space carefully and reduces the combinatorial explosion of possible tuning runs. This methodology has even been successfully applied to complex Spark workflows consisting of Spark SQL and ML Pipelines (and achieved substantial performance improvements) and a variety of other cluster architectures.

comments powered by Disqus