Tuning Spark machine-learning workloads
Spark's efficiency and speed can help reduce the TCO of existing clusters. This is because Spark's performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload able to improve runtimes by a factor of 2.22.
Talk Title | Tuning Spark machine-learning workloads |
Speakers | |
Conference | Strata + Hadoop World |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 27-29, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Spark’s efficiency and speed can help big data administrators reduce the total cost of ownership (TCO) of their existing clusters. This is because Spark’s performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload. Using this methodology, Raj has been able to improve runtimes by a factor of 2.22. Since Spark has a large number of tunables, a bottom-up approach to finding the optimal runtime by varying Spark workers and Spark worker cores can create an explosion of tuning runs for a given workload because of the multiplicative nature of possible configurations. The discussed methodology uses a hybrid top-bottom approach that searches the configuration space carefully and reduces the combinatorial explosion of possible tuning runs. This methodology has even been successfully applied to complex Spark workflows consisting of Spark SQL and ML Pipelines (and achieved substantial performance improvements) and a variety of other cluster architectures.