December 8, 2019

255 words 2 mins read

Tuning Spark machine-learning workloads

Tuning Spark machine-learning workloads

Spark's efficiency and speed can help reduce the TCO of existing clusters. This is because Spark's performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload able to improve runtimes by a factor of 2.22.


Talk Title	Tuning Spark machine-learning workloads
Speakers
Conference	Strata + Hadoop World
Conf Tag	Make Data Work
Location	New York, New York
Date	September 27-29, 2016
URL	Talk Page
Slides	Talk Slides
Video

Spark’s efficiency and speed can help big data administrators reduce the total cost of ownership (TCO) of their existing clusters. This is because Spark’s performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload. Using this methodology, Raj has been able to improve runtimes by a factor of 2.22. Since Spark has a large number of tunables, a bottom-up approach to finding the optimal runtime by varying Spark workers and Spark worker cores can create an explosion of tuning runs for a given workload because of the multiplicative nature of possible configurations. The discussed methodology uses a hybrid top-bottom approach that searches the configuration space carefully and reduces the combinatorial explosion of possible tuning runs. This methodology has even been successfully applied to complex Spark workflows consisting of Spark SQL and ML Pipelines (and achieved substantial performance improvements) and a variety of other cluster architectures.

sql spark ml big data performance pipeline cluster

comments powered by Disqus

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

October 27, 2019

Jayant Shekhar, Amandeep Khurana, Krishna Sankar, and Vartika Singh guide participants through techniques for building machine-learning apps using Spark MLlib and Spark ML and demonstrate the principles of graph processing with Spark GraphX.

Scala and the JVM as a big data platform: Lessons from Apache Spark

Scala and the JVM as a big data platform: Lessons from Apache Spark

October 21, 2019

The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.

Apache Eagle: Secure Hadoop in real time

Apache Eagle: Secure Hadoop in real time

November 21, 2019

Apache Eagle is an open source monitoring solution to instantly identify access to sensitive data, recognize malicious activities, and take action. Arun Karthick Manoharan, Edward Zhang, and Chaitali Gupta explain how Eagle helps secure a Hadoop cluster using policy-based and machine-learning user-profile-based detection and alerting.

Sightseeing, venues, and friends: Predictive analytics with Spark ML and Cassandra

Sightseeing, venues, and friends: Predictive analytics with Spark ML and Cassandra

November 17, 2019

Which venues have similar visiting patterns? How can we detect when a user is on vacation? Can we predict which venues will be favorited by users by examining their friends' preferences? Natalino Busa explains how these predictive analytics tasks can be accomplished by using Spark SQL, Spark ML, and just a few lines of Scala code.

Stream analytics in the enterprise: A look at Intels internal IoT implementation

Stream analytics in the enterprise: A look at Intels internal IoT implementation

November 17, 2019

Moty Fania shares Intels IT experience implementing an on-premises IoT platform for internal use cases. The platform was based on open source big data technologies and containers and was designed as a multitenant platform with built-in analytical capabilities. Moty highlights the key lessons learned from this journey and offers a thorough review of the platforms architecture.

Faster conclusions using in-memory columnar SQL and machine learning

Faster conclusions using in-memory columnar SQL and machine learning

October 25, 2019

Hadoops traditional batch technologies are quickly being supplanted by in-memory columnar execution to drive faster data-to-value. Wes McKinney and Jacques Nadeau provide an overview of in-memory columnar execution, survey key related technologies, including Kudu, Ibis, Impala, and Drill, and cover a sample use case using Ibis in conjunction with Apache Drill to deliver real-time conclusions.