January 2, 2020

244 words 2 mins read

Apache Spark ML and MLlib tuning and optimization: A case study on boosting the performance of ALS by 60x

Apache Spark ML and MLlib tuning and optimization: A case study on boosting the performance of ALS by 60x

Apache Spark ML and MLlib are hugely popular in the big data ecosystem, and Intel has been deeply involved in Spark from a very early stage. Peng Meng outlines the methodology behind Intel's work on Spark ML and MLlib optimization and shares a case study on boosting the performance of Spark MLlib ALS by 60x in JD.coms production environment.


Talk Title	Apache Spark ML and MLlib tuning and optimization: A case study on boosting the performance of ALS by 60x
Speakers	Peng Meng (Intel)
Conference	Strata + Hadoop World
Conf Tag	Make Data Work
Location	Singapore
Date	December 6-8, 2016
URL	Talk Page
Slides	Talk Slides
Video

Apache Spark ML and MLlib are hugely popular in the big data ecosystem and have evolved from standard ML libraries to powerful components that support complex workflows and production requirements. Intel has been deeply involved in Spark from a very early stage, working with the community in feature development, bug fixing, and performance optimization. Peng Meng outlines the methodology behind Intel’s work on Spark ML and MLlib optimization and shares a case study on boosting the performance of Spark MLlib alternating least squares (ALS) by 60x in JD.com’s production environment. The methods include rewriting the code of recommendForAll, CartesianRDD compute optimization, choosing between f2jBLAS and NativeBLAS, the best settings for the cluster, and ALS parameters. This solution not only largely reduced the computation time on JD and VipShop production environment. It was also merged into Apache Spark.

intel code apache spark ecosystem ml big data optimization performance cluster

comments powered by Disqus

Authorization in the cloud: Enforcing access control across compute engines

Authorization in the cloud: Enforcing access control across compute engines

December 16, 2019

Li Li and Hao Hao elaborate the architecture of Apache Sentry + RecordService for Hadoop in the cloud, which provides unified, fine-grained authorization via role- and attribute-based access control, to encourage attendees to adopt Apache Sentry and RecordService to protect sensitive data on the multitenant cloud across the Hadoop ecosystem.

Tuning Spark machine-learning workloads

Tuning Spark machine-learning workloads

December 8, 2019

Spark's efficiency and speed can help reduce the TCO of existing clusters. This is because Spark's performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload able to improve runtimes by a factor of 2.22.

Python scalability: A convenient truth

Python scalability: A convenient truth

October 21, 2019

Despite Python's popularity throughout the data-engineering and data science workflow, the principles behind its performance and scaling behavior are less understood. Travis Oliphant explains best practices and modern tools to scale Python to larger-than-memory and distributed workloads without sacrificing its ease of use or being forced to adopt heavyweight frameworks.

Scala and the JVM as a big data platform: Lessons from Apache Spark

Scala and the JVM as a big data platform: Lessons from Apache Spark

October 21, 2019

The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.

Bringing deep learning into big data analytics using BigDL

Bringing deep learning into big data analytics using BigDL

January 2, 2020

Xianyan Jia and Zhenhua Wang explore deep learning applications built successfully with BigDL. They also teach you how to develop fast prototypes with BigDL's off-the-shelf deep learning toolkit and build end-to-end deep learning applications with flexibility and scalability using BigDL on Spark.

R you ready for the cloud? Using R for operationalizing an enterprise-grade data science solution on Azure

R you ready for the cloud? Using R for operationalizing an enterprise-grade data science solution on Azure

December 30, 2019

R has long been criticized for its limitations on scalable data analytics. What's needed is an R-centric paradigm that enables data scientists to elastically harness cloud resources of manifold computing capability for large-scale data analytics. Le Zhang and Graham Williams demonstrate how to operationalize an E2E enterprise-grade pipeline for big data analyticsall within R.