November 30, 2019

323 words 2 mins read

Spark and R with sparklyr

Spark and R with sparklyr

R is a top contender for statistics and machine learning, but Spark has emerged as the leader for in-memory distributed data analysis. Douglas Ashton, Aimee Gott, and Mark Sellors introduce Spark, cover data manipulation with Spark as a backend to dplyr and machine learning via MLlib, and explore RStudio's sparklyr package, giving you the power of Spark without having to leave your R session.

Talk Title Spark and R with sparklyr
Speakers Douglas Ashton (Mango Solutions), Aimee Gott (Mango Solutions), Mark Sellors (Mango Solutions)
Conference Strata Data Conference
Conf Tag Making Data Work
Location London, United Kingdom
Date May 23-25, 2017
URL Talk Page
Slides Talk Slides
Video

One of the frustrations in data science is when the size of a problem crosses from being manageable on a laptop or a single server to being too big to fit in memory or taking too long to process. This often involves switching to a completely different environment and even a different language. Apache Spark is the leader for distributed in-memory data analysis. It comes with advanced machine-learning modules and has interfaces with Scala, Python, and R. The SparkR project brings much of Spark’s capabilities to R but is still missing many of the machine-learning tools available with Python or Scala. This year RStudio released the sparklyr package to provide tighter integration with RStudio IDE and Spark. Sparklyr provides a backend to the commonly used dplyr package, allowing R users who are familiar with dplyr to continue using this interface, and it provides much more in terms of machine learning and feature transformations. Douglas Ashton, Aimee Gott, and Mark Sellors offer an overview of Apache Spark and the types of problems it can solve before walking you through hands-on examples covering the basics of working with distributed data, data manipulation, and machine learning. You’ll leave with everything you need to seamlessly scale your R data analysis to a distributed environment—without learning a entirely new language.

comments powered by Disqus