sparklyr, implyr, and more: dplyr interfaces to large-scale data
The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable.
Talk Title | sparklyr, implyr, and more: dplyr interfaces to large-scale data |
Speakers | Ian Cook (Cloudera) |
Conference | Strata Data Conference |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 6-8, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
dplyr, one of the most popular packages for R, provides a consistent grammar for data manipulation that can abstract over diverse data sources. dplyr can work with in-memory data frames and can also efficiently query large-scale data with processing engines including Apache Spark and Apache Impala (incubating). But dplyr works differently with these different data sources—and the differences can be sneaky. Ian Cook demonstrates several dplyr-compatible interfaces, including sparklyr (from RStudio) and the new package implyr (from Cloudera), and offers tips for writing dplyr code that works across these different interfaces. He helps solve mysteries including: