November 20, 2019

198 words 1 min read

sparklyr, implyr, and more: dplyr interfaces to large-scale data

sparklyr, implyr, and more: dplyr interfaces to large-scale data

The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable.


Talk Title	sparklyr, implyr, and more: dplyr interfaces to large-scale data
Speakers	Ian Cook (Cloudera)
Conference	Strata Data Conference
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 6-8, 2018
URL	Talk Page
Slides	Talk Slides
Video

dplyr, one of the most popular packages for R, provides a consistent grammar for data manipulation that can abstract over diverse data sources. dplyr can work with in-memory data frames and can also efficiently query large-scale data with processing engines including Apache Spark and Apache Impala (incubating). But dplyr works differently with these different data sources—and the differences can be sneaky. Ian Cook demonstrates several dplyr-compatible interfaces, including sparklyr (from RStudio) and the new package implyr (from Cloudera), and offers tips for writing dplyr code that works across these different interfaces. He helps solve mysteries including:

code apache spark large-scale cloud

comments powered by Disqus

Speed up mission-critical analytics in the cloud (sponsored by Kyligence)

Speed up mission-critical analytics in the cloud (sponsored by Kyligence)

November 20, 2019

As organizations look to scale their analytics capability, the need to grow beyond a traditional data warehouse becomes critical, and cloud-based solutions allow more flexibility while being more cost efficient. Billy Liu offers an overview of Kyligence Cloud, a managed Apache Kylin online service designed to speed up mission-critical analytics at web scale for big data.

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams

November 20, 2019

Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead.

Container Storage Interface: Present and Future

Container Storage Interface: Present and Future

November 19, 2019

The goal of Container Storage Interface (CSI) is to provide a standard API allowing a storage provider to write just one plugin that will work for all major container orchestration systems: Kubernetes …

Securing Serverless Functions via Kubernetes Objects

Securing Serverless Functions via Kubernetes Objects

November 19, 2019

Serverless is fast becoming a new application architecture paradigm. As glue code that links cloud services together it is tempting to forget about the security of functions being deployed. In this ta …

Vectorized query processing using Apache Arrow

Vectorized query processing using Apache Arrow

November 19, 2019

Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow.

Apache Kafka + Apache Mesos = Highly scalable streaming microservices

Apache Kafka + Apache Mesos = Highly scalable streaming microservices

November 18, 2019

Kai Whner shares a highly scalable, mission-critical infrastructure using Apache Kafka and Apache Mesos: Kafka brokers are used as the distributed messaging backbone; Kafkas Streams API embeds stream processing into any external application without the need for a dedicated streaming cluster; and Mesos is used as a scalable infrastructure to leverage the benefits of a cloud-native platform.