October 31, 2019

285 words 2 mins read

Using R for scalable data analytics: From single machines to Hadoop Spark clusters

Using R for scalable data analytics: From single machines to Hadoop Spark clusters

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.


Talk Title	Using R for scalable data analytics: From single machines to Hadoop Spark clusters
Speakers	Vanja Paunic (Microsoft), Robert Horton (Microsoft), Hang Zhang (Microsoft), Srini Kumar (LevaData, Inc.), Mengyue Zhao (Microsoft), John-Mark Agosta (Microsoft), Mario Inchiosa (Microsoft), Debraj GuhaThakurta (Microsoft)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 14-16, 2017
URL	Talk Page
Slides	Talk Slides
Video

R is one of the most used languages in the data science, statistics, and machine-learning (ML) community. Although open source R has a rich set of packages and functions for statistics and ML, when it comes to scalable data science, many CRAN-R users are hindered by the limitations of available functions to handle big data efficiently and a lack of knowledge about the appropriate computing environments to scale R scripts from single-node to elastic and distributed cloud services, including Spark 2.0 integrations. Vanja Paunic, Robert Horton, Hang Zhang, Srini Kumar, Mengyue Zhao, John-Mark Agosta, Mario Inchiosa, and Debraj GuhaThakurta walk you through creating end-to-end data science solutions in R on Spark clusters and consuming them in production. The tutorial materials and the scripts that are used to create the Spark clusters will be published to a public GitHub repository, so you’ll be able to create Spark clusters identical to the ones you use in the tutorial by running the scripts even after the tutorial session completes.

tutorial spark github hadoop data analytics data science analytics open source big data cloud scalable cluster

comments powered by Disqus

Uber's data science workbench

Uber's data science workbench

October 31, 2019

Peng Du and Randy Wei offer an overview of Ubers data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services, providing convenient features such as task scheduling, model publishing, and job monitoring.

Virtualizing Hadoop and Spark: Architecture, performance, and best practices (sponsored by VMware)

Virtualizing Hadoop and Spark: Architecture, performance, and best practices (sponsored by VMware)

October 31, 2019

Justin Murray outlines the benefits of virtualizing Hadoop and Spark, covering the main architectural approaches at a technical level and demonstrating how the core Hadoop architecture maps into virtual machines and how those relate to physical servers. You'll gain a set of design approaches and best practices to make your application infrastructure fit well with the virtualization layer.

Zillow: Transforming real estate through big data and machine learning

Zillow: Transforming real estate through big data and machine learning

October 30, 2019

Zillow pioneered providing access to unprecedented information about the housing market. Long gone are the days when you needed an agent to get comparables and prior sale and listing data. And with more data, data science has enabled more use cases. Jasjeet Thind explains how Zillow uses Spark and machine learning to transform real estate.

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop

October 31, 2019

Marcel Kornacker and Mostafa Mokhtar help simplify the process of making good SQL-on-Hadoop decisions and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning.

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)

October 31, 2019

Unbounded, out-of-order, global-scale data is now the norm. Even for the same computation, each use case entails its own balance between completeness, latency, and cost. Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner, including Apache Spark, Apache Flink, and Google Cloud Dataflow.

Using big data, the cloud, and AI to enable intelligence at scale (sponsored by Microsoft)

Using big data, the cloud, and AI to enable intelligence at scale (sponsored by Microsoft)

October 31, 2019

Wee Hyong Tok and Danielle Dean explain how the global, trusted, and hybrid Microsoft platform can enable you to do intelligence at scale, describing real-life applications where big data, the cloud, and AI are making a difference and how this is accelerating the digital transformation for these organizations at a lighting pace.