Uber's data science workbench

Peng Du and Randy Wei offer an overview of Ubers data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services, providing convenient features such as task scheduling, model publishing, and job monitoring.


Talk Title	Uber's data science workbench
Speakers	Peng Du (Uber Inc.), Randy Wei (Uber Inc.)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 14-16, 2017
URL	Talk Page
Slides	Talk Slides
Video

Peng Du and Randy Wei offer an overview of Uber’s data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks like Jupyter and RStudio, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services, providing convenient features such as task scheduling, model publishing, and job monitoring. Uber’s data science workbench provides clients with a scalable compute environment through dedicated Docker containers spawned by requests for notebook instances and a YARN/Mesos managed cluster for compute engines such as Spark, Hive, and Presto. Socialization features are supported in the workbench where clients can share, comment, and collaborate on notebook scripts with appropriate access control. All files, including scripts and results, are maintained by a version control system so that people can track progress and compare results. In order to improve the productivity of data scientists, the workbench is also integrated with multiple services in Uber. A matured script can be scheduled as a periodical task in Uber’s job scheduling service, and people can publish their results through dashboard services like Shiny and models through Uber’s machine-learning platform. Last but not least, for complicated tasks that involve long-time running jobs in Spark, Hive, or Presto, the workbench will register the jobs in Uber’s monitoring service so that people can check the progress and debug information from them.

Uber's data science workbench

Using R for scalable data analytics: From single machines to Hadoop Spark clusters

Zillow: Transforming real estate through big data and machine learning

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)

Virtualizing Hadoop and Spark: Architecture, performance, and best practices (sponsored by VMware)