Uber's data science workbench
Peng Du and Randy Wei offer an overview of Ubers data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services, providing convenient features such as task scheduling, model publishing, and job monitoring.
Talk Title | Uber's data science workbench |
Speakers | Peng Du (Uber Inc.), Randy Wei (Uber Inc.) |
Conference | Strata + Hadoop World |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 14-16, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Peng Du and Randy Wei offer an overview of Uber’s data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks like Jupyter and RStudio, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services, providing convenient features such as task scheduling, model publishing, and job monitoring. Uber’s data science workbench provides clients with a scalable compute environment through dedicated Docker containers spawned by requests for notebook instances and a YARN/Mesos managed cluster for compute engines such as Spark, Hive, and Presto. Socialization features are supported in the workbench where clients can share, comment, and collaborate on notebook scripts with appropriate access control. All files, including scripts and results, are maintained by a version control system so that people can track progress and compare results. In order to improve the productivity of data scientists, the workbench is also integrated with multiple services in Uber. A matured script can be scheduled as a periodical task in Uber’s job scheduling service, and people can publish their results through dashboard services like Shiny and models through Uber’s machine-learning platform. Last but not least, for complicated tasks that involve long-time running jobs in Spark, Hive, or Presto, the workbench will register the jobs in Uber’s monitoring service so that people can check the progress and debug information from them.