How to build consistent, scalable workspaces for data science teams

Data science insights are undoubtedly transforming organizations, but the challenges of setting up and maintaining a data science stack that generates those insights are rarely discussed. Elaine Lee describes how Avants Data Engineering team built a system with open source projects, centered around Docker, to support data science R&D, continuous integration, and scaling in production.


Talk Title	How to build consistent, scalable workspaces for data science teams
Speakers	Elaine Lee (Avant)
Conference	O’Reilly Open Source Convention
Conf Tag
Location	Austin, Texas
Date	May 16-19, 2016
URL	Talk Page
Slides	Talk Slides
Video

There have been plenty of discussions extolling the value generated by data science. However, the technical challenges of doing data science are rarely discussed. In particular, a lot of effort goes into setting up the work environment. This typically involves configuring database connections, installing packages for modeling, and finally setting up mechanisms for storing and interacting with the results. The system ideally needs to be scalable to accommodate the quantity of data, meet the computational demands of sophisticated machine-learning algorithms, and support a team of data scientists accessing those very assets, all of which can hike infrastructure bills if not strategically provisioned. Above all, the system needs to be consistent—results need to be reproducible when rolling back to previous versions for stability or conducting audits. Elaine Lee describes the container-based system created by Avant’s Data Engineering team that generates on-demand environments for its data scientists and for distributed tasks. Using open source projects such as Docker, Jenkins, and others for scheduling and scaling, Avant’s push-button solution allows each data scientist to initiate their own workspace that is guaranteed to be stable and preconfigured with permissions to external services and contain all the modeling tools they need. They no longer need to wrangle with dependencies and memory allocations because the autoscaling workspace is built from a Docker image that has passed the team’s automated checks and tests. Thus, data scientists can focus exclusively on creatively generating insight from data. This heavily automated solution also reduces the time data engineers spend maintaining and provisioning resources, enabling them to explore new technologies to create tooling for data scientists.

How to build consistent, scalable workspaces for data science teams

Building a scalable data science platform with R

Incremental revolution: What Docker learned from the open source fire hose

Is your open source project ready for the container era?

Migrate your traditional VM-based clusters to containers

Multihost, multinetwork persistent containers

PaaSTA: Running applications at Yelp