How to build consistent, scalable workspaces for data science teams
Data science insights are undoubtedly transforming organizations, but the challenges of setting up and maintaining a data science stack that generates those insights are rarely discussed. Elaine Lee describes how Avants Data Engineering team built a system with open source projects, centered around Docker, to support data science R&D, continuous integration, and scaling in production.
Talk Title | How to build consistent, scalable workspaces for data science teams |
Speakers | Elaine Lee (Avant) |
Conference | O’Reilly Open Source Convention |
Conf Tag | |
Location | Austin, Texas |
Date | May 16-19, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
There have been plenty of discussions extolling the value generated by data science. However, the technical challenges of doing data science are rarely discussed. In particular, a lot of effort goes into setting up the work environment. This typically involves configuring database connections, installing packages for modeling, and finally setting up mechanisms for storing and interacting with the results. The system ideally needs to be scalable to accommodate the quantity of data, meet the computational demands of sophisticated machine-learning algorithms, and support a team of data scientists accessing those very assets, all of which can hike infrastructure bills if not strategically provisioned. Above all, the system needs to be consistent—results need to be reproducible when rolling back to previous versions for stability or conducting audits. Elaine Lee describes the container-based system created by Avant’s Data Engineering team that generates on-demand environments for its data scientists and for distributed tasks. Using open source projects such as Docker, Jenkins, and others for scheduling and scaling, Avant’s push-button solution allows each data scientist to initiate their own workspace that is guaranteed to be stable and preconfigured with permissions to external services and contain all the modeling tools they need. They no longer need to wrangle with dependencies and memory allocations because the autoscaling workspace is built from a Docker image that has passed the team’s automated checks and tests. Thus, data scientists can focus exclusively on creatively generating insight from data. This heavily automated solution also reduces the time data engineers spend maintaining and provisioning resources, enabling them to explore new technologies to create tooling for data scientists.