Reproducible data dependencies for Jupyter: Distributing massive, versioned image datasets from the Allen Institute for Cell Science


Talk Title	Reproducible data dependencies for Jupyter: Distributing massive, versioned image datasets from the Allen Institute for Cell Science
Speakers	Jackson Brown (Allen Institute for Cell Science), Aneesh Karve (Quilt)
Conference	JupyterCon in New York 2018
Conf Tag	The Official Jupyter Conference
Location	New York, New York
Date	August 22-24, 2018
URL	Talk Page
Slides	Talk Slides
Video

Reproducible data is essential for notebooks that work across time, across contributors, and across machines. Jackson Brown and Aneesh Karve demonstrate how to use an open source data registry to create reproducible data dependencies for Jupyter and share a case study in open science over terabyte-size image datasets. The Allen Institute for Cell Science generates terabytes of microscopy images every week. To improve access to these datasets for data scientists and external collaborators, the institute sought a platform that would enable plain-text search, subsetting of large datasets, version control to support reproducible experiments, and easy accessibility from data science tools like Jupyter, Python, and pandas. The team discovered that software optimized for storing and versioning source code (e.g., GitHub) exhibits slow performance for large files and places hard limits on file size that preclude large data repositories altogether. In response, the team is creating an open repository of image data that is enriched with metadata and encapsulated in “data packages”—versioned, immutable sets of data dependencies. The concept of package management is well known in software development. To date, however, package management has largely been applied to source code. Jackson and Aneesh propose to extend package management to the unique file size and format challenges of data by building on top of Quilt, an open source data registry. In combination with custom filtering software, Quilt enables efficient search and query of metadata so that data scientists can filter terabyte-sized packages into megabyte-size subsets that fit on a single machine. The package management infrastructure optimizes not only storage and network transfer but also serialization and virtualization. As a result, data scientists can interact with data packages in formats that are native to Jupyter and Python. Jackson and Aneesh also explore the role of data packages in versioning models and detecting model drift using “data unit tests” that check data profiles.

Reproducible data dependencies for Jupyter: Distributing massive, versioned image datasets from the Allen Institute for Cell Science

Pangeo: Big data climate science in the cloud

Data science in the cloud

Reproducible science with the Renku platform

SoS: A polyglot notebook and workflow system for both interactive multilanguage data analysis and batch data processing

The reporters notebook

The CIVIC platform: Collaborative data science in the cybernetic ecosystem