January 6, 2020

414 words 2 mins read

Reproducible data dependencies for Jupyter: Distributing massive, versioned image datasets from the Allen Institute for Cell Science

Reproducible data dependencies for Jupyter: Distributing massive, versioned image datasets from the Allen Institute for Cell Science

Reproducible data is essential for notebooks that work across time, across contributors, and across machines. Jackson Brown and Aneesh Karve demonstrate how to use an open source data registry to create reproducible data dependencies for Jupyter and share a case study in open science over terabyte-size image datasets.

Talk Title Reproducible data dependencies for Jupyter: Distributing massive, versioned image datasets from the Allen Institute for Cell Science
Speakers Jackson Brown (Allen Institute for Cell Science), Aneesh Karve (Quilt)
Conference JupyterCon in New York 2018
Conf Tag The Official Jupyter Conference
Location New York, New York
Date August 22-24, 2018
URL Talk Page
Slides Talk Slides
Video

Reproducible data is essential for notebooks that work across time, across contributors, and across machines. Jackson Brown and Aneesh Karve demonstrate how to use an open source data registry to create reproducible data dependencies for Jupyter and share a case study in open science over terabyte-size image datasets. The Allen Institute for Cell Science generates terabytes of microscopy images every week. To improve access to these datasets for data scientists and external collaborators, the institute sought a platform that would enable plain-text search, subsetting of large datasets, version control to support reproducible experiments, and easy accessibility from data science tools like Jupyter, Python, and pandas. The team discovered that software optimized for storing and versioning source code (e.g., GitHub) exhibits slow performance for large files and places hard limits on file size that preclude large data repositories altogether. In response, the team is creating an open repository of image data that is enriched with metadata and encapsulated in “data packages”—versioned, immutable sets of data dependencies. The concept of package management is well known in software development. To date, however, package management has largely been applied to source code. Jackson and Aneesh propose to extend package management to the unique file size and format challenges of data by building on top of Quilt, an open source data registry. In combination with custom filtering software, Quilt enables efficient search and query of metadata so that data scientists can filter terabyte-sized packages into megabyte-size subsets that fit on a single machine. The package management infrastructure optimizes not only storage and network transfer but also serialization and virtualization. As a result, data scientists can interact with data packages in formats that are native to Jupyter and Python. Jackson and Aneesh also explore the role of data packages in versioning models and detecting model drift using “data unit tests” that check data profiles.

comments powered by Disqus