January 5, 2020

533 words 3 mins read

Supporting reproducibility in Jupyter through dataflow notebooks

Supporting reproducibility in Jupyter through dataflow notebooks

Dataflow notebooks build on the Jupyter Notebook environment by adding constructs to make dependencies between cells explicit and clear. David Koop offers an overview of the Dataflow kernel, shows how it can be used to robustly link cells as a notebook is developed, and demonstrates how that notebook can be reused and extended without impacting its reproducibility.

Talk Title Supporting reproducibility in Jupyter through dataflow notebooks
Speakers David Koop (University of Massachusetts Dartmouth)
Conference JupyterCon in New York 2018
Conf Tag The Official Jupyter Conference
Location New York, New York
Date August 22-24, 2018
URL Talk Page
Slides Talk Slides
Video

Jupyter notebooks are an important tool for exploratory data analysis, but their ability to combine code, outputs, and text has also enabled their use to document and reproduce results. Notebooks can further play an important role in enabling reuse and extensions of those results; users can not only examine and reexecute past results but also tweak inputs and test new ideas. However, the ability to modify and execute any cell, at any place in the notebook, can sometimes conflict with the desire to present steps in an easy-to-follow, top-down ordering. David Koop offers an overview of the Dataflow kernel, which combines the ability to tinker with the ability to define precise dependencies between cells. David shows how it can be used to robustly link cells as a notebook is developed and demonstrates how that notebook can be reused and extended without impacting its reproducibility. In addition, he explains the importance of intermediate outputs and how to structure cells to better define the boundaries between them and explores other extensions enabled by the dataflow structure and relationships to the JupyterLab environment. A dataflow notebook promotes cells to be building blocks that can be linked together based on references to each other. More technically, a dataflow encodes dependencies between computations by defining how the outputs of one computation are inputs of another computation. For example, cell A might calculate a value, cell B might read an image, and cell C might use cell A’s output to blur cell B’s output to create a new image. If we change cell B to read a different image, it seems appropriate to also update cell C’s output. With robust dependencies, we can automatically execute such updates. This structure also allows users to examine how a change to one cell might impact results in other cells of the notebook. An interactive visualization of the dependency graph can help users navigate complex notebooks. In a dataflow notebook, outputs (as defined by the last line of a cell) should reflect what was computed in a cell. Jupyter provides extensive capabilities for displaying and customizing outputs, but a cell with multiple outputs is often encapsulated as a tuple or other collection that limits these rich displays. To address this limitation, David shares extensions to the Jupyter notebook to render each output individually. For example, a tuple (df, im) would generate two outputs, the DataFrame df as an HTML table and the image im as an image. Combined with an identification scheme to reference the individual outputs, users can more clearly examine and link specific outputs to be used in other cells.

comments powered by Disqus