Pangeo: Big data climate science in the cloud

Climate science is being flooded with petabytes of data, overwhelming traditional modes of data analysis. The Pangeo project is building a platform to take big data climate science into the cloud using SciPy and large-scale interactive computing tools. Join Ryan Abernathey and Yuvi Panda to find out what the Pangeo team is building and why and learn how to use it.


Talk Title	Pangeo: Big data climate science in the cloud
Speakers	Ryan Abernathey (Columbia University), Yuvi Panda (Data Science Education Program (UC Berkeley))
Conference	JupyterCon in New York 2018
Conf Tag	The Official Jupyter Conference
Location	New York, New York
Date	August 22-24, 2018
URL	Talk Page
Slides	Talk Slides
Video

Earth’s climate is changing at a rate unprecedented in human history. This change brings profound challenges for human society, including rising seas, more severe droughts and floods, and more intense hurricanes. To understand and respond to these challenges, the climate science community is deploying an ever-growing array of satellites, autonomous sensor systems, and computer simulations, resulting in petabytes of new data generated every year. This volume of data is quickly overwhelming our community’s capacity for storage, analysis, and visualization. Paradoxically, rather than accelerating climate science, big data is slowly grinding it to a halt. Our inability to deal with this explosive growth in climate datasets has become a major technical obstacle, holding back scientific progress just when we need it most. Climate scientists employ a wide range of data science techniques, from simple descriptive statistics to sophisticated spatiotemporal analysis to neural network-based learning. Interactivity, the ability to quickly iterate and refine a particular analysis pipeline is highly valued. Like most scientific fields, data analysis in climate science has traditionally followed a download model; datasets stored on FTP servers are downloaded and analyzed on a user’s personal computer. This works fine for MB-scale datasets, but it becomes cumbersome for GB-scale datasets, expensive and difficult for TB-scale datasets, and impossible for PB-scale datasets. Part of the difficulty is that existing big data tools (e.g., Spark and Hadoop) were designed around tabular data and are not very well suited to the multidimensional numerical arrays found in climate science. A central goal of the Pangeo project is to meet this challenge by developing data and software infrastructure to enable interactive-speed analysis of the largest climate datasets by allowing the integration of existing open source scientific Python technologies within a cloud environment. These include xarray, a Python package for working with labeled, multidimensional array data, as commonly found in climate science; Dask, a parallel computing library for Python that helps xarray represent huge datasets and distribute computations across clusters; JupyterHub and JupyterLab, computing environments that enable users to interact with cloud-based resources; and Kubernetes, a versatile, cloud-agnostic scheduler for running interactive and batch workloads. Ryan Abernathey and Yuvi Panda offer an overview of these tools and describe how they work together. They then conduct a live demo using a Pangeo environment running on Google Cloud Platform to analyze global patterns of sea-level rise based on satellite observations of the ocean. Ryan and Yuvi conclude by outlining remaining challenges regarding how climate data is stored and accessed on the cloud. Acknowledgements: The Pangeo project recently received support from the National Science Foundation and Google to develop this platform in both traditional high-performance computing environments and on Google Cloud Platform. This award supports scientists and developers from Lamont Doherty Earth Observatory of Columbia University, the National Center for Atmospheric Research, and Anaconda Inc. It has also benefited from volunteer contributions from institutions such as UC Berkeley, UK Met Office, US Geological Survey, and the HDF Group.

Pangeo: Big data climate science in the cloud

Reproducible data dependencies for Jupyter: Distributing massive, versioned image datasets from the Allen Institute for Cell Science

Distributed training of deep learning models

Reproducible quantum chemistry in Jupyter

Distributed TensorFlow on Hops

Nezha: A Kubernetes Native Big Data Accelerator For Machine Learning

Airflow on Kubernetes: Dynamic Workflows Simplified