Zarr Data Overview¶
Requirements¶
First and foremost, a Zarr package is required to interact with the data stores. Listed below are languages with actively developed Zarr packages; bolded languages do not yet have Zarr packages that support the reading of remote data stores:
Python: zarr-developers/zarr-python
TypeScript: gzuidhof/zarr.js
C++: constantinpape/z5
Julia: meggart/zarr.jl
Java: saalfeldlab/n5-zarr
Scala: lasersonlab/ndarray.scala
Additionally, a filesystem package for Google Cloud and/or S3 storage is required for some languages to access the files containing the data stores:
Though optional, a CSV-loading package allows for searching and filtering of the Zarr data stores, which are enumerated in CSV files located at the root of each cloud storage bucket. Python users are encouraged to use xarray, intake, and intake-esm, which facilitate exploration and interaction with the data through the use of Earth System Model (ESM) collection specifications which are also provided at the root of each bucket.
Data Locations¶
CMIP6 data in the cloud can be found in both Google Cloud and AWS S3 storage buckets:
gs://cmip6
(part of Google Cloud Public Datasets)s3://cmip6-pds
(part of the AWS Open Data Sponsorship Program)
Warning
The AWS S3 storage copy mechanism is currently broken and thus data might be out of sync. Progress on reimplementing a sync between buckets is tracked here.
The Zarr-formatted data is currently ingested using Pangeo-Forge recipes as part of the NSF LEAP Project (more info)
The base organization of Zarr stores is reflected in the master CSV files located at the root of each bucket, which enumerates all available Zarr stores and their facets (components of the instance_id) to allow for sorting and filtering.
Warning
Parts of the information below is superseeded by the new `Pangeo-ESGF CMIP6 Zarr Data 2.0` (currently in Beta testing) Please refer to the repository for up to date information, particularly how to access new data and request new data to be ingested. This page will be updated once the beta testing phase is complete.
Zarr storage format¶
Each data store in the CMIP6 collection consists of all of the data, including the grids and metadata, stored in Zarr format. This format stores the data as a collection of files consisting of regular text files containing metadata and data files consisting of the data divided into compressed chunks which can be read individually or in parallel, allowing very large datasets to scale more efficiently in the cloud.
The original datasets, stored in the WCRP/CMIP6 ESGF repositories, consist of netCDF files.
Each of these datasets typically corresponds to a single variable saved at specified time intervals for the length of a single model run.
For convenience, these datasets were often divided into multiple netCDF files.
Our Zarr data stores correspond to the result of concatenating these netCDF files and then storing them as a single Zarr object.
All of the metadata in the original netCDF files has been preserved, including the licensing information and the CMIP6 persistent identifiers (tracking_ids
) which are unique for each of the original netCDF files.
A Zarr data store tracking_id
consists of a concatenated list of the netCDF tracking_ids
from which it was created. An individual tracking_id
can be looked up at Handle.net (e.g., enter “hdl:21.14100/33cbdc29-fbc9-44ab-9e09-5dc7824441cf”, which then redirects here).
Directory structure¶
UPDATE: Feb. 1, 2021: Please note that our zarr store names, and therefore the URLs for our zarr stores, has recently changed. For example, the prefix is now gs://cmip6/CMIP6/
on GC and s3://cmip6-pds/CMIP6/
on AWS S3. In addition, conforming to the ESGF syntax, we have appended the version_id
(e.g., /v20200101) to the names.
To organize the data there is a list of keywords, each with a controlled vocabulary which has been developed over the many CMIP iterations.
The keywords categorize the model data in the many ways we might want to search the data.
For example, to find all available 3-hourly precipitation data from the pre-industrial control runs, we only need to specify the variable, frequency and experiment name.
In this case, the keywords ['variable_id', 'table_id', 'experiment_id']
will have the values ['pr', '3hr', 'piControl']
.
The data are now structured in this cloud repository using 9 of these keywords in this order:
cmip6[-pds]/CMIP6/
└──<activity_id>/
└──<institution_id>/
└──<source_id>/
└──<experiment_id>/
└──<member_id>/
└──<table_id>/
└──<variable_id>/
└──<grid_label>/
└──<version_id>/
Each object specified in this way refers to a single Zarr data store.
CSV file structure¶
We maintain CSV files listing the most recent versions of the Zarr data stores, providing the keyword values in columns as well as the dataset URLs and some additional information. These files allow for rapid searching by keyword using your favorite spreadsheet software. For example, in python, we generally use the pandas package.
There are two different master CSV files located at the root of the buckets; one contains only datasets with no serious issues listed in the official ESGF Errata Service:
And the other contains all available Zarr data stores, including those with serious issues (represented with a -noQC
label):
For backward compatibility on GCS, we also maintain redundant copies called “cmip6-zarr-consolidated-stores.csv” and “cmip6-zarr-consolidated-stores-noQC.csv”.
The first 8 column names correspond to the standard CMIP keywords; the next three additional columns are:
zstore
: URL of the corresponding Zarr data storedcpp_init_year
: optional metadata for convenience when accessing DCPP-type experimentsversion
: approximate data of model output file as listed on ESGF in YYYYMMDD format
Finally, the -noQC
variants exclusively include three additional columns:
status
: status of the dataset’s issue, if any, using a controlled vocabulary:new
: issue has been recently raised with no other updates to statusonhold
: issue is in the process of being examined or resolvedresolved
: issue has been resolved AND the corrected files have been published on ESGF with a new dataset versionwontfix
: issue cannot/won’t be fixed by the data provider; may result in a persistent low severity issue with no consequences to analysis
severity
: severity of the dataset’s issue, if any, using a controlled vocabulary:low
: issue concerns file management (e.g., addition, removal, period extension, etc.)medium
: issue concerns metadata (netCDF attributes) without undermining the values of the involved variablehigh
: issue concerns single point variable or axis valuescritical
: issue concerns the variable or axis values undermining the analysis; use of this data is strongly discouraged
issue_url
: link to view the issue on ESGF Errata Service
There are currently over 400,000 entries - which is too large for Google Spreadsheets, but can be viewed in most standard spreadsheet applications and the entries can be sorted, selected and discovered quickly and efficiently. We find that importing them as a python pandas
dataframe is very useful.
NetCDF Data Overview¶
Data locations¶
CMIP6 netcdf data in the cloud can be found in AWS S3 storage.
s3://esgf-world
(part of the AWS Open Data Sponsorship Program).
The data is in NetCDF format, with a predetermined and well-defined directory structure to ensure that it is properly organized and classified. This directory structure is reflected in the CSV files located here, which enumerates all available netcdf datasets using their containing directory names as columns to allow for sorting and filtering.The names of the columns adhere to the CMIP6 controlled vocabulary whenever available. One can use the AWS S3 explorer to quickly explore these data holdings.
These datasets are also linked from the AWS registry of open data on AWS.
Directory structure¶
The directory structure (or the prefixes) adhere to the CMIP6 Data Reference Syntax and CMIP6 Controlled Vocabulary to facilitate building of automated tools to build data catalogs and other utilities to aid in data analysis.
Here is an example: s3://esgf-world/CMIP6/AerChemMIP/NOAA-GFDL/GFDL-ESM4/hist-piNTCF/r1i1p1f1/Amon/tas/gr1/v20180701/tas_Amon_GFDL-ESM4_hist-piNTCF_r1i1p1f1_gr1_185001-194912.nc
(appears as the column path in the CSV file located here)
where:
esgf-world
is the name of the S3 bucket with CMIP6 NetCDF holdings (subset)CMIP6
is the project_idAerChemMIP
is the name of the MIP (Model Intercomparison Project)NOAA-GFDL
is the institution_idGFDL-ESM4
is the source_id (i.e., the model)hist-piNTCF
is the experiment_idr1i1p1f1
is the member_id (i.e ensemble member. r,i,p stand for realization, initiatialization, physics, forcing respectively)Amon
is the table_id (i.e. the MIP table. Amon stands for atmos monthly)tas
is the variable_idgr1
is the grid_label (in this example, r in “gr1” stands for regridded)v20180701
is the version_idtas_Amon_GFDL-ESM4_hist-piNTCF_r1i1p1f1_gr1_185001-194912.nc
is the file_name
More CMIP6 netcdf data is being added incrementally to the S3 storage bucket, through a cloud based experimental Earth System Grid Federation (ESGF) node.
CSV File Structure¶
The CSV file), also known as the intake-esm catalog is a CSV file listing the netcdf objects in the esgf-world bucket, providing the keyword values in columns as well as the dataset URLs and some additional information. The column names use CMIP6 controlled vocabulary as indicated in the section above. These files allow for rapid searching by keyword using your favorite spreadsheet software. For example, in python, we generally use the pandas package. If you’d like to use them in your data analysis directly, you can also leverage xarray and dask. An example can be found here.