Accessing data in the cloud

Once the master CSV file is understood, accessing data is a matter of searching for relevant datasets using the controlled vocabulary and opening them by pointing your Zarr package of choice to their corresponding zstore URLs. While this can be done using any language whose Zarr package supports the reading of remote data stores, the following examples will be in Python, showcasing the use of xarray, intake, and intake-esm to open and explore Earth System Model (ESM) collections of the CMIP6 data.

Opening a single Zarr data store

A standalone Zarr data store can be opened using xarray’s open_zarr() function. The function takes a Python-native MutableMapping as input, which can be acquired from a Zarr store URL using either gcsfs or s3fs, depending on the cloud provider:

import gcsfs
import xarray as xr

# Connect to Google Cloud Storage
fs = gcsfs.GCSFileSystem(token='anon', access='read_only')

# create a MutableMapping from a store URL
mapper = fs.get_mapper("gs://cmip6/CMIP6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/hfls/gn/v20200225/")

# make sure to specify that metadata is consolidated
ds = xr.open_zarr(mapper, consolidated=True)

or, for the AWS datasets:

import s3fs
import xarray as xr

# Connect to AWS S3 storage
fs = s3fs.S3FileSystem(anon=True)

# create a MutableMapping from a store URL
mapper = fs.get_mapper("s3://cmip6-pds/CMIP6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/hfls/gn/v20200225/")

# make sure to specify that metadata is consolidated
ds = xr.open_zarr(mapper, consolidated=True)

Notice the option consolidated=True, which relies on a consolidated metadata file to open and describe the Zarr data store with minimal data egress.

Manually searching the catalog

By downloading the master CSV file enumerating all available data stores, we can interact with the spreadsheet through a pandas DataFrame to search and explore for relevant data using the CMIP6 controlled vocabulary:

import pandas as pd

# for Google Cloud:
df = pd.read_csv("https://cmip6.storage.googleapis.com/pangeo-cmip6.csv")
# for AWS S3:
# df = pd.read_csv("https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.csv")

df_subset = df.query("activity_id=='CMIP' & table_id=='Amon' & variable_id=='tas'")

From here, we can open any of the selected data stores using xarray, providing the value of the zstore column as input:

# get the path to a specific zarr store
zstore = df_subset.zstore.values[-1]
mapper = fs.get_mapper(zstore)

# open using xarray
ds = xr.open_zarr(mapper, consolidated=True)

When working with multiple data stores at the same time, it may be necessary to combine several together to form a dataset for analysis. In these cases, it is easier to access them using an ESM collection with intake-esm. An ESM collection contains metadata describing how data stores can be combined to yield highly aggregated datasets, which is used by intake-esm to automatically merge/concatenate them when they are loaded into an xarray container. This eases the burden on the user to manually combine data, while still offering the ability to search and explore all of the available data stores.

Loading an ESM collection

To load an ESM collection with intake-esm, the user must provide a valid ESM collection specification as input to intake’s open_esm_datastore() function:

import intake

# for Google Cloud:
col = intake.open_esm_datastore("https://storage.googleapis.com/cmip6/pangeo-cmip6.json")
# for AWS S3:
#col = intake.open_esm_datastore("https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.json")

col

This gives a summary of the ESM collection, including the total number of Zarr data stores (referred to as assets), along with the total number of datasets these Zarr data stores correspond to. The collection can also be viewed as a DataFrame:

col.df.head()

Searching for datasets

After exploring the controlled vocabulary, it’s straightforward to get the data assets you want using intake-esm’s search() function. In the example below, we will search for the following:

  • variables: tas which stands for near-surface air temperature

  • experiments: ["historical", "ssp245", "ssp585"]:

    • historical: all forcing of the recent past

    • ssp245: update of RCP4.5 based on SSP2

    • ssp585: emission-driven RCP8.5 based on SSP5

  • table ID: Amon which stands for monthly atmospheric data

  • grid label: gr which stands for regridded data reported on the data provider’s preferred target grid

# form query dictionary
query = dict(experiment_id=['historical', 'ssp245', 'ssp585'],
             table_id='Amon',
             variable_id=['tas'],
             member_id = 'r1i1p1f1',
             grid_label='gr')
# subset catalog and get some metrics grouped by 'source_id'
col_subset = col.search(require_all_on=['source_id'], **query)
col_subset.df.groupby('source_id')[['experiment_id', 'variable_id', 'table_id']].nunique()

Loading datasets

Once you’ve identified data assets of interest, you can load them into xarray dataset containers using intake-esm’s to_dataset_dict() function. Invoking this function yields a Python dictionary of high-level aggregated xarray datasets. The logic for merging/concatenating the query results into datasets is provided in the input JSON file, under aggregation_control:

"aggregation_control": {
  "variable_column_name": "variable_id",
  "groupby_attrs": [
    "activity_id",
    "institution_id",
    "source_id",
    "experiment_id",
    "table_id",
    "grid_label"
  ],
  "aggregations": [{
      "type": "union",
      "attribute_name": "variable_id"
    },

    {
      "type": "join_new",
      "attribute_name": "member_id",
      "options": {
        "coords": "minimal",
        "compat": "override"
      }
    },
    {
      "type": "join_new",
      "attribute_name": "dcpp_init_year",
      "options": {
        "coords": "minimal",
        "compat": "override"
      }
    }
  ]
}

Though these aggregation specifications are sufficient to merge individual data assets into xarray datasets, sometimes additional arguments must be provided depending on the format of the data assets. For example, Zarr-based assets can be loaded with the option consolidated=True, which relies on a consolidated metadata file to describe the assets with minimal data egress:

dsets = col_subset.to_dataset_dict(zarr_kwargs={'consolidated': True},
                                   storage_options={'token': 'anon'})
# list all merged datasets
[key for key in dsets.keys()]

When the datasets have finished loading, we can extract any of them like we would a value in a Python dictionary:

ds = dsets['ScenarioMIP.THU.CIESM.ssp585.Amon.gr']
ds

Preprocessing the CMIP6 datasets

Once you are comfortable with the basic intake-esm features, you may notice that many datasets cannot be easily combined and manipulated without some time consuming debugging. Julius Busecke’s very useful package, cmip6_preprocessing, can be added which does some of this cleanup for you - especially for the very tricky ‘Omon’ datasets. See, for example, this tutorial .

from cmip6_preprocessing.preprocessing import combined_preprocessing

and then you can use this when calling to_dataset_dict:

dsets = col_subset.to_dataset_dict(
  zarr_kwargs={'consolidated': True, 'decode_times':False},
  aggregate=True,
  preprocess=combined_preprocessing,
  storage_options={'token': 'anon'}
)
# AWS needs a slightly different syntax for the storage options
dsets = col_subset.to_dataset_dict(
  zarr_kwargs={'consolidated': True, 'decode_times':False},
  aggregate=True,
  preprocess=combined_preprocessing,
  storage_options={'anon': 'True'}
)