Accessing data in the cloud =========================== Once the master CSV file is understood, accessing data is a matter of searching for relevant datasets using the controlled vocabulary and opening them by pointing your Zarr package of choice to their corresponding ``zstore`` URLs. While this can be done using any language whose Zarr package supports the reading of remote data stores, the following examples will be in Python, showcasing the use of `xarray `_, `intake `_, and `intake-esm `_ to open and explore Earth System Model (ESM) collections of the CMIP6 data. Opening a single Zarr data store -------------------------------- A standalone Zarr data store can be opened using xarray's ``open_zarr()`` function. The function takes a Python-native ``MutableMapping`` as input, which can be acquired from a Zarr store URL using either `gcsfs `_ or `s3fs `_, depending on the cloud provider: .. code-block:: python import gcsfs import xarray as xr # Connect to Google Cloud Storage fs = gcsfs.GCSFileSystem(token='anon', access='read_only') # create a MutableMapping from a store URL mapper = fs.get_mapper("gs://cmip6/CMIP6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/hfls/gn/v20200225/") # make sure to specify that metadata is consolidated ds = xr.open_zarr(mapper, consolidated=True) or, for the AWS datasets: .. code-block:: python import s3fs import xarray as xr # Connect to AWS S3 storage fs = s3fs.S3FileSystem(anon=True) # create a MutableMapping from a store URL mapper = fs.get_mapper("s3://cmip6-pds/CMIP6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/hfls/gn/v20200225/") # make sure to specify that metadata is consolidated ds = xr.open_zarr(mapper, consolidated=True) Notice the option ``consolidated=True``, which relies on a consolidated metadata file to open and describe the Zarr data store with minimal data egress. Manually searching the catalog ------------------------------ By downloading the master CSV file enumerating all available data stores, we can interact with the spreadsheet through a `pandas DataFrame `_ to search and explore for relevant data using the `CMIP6 controlled vocabulary `_: .. code-block:: python import pandas as pd # for Google Cloud: df = pd.read_csv("https://cmip6.storage.googleapis.com/pangeo-cmip6.csv") # for AWS S3: # df = pd.read_csv("https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.csv") df_subset = df.query("activity_id=='CMIP' & table_id=='Amon' & variable_id=='tas'") From here, we can open any of the selected data stores using xarray, providing the value of the ``zstore`` column as input: .. code-block:: python # get the path to a specific zarr store zstore = df_subset.zstore.values[-1] mapper = fs.get_mapper(zstore) # open using xarray ds = xr.open_zarr(mapper, consolidated=True) When working with multiple data stores at the same time, it may be necessary to combine several together to form a dataset for analysis. In these cases, it is easier to access them using an ESM collection with intake-esm. An ESM collection contains metadata describing how data stores can be combined to yield highly aggregated datasets, which is used by intake-esm to automatically merge/concatenate them when they are loaded into an xarray container. This eases the burden on the user to manually combine data, while still offering the ability to search and explore all of the available data stores. Loading an ESM collection ------------------------- To load an ESM collection with intake-esm, the user must provide a valid ESM collection specification as input to intake's ``open_esm_datastore()`` function: .. code-block:: python import intake # for Google Cloud: col = intake.open_esm_datastore("https://storage.googleapis.com/cmip6/pangeo-cmip6.json") # for AWS S3: #col = intake.open_esm_datastore("https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.json") col This gives a summary of the ESM collection, including the total number of Zarr data stores (referred to as assets), along with the total number of datasets these Zarr data stores correspond to. The collection can also be viewed as a DataFrame: .. code-block:: python col.df.head() Searching for datasets ---------------------- After exploring the controlled vocabulary, it’s straightforward to get the data assets you want using intake-esm's ``search()`` function. In the example below, we will search for the following: - variables: ``tas`` which stands for near-surface air temperature - experiments: ``["historical", "ssp245", "ssp585"]``: - ``historical``: all forcing of the recent past - ``ssp245``: update of `RCP4.5 `_ based on SSP2 - ``ssp585``: emission-driven `RCP8.5 `_ based on SSP5 - table ID: ``Amon`` which stands for monthly atmospheric data - grid label: ``gr`` which stands for regridded data reported on the data provider's preferred target grid .. code-block:: python # form query dictionary query = dict(experiment_id=['historical', 'ssp245', 'ssp585'], table_id='Amon', variable_id=['tas'], member_id = 'r1i1p1f1', grid_label='gr') # subset catalog and get some metrics grouped by 'source_id' col_subset = col.search(require_all_on=['source_id'], **query) col_subset.df.groupby('source_id')[['experiment_id', 'variable_id', 'table_id']].nunique() Loading datasets ---------------- Once you've identified data assets of interest, you can load them into xarray dataset containers using intake-esm's ``to_dataset_dict()`` function. Invoking this function yields a Python dictionary of high-level aggregated xarray datasets. The logic for merging/concatenating the query results into datasets is provided in the input JSON file, under ``aggregation_control``: .. code-block:: json "aggregation_control": { "variable_column_name": "variable_id", "groupby_attrs": [ "activity_id", "institution_id", "source_id", "experiment_id", "table_id", "grid_label" ], "aggregations": [{ "type": "union", "attribute_name": "variable_id" }, { "type": "join_new", "attribute_name": "member_id", "options": { "coords": "minimal", "compat": "override" } }, { "type": "join_new", "attribute_name": "dcpp_init_year", "options": { "coords": "minimal", "compat": "override" } } ] } Though these aggregation specifications are sufficient to merge individual data assets into xarray datasets, sometimes additional arguments must be provided depending on the format of the data assets. For example, Zarr-based assets can be loaded with the option ``consolidated=True``, which relies on a consolidated metadata file to describe the assets with minimal data egress: .. code-block:: python dsets = col_subset.to_dataset_dict(zarr_kwargs={'consolidated': True}, storage_options={'token': 'anon'}) # list all merged datasets [key for key in dsets.keys()] When the datasets have finished loading, we can extract any of them like we would a value in a Python dictionary: .. code-block:: python ds = dsets['ScenarioMIP.THU.CIESM.ssp585.Amon.gr'] ds Preprocessing the CMIP6 datasets -------------------------------- Once you are comfortable with the basic `intake-esm` features, you may notice that many datasets cannot be easily combined and manipulated without some time consuming debugging. Julius Busecke's very useful package, `cmip6_preprocessing `_, can be added which does some of this cleanup for you - especially for the very tricky 'Omon' datasets. See, for example, this `tutorial `_ . .. code-block:: python from cmip6_preprocessing.preprocessing import combined_preprocessing and then you can use this when calling ``to_dataset_dict``: .. code-block:: python dsets = col_subset.to_dataset_dict( zarr_kwargs={'consolidated': True, 'decode_times':False}, aggregate=True, preprocess=combined_preprocessing, storage_options={'token': 'anon'} ) # AWS needs a slightly different syntax for the storage options dsets = col_subset.to_dataset_dict( zarr_kwargs={'consolidated': True, 'decode_times':False}, aggregate=True, preprocess=combined_preprocessing, storage_options={'anon': 'True'} )