Save larger files from bucket to bucket using forces minIO#

It is important to save your results in a place that can last longer than a few days/weeks!
- When you have saved data locally on your JupyterLab instance and you want to mak a backup on https://forces2021.uiogeo-apps.sigma2.no/
import os
import pathlib
import s3fs
import xarray as xr

Connect to bucket (anonymous login for public data only)#

fs = s3fs.S3FileSystem(anon=True,
      client_kwargs={
         'endpoint_url': 'https://climate.uiogeo-apps.sigma2.no/'
      })

Get data into xarray#

s3path = 's3://ESGF/CMIP6/GeoMIP/MPI-M/*/G6sulfur/*/day/tasmin/gn/*/*.nc'
remote_files = fs.glob(s3path)
remote_files
['ESGF/CMIP6/GeoMIP/MPI-M/MPI-ESM1-2-LR/G6sulfur/r1i1p1f1/day/tasmin/gn/v20190710/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r1i1p1f1_gn_20150101-20341231.nc',
 'ESGF/CMIP6/GeoMIP/MPI-M/MPI-ESM1-2-LR/G6sulfur/r1i1p1f1/day/tasmin/gn/v20190710/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r1i1p1f1_gn_20350101-20541231.nc',
 'ESGF/CMIP6/GeoMIP/MPI-M/MPI-ESM1-2-LR/G6sulfur/r1i1p1f1/day/tasmin/gn/v20190710/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r1i1p1f1_gn_20550101-20741231.nc',
 'ESGF/CMIP6/GeoMIP/MPI-M/MPI-ESM1-2-LR/G6sulfur/r1i1p1f1/day/tasmin/gn/v20190710/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r1i1p1f1_gn_20750101-20941231.nc',
 'ESGF/CMIP6/GeoMIP/MPI-M/MPI-ESM1-2-LR/G6sulfur/r1i1p1f1/day/tasmin/gn/v20190710/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r1i1p1f1_gn_20950101-20991231.nc',
 'ESGF/CMIP6/GeoMIP/MPI-M/MPI-ESM1-2-LR/G6sulfur/r2i1p1f1/day/tasmin/gn/v20190710/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r2i1p1f1_gn_20150101-20341231.nc',
 'ESGF/CMIP6/GeoMIP/MPI-M/MPI-ESM1-2-LR/G6sulfur/r2i1p1f1/day/tasmin/gn/v20190710/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r2i1p1f1_gn_20350101-20541231.nc',
 'ESGF/CMIP6/GeoMIP/MPI-M/MPI-ESM1-2-LR/G6sulfur/r2i1p1f1/day/tasmin/gn/v20190710/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r2i1p1f1_gn_20550101-20741231.nc',
 'ESGF/CMIP6/GeoMIP/MPI-M/MPI-ESM1-2-LR/G6sulfur/r2i1p1f1/day/tasmin/gn/v20190710/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r2i1p1f1_gn_20750101-20941231.nc',
 'ESGF/CMIP6/GeoMIP/MPI-M/MPI-ESM1-2-LR/G6sulfur/r2i1p1f1/day/tasmin/gn/v20190710/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r2i1p1f1_gn_20950101-21001231.nc']
# Iterate through remote_files to create a fileset
fileset = [fs.open(file) for file in remote_files]

# This works
dset = xr.open_mfdataset(fileset, combine='by_coords', use_cftime=True)
dset
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 96, lon: 192, time: 31411)
Coordinates:
  * time       (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00
  * lat        (lat) float64 -88.57 -86.72 -84.86 -83.0 ... 84.86 86.72 88.57
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    height     float64 2.0
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) object dask.array<chunksize=(7305, 2), meta=np.ndarray>
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(7305, 96, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(7305, 192, 2), meta=np.ndarray>
    tasmin     (time, lat, lon) float32 dask.array<chunksize=(7305, 96, 192), meta=np.ndarray>
Attributes: (12/48)
    CDO:                    Climate Data Operators version 1.9.9rc8 (https://...
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            GeoMIP
    branch_method:          standard
    branch_time_in_child:   [60265.]
    branch_time_in_parent:  [60265.]
    ...                     ...
    title:                  MPI-ESM1-2-LR output prepared for CMIP6
    variable_id:            tasmin
    variant_label:          r1i1p1f1
    license:                CMIP6 model data produced by MPI-M is licensed un...
    cmor_version:           3.6.0
    tracking_id:            hdl:21.14100/dd1cb01a-dbe1-4096-835a-c9604879eea8

Check the size (MB) of our dataset#

dset.nbytes / 1e6
2461.368272

Our dataset is bit more than 2.4 GB

Save file from memory to bucket#

%%time
dset.load()
CPU times: user 182 µs, sys: 27 µs, total: 209 µs
Wall time: 218 µs
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 96, lon: 192, time: 31411)
Coordinates:
  * time       (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00
  * lat        (lat) float64 -88.57 -86.72 -84.86 -83.0 ... 84.86 86.72 88.57
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    height     float64 2.0
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) object 2015-01-01 00:00:00 ... 2101-01-01 00:00:00
    lat_bnds   (time, lat, bnds) float64 -89.5 -87.65 -87.65 ... 87.65 89.5
    lon_bnds   (time, lon, bnds) float64 -0.9375 0.9375 0.9375 ... 357.2 359.1
    tasmin     (time, lat, lon) float32 242.5 242.5 242.4 ... 253.4 253.4 253.4
Attributes: (12/48)
    CDO:                    Climate Data Operators version 1.9.9rc8 (https://...
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            GeoMIP
    branch_method:          standard
    branch_time_in_child:   [60265.]
    branch_time_in_parent:  [60265.]
    ...                     ...
    title:                  MPI-ESM1-2-LR output prepared for CMIP6
    variable_id:            tasmin
    variant_label:          r1i1p1f1
    license:                CMIP6 model data produced by MPI-M is licensed un...
    cmor_version:           3.6.0
    tracking_id:            hdl:21.14100/dd1cb01a-dbe1-4096-835a-c9604879eea8

Save your results to Remote private object storage#

  • your credentials are in $HOME/.aws/credentials

  • check with your instructor to get the secret access key (replace XXX by the right key)

[default]
aws_access_key_id=forces2021-work
aws_secret_access_key=XXXXXXXXXXXX
aws_endpoint_url=https://forces2021.uiogeo-apps.sigma2.no/
target = s3fs.S3FileSystem(anon=False,
      client_kwargs={
         'endpoint_url': 'https://forces2021.uiogeo-apps.sigma2.no/'
      })

Save as netCDF#

  • netCDF is not a cloud-optimized format so it may be slow

s3_path =  "s3://work/annefou/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r1i1p1f1_gn.nc"
print(s3_path)
s3://work/annefou/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r1i1p1f1_gn.nc
with target.open(s3_path, 'wb') as f:
    f.write(dset.to_netcdf(None))
/opt/conda/lib/python3.8/site-packages/xarray/conventions.py:441: UserWarning: Variable 'time' has datetime type and a bounds variable but time.encoding does not have units specified. The units encodings for 'time' and 'time_bnds' will be determined independently and may not be equal, counter to CF-conventions. If this is a concern, specify a units encoding for 'time' before writing to a file.
  warnings.warn(

Then you can use the remote file#

remote_file = ['work/annefou/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r1i1p1f1_gn.nc']
fileset = [target.open(file) for file in remote_file]
%%time
ds_check = xr.open_mfdataset(fileset, combine='by_coords', use_cftime=True)
ds_check
CPU times: user 20.8 s, sys: 4.15 s, total: 24.9 s
Wall time: 51.1 s
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 96, lon: 192, time: 31411)
Coordinates:
  * time       (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
  * lat        (lat) float64 -88.57 -86.72 -84.86 -83.0 ... 84.86 86.72 88.57
    height     float64 ...
Dimensions without coordinates: bnds
Data variables:
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(31411, 192, 2), meta=np.ndarray>
    tasmin     (time, lat, lon) float32 dask.array<chunksize=(31411, 96, 192), meta=np.ndarray>
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(31411, 96, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(31411, 2), meta=np.ndarray>
Attributes: (12/48)
    CDO:                    Climate Data Operators version 1.9.9rc8 (https://...
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            GeoMIP
    branch_method:          standard
    branch_time_in_child:   60265.0
    branch_time_in_parent:  60265.0
    ...                     ...
    title:                  MPI-ESM1-2-LR output prepared for CMIP6
    variable_id:            tasmin
    variant_label:          r1i1p1f1
    license:                CMIP6 model data produced by MPI-M is licensed un...
    cmor_version:           3.6.0
    tracking_id:            hdl:21.14100/dd1cb01a-dbe1-4096-835a-c9604879eea8
%%time
ds_seas = ds_check.groupby('time.season').mean('time', keep_attrs=True, skipna = True)
CPU times: user 227 ms, sys: 2.54 ms, total: 229 ms
Wall time: 246 ms

Save as Zarr#

  • it usually takes longer to save but it is much faster to read

dset.load()
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 96, lon: 192, time: 31411)
Coordinates:
  * time       (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00
  * lat        (lat) float64 -88.57 -86.72 -84.86 -83.0 ... 84.86 86.72 88.57
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    height     float64 2.0
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) object 2015-01-01 00:00:00 ... 2101-01-01 00:00:00
    lat_bnds   (time, lat, bnds) float64 -89.5 -87.65 -87.65 ... 87.65 89.5
    lon_bnds   (time, lon, bnds) float64 -0.9375 0.9375 0.9375 ... 357.2 359.1
    tasmin     (time, lat, lon) float32 242.5 242.5 242.4 ... 253.4 253.4 253.4
Attributes: (12/48)
    CDO:                    Climate Data Operators version 1.9.9rc8 (https://...
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            GeoMIP
    branch_method:          standard
    branch_time_in_child:   [60265.]
    branch_time_in_parent:  [60265.]
    ...                     ...
    title:                  MPI-ESM1-2-LR output prepared for CMIP6
    variable_id:            tasmin
    variant_label:          r1i1p1f1
    license:                CMIP6 model data produced by MPI-M is licensed un...
    cmor_version:           3.6.0
    tracking_id:            hdl:21.14100/dd1cb01a-dbe1-4096-835a-c9604879eea8
s3_path =  "s3://work/annefou/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r1i1p1f1_gn.zarr"
print(s3_path)
s3://work/annefou/tasmin_day_MPI-ESM1-2-LR_G6sulfur_r1i1p1f1_gn.zarr
store = s3fs.S3Map(root=s3_path, s3=target, check=False)
%%time
dset.to_zarr(store=store, mode="w", consolidated=True, compute=True)
CPU times: user 35.2 s, sys: 6.62 s, total: 41.8 s
Wall time: 1min
<xarray.backends.zarr.ZarrStore at 0x7f651e2a3040>

Then you can use the remote file#

  • loading Zarr is usually faster, especially with large datasets

%%time
ds_check = xr.open_zarr(store=store, consolidated=True)
ds_check
CPU times: user 90.7 ms, sys: 5.84 ms, total: 96.5 ms
Wall time: 697 ms
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 96, lon: 192, time: 31411)
Coordinates:
    height     float64 ...
  * lat        (lat) float64 -88.57 -86.72 -84.86 -83.0 ... 84.86 86.72 88.57
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
  * time       (time) datetime64[ns] 2015-01-01T12:00:00 ... 2100-12-31T12:00:00
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(3927, 24, 1), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(3927, 48, 1), meta=np.ndarray>
    tasmin     (time, lat, lon) float32 dask.array<chunksize=(1964, 12, 24), meta=np.ndarray>
    time_bnds  (time, bnds) datetime64[ns] dask.array<chunksize=(15706, 2), meta=np.ndarray>
Attributes: (12/48)
    CDO:                    Climate Data Operators version 1.9.9rc8 (https://...
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            GeoMIP
    branch_method:          standard
    branch_time_in_child:   [60265.0]
    branch_time_in_parent:  [60265.0]
    ...                     ...
    table_id:               day
    table_info:             Creation Date:(09 May 2019) MD5:5f007c16960eee824...
    title:                  MPI-ESM1-2-LR output prepared for CMIP6
    tracking_id:            hdl:21.14100/dd1cb01a-dbe1-4096-835a-c9604879eea8
    variable_id:            tasmin
    variant_label:          r1i1p1f1
%%time
ds_seas = ds_check.groupby('time.season').mean('time', keep_attrs=True, skipna = True)
CPU times: user 186 ms, sys: 2.35 ms, total: 188 ms
Wall time: 201 ms