Introduction to the Pangeo ecosystem
Handling multi-dimensional arrays with xarray
Context¶
We will be using the Pangeo open-source software stack for computing and visualizing the Vegetation Condition Index (VCI) Kogan, 1995, a well-established indicator to estimate droughts from remote sensing data.
VCI compares the current normalized difference vegetation index (NDVI) Wikipedia, 2022 to the range of values observed in previous years.
Data¶
In this episode, we will use Sentinel-3 NDVI Analysis Ready Data (ARD) provided by the Copernicus Global Land Service.
This dataset can be discovered through the OpenEO API from the CGLS distributor, VITO. Access is free of charge but an EGI registration is needed.
The same dataset can also be downloaded from Zenodo: FOSS4G Training Datasets: NDVI
Further info about drought indices can be found in the Integrated Drought Management Programme (see here).
Setup¶
This episode uses the following main Python packages:
- xarray Hoyer & Hamman, 2017 with
netCDF4
andh5netcdf
engines - pooch Uieda et al., 2020
- numpy Harris et al., 2020
Please install these packages if they are not already available in your Python environment (see Setup page).
Packages¶
In this episode, Python packages are imported when we start to use them. However, for best software practices, we recommend that you install and import all the necessary libraries at the top of your Jupyter notebook.
import xarray as xr
Fetch Data¶
- For now we will fetch a netCDF file containing Sentinel-3 NDVI Analysis Ready Data (ARD).
- The file is available in a Zenodo repository. We will download it using using
pooch
, a very handy Python-based library to download and cache your data files locally (see further info here) - In the Data access and discovery episode, we will learn about different ways to access data, including access to remote data.
import pooch
cgls_file = pooch.retrieve(
url="https://zenodo.org/record/6969999/files/C_GLS_NDVI_20220101_20220701_Lombardia_S3_2.nc",
known_hash="md5:bbb25f1865056c886c6f9b37147d8f2f",
path=f".",
)
Output
Open and read metadata through Xarray¶
cgls_ds = xr.open_dataset(cgls_file)
As the dataset is in the NetCDF format, Xarray automatically selects the correct engine (this happens in the background because engine=‘netcdf’ has been automatically specified). Other common options are “h5netcdf” or “zarr”. GeoTiff data can also be read, but to access it requires rioxarray, which will be quickly covered later. Supposing that you have a dataset in an unrecognised format, you can always create your own reader as a subclass of the backend entry point and pass it through the engine parameter.
cgls_ds
What is xarray?¶
Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multi-dimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.
How is xarray structured?¶
Xarray has two core data structures, which build upon and extend the core strengths of NumPy and Pandas libraries. Both data structures are fundamentally N-dimensional:
DataArray is the implementation of a labeled, N-dimensional array. It is an N-D generalization of a Pandas.Series. The name DataArray itself is borrowed from Fernando Perez’s datarray project, which prototyped a similar data structure.
Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray as the pandas.DataFrame.
Accessing Coordinates and Data Variables¶
DataArray, within Datasets, can be accessed through:
- the dot notation like Dataset.NameofVariable
- or using square brackets, like Dataset[‘NameofVariable’] (NameofVariable needs to be a string so use quotes or double quotes)
cgls_ds.t
cgls_ds.t
is a one-dimensional xarray.DataArray
with dates of type datetime64[ns]
cgls_ds.NDVI
cgls_ds.NDVI
is a three-dimensional xarray.DataArray
with NDVI values of type uint8
cgls_ds['NDVI']
Same can be achieved for attributes and a DataArray.attrs will return a dictionary.
cgls_ds['NDVI'].attrs
Xarray and Memory usage¶
Once a Data Array|Set is opened, xarray loads into memory only the coordinates and all the metadata needed to describe it. The underlying data, the component written into the datastore, are loaded into memory as a NumPy array, only once directly accessed; once in there, it will be kept to avoid re-readings. This brings the fact that it is good practice to have a look to the size of the data before accessing it. A classical mistake is to try loading arrays bigger than the memory with the obvious result of killing a notebook Kernel or Python process. If the dataset does not fit in the available memory, then the only option will be to load it through the chunking; later on, in the tutorial ‘chunking_introduction’, we will introduce this concept.
As the size of the data is not too big here, we can continue without any problem. But let’s first have a look to the actual size and then how it impacts the memory once loaded into it.
import numpy as np
print(f'{np.round(cgls_ds.NDVI.nbytes / 1024**2, 2)} MB') # all the data are automatically loaded into memory as NumpyArray once they are accessed.
cgls_ds.NDVI.data
Renaming Coordinates and Data Variables¶
As other datasets have dimensions named according to the more common triad lat,lon,time a renomination is needed.
cgls_ds = cgls_ds.rename(x='lon', y='lat', t='time')
cgls_ds
Selection methods¶
As underneath DataArrays are Numpy Array objects (that implement the standard Python x[obj] (x: array, obj: int,slice) syntax). Their data can be accessed through the same approach of numpy indexing.
cgls_ds.NDVI[0,100,100]
or slicing
cgls_ds.NDVI[0:5,100:110,100:110]
As it is not easy to remember the order of dimensions, Xarray really helps by making it possible to select the position using names:
.isel
-> selection based on positional index.sel
-> selection based on coordinate values
We first check the number of elements in each coordinate of the NDVI Data Variable using the built-in method sizes. Same result can be achieved querying each coordinate using the Python built-in function len
.
cgls_ds.sizes
cgls_ds.NDVI.isel(time=0, lat=100, lon=100) # same as cgls_ds.NDVI[0,100,100]
The more common way to select a point is through the labeled coordinate using the .sel
method.
cgls_ds.NDVI.sel(time='2022-01-01')
Time is easy to be used as there is a 1 to 1 correspondence with values in the index, float values are not that easy to be used and a small discrepancy can make a big difference in terms of results.
Coordinates are always affected by precision issues; the best option to quickly get a point over the coordinates is to set the sampling method (method=‘’) that will search for the closest point according to the specified one.
Options for the method are:
- pad / ffill: propagate last valid index value forward
- backfill / bfill: propagate next valid index value backward
- nearest: use nearest valid index value
Another important parameter that can be set is the tolerance that specifies the distance between the requested and the target (so that abs(index[indexer] - target) <= tolerance) from documentation.
cgls_ds.sel(lat=46.3, lon=8.8, method='nearest')
cgls_ds.isel(lon=100).lon.values.item()
cgls_ds.isel(lat=100).lat.values.item()
cgls_ds.sel(lat=46.336112857142965, lon=8.799356142858498)
That is why we use a method
! It makes your life easier to deal with inexact matches.
As the exercise is focused on an Area Of Interest, this can be obtained through a bounding box defined with slices.
NDVI_AOI = cgls_ds.NDVI.sel(lat=slice(46.5,44.5), lon=slice(8.5,11.5))
NDVI_AOI
Plotting¶
Plotting data can easily be obtained through matplotlib.pyplot back-end matplotlib documentation.
NDVI_AOI.isel(time=0).plot(cmap="RdYlGn")
In the next episode, we will learn more about advanced visualization tools and how to make interactive plots using holoviews, a tool part of the HoloViz ecosystem.
Basic maths¶
NDVI values are a little odd in comparison to standard NDVI range values [-1, +1]. This confirms the max values reported in the Product User Manual (PUM).
NDVI characteristics from the Product User Manual (PUM)¶
layer name | description | physical min | physical max | digital max | scaling | offset | No Data |
---|---|---|---|---|---|---|---|
ndvi | normalized difference vegetation index | -0.08 | 0.92 | 250 | 1/250 | -0.08 | 254, 255 |
ndvi_unc | uncertainty on ndvi | 0 | 1 | 1000 | 1/1000 | 0 | 65535 |
nobs | number of observations | 0 | 32 | 32 | 1 | 0 | 255 |
qflag | bitwise quality flag | - | - | 254 | 1 | 0 | 255 |
Simple arithmetic operations can be performed without worrying about dimensions and coordinates, using the same notation we use with numpy
. Underneath xarray will automatically vectorize the operations over all the data dimensions.
NDVI_AOI * (1/250) - 0.08
The universal function (ufunc) from numpy and scipy can be applied too directly to the data.
np.subtract(np.multiply(NDVI_AOI, 0.004), 0.08)
NDVI_AOI = NDVI_AOI * (1/250) - 0.08
NDVI_AOI
Statistics¶
All the standard statistical operations can be used such as min
, max
, mean
. When no argument is passed to the function, the operation is done over all the dimension of the variable (same as with numpy
).
NDVI_AOI.min()
You can make a statistical operation over a dimension. For instance, let’s retrieve the maximum value for each available time.
NDVI_AOI.max(dim='time')
Aggregation¶
We have data every 10 days. To obtain monthly values, we can group values per month and compute the mean.
NDVI_monthly = NDVI_AOI.groupby(NDVI_AOI.time.dt.month).mean()
NDVI_monthly
As we have data from January to July, the time dimension is now month
and takes values from 1
to 7
.
NDVI_monthly.month
Mask¶
Not all values are valid and masking all those which are not in the valid range [-0.08, 0.92] is necessary. Masking can be achieved through the method DataSet|Array.where(cond, other)
or xr.where(cond, x, y)
.
The difference consists in the possibility to specify the value in case the condition is positive or not; DataSet|Array.where(cond, other)
only offer the possibility to define the false condition value (by default is set to np.NaN))
NDVI_masked = NDVI_AOI.where((NDVI_AOI >= -0.08) & (NDVI_AOI <= 0.92))
NDVI_masked.isel(time=0).plot()
To better visualize the mask, with the help of xr.where
, ad-hoc variable can be created. ‘xr.where’ let us specify value 1 for masked and 0 for the unmasked data.
mask = xr.where((NDVI_AOI <= -0.08) | (NDVI_AOI >= 0.92), 1, 0)
mask = xr.where((NDVI_AOI <= -0.08) | (NDVI_AOI >= 0.92), 1, 0)
mask.isel(time=0).plot()
Plot a single point (defined by its latitude and longitude) over the time dimension.
NDVI_masked.sel(lat=45.88, lon=8.63, method='nearest').plot()
Save xarray Dataset¶
It is very often convenient to save intermediate or final results into a local file. We will learn more about the different file formats Xarray can handle, but for now let’s save it as a netCDF file. Check the file size after saving the result into netCDF.
NDVI_masked.to_netcdf('C_GLS_NDVI_20220101_20220701_Lombardia_S3_2_masked.nc')
Advance Saving methods¶
Encoding and Compression¶
From the NDVI dataset we already know that values can be encoded and can be conceptualized as pure Digital Numbers (DN). To revert those values to physical values (PhyVal) the formula PhyVal = DN * scale_factor + add_offset has to be used. To achieve the same result and transform our PhyVal back to DN 4 different parameters has to be defined :
- dtype : datatype specification, in a numpy version (np.int16, np.float32) or a string one that can be converted to it. Here we use ‘np.uint8’ as values will range only up to 255.
- _FillValues : a values that substitute the NaNs one. Some cast doesn’t allow the conversion of Nans as there is no physical representation for that value (like from Float to Int), so an alternative value withing the acceptable values needs to be specified.
- scale_factor & add_offset : values can be converted through a scaling and off_set parameters according to the formula decoded = scale_factor * encoded + add_offset
A compression method can be defined as well; if the format is netCDF4 with the engine set to ‘netcdf4’ or ‘h5netcdf’ there are different compression options. The easiest solution is to stick with the default one for NetCDF4 files.
Note that encoding parameters needs to be done through a nested dictionary and parameters has to be defined for each single variable.
NDVI_masked.to_netcdf('C_GLS_NDVI_20220101_20220701_Lombardia_S3_2_mcs.nc',
engine='netcdf4',
encoding={'NDVI':{"dtype": np.uint8,
'_FillValue': 255,
'scale_factor':0.004,
'add_offset':-0.08,
'zlib': True, 'complevel':4}
}
)
Through the datatype and the compression a compression of almost 10 time has been achieved; as drawback speed reading has been decreased.
References¶
Packages citation¶
- Kogan, F. N. (1995). Application of vegetation index and brightness temperature for drought detection. Advances in Space Research, 15(11), 91–100. https://doi.org/10.1016/0273-1177(95)00079-T
- Wikipedia. (2022 (accessed August 7, 2022)). Normalized difference vegetation index. https://en.wikipedia.org/wiki/Normalized_difference_vegetation_index
- Hoyer, S., & Hamman, J. (2017). xarray: N-D labeled arrays and datasets in Python. Journal of Open Research Software, 5(1). 10.5334/jors.148
- Uieda, L., Soler, S. R., Rampin, R., van Kemenade, H., Turk, M., Shapero, D., Banihirwe, A., & Leeman, J. (2020). Pooch: A friend to fetch your data files. Journal of Open Source Software, 5(45), 1943. 10.21105/joss.01943
- Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362. 10.1038/s41586-020-2649-2