Python Ecosystem for EO#

Core Python Libraries Overview#

Python has become the lingua franca of EO processing through powerful libraries. Before diving into the details, let’s revisit data cubes and how we can represent them in Python. EO data cubes in Python are typically represented as labeled multidimensional arrays that integrate data values, coordinates, and metadata. Consider, for example, a 6x7 raster with 4 spectral bands collected over 3 time points. This structured representation allows efficient and intuitive access to complex EO datasets.

https://openeo.org/assets/img/dc_timeseries.c2c7a902.png

Fig. 1 An examplary raster datacube with 4 dimensions: x, y, bands and time.#

The demo below uses the Pangeo ecosystem to access and process data.

Note

The Pangeo ecosystem Pangeo is a Community-Driven Approach to Advancing Open Source Earth Observation Tools Across Disciplines.

_images/PangeoEurope.png

The list of Python packages below is not exhaustive. Its purpose is mainly to draw a visual picture of the different components typically needed for Earth Observation workflows.

Data Models#

  • xarray → N-dimensional arrays (gridded data).

  • geopandas → Tabular vector data (points, lines, polygons).

  • shapely → Geometries & spatial operations.

  • pyproj → Coordinate systems & reprojection.

  • eopf-xarray → Extends xarray with EO Processing Framework methods.

Storage Solutions (I/O & Formats)#

  • rioxarray → GeoTIFFs + CRS handling.

  • rasterio → Raster I/O.

  • fiona → Vector I/O (GeoJSON, Shapefile).

  • Zarr / NetCDF (via xarray) → Cloud-native, chunked array storage.

Catalogs & Metadata (optional)#

Catalogs are not strictly required — you can load files directly — but they become invaluable when dealing with large, multi-sensor, or cloud-hosted EO datasets.

  • pystac → STAC objects (Items, Collections, Catalogs).

  • sat-search → Search STAC APIs.

  • stackstac → Turn STAC items into xarray stacks.

  • odc-stac → Load STAC into xarray (optimized for EO).

Scalable Compute#

  • dask → Parallel/distributed computing.

  • pangeo stack → Dask + xarray ecosystem for scalable EO.

User Interfaces & Visualization#

  • hvplot → High-level, interactive plotting (works with xarray, geopandas, dask).

  • holoviews / panel → Dashboards and advanced UIs.

  • folium / ipyleaflet → Interactive maps in notebooks.

  • cartopy → Map projections & static plots.

Resource Management & Deployment (optional)#

  • Kubernetes / Dask Gateway → Manage compute clusters.

  • Pangeo Hub / JupyterHub → Multi-user access to scalable EO environments.

Data cubes and Lazy data loading with Xarray#

When accessing data through an API or cloud service, data is often lazily loaded. This means that initially only the metadata is retrieved, allowing us to understand the data’s dimensions, spatial and temporal extents, available spectral bands, and additional context without downloading the entire dataset.

Xarray supports this approach effectively, providing a powerful interface to open, explore, and manipulate large EO data cubes efficiently.

Let’s open an example dataset to explore these capabilities.

Note

xarray-eopf xarray-eopf is a Python package that extends xarray with a custom backend called “eopf-zarr”. This backend enables seamless access to ESA EOPF data products stored in the Zarr format, presenting them as analysis-ready data structures.

This notebook demonstrates how to use the xarray-eopf backend to explore and analyze EOPF Zarr datasets. It highlights the key features currently supported by the backend.

import xarray as xr
xr.set_options(display_expand_attrs=False)
<xarray.core.options.set_options at 0x1083ddb90>
path = (
    "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:202505-s02msil2a/18/products"
    "/cpm_v256/S2B_MSIL2A_20250518T112119_N0511_R037_T29RLL_20250518T140519.zarr"
)
ds = xr.open_datatree(path, engine="eopf-zarr", op_mode="native", chunks={})
ds
<xarray.DatasetView> Size: 0B
Dimensions:  ()
Data variables:
    *empty*
Attributes: (2)

What is xarray?#

Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multi-dimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.

How is xarray structured?#

Xarray has two core data structures, which build upon and extend the core strengths of NumPy and Pandas libraries. Both data structures are fundamentally N-dimensional:

  • DataTree is a tree-like hierarchical collection of xarray objects;

  • DataArray is the implementation of a labeled, N-dimensional array. It is an N-D generalization of a Pandas.Series. The name DataArray itself is borrowed from Fernando Perez’s datarray project, which prototyped a similar data structure.

  • Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray as the pandas.DataFrame.

Accessing Coordinates and Data Variables#

DataArray, within Datasets, can be accessed through:

  • the dot notation like Dataset.NameofVariable

  • or using square brackets, like Dataset[‘NameofVariable’] (NameofVariable needs to be a string so use quotes or double quotes)

ds["quality/l2a_quicklook/r60m/tci"]
<xarray.DataArray 'tci' (band: 3, y: 1830, x: 1830)> Size: 10MB
dask.array<open_dataset-tci, shape=(3, 1830, 1830), dtype=uint8, chunksize=(1, 305, 305), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) int64 24B 1 2 3
  * x        (x) int64 15kB 300030 300090 300150 300210 ... 409650 409710 409770
  * y        (y) int64 15kB 3099990 3099930 3099870 ... 2990370 2990310 2990250
Attributes: (6)
ds.quality.l2a_quicklook.r60m.tci
<xarray.DataArray 'tci' (band: 3, y: 1830, x: 1830)> Size: 10MB
dask.array<open_dataset-tci, shape=(3, 1830, 1830), dtype=uint8, chunksize=(1, 305, 305), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) int64 24B 1 2 3
  * x        (x) int64 15kB 300030 300090 300150 300210 ... 409650 409710 409770
  * y        (y) int64 15kB 3099990 3099930 3099870 ... 2990370 2990310 2990250
Attributes: (6)

Accessing metadata#

Metadata describes the data itself — offering context, provenance, and characteristics — in a way that can be read by both humans and machines.

ds["quality/l2a_quicklook/r60m/tci"].attrs
{'_eopf_attrs': {'coordinates': ['band', 'x', 'y'],
  'dimensions': ['band', 'y', 'x']},
 'proj:bbox': [300000.0, 2990220.0, 409800.0, 3100020.0],
 'proj:epsg': 32629,
 'proj:shape': [1830, 1830],
 'proj:transform': [60.0, 0.0, 300000.0, 0.0, -60.0, 3100020.0, 0.0, 0.0, 1.0],
 'proj:wkt2': 'PROJCS["WGS 84 / UTM zone 29N",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-9],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","32629"]]'}

Tip

Metadata everywhere! In the command above, we are showing the metadata (attributes) of one variable called “quality/l2a_quicklook/r60m/tci”. As an exercise, try the command:

ds.attrs
  • What do you observe?

Quick Visualisation#

As an example, we plot the quicklook image to view the RGB image.

import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
ds.quality.l2a_quicklook.r60m.tci.plot.imshow(rgb="band")
<matplotlib.image.AxesImage at 0x167d99cd0>
_images/11bfb642be626202f1b740fc5f13d04001c61ed8130e08bf45c9a03f8ffd884b.png

Selection methods#

As underneath DataArrays are Numpy Array objects (that implement the standard Python x[obj] (x: array, obj: int,slice) syntax). Their data can be accessed through the same approach of numpy indexing.

ds["quality/l2a_quicklook/r60m/tci"][0,0,0]
<xarray.DataArray 'tci' ()> Size: 1B
dask.array<getitem, shape=(), dtype=uint8, chunksize=(), chunktype=numpy.ndarray>
Coordinates:
    band     int64 8B 1
    x        int64 8B 300030
    y        int64 8B 3099990
Attributes: (6)

or slicing

ds["quality/l2a_quicklook/r60m/tci"][0,10:110,35:150]
<xarray.DataArray 'tci' (y: 100, x: 115)> Size: 12kB
dask.array<getitem, shape=(100, 115), dtype=uint8, chunksize=(100, 115), chunktype=numpy.ndarray>
Coordinates:
    band     int64 8B 1
  * x        (x) int64 920B 302130 302190 302250 302310 ... 308850 308910 308970
  * y        (y) int64 800B 3099390 3099330 3099270 ... 3093570 3093510 3093450
Attributes: (6)

As it is not easy to remember the order of dimensions, Xarray really helps by making it possible to select the position using names:

  • .isel -> selection based on positional index

  • .sel -> selection based on coordinate values

We first check the number of elements in each coordinate of the Data Variable using the built-in method sizes. Same result can be achieved querying each coordinate using the Python built-in function len.

print(ds["quality/l2a_quicklook/r60m/tci"].sizes)
Frozen({'band': 3, 'y': 1830, 'x': 1830})
ds["quality/l2a_quicklook/r60m/tci"].isel(band=0, y=100, x=10)
<xarray.DataArray 'tci' ()> Size: 1B
dask.array<getitem, shape=(), dtype=uint8, chunksize=(), chunktype=numpy.ndarray>
Coordinates:
    band     int64 8B 1
    x        int64 8B 300630
    y        int64 8B 3093990
Attributes: (6)

Or a slice

ds["quality/l2a_quicklook/r60m/tci"].isel(band=0, y=slice(10,100), x=slice(10,100))
<xarray.DataArray 'tci' (y: 90, x: 90)> Size: 8kB
dask.array<getitem, shape=(90, 90), dtype=uint8, chunksize=(90, 90), chunktype=numpy.ndarray>
Coordinates:
    band     int64 8B 1
  * x        (x) int64 720B 300630 300690 300750 300810 ... 305850 305910 305970
  * y        (y) int64 720B 3099390 3099330 3099270 ... 3094170 3094110 3094050
Attributes: (6)

The most common way to select an area (here 54x70 points) is through the labeled coordinate using the .sel method.

ds["quality/l2a_quicklook/r60m/tci"].sel(band=1, y=slice(3097890,3094710), x=slice(301830,305970))
<xarray.DataArray 'tci' (y: 54, x: 70)> Size: 4kB
dask.array<getitem, shape=(54, 70), dtype=uint8, chunksize=(54, 70), chunktype=numpy.ndarray>
Coordinates:
    band     int64 8B 1
  * x        (x) int64 560B 301830 301890 301950 302010 ... 305850 305910 305970
  * y        (y) int64 432B 3097890 3097830 3097770 ... 3094830 3094770 3094710
Attributes: (6)

Visualize on a map with a projection#

# Assign the CRS (UTM zone 32N)

# Select your DataArray slice
da = ds["quality/l2a_quicklook/r60m/tci"].sel(
    band=1,
#    y=slice(3097890, 3094710),
#    x=slice(301830, 305970)
)

# Assign the CRS (UTM zone 32N)
da = da.rio.write_crs("EPSG:32632")
# Reproject to EPSG:4326 (PlateCarree)
da_ll = da.rio.reproject("EPSG:4326")

Plot using matplotlib and cartopy#

import matplotlib.pyplot as plt
import cartopy.crs as ccrs

fig = plt.figure(figsize=[12, 10])
ax = plt.subplot(1, 1, 1, projection=ccrs.Mercator())
ax.coastlines(resolution='10m')

da_ll.plot(
    ax=ax,
    transform=ccrs.PlateCarree(),
)

plt.title("Sentinel-2 TCI (UTM 32N → Lat/Lon)", fontsize=16)
plt.show()
_images/661f9f66fca0decf9a02949fa726f8df388ee4c9adf68885b6bea2013760169e.png

Warning

Limitation of Xarray 👉 You need to know where the file is!

But what if I don’t know the file?#

Earth Observation (EO) data is massive — petabytes across satellites, sensors, and time ranges.

Manually finding files? Difficult. Painful.

We need a way to search and organize EO datasets by space, time, and metadata. In short, we need a catalog.

In EO, the de-facto standard for dataset catalogs is STAC.

STAC stands for SpatioTemporal Asset Catalog. It’s an open standard, developed by the community, for describing geospatial data — especially large EO datasets like Sentinel, Landsat, or commercial imagery.

Key points:

  • Catalog of assets: A STAC is basically a JSON-based catalog that describes geospatial data (satellite images, derived products, etc.).

  • Spatiotemporal: Each dataset (called an Item) has metadata about where (bounding box, geometry) and when (timestamp) it was acquired.

  • Assets: The actual data files (e.g., GeoTIFFs, Cloud-optimized GeoTIFFs, NetCDF, Zarr) are linked as “assets” in the STAC Item.

  • Collections: Items are grouped into collections (e.g., Sentinel-2 L2A).

  • Catalogs: Collections can be organized hierarchically, enabling scalable discovery across petabytes of EO data.

_images/STAC_Diagram_Overview.png

Let’s access Sentinel-2 satellite imagery using a STAC (SpatioTemporal Asset Catalog) interface. Leveraging the pystac-client, xarray, and matplotlib libraries, the notebook demonstrates how to:

  • Connect to a public EOPF STAC catalog hosted at https://stac.core.eopf.eodc.eu.

  • Perform structured searches across Sentinel-2 L2A collections, filtering by spatial extent, date range, cloud cover, and orbit characteristics.

  • Retrieve and load data assets directly into xarray for interactive analysis.

  • Visualize Sentinel-2 RGB composites along with pixel-level cloud coverage masks.

import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
import numpy as np
import pystac_client
import xarray as xr
from pystac_client import CollectionSearch
from matplotlib.gridspec import GridSpec
# Initialize the collection search
search = CollectionSearch(
    url="https://stac.core.eopf.eodc.eu/collections",  # STAC /collections endpoint
)

# Retrieve all matching collections (as dictionaries)
for collection_dict in search.collections_as_dicts():
    print(collection_dict["id"])
sentinel-2-l2a
sentinel-3-olci-l2-lfr
sentinel-1-l2-ocn
sentinel-3-slstr-l2-lst
sentinel-1-l1-grd
sentinel-2-l1c
sentinel-1-l1-slc
sentinel-3-slstr-l1-rbt
sentinel-3-olci-l1-efr
sentinel-3-olci-l1-err
sentinel-3-olci-l2-lrr

Quicklook Visualization for Sentinel-2#

We can use the RGB quicklook layer contained in the Sentinel-2 EOPF Zarr product for a visualization of the content:

item = items[0]  # extracting the first item

ds = xr.open_dataset(
    item.assets["product"].href,
    **item.assets["product"].extra_fields["xarray:open_datatree_kwargs"],
)  # The engine="eopf-zarr" is already embedded in the STAC metadata
ds
<xarray.Dataset> Size: 9GB
Dimensions:                                            (
                                                        conditions_geometry_angle: 2,
                                                        conditions_geometry_band: 13,
                                                        conditions_geometry_y: 23,
                                                        conditions_geometry_x: 23,
                                                        conditions_geometry_detector: 4,
                                                        ...
                                                        quality_mask_r20m_y: 5490,
                                                        quality_mask_r20m_x: 5490,
                                                        quality_mask_r60m_y: 1830,
                                                        quality_mask_r60m_x: 1830,
                                                        quality_probability_y: 5490,
                                                        quality_probability_x: 5490)
Coordinates: (12/63)
  * conditions_geometry_angle                          (conditions_geometry_angle) <U7 56B ...
  * conditions_geometry_band                           (conditions_geometry_band) <U3 156B ...
  * conditions_geometry_detector                       (conditions_geometry_detector) int64 32B ...
  * conditions_geometry_x                              (conditions_geometry_x) int64 184B ...
  * conditions_geometry_y                              (conditions_geometry_y) int64 184B ...
  * conditions_mask_detector_footprint_r10m_x          (conditions_mask_detector_footprint_r10m_x) int64 88kB ...
    ...                                                 ...
  * quality_mask_r20m_y                                (quality_mask_r20m_y) int64 44kB ...
  * quality_mask_r60m_x                                (quality_mask_r60m_x) int64 15kB ...
  * quality_mask_r60m_y                                (quality_mask_r60m_y) int64 15kB ...
    quality_probability_band                           int64 8B ...
  * quality_probability_x                              (quality_probability_x) int64 44kB ...
  * quality_probability_y                              (quality_probability_y) int64 44kB ...
Data variables: (12/86)
    conditions_geometry_mean_sun_angles                (conditions_geometry_angle) float64 16B dask.array<chunksize=(2,), meta=np.ndarray>
    conditions_geometry_mean_viewing_incidence_angles  (conditions_geometry_band, conditions_geometry_angle) float64 208B dask.array<chunksize=(13, 2), meta=np.ndarray>
    conditions_geometry_sun_angles                     (conditions_geometry_angle, conditions_geometry_y, conditions_geometry_x) float64 8kB dask.array<chunksize=(2, 23, 23), meta=np.ndarray>
    conditions_geometry_viewing_incidence_angles       (conditions_geometry_band, conditions_geometry_detector, conditions_geometry_angle, conditions_geometry_y, conditions_geometry_x) float64 440kB dask.array<chunksize=(7, 4, 2, 23, 23), meta=np.ndarray>
    conditions_mask_detector_footprint_r10m_b02        (conditions_mask_detector_footprint_r10m_y, conditions_mask_detector_footprint_r10m_x) uint8 121MB dask.array<chunksize=(1830, 1830), meta=np.ndarray>
    conditions_mask_detector_footprint_r10m_b03        (conditions_mask_detector_footprint_r10m_y, conditions_mask_detector_footprint_r10m_x) uint8 121MB dask.array<chunksize=(1830, 1830), meta=np.ndarray>
    ...                                                 ...
    quality_mask_r20m_b8a                              (quality_mask_r20m_y, quality_mask_r20m_x) uint8 30MB dask.array<chunksize=(915, 915), meta=np.ndarray>
    quality_mask_r60m_b01                              (quality_mask_r60m_y, quality_mask_r60m_x) uint8 3MB dask.array<chunksize=(305, 305), meta=np.ndarray>
    quality_mask_r60m_b09                              (quality_mask_r60m_y, quality_mask_r60m_x) uint8 3MB dask.array<chunksize=(305, 305), meta=np.ndarray>
    quality_mask_r60m_b10                              (quality_mask_r60m_y, quality_mask_r60m_x) uint8 3MB dask.array<chunksize=(305, 305), meta=np.ndarray>
    quality_probability_cld                            (quality_probability_y, quality_probability_x) uint8 30MB dask.array<chunksize=(915, 915), meta=np.ndarray>
    quality_probability_snw                            (quality_probability_y, quality_probability_x) uint8 30MB dask.array<chunksize=(915, 915), meta=np.ndarray>
Attributes: (2)
ds.quality_l2a_quicklook_r60m_tci.plot.imshow(rgb="quality_l2a_quicklook_r60m_band")
<matplotlib.image.AxesImage at 0x330582bd0>
_images/000fdc8a57f6d9b7c5baabf5dc65c67dec4093d37d64de1f43207ee465468408.png

There isn’t just one STAC catalog#

There are many STAC catalogs, each hosting different collections of datasets. A collection groups related assets, such as all Sentinel-2 images for a region, or all Landsat scenes for a certain period. By browsing catalogs and their collections, users can quickly find the data they need without hunting through petabytes of raw files.

The STAC catalog at https://earth-search.aws.element84.com/v1 is provided by Element 84 and offers a centralized search interface for various geospatial datasets hosted on AWS. This catalog is part of the Earth Search project, which is designed to facilitate access to open geospatial data through a standardized SpatioTemporal Asset Catalog (STAC) API.

import stackstac
#                West, South, East, North
spatial_extent = [11.1, 46.1, 11.5, 46.5]
temporal_extent = ["2015-01-01","2022-01-01"]

Running this cell may take up to 2 minutes

URL = "https://earth-search.aws.element84.com/v1"
catalog = pystac_client.Client.open(URL)

# Search Sentinel-2 L2A items
items = catalog.search(
    bbox=spatial_extent,
    datetime=temporal_extent,
    collections=["sentinel-2-l2a"]
).item_collection()

Calling stackstac.stack() method for the items, the data will be lazily loaded and an xArray.DataArray object returned.

Running the next cell will show the selected data content with the dimension names and their extent.

# Stack into a datacube
datacube = stackstac.stack(
    items,
    bounds_latlon=spatial_extent,
    epsg=32632,        # UTM zone for tile 32N
    resolution=10,     # 10m pixels
    chunksize=4096     # optional, improves performance
)

datacube
<xarray.DataArray 'stackstac-6820bbacf43f782d94d9a1563f88a8ac' (time: 966,
                                                                band: 32,
                                                                y: 4535, x: 3210)> Size: 4TB
dask.array<fetch_raster_window, shape=(966, 32, 4535, 3210), dtype=float64, chunksize=(1, 1, 4096, 3210), chunktype=numpy.ndarray>
Coordinates: (12/52)
  * time                                     (time) datetime64[ns] 8kB 2016-1...
    id                                       (time) <U24 93kB 'S2A_32TPS_2016...
  * band                                     (band) <U12 2kB 'aot' ... 'wvp-jp2'
  * x                                        (x) float64 26kB 6.611e+05 ... 6...
  * y                                        (y) float64 36kB 5.153e+06 ... 5...
    s2:granule_id                            (time) <U62 240kB 'S2A_OPER_MSI_...
    ...                                       ...
    raster:bands                             (band) object 256B [{'nodata': 0...
    gsd                                      (band) object 256B None 10 ... None
    common_name                              (band) object 256B None ... None
    center_wavelength                        (band) object 256B None ... None
    full_width_half_max                      (band) object 256B None ... None
    epsg                                     int64 8B 32632
Attributes: (4)

From the output of the previous cell you can notice something really interesting: the size of the selected data is more than 3 TB!

But you should have noticed that it was too quick to download this huge amount of data.

This is what lazy loading allows: getting all the information about the data in a quick manner without having to access and download all the available files.

Quiz hint: look carefully at the dimensions of the loaded datacube!

Data Formats and Performance#

Chunking Explained#

When working with large Earth Observation datasets, it’s usually not possible to load all data into a computer’s memory at once. Instead, data is divided into smaller, manageable chunks. Each chunk is the smallest unit that can be processed independently. This chunking strategy enables efficient, piecewise data access and parallel processing, improving performance and scalability.

The figure below visually explains the concept of data chunking: on the left, a three-dimensional data cube (x, y, and time) is shown without chunks, while on the right, the same data cube is displayed with chunks highlighted.

Data Cube without chunking

Data Cube with chunking

No Chunking

Chunking

Figure: on the left, a 3D data cube without any chunking strategy applied. On the right, a 3D data cube with box chunking.

There are several ways to chunk data depending on the analysis needs:

  • Spatial chunking divides data by geographic regions (e.g., longitude and latitude), ideal for spatial queries.

  • Time-based chunking divides datasets along temporal dimensions, useful for time series analysis.

  • Box chunking partitions data into fixed-size, 3D blocks combining space and time.

Choosing an appropriate chunking strategy is crucial, as it impacts read performance, memory usage, and cost-effectiveness in cloud environments.

Q&A: What Did You Discover?#

3 minutes

Key Concepts Participants Should Understand:

  1. Memory Efficiency: Chunking enables processing datasets larger than RAM

  2. Lazy Evaluation: Build complex operations without immediate computation

  3. Cloud-Native: Stream only needed data chunks from remote storage

  4. Scalability: Same patterns work from laptop to supercomputer

Questions for Understanding:

  • “How does chunking solve the ‘big data’ problem?”

  • “Why is lazy evaluation important for satellite data analysis?”