YAXArrays and mapCube in Julia#

In this lesson, we discuss cover the basics of YAXArray data structures and mapCube in Julia.

Authors & Contributors#

Notebook#

  • Felix Cremer, Max Planck Institute for Biogeochemistry (Germany), @felixcremer

Contributors#

  • Anne Fouilloux, Simula (Norway), @annefou

This tutorial is based on the Jupyter notebook General introduction to YAXArrays.jl and mapCube from Geospatial data cubes in Julia workshop.

Overview

Questions
  • What is YXArrays.jl?
  • What is mapCube and what can I used it for?
  • How can I read and manipulate a Zarr dataset in Julia?
  • How can apply a function to my entire or a portion of my dataset?
  • How can I use Python or R in Julia?
Objectives
  • Understand YXArrays.jl and mapCube in Julia
  • Read and write Zarr datasets
  • Apply a simple or rather complex function to perform complex data analysis
using Pkg
Pkg.activate(".")
Pkg.instantiate()

Accessing remote data stored on the cloud with YXArrays#

  • Be aware that opening a dataset with YXArrays does not load the entire dataset in memory;

  • Only metadata set is read and available;

  • As for Xarray in Python, YXArrays in Julia will load data only when needed (e.g. accessed).

using YAXArrays
using Zarr
era5url = "https://s3.bgc-jena.mpg.de:9000/deepextremes/v3/ERA5Cube.zarr"
ds = open_dataset(era5url)

select a subset of the global remote dataset#

sub = ds[Ti=DateTime(1998,1,1)..DateTime(2022,12,31), longitude=0..14.76,latitude=30.1..60]

Create local Zarr for tutorial#

savedataset(sub,path="./era5.zarr", overwrite=true)

Let’s dig into YAXArrays.jl and mapCube#

So far we have only used mapslices in this tutorial. However, this can only cover very simple cases for a single input cube and computations on one or dimensions which either collapse or return the same dimension.

using DimensionalData, YAXArrays, Zarr, NetCDF
using WGLMakie
path = "./era5.zarr"
c = Cube(open_dataset(zopen(path,consolidated=true,fill_as_missing=false)))
c[latitude=Near(11)]

The mapCube function#

is a generalization of mapslices, where you can annotate the exact signature of the function to be applied. For example the computation of the median over time can be written using mapCube. Here one hase to specify the dimension(s) that the user-defined function is going to operate on. For the computation of the median over time the only input dimension is time and there are no output dimensions as only a single value is returned. The user defined function passed to mapCube always has the signature f(outputs..., inputs...) and potentially followd by additional arguments and keyword args.

Apply function along single Axis#

using Statistics 
indims = InDims("latitude")
outdims = OutDims()
function apply_median(xout, xin)
    x = filter(!ismissing, xin)

    x = filter(!isnan,x)
    #@show x

    #filter!(!ismissing, x)
    xout[] = isempty(x) ? missing : median(x)
end
medians = mapCube(apply_median, c[Variable=Where(contains("t2m"))];indims, outdims)
fig, ax, heat = heatmap(DimArray(medians[Variable=At("t2m")]))

Apply function on all elements#

medians_kelvin = map(x-> x + 273.15, medians)

This function is applied lazily and only computed when the data is worked with. This could be a mapCube operation, saving the data to disk or plotting the data.

heatmap(DimArray(medians_kelvin[Variable=At("t2m")]))

Arguments for inner function and output dimensions#

Let’s make a slightly more complex computation to demonstrate a case where multiple outputs are generated. For examples, imagine we want to normalize every time series (to zero mean and unit variance), but at the same time return the means and variances in a single dataset for later re-use:

Apply function with multiple output cubes#

function norm(ts_out, mean_out, std_out, ts_in)
    x = filter(!ismissing, ts_in)

    tsshort = filter(!isnan,x)
    if isempty(tsshort)
        ts_out .= missing
        mean_out[] = missing
        std_out[] = missing
    else
        mean_out[] = mean(tsshort)
        std_out[] = std(tsshort)
        ts_out .= (ts_in .- mean_out[])./std_out[]
    end
end
using NetCDF
indims = InDims("Time")
od_ts = OutDims("Time",path = "./normalized_ts.zarr",
                backend=:zarr,overwrite=true)
od_m = OutDims(path = "./means.nc",backend=:netcdf, overwrite=true)
od_s = OutDims(path = "./stds.nc",backend=:netcdf, overwrite=true)
outdims = (od_ts, od_m, od_s)
tsnorm, means, stds = mapCube(norm,c[Variable=Where(contains("t2m"))],indims=indims, outdims=outdims);
heatmap(DimArray(stds[Variable=At("t2m")]))

Apply function on moving window#

function meanfilter(xout, xin)
    if ismissing(xin[2,2])
        xout .= missing
    else
    xout .= mean(skipmissing(xin))
    end
end
heatmap(DimArray(means[Variable=At("t2m")]))
indims = InDims(MovingWindow("Longitude", 1,1),MovingWindow("Latitude", 1, 1))
filteredmeans = mapCube(meanfilter, means, indims=indims, outdims=OutDims())
heatmap(DimArray(filteredmeans[Variable=At("t2m")]))

Define new output dimensions#

using Dates
pet = c[Variable=At("pet"),
        lon=Near(11.3464),lat=Near(46.4946)]
fig,ax, pl = lines(lookup(pet, Ti).data,pet.data)
fig

So far the function applied here were very simple statistics. Just to stress again, that we are running arbitrary Julia code here, so for example if we want to use some package for time series decomposition like SignalDecomposition.jl:

using SignalDecomposition: TimeAnomaly, decompose as dc 
dates = lookup(pet, Ti)
stlres = dc(dates,pet.data[:],TimeAnomaly())
fig,ax, p = plot(dates, stlres[1])
ax2 = Axis(fig[2,1])
plot!(ax2,stlres[2])
fig

In order to apply this over a full array we define the usual Trio: indims, outdims and the function to be applied. Here we create a new dimension for the output. There are 2 types of axes in YAXArrays, CategoricalAxis for unordered and RangeAxis for ordered dimensions. Here we create a categorical axis for our outputs. This means that inside the function the input array xin is a vector with of length n_timesteps and the output is a matrix of size n_timesteps x 3

import Logging
Logging.disable_logging(Logging.Info)
indims = InDims("time")
outdims = OutDims(dims(c,Ti),Dim{:Scale}(["Seasonal", "anomalies", "trend"]), 
                    path = "decomposed.zarr",backend=:zarr, overwrite=true)
function decompose_TS(xout, xin, dates)
    any(isnan,xin) && return xout .= missing
    seas, anomaly = dc(dates,xin,TimeAnomaly())

    xout[:,1] = seas
    xout[:,2] = anomaly
end
using Logging
Logging.disable_logging(Warn)
@time dec = mapCube(decompose_TS, 
    c[Variable=At("t2m")],
    lookup(c, Ti),
        #lon=Near(11.3464),lat=Near(46.4946)],
    indims = indims,
    outdims = outdims)
dec[Scale=At("trend")][1,1,1]
fig, axseas, heatyax = lines( lookup(dec, Ti).data,
    dec[lon=Near(11.3464),lat=Near(46.4946)].data[:,1])
fig

Compute variance and plot a map of seasonal variance#

scalevar = mapslices(var,dec,dims="Time")
scalerange = mapslices(x->maximum(x) - minimum(x), dec, dims="Time")
heatmap(scalerange[scale=At("Seasonal")])
heatmap(scalevar[scale=At("anomalies")])

Use Python or R in inner function#

using PythonCall
scipyndimage = pyimport("scipy.ndimage")
using Missings
function gaussian_smooth(xout, xin)
    missinds = .!ismissing.(xin)
    cleanin = disallowmissing(xin[missinds])
    smooth = scipyndimage.gaussian_filter(cleanin, sigma=4)
    @show typeof(smooth)
    xout[missinds] .= pyconvert(Vector, smooth)
end
pet_bozen_2010 = c[lon=Near(11.3464),lat=Near(46.4946),
    time = DateTime(2010)..DateTime(2011),
    Variable=At("pet")]
using Missings
petmem = disallowmissing(readcubedata(pet_bozen_2010).data)
nonmissinds = .!ismissing.(petmem)
view(petmem, nonmissinds )
disallowmissing(collect(petmem))
smooth = scipyndimage.gaussian_filter(Py(petmem), sigma=4)
pyconvert(Vector, smooth)
smoothcube = mapCube(gaussian_smooth, pet_bozen_2010, indims=InDims("time"), outdims=OutDims("time"))
fig, ax, l = lines(pet_bozen_2010, label="Original")
lines!(ax,smoothcube, label="Smooth")
axislegend(ax)
fig