YAXArrays and mapCube in Julia#
In this lesson, we discuss cover the basics of YAXArray data structures and mapCube in Julia.
Accessing remote data stored on the cloud with YXArrays#
Be aware that opening a dataset with YXArrays does not load the entire dataset in memory;
Only metadata set is read and available;
As for Xarray in Python, YXArrays in Julia will load data only when needed (e.g. accessed).
using YAXArrays
using Zarr
era5url = "https://s3.bgc-jena.mpg.de:9000/deepextremes/v3/ERA5Cube.zarr"
ds = open_dataset(era5url)
select a subset of the global remote dataset#
sub = ds[Ti=DateTime(1998,1,1)..DateTime(2022,12,31), longitude=0..14.76,latitude=30.1..60]
Create local Zarr for tutorial#
savedataset(sub,path="./era5.zarr", overwrite=true)
Let’s dig into YAXArrays.jl and mapCube#
So far we have only used mapslices
in this tutorial. However, this can only cover very simple cases for a single input cube and computations on one or dimensions which either collapse or return the same dimension.
using DimensionalData, YAXArrays, Zarr, NetCDF
using WGLMakie
path = "./era5.zarr"
c = Cube(open_dataset(zopen(path,consolidated=true,fill_as_missing=false)))
c[latitude=Near(11)]
The mapCube
function#
is a generalization of mapslices, where you can annotate the exact signature of the function to be applied. For example the computation of the median
over time can be written using mapCube
.
Here one hase to specify the dimension(s) that the user-defined function is going to operate on. For the computation of the median over time the only input dimension is time
and there are no output dimensions as only a single value is returned. The user defined function passed to mapCube
always has the signature f(outputs..., inputs...)
and potentially followd by additional arguments and keyword args.
Apply function along single Axis#
using Statistics
indims = InDims("latitude")
outdims = OutDims()
function apply_median(xout, xin)
x = filter(!ismissing, xin)
x = filter(!isnan,x)
#@show x
#filter!(!ismissing, x)
xout[] = isempty(x) ? missing : median(x)
end
medians = mapCube(apply_median, c[Variable=Where(contains("t2m"))];indims, outdims)
fig, ax, heat = heatmap(DimArray(medians[Variable=At("t2m")]))
Apply function on all elements#
medians_kelvin = map(x-> x + 273.15, medians)
This function is applied lazily and only computed when the data is worked with. This could be a mapCube operation, saving the data to disk or plotting the data.
heatmap(DimArray(medians_kelvin[Variable=At("t2m")]))
Arguments for inner function and output dimensions#
Let’s make a slightly more complex computation to demonstrate a case where multiple outputs are generated. For examples, imagine we want to normalize every time series (to zero mean and unit variance), but at the same time return the means and variances in a single dataset for later re-use:
Apply function with multiple output cubes#
function norm(ts_out, mean_out, std_out, ts_in)
x = filter(!ismissing, ts_in)
tsshort = filter(!isnan,x)
if isempty(tsshort)
ts_out .= missing
mean_out[] = missing
std_out[] = missing
else
mean_out[] = mean(tsshort)
std_out[] = std(tsshort)
ts_out .= (ts_in .- mean_out[])./std_out[]
end
end
using NetCDF
indims = InDims("Time")
od_ts = OutDims("Time",path = "./normalized_ts.zarr",
backend=:zarr,overwrite=true)
od_m = OutDims(path = "./means.nc",backend=:netcdf, overwrite=true)
od_s = OutDims(path = "./stds.nc",backend=:netcdf, overwrite=true)
outdims = (od_ts, od_m, od_s)
tsnorm, means, stds = mapCube(norm,c[Variable=Where(contains("t2m"))],indims=indims, outdims=outdims);
heatmap(DimArray(stds[Variable=At("t2m")]))
Apply function on moving window#
function meanfilter(xout, xin)
if ismissing(xin[2,2])
xout .= missing
else
xout .= mean(skipmissing(xin))
end
end
heatmap(DimArray(means[Variable=At("t2m")]))
indims = InDims(MovingWindow("Longitude", 1,1),MovingWindow("Latitude", 1, 1))
filteredmeans = mapCube(meanfilter, means, indims=indims, outdims=OutDims())
heatmap(DimArray(filteredmeans[Variable=At("t2m")]))
Define new output dimensions#
using Dates
pet = c[Variable=At("pet"),
lon=Near(11.3464),lat=Near(46.4946)]
fig,ax, pl = lines(lookup(pet, Ti).data,pet.data)
fig
So far the function applied here were very simple statistics. Just to stress again, that we are running arbitrary Julia code here, so for example if we want to use some package for time series decomposition like SignalDecomposition.jl
:
using SignalDecomposition: TimeAnomaly, decompose as dc
dates = lookup(pet, Ti)
stlres = dc(dates,pet.data[:],TimeAnomaly())
fig,ax, p = plot(dates, stlres[1])
ax2 = Axis(fig[2,1])
plot!(ax2,stlres[2])
fig
In order to apply this over a full array we define the usual Trio: indims, outdims and the function to be applied. Here we create a new dimension for the output. There are 2 types of axes in YAXArrays, CategoricalAxis
for unordered and RangeAxis
for ordered dimensions. Here we create a categorical axis for our outputs. This means that inside the function the input array xin
is a vector with of length n_timesteps
and the output is a matrix of size n_timesteps x 3
import Logging
Logging.disable_logging(Logging.Info)
indims = InDims("time")
outdims = OutDims(dims(c,Ti),Dim{:Scale}(["Seasonal", "anomalies", "trend"]),
path = "decomposed.zarr",backend=:zarr, overwrite=true)
function decompose_TS(xout, xin, dates)
any(isnan,xin) && return xout .= missing
seas, anomaly = dc(dates,xin,TimeAnomaly())
xout[:,1] = seas
xout[:,2] = anomaly
end
using Logging
Logging.disable_logging(Warn)
@time dec = mapCube(decompose_TS,
c[Variable=At("t2m")],
lookup(c, Ti),
#lon=Near(11.3464),lat=Near(46.4946)],
indims = indims,
outdims = outdims)
dec[Scale=At("trend")][1,1,1]
fig, axseas, heatyax = lines( lookup(dec, Ti).data,
dec[lon=Near(11.3464),lat=Near(46.4946)].data[:,1])
fig
Compute variance and plot a map of seasonal variance#
scalevar = mapslices(var,dec,dims="Time")
scalerange = mapslices(x->maximum(x) - minimum(x), dec, dims="Time")
heatmap(scalerange[scale=At("Seasonal")])
heatmap(scalevar[scale=At("anomalies")])
Use Python or R in inner function#
using PythonCall
scipyndimage = pyimport("scipy.ndimage")
using Missings
function gaussian_smooth(xout, xin)
missinds = .!ismissing.(xin)
cleanin = disallowmissing(xin[missinds])
smooth = scipyndimage.gaussian_filter(cleanin, sigma=4)
@show typeof(smooth)
xout[missinds] .= pyconvert(Vector, smooth)
end
pet_bozen_2010 = c[lon=Near(11.3464),lat=Near(46.4946),
time = DateTime(2010)..DateTime(2011),
Variable=At("pet")]
using Missings
petmem = disallowmissing(readcubedata(pet_bozen_2010).data)
nonmissinds = .!ismissing.(petmem)
view(petmem, nonmissinds )
disallowmissing(collect(petmem))
smooth = scipyndimage.gaussian_filter(Py(petmem), sigma=4)
pyconvert(Vector, smooth)
smoothcube = mapCube(gaussian_smooth, pet_bozen_2010, indims=InDims("time"), outdims=OutDims("time"))
fig, ax, l = lines(pet_bozen_2010, label="Original")
lines!(ax,smoothcube, label="Smooth")
axislegend(ax)
fig