Our vision for the Pangeo project is an ecosystem of mutually compatible Geoscience python packages which follow open-source best practices. These practices are well established across the scientific python community.
The open-source guide provides some great advice on building and maintaining open-source projects.
To address the needs of geoscience researchers, we have developed some additional recommendations.
The Pangeo project strongly encourages the use of Xarray data structures wherever possible. Xarray Dataset and DataArrays contain multidimensional numeric array data and also the metadata describing the data’s coordinates, labels, units, and other relevant attributes. Xarray makes it easy to keep this important metadata together with the raw data; applications can then take advantage of the metadata to perform calculations or create visualizations in a coordinate-aware fashion. The use of Xarray eliminates many common bugs, reduces the need to write boilerplate code, makes code easier to understand, and generally makes users and developers happier and more productive in their day-to-day scientific computing.
Xarray ‘s data model is explicitly based on the CF Conventions, a well-established community standard which encompasses many different common scenarios encountered in Earth System science. However, Xarray is flexible and does not require compliance with CF conventions. We encourage Pangeo packages to follow CF conventions wherever it makes sense to do so.
Most geoscientists have encountered the CF data model via the ubiquitous netCDF file format. While Xarray can easily read and write netCDF files, it doesn’t have to. This is a key difference between software built on Xarray and numerous other tools designed to process netCDF data (e.g. nco, cdo, etc. etc.): Xarray data can be passed directly between python libraries (or over a network) without ever touching a disk drive. This “in-memory” capability is a key ingredient to the Big-Data scalability of Pangeo packages. Very frequently the bottleneck in data processing pipelines is reading and writing files.
Another important aspect of scalability is the use of Dask for parallel and out-of-core computations. The raw data underlying Xarray objects can be either standard in-memory numpy arrays or Dask arrays. Dask arrays behave nearly identically to numpy arrays (they support the same API), but instead of storing raw data directly, they store a symbolic computational graph of operations (e.g. reading data from disk or network, performing transformations or mathematical calculations, etc.) that must be executed in order to obtain the data. No operations are actually executed until actual numerical values are required, such as for making a figure. (This is called lazy execution.) Dask figures out how to execute these computational graphs efficiently on different computer architectures using sophisticated techniques. By chaining operations on dask arrays together, researchers can symbolically represent large and complex data analysis pipelines and then deploy them effectively on large computer clusters.