Package Guidelines - Pangeo Data

Our vision for the Pangeo project is an ecosystem of mutually compatible Geoscience python packages which follow open-source best practices. These practices are well established across the scientific python community.

General Best Practices for Open Source

Use an open-source license. See Jake VanderPlas’ article on licensing scientific code or these more general guidelines
Use version control for source code (for example, on github)
Provide thorough test coverage and continuous integration of testing
Maintain comprehensive Documentation
Establish a code of conduct for contributors

The open-source guide provides some great advice on building and maintaining open-source projects.

Best Practices for Pangeo Projects

To address the needs of geoscience researchers, we have developed some additional recommendations.

Solve a general problem: packages should solve a general problem that is encountered by a relatively broad groups of researchers.
Clearly defined scope: packages should have a clear and relatively narrow scope, solving the specific problem[s] identified in the point above (rather than attempting to cover every possible aspect of geoscience research computing).
No duplication: developers should try to leverage existing packages as much as possible to avoid duplication of effort. (In early-stage development and experimentation, however, some duplication will be inevitable as developers try implementing different solutions to the same general problems.)
Consume and Produce Xarray Objects: Xarray data structures facilitate mutual interoperability between packages. (For more about Xarray , see below.)
Operate Lazily: whevever possible, packages should avoid explicitly triggering computation on Dask objects. (For more about Dask, see below)

Why Xarray and Dask?

The Pangeo project strongly encourages the use of Xarray data structures wherever possible. Xarray Dataset and DataArrays contain multidimensional numeric array data and also the metadata describing the data’s coordinates, labels, units, and other relevant attributes. Xarray makes it easy to keep this important metadata together with the raw data; applications can then take advantage of the metadata to perform calculations or create visualizations in a coordinate-aware fashion. The use of Xarray eliminates many common bugs, reduces the need to write boilerplate code, makes code easier to understand, and generally makes users and developers happier and more productive in their day-to-day scientific computing.

Xarray ‘s data model is explicitly based on the CF Conventions, a well-established community standard which encompasses many different common scenarios encountered in Earth System science. However, Xarray is flexible and does not require compliance with CF conventions. We encourage Pangeo packages to follow CF conventions wherever it makes sense to do so.

Most geoscientists have encountered the CF data model via the ubiquitous netCDF file format. While Xarray can easily read and write netCDF files, it doesn’t have to. This is a key difference between software built on Xarray and numerous other tools designed to process netCDF data (e.g. nco, cdo, etc. etc.): Xarray data can be passed directly between python libraries (or over a network) without ever touching a disk drive. This “in-memory” capability is a key ingredient to the Big-Data scalability of Pangeo packages. Very frequently the bottleneck in data processing pipelines is reading and writing files.

Another important aspect of scalability is the use of Dask for parallel and out-of-core computations. The raw data underlying Xarray objects can be either standard in-memory numpy arrays or Dask arrays. Dask arrays behave nearly identically to numpy arrays (they support the same API), but instead of storing raw data directly, they store a symbolic computational graph of operations (e.g. reading data from disk or network, performing transformations or mathematical calculations, etc.) that must be executed in order to obtain the data. No operations are actually executed until actual numerical values are required, such as for making a figure. (This is called lazy execution.) Dask figures out how to execute these computational graphs efficiently on different computer architectures using sophisticated techniques. By chaining operations on dask arrays together, researchers can symbolically represent large and complex data analysis pipelines and then deploy them effectively on large computer clusters.