oocgcm : out of core analysis of general circulation models in python.

This project provides tools for processing and analysing output of general circulation models and gridded satellite data in the field of Earth system science.

Our aim is to simplify the analysis of very large datasets of model output (~1-100Tb) like those produced by basin-to-global scale sub-mesoscale permitting ocean models and ensemble simulations of eddying ocean models by leveraging the potential of xarray and dask python packages.

The main ambition of this project is to provide simple tools for performing out-of-core computations with model output and gridded data, namely processing data that is too large to fit into memory at one time.

The project is so far mostly targeting NEMO ocean model and gridded ocean satellite data (AVISO, SST, ocean color...) but our aim is to build a framework that can be used for a variety of models based on the Arakawa C-grid. The framework can in principle also be used for atmospheric general circulation models.

We are trying to develop a framework flexible enough in order not to impose too strictly a specific workflow to the end user.

oocgcm is a pure Python package and we try to keep the list of dependencies as small as possible in order to simplify the deployment on a number of platforms.

oocgcm is not intended to provide advanced visualization functionalities for gridded geographical data as several powerful tools already exist in the python ecosystem (see in particular cartopy and basemap).

We rather focus on building a framework that simplifies the design and production of advanced statistical and dynamical analyses of large datasets of model output and gridded data.

Note

oocgcm is at the pre-alpha stage. The API is therefore unstable and likely to change without notice.

Documentation

The need for out-of-core analysis tools

A practical need

oocgcm is a pure Python package built on the top of xarray and dask.

Most of the tools that are implemented in oocgcm are already available in several python libraries dedicated to the analysis of gridded data, and based on numpy and one of the several netCDF interfaces for python.

Why have we choosen to implement a diagnostic package based on xarray and dask ?

Numpy-based model diagnostic libraries are facing a challenge with the ongoing evolution of geoscientific models and earth observing networks. With the most high-end models being runs on several tens of thousand cores, even a two-dimensional slice of model output cannot be loaded in memory at one time. Model diagnostic tools and gridded data analysis tools should therefore be parallelized and run out-of-core.

One option is to run FORTRAN/MPI codes steered from shell scripts but there you loose the flexibility and the multiple benefits of a python-based workflow. Another option is to use one of the several libraries available for parallel computing in python. This usually implies a complete refactoring of your analysis package at the risk of eventually using different analysis tools depending on the size of your dataset...

A wider perspective

We believe that the above question is not a “technical” problem but a real challenge for our fields of research. We are here facing a “technical” translation of one of the “big data challenges” in earth system science.

Transforming large gridded datasets into scientific results indeed requires innovative descriptive approaches that merge statistical descriptions and physically-motivated analyses. This usually involves performing rather “complex” analysis tasks on the dataset.

In practice, this is usually a two stage process. A first stage involves transforming an idea into a prototype code. Using a interactive scripting language usually accelerates this phase because you can glue different bits of code and see the results on the fly.

The second stage involves transforming the prototype code into a production code. And this is usually a difficult task especially for students, because it requires a more in-depth understanding of the hardware infrastructure and of the software design.

So, depending on the language and on the libraries you use, what actually changes is not your ability to perform your analysis but rather the time it takes for you to prepare your production code and eventually reach the scientific result you are after.

More generally, there is an objective risk for our fields of research if we don’t embrace the question of the development of analysis tools that accelerate the idea-prototype-production-results cycle. If the time needed to transform an idea into an efficient production code is too long, we will keep performing only simple or routine analyses on our datasets, eventually missing the potential breakthrough of big-data in earth system sciences.

Why using xarray and dask ?

oocgcm is a pure Python package built on the top of xarray, which itself integrates dask.array to support streaming computation on large datasets that don’t fit into memory. Why have we choosen to use xarray and dask ?

xarray

xarray implements a N-dimensional variants of the core pandas data structures. In addition, xarray adopts the Common Data Model for self-describing data. In practice, xarray.Dataset is an in-memory representation of a netCDF file or of a collection of netCDF files.

Building upon xarray has several advantages :
  • metadata available in the netCDF files are associated with xarray objects in the form of a Python dictionary x.attrs. This simplifies the exploration of the dataset, yields more robust code, and simplifies the export of the results to netCDF files.
  • because dimensions are associated with the variable in xarray objects, xarray allows flexible split-apply-combine operations with groupby x.groupby('time.dayofyear').mean()
  • xarray objects do not load data in memory by default. Loading the data is only done at the execution time if needed. This means that the user has access to all his dataset without having to worry about loading the data, therefore simplifying the prototyping of a new analysis.
  • xarray is natively integrated with pandas, meaning that xarray objects can straightforwardly be exported to pandas DataFrames. This allows to easily access a range of time-series analysis tools.
  • xarray objects can be exported to iris or cdms so that the user can merge several different analysis tools in his workflow.
  • Little work is needed for applying a numpy function to xarray objects. Several numpy ufunc are already applicable to xarray.DataArray data-structure.

dask

dask implement an abstract graph representation of the dynamic task scheduling needed for performing out-of-core computation. dask also implement an efficient scheduling procedure for optimizing the execution time of acyclic graphs (DAG) of tasks on a given machine.

From a user standpoint the key concept of dask.array is the notion of chunk. A chunk is the user-defined shape of the subdataset on which the unitary tasks will be applied.

dask allows to easily leverage the resources of shared memory architectures (multi-core laptop or work-station) but also the resources of distributed memory architectures (clusters of cpu).

At present, xarray integrates dask functionalities for shared memory architectures. xarray will also allow to leverage dask potential on distributed memory architectures in the future.

Building upon dask has several advantages :
  • parallelization comes at no cost. The only modification of your code that is needed is your defining the chunks on which the computation should be performed.
  • dask back-end methods are generic, powerful and well tested for a number of different applications.
  • dask comes with powerful and easy-to-use profiling tools for optimizing the execution time on a given machine.

Most importantly, xarray and dask are supported by active and friendly teams of developers, that we hereby gratefully acknowledge.

Installation

Dependancies

oocgcm is a pure Python package, but some of its dependencies are not. The easiest way to get them installed is to use conda. You can install all oocgcm dependancies with the conda command line tool:

conda install xarray dask netCDF4 bottleneck

Installing

The project is still in pre-alpha phase, the only way to install oocgcm is therefore to clone the repository on github:

git clone https://github.com/lesommer/oocgcm.git
cd oocgcm
python setup.py install

Test

Test oocgcm with py.test:

cd oocgcm
py.test oocgcm

Generic versus data-specific tools

oocgcm is designed so that most methods can be used with several different types of gridded data.

There are two classes of methods in oocgcm. Methods that are specific to a particular model or source of data and methods that are data-agnostic or generic.

For generic tools, we try to use naming conventions for internal variables based on CF conventions or COMODO conventions.

Data-specific methods are generally created as particular instances of generic methods. This is in particular the case for grid descriptor objects.

The distinction between data-agnostic and data-specific methods is reflected in the general structure of the code.

All the methods related to a particular sources of data are put together in a specific folder (eg. oocgcm.oceanmodels.nemo) so that it is easier for third party to maintain data-specific tools associated to their particular source of data / model.

Adapted to different usage

We would like oocgcm to be generic enough not to impose to strictly a particular workflow to the end user.

oocgcm can be used interactively in ipython or jupyter notebooks.

oocgcm can also be used for building individual python scripts that read and write collections of netcdf files (a la nco).

Grid descriptor objects

One of the key concepts of oocgcm is the notion of grid that is implemented in py:module:oocgcm.core.grids. The current version implements a generic two-dimensional lat/lon grid in py:class:oocgcm.core.grids.generic_2d_grid. Future versions will have a similar object for three-dimensional data.

Grid objects can be created in various ways depending on the source of gridded data. A present, two-dimensional grid objects can be created :

  • from arrays of latitude and longitude,
  • from arrays of projection coordinate x and y,
  • from NEMO ocean model netCDF output files.

Grid descriptors provide access to all the information that may be needed for defining operations on the grid. They implement methods for vector calculus, differential calculus and spatial integration. Grid descriptors can also be sliced for defining descriptors of smaller portions of the physical domain.

In practice, a grid object can be created from arrays of latitude and longitude (1d or 2d arrays).

from oocgcm.griddeddata import grids
lats = ...
lons = ...
grd = grids.latlon_2d_grid(latitudes=lats,longitudes=lons)

Grids descriptors can also be constructed from netCDF files describing the model grid (only available for NEMO ocean model so far).

from oocgcm.oceanmodels.nemo import grids
grd = grids.nemo_2d_grid(nemo_coordinate_file=...,nemo_byte_mask_file=...,chunks=...)

Note

Construction from netCDF files should be preferred as the metric factors that described the grid (e.g. grd["cell_x_size_at_t_location"]) are more accurate in this case.

Grids descriptors can also be constructed from x,y coordinate (in m). This can be useful for analysing idealized model experiments (eg. on the f-plane or beta-plane).

from oocgcm.griddeddata import grids
x = np.arange(start=0, stop=1.e7, step=1.e6,dtype=float)
y = np.arange(start=0, stop=1.2e7, step=1.e6,dtype=float)
grd = grids.plane_2d_grid(ycoord=y,xcoord=x)

Note

It should be noted that the creation of grid descriptors does not load any data nor creates any additional numpy arrays. A grid descriptors only defines xarray.DataArray instances. Grid object just contain the information on how to perform a particular calculation.

Once created, a grid descriptor grd can be restricted to a subdomain as follows

zoom = grd[500:800,2000:2500]

Operations on vector fields

Grid descriptor objects allow to perform operations on vector fields (as defined in oocgcm.core). For instance,

from oocgcm.core.grids import VectorField2d
vectorfield1 = VectorField2d(...,...)
vectorfield2 = VectorField2d(...,...)
sprod = grd.scalar_product(vectorfield1,vectorfield2)
cprod = grd.vertical_component_of_the_cross_product(vectorfield1,vectorfield2)

Note

Methods associated to grid object only define xarray.DataArray instances. The actual computation only occurs when xarray.DataArray values are explicitely requested.

Differential operators

Grid descriptor objects implement differential operators that can be applied to gridded data :

sst = ...
u = ...
v = ...
current = VectorField2d(u,v)
gsst = grd.horizontal_gradient(sst)
curl = grd.vertical_component_of_curl(current)

Note that by default, oocgcm uses low order discretization methods. This can be easily overridden by the user provided the reference API is followed.

Spatial integration

Grid descriptors also implement methods for performing spatial integration :

ssh = ...
sst = ...
index = grd.spatial_average_xy(ssh,where=sst<2.) # average ssh where sst <2.
index.plot() # plots the time series

Broadcasting additional dimensions

Methods associated with two-dimensional grid descriptors define operations based on (‘y’,’x’) dimensions. Because of the flexibility of xarray, these methods can therefore be applied to dataarrays with additional extra dimensions (for instance, time, index in an ensemble simulation).

For instance, if ssh is a xarray dataarray of sea surface height built from a collection of netCDF files corresponding to different times,

import xarray as xr
xr.open_mfdataset(path_to_my_files)['sossheig']

the norm of sea surface heigh gradient is defined as follows

gssh = grd.norm_of_vectorfield(grd.horizontal_gradient(ssh))

but gssh does not hold any data yet. The calculation is only performed when a partical slice of data is requested, as for instance if the first two-dimensional field is written in a netCDF file.

gssh[0].to_netcdf(path_to_my_output_file)

This abstract representation of operations that allows xarray is key for efficiently implementing out-of-core procedures. It also allows to straighforwardly deal with grids with time-varying metrics. This is key for working with arbitrary lagrangian-eulerian vertical coordinates.

Tools for simplifying I/O

Currently oocgcm provides tools that simplify the creation of xarray.DataArray from a physical database (structured collection of netcdf files).

oocgcm will soon provide tools that facilitate the creation of physical database resulting from analysis performed with oocgcm, following particular conventions (eg for instance DRAKKAR conventions).

API reference

This page provides an auto-generated summary of oocgcm’s API.

General purpose utilities and data-structures

Parameters
parameters.physicalparameters.coriolis_parameter(...) Return Coriolis parameter for a given latitude.
parameters.physicalparameters.beta_parameter(...) Return planetary beta parameter.
Data-structures
core.grids.VectorField2d(vx, vy[, ...]) Minimal data structure for manupulating 2d vector fields on a grid.
core.grids.VectorField3d(vx, vy, vz[, ...]) Minimal data structure for manupulating 3d vector fields on a grid.
core.grids.Tensor2d(axx, axy, ayx, ayy[, ...]) Minimal data structure for manupulating 2d tensors on a grid.
I/O tools
core.io.return_xarray_dataset(filename[, chunks]) Return an xarray dataset corresponding to filename.
core.io.return_xarray_dataarray(filename, ...) Return a xarray dataarray corresponding to varname in filename.
oceanmodels.nemo.io.return_xarray_dataset(...) Wrapper for core.io.return_xarray_dataset.
oceanmodels.nemo.io.return_xarray_dataarray(...) Wrapper for core.io.return_xarray_dataarray.
Miscellaneous
core.utils.is_numpy(array) Return True if array is a numpy array
core.utils.is_xarray(array) Return True if array is a xarray.DataArray
core.utils.is_daskarray(array) Return True if array is a dask array
core.utils.has_chunks(array) Return True if array is a xarray or a daskarray with chunks.
core.utils.map_apply(func, scalararray) Return a xarray dataarray with value func(scalararray.data)

Two-dimensional grid descriptor objects

Two-dimensional grid descriptors hold all the information required for defining operations on xarray dataarray that require information on the underlying grid of the model.

Generic two-dimensional model-agnostic grid descriptor
core.grids.generic_2d_grid([arrays, parameters]) Model agnostic grid object, two dimensional version.
Creating grids descriptor from arrays
griddeddata.grids.variables_holder_for_2d_grid_from_latlon_arrays([...]) This class create the dictionnary of variables to be used for creating a oocgcm.core.grids.generic_2d_grid from arrays of latitude and longitude.
griddeddata.grids.latlon_2d_grid([...]) Return a generic 2d grid build from arrays of lat,lon.
griddeddata.grids.variables_holder_for_2d_grid_from_plane_coordinate_arrays([...]) This class create the dictionnary of variables to be used for creating a oocgcm.core.grids.generic_2d_grid from arrays of plane coordinates.
griddeddata.grids.plane_2d_grid([xcoord, ...]) Return a generic 2d grid build from arrays of coordinate.
Creating grids descriptors from model output
oceanmodels.nemo.grids.variables_holder_for_2d_grid_from_nemo_ogcm([...]) This class create the dictionnary of variables used for creating a oocgcm.core.grids.generic_2d_grid from NEMO output files.
oceanmodels.nemo.grids.nemo_2d_grid([...]) Return a generic 2d grid from nemo coordinate and mask files.
Attributes
core.grids.generic_2d_grid.dims Dimensions of the xarray dataarrays describing the grid.
core.grids.generic_2d_grid.ndims Number of dimensions of the dataarrays describing the grid.
core.grids.generic_2d_grid.shape Shape of the xarray dataarrays describing the grid.
core.grids.generic_2d_grid.chunks Chunks of the xarray dataarrays describing the grid.
Manipulating grids

generic_2d_grid implement the mapping interface with keys given by variable names and values given by DataArray objects.

core.grids.generic_2d_grid.__getitem__(item) The behavior of this function depends on the type of item.
core.grids.generic_2d_grid.__contains__(key) The ‘in’ operator will return true or false depending on whether ‘key’ is an array in the dataset or not.
core.grids.generic_2d_grid.__iter__()
core.grids.generic_2d_grid.chunk([chunks]) Rechunk all the variables defining the grid.
core.grids.generic_2d_grid.get_projection_coordinates([...]) Return (x,y) the coordinate arrays (in m) at grid location.
Changing the grid location of arrays
core.grids.generic_2d_grid.change_grid_location_t_to_u(...) Return a xarray corresponding to scalararray averaged at a new grid location.
core.grids.generic_2d_grid.change_grid_location_u_to_t(...) Return a xarray corresponding to scalararray averaged at a new grid location.
core.grids.generic_2d_grid.change_grid_location_t_to_v(...) Return a xarray corresponding to scalararray averaged at a new grid location.
core.grids.generic_2d_grid.change_grid_location_v_to_t(...) Return a xarray corresponding to scalararray averaged at a new grid location.
core.grids.generic_2d_grid.change_grid_location_f_to_u(...) Return a xarray corresponding to scalararray averaged at a new grid location.
core.grids.generic_2d_grid.change_grid_location_f_to_v(...) Return a xarray corresponding to scalararray averaged at a new grid location.
core.grids.generic_2d_grid.change_grid_location_v_to_u(...) Return a xarray corresponding to scalararray averaged at a new grid location.
core.grids.generic_2d_grid.change_grid_location_u_to_v(...) Return a xarray corresponding to scalararray averaged at a new grid location.
Vector calculus
core.grids.generic_2d_grid.norm_of_vectorfield(...) Return the norm of a vector field, at t-point.
core.grids.generic_2d_grid.scalar_product(...) Return the scalar product of two vector fields, at t-point.
core.grids.generic_2d_grid.scalar_outer_product(...) Return the outer product of a scalar (t location)
core.grids.generic_2d_grid.vertical_component_of_the_cross_product(...) Return the cross product of two vector fields.
Differential operators
core.grids.generic_2d_grid.horizontal_gradient(...) Return the horizontal gradient of the input datastructure.
core.grids.generic_2d_grid.horizontal_gradient_vector(...) Return the horizontal gradient of a scalar field defined at t-points.
core.grids.generic_2d_grid.horizontal_gradient_tensor(...) Return the horizontal gradient tensor of a two-dimensional vector field at u,v locations.
core.grids.generic_2d_grid.horizontal_laplacian(...) Return the horizontal laplacian of a scalar field at t-points.
core.grids.generic_2d_grid.vertical_component_of_curl(...) Return the vertical component of the curl of a vector field.
core.grids.generic_2d_grid.horizontal_divergence(...) Return the horizontal divergence of a vector field at u,v-points.
core.grids.generic_2d_grid.integrate_dxdy(array) Return the horizontal integral of array in regions where where is True.
core.grids.generic_2d_grid.spatial_average_xy(array) Return the horizontal average of array in regions where where is True.
Spatial integration
Operators specific to oceanic applications
core.grids.generic_2d_grid.geostrophic_current_from_sea_surface_height(...) Return the geostrophic current on u,v-grids.
core.grids.generic_2d_grid.q_vector_due_to_kinematic_deformation(...) Return the component of the generalized Q-vector associated with kinematic deformation of a two-dimensional velocity field.
core.grids.generic_2d_grid.frontogenesis_function(...) Return the frontogenesis function.

Tools for filtering timeseries and spatial fields

oocgcm.filtering.timefilters

Testing tools

oocgcm.test.signals

Contributing to oocgcm

If you are willing to contribute to developing oocgcm (reporting bugs, suggesting improvements, ...), here is some information.

Suggesting improvments to oocgcm

As for any piece of software stored on github, contributing to oocgcm involves your knowing the basics of git version control. Then just

  • create a github account
  • fork oocgcm repository
  • push the proposed changes on your forked repository
  • Submit a pull request

Contributions are also welcome on oocgcm issues tracker.

If you don’t feel comfortable with the above tools, just contact me by email.

Overall layout of the library

Here is some information about the overall layout of oocgcm library:

oocgcm/
    setup.py
    oocgcm/
    docs/
    examples/
    ci/
setup.py
installation script.
oocgcm
contains the actual library
docs
contains the documentation in rst format. Used for building the docs on readthedocs with sphinx and numpydoc
examples
provides example of applications in notebooks and python scripts
ci
contains the yaml files for continuous integration. Used for testing the build and testing oocgcm with several combinations of libraries.

Structure of oocgcm package

The actual package itself contains the following submodules:

oocgcm/
    oocgcm/
        core/
        parameters/
        griddeddata/
        oceanmodels/
        oceanfuncs/
        airseafuncs/
        spectra/
        stats/
        filtering/
        test/
core
contains data-agnostic versions of methods that are inherently data-specific.
parameters
defines physical and mathematical parameters that may be used in several submodules.
griddeddata
contains data-specific methods adapted to two-dimensional gridded data, including satellite data.
oceanmodels
contains data-specific methods adapted to a range of c-grid ocean models. Currently, contains only tools for NEMO ocean model.
oceanfuncs
contains functions that are only relevant for analyzing ocean data but transverse to particular sources of data (a priori numpy version of the function and wrappers for xarray dataarrays).
airseafuncs
will contain methods for analysing air-sea exchanges.
regrid
will contain methods for regridding gridded data.
spectra
will contain methods for computing wavenumber spectra and frequency spectra out of xarray.DataArray
stats
will contain methods for applying descriptive statistics to gridded data.
filtering
will contain methods for filtering gridded data in space or in time.
test
contains the test series for oocgcm. unit tests are runs after each commit.

Structure of oocgcm.core

oocgcm/core contains:

oocgcm/core/
    io.py
    grids.py
    utils.py
io.py
contains functions that are used for creating xarray datasets and xarray dataarrays and functions used for writing output files.
grids.py
contains tools that define grid descriptor objects.
utils.py
contains useful functions for several submodules. This includes function for testing and asserting types.

Files with similar names and contents can be repeated for each specific source of data when needed.

Frequently Asked Questions

Why is the code architecture so complex ?

The architecture and layout of oocgcm reflects the constraint that are inherent to build a tool that can be easily adapted to different sources of data.

We have decided to separate as much as possible the generic tools from the data-specific tools. Data-specific tools are generally created from generic method and try to avoid unnecessary duplication. But some method are inherently data-specific.

For instance, the ‘time’ dimension of NEMO ocean model output is called ‘time_counter’, but oocgcm assumes that the ‘time’ dimension is ‘t’. It is therefore necessary to change the name of this dimension when reading NEMO model output. This is done in oocgcm.oceanmodels.nemo.io methods.

We have chosen to gather all the methods that are specific to a certain source of data within a unique folder for facilitating the maintenance by third-parties.

In practice, switching from one model to the other should be as simple as changing

from oocgcm.oceanmodels.nemo import io, grids

into

from oocgcm.oceanmodels.mitgcm import io, grids

What’s New ?

v0.1 (unreleased yet)

This initial release includes

  • A generic two-dimensional C-grid grid descriptor that provides methods for vector calculus, differential operators and spatial integration.

Get in touch

  • Report bugs, suggest feature ideas or view the source code on GitHub.