Why using xarray and dask ?¶
oocgcm is a pure Python package built on the top of xarray, which itself integrates dask.array to support streaming computation on large datasets that don’t fit into memory. Why have we choosen to use xarray and dask ?
xarray implements a N-dimensional variants of the core pandas data structures. In addition, xarray adopts the Common Data Model for self-describing data. In practice, xarray.Dataset is an in-memory representation of a netCDF file or of a collection of netCDF files.
- Building upon xarray has several advantages :
- metadata available in the netCDF files are associated with xarray objects in the form of a Python dictionary x.attrs. This simplifies the exploration of the dataset, yields more robust code, and simplifies the export of the results to netCDF files.
- because dimensions are associated with the variable in xarray objects, xarray allows flexible split-apply-combine operations with groupby x.groupby('time.dayofyear').mean()
- xarray objects do not load data in memory by default. Loading the data is only done at the execution time if needed. This means that the user has access to all his dataset without having to worry about loading the data, therefore simplifying the prototyping of a new analysis.
- xarray is natively integrated with pandas, meaning that xarray objects can straightforwardly be exported to pandas DataFrames. This allows to easily access a range of time-series analysis tools.
- xarray objects can be exported to iris or cdms so that the user can merge several different analysis tools in his workflow.
- Little work is needed for applying a numpy function to xarray objects. Several numpy ufunc are already applicable to xarray.DataArray data-structure.
dask implement an abstract graph representation of the dynamic task scheduling needed for performing out-of-core computation. dask also implement an efficient scheduling procedure for optimizing the execution time of acyclic graphs (DAG) of tasks on a given machine.
From a user standpoint the key concept of dask.array is the notion of chunk. A chunk is the user-defined shape of the subdataset on which the unitary tasks will be applied.
dask allows to easily leverage the resources of shared memory architectures (multi-core laptop or work-station) but also the resources of distributed memory architectures (clusters of cpu).
- Building upon dask has several advantages :
- parallelization comes at no cost. The only modification of your code that is needed is your defining the chunks on which the computation should be performed.
- dask back-end methods are generic, powerful and well tested for a number of different applications.
- dask comes with powerful and easy-to-use profiling tools for optimizing the execution time on a given machine.