Common styles#

Open with Xarray, write to Zarr#

This recipe category uses Xarray to open input files and Zarr as the target dataset format. Inputs can be in any file format Xarray can read, including NetCDF, OPeNDAP, GRIB, Zarr, and, via rasterio, GeoTIFF and other geospatial raster formats. The target Zarr dataset will conform to the Xarray Zarr encoding conventions.

Tip

The following example recipes are representative of this style:

Below we give a very basic overview of how this recipe is used.

First you must define a file pattern. Once you have a FilePattern object, the recipe pipeline will contain at a minimum the following transforms applied to the file pattern collection:

Tip

If using the pangeo_forge_recipes.transforms.ConsolidateDimensionCoordinates transform, make sure to chain on the pangeo_forge_recipes.transforms.ConsolidateMetadata transform to your recipe.

Note

pangeo_forge_recipes.transforms.StoreToZarr supports appending to existing Zarr stores via the optional append_dim keyword argument. This option functions nearly identically to the append_dim kwarg in xarray.Dataset.to_zarr; the two differences with this method are that Pangeo Forge will automatically introspect the inputs in your FilePattern to determine how the existing Zarr store dimensions need to be resized, and that writes are parallelized via Apache Beam. Apart from ensuring that the named append_dim already exists in the dataset to which you are appending, use of this option does not ensure logical consistency (e.g. contiguousness, etc.) of the appended data. When selecting this option, it is therefore up to you, the user, to ensure that the inputs provided in the file pattern for the appending recipe are limited to those which you want to append.

Open with Kerchunk, write to virtual Zarr#

The standard Zarr recipe creates a copy of the original dataset in the Zarr format, this kerchunk-based reference recipe style does not copy the data and instead creates a Kerchunk mapping, which allows archival formats (including NetCDF, GRIB2, etc.) to be read as if they were Zarr datasets. More details about how Kerchunk works can be found in the kerchunk docs and this blog post.

Note

Examples of this recipe style currently exist in development form, and will be cited here as soon as they are integration tested, which is pending pangeo-forge/pangeo-forge-recipes#608.

Is this style right for my dataset?#

For archival data stored on highly-throughput storage devices, and for which preprocessing is not required, reference recipes are an ideal and storage-efficient option. When choosing whether to create a reference recipe, it is important to consider questions such as:

Where are the archival (i.e. source) files for this dataset currently stored?#

If the original data are not already in the cloud (or some other high-bandwidth storage device, such as an on-prem data center), the performance benefits of using a reference recipe may be limited, because network speeds to access the original data will constrain I/O throughput.

Does this dataset require preprocessing?#

With reference recipes, modification of the underlying data is not possible. For example, the chunking schema of a dataset cannot be modified with Kerchunk, so you are limited to the chunk schema of the archival data. If you need to optimize your datasets chunking schema for space or time, the standard Zarr recipe is the only option. While you cannot modify chunking in a reference recipe, changes in the metadata (attributes, encoding, etc.) can be applied.