Common styles#
Open with Xarray, write to Zarr#
This recipe category uses Xarray to open input files and Zarr as the target dataset format. Inputs can be in any file format Xarray can read, including NetCDF, OPeNDAP, GRIB, Zarr, and, via rasterio, GeoTIFF and other geospatial raster formats. The target Zarr dataset will conform to the Xarray Zarr encoding conventions.
Below we give a very basic overview of how this recipe is used.
First you must define a file pattern.
Once you have a FilePattern
object,
the recipe pipeline will contain at a minimum the following transforms applied to the file pattern collection:
pangeo_forge_recipes.transforms.OpenURLWithFSSpec
: retrieves each pattern file using the specified URLs.pangeo_forge_recipes.transforms.OpenWithXarray
: load each pattern file into anxarray.Dataset
.pangeo_forge_recipes.transforms.StoreToZarr
: generate a Zarr store by combining the datasets.pangeo_forge_recipes.transforms.ConsolidateDimensionCoordinates
: consolidate the Dimension Coordinates for dataset read performance.pangeo_forge_recipes.transforms.ConsolidateMetadata
: calls Zarr’s convinience function to consolidate metadata.
Tip
If using the pangeo_forge_recipes.transforms.ConsolidateDimensionCoordinates
transform, make sure to chain on the pangeo_forge_recipes.transforms.ConsolidateMetadata
transform to your recipe.
Note
pangeo_forge_recipes.transforms.StoreToZarr
supports appending to existing Zarr stores
via the optional append_dim
keyword argument. This option functions nearly identically to the
append_dim
kwarg in
xarray.Dataset.to_zarr
;
the two differences with this method are that Pangeo Forge will automatically introspect the inputs in
your FilePattern
to determine how the existing Zarr
store dimensions need to be resized, and that writes are parallelized via Apache Beam. Apart from
ensuring that the named append_dim
already exists in the dataset to which you are appending, use of
this option does not ensure logical consistency (e.g. contiguousness, etc.) of the appended data. When
selecting this option, it is therefore up to you, the user, to ensure that the inputs provided in the
file pattern for the appending recipe are limited to those which you want to
append.
Open with Kerchunk, write to virtual Zarr#
The standard Zarr recipe creates a copy of the original dataset in the Zarr format, this kerchunk-based reference recipe style does not copy the data and instead creates a Kerchunk mapping, which allows archival formats (including NetCDF, GRIB2, etc.) to be read as if they were Zarr datasets. More details about how Kerchunk works can be found in the kerchunk docs and this blog post.
Note
Examples of this recipe style currently exist in development form, and will be cited here as soon as they are integration tested, which is pending pangeo-forge/pangeo-forge-recipes#608.
Is this style right for my dataset?#
For archival data stored on highly-throughput storage devices, and for which preprocessing is not required, reference recipes are an ideal and storage-efficient option. When choosing whether to create a reference recipe, it is important to consider questions such as:
Where are the archival (i.e. source) files for this dataset currently stored?#
If the original data are not already in the cloud (or some other high-bandwidth storage device, such as an on-prem data center), the performance benefits of using a reference recipe may be limited, because network speeds to access the original data will constrain I/O throughput.
Does this dataset require preprocessing?#
With reference recipes, modification of the underlying data is not possible. For example, the chunking schema of a dataset cannot be modified with Kerchunk, so you are limited to the chunk schema of the archival data. If you need to optimize your datasets chunking schema for space or time, the standard Zarr recipe is the only option. While you cannot modify chunking in a reference recipe, changes in the metadata (attributes, encoding, etc.) can be applied.