Recipes

A recipe defines how to transform data in one format / location into another format / location. The primary way people contribute to Pangeo Forge is by writing / maintaining recipes.

Warning

The Recipe API is still in flux and may change. Make sure the version of the documentation you are reading matches your installed version of pangeo_forge_recipes.

The Recipe Object

A Recipe is a Python object which encapsulates a workflow for transforming data. A Recipe knows how to take a file pattern, which descibes a collection of source files (“inputs”), and turn it into a single analysis-ready, cloud-optimized dataset. Creating a recipe does not actually cause any data to be read or written; the recipe is just the description of the transformation. To actually do the work, the recipe must be executed. Recipe authors (i.e. data users or data managers) can either execute their recipes on their own computers and infrastructure, in private, or contribute their recipe to the public Pangeo Forge The Recipe Box, where it can be executed in the cloud via Bakeries.

Recipe Classes

To write a recipe, you must start from one of the existing recipe classes. Recipe classes are based on a specific data model for the input files and target dataset format. Right now, there is only one recipe class implemented:

XarrayZarr Recipe

The pangeo_forge_recipes.recipes.XarrayZarrRecipe recipe class uses Xarray to read the input files and Zarr as the target dataset format. The inputs can be in any file format Xarray can read, including NetCDF, OPeNDAP, GRIB, Zarr, and, via rasterio, GeoTIFF and other geospatial raster formats. The target Zarr dataset can be written to any storage location supported by filesystem-spec; see Storage for more details. The target Zarr dataset will conform to the Xarray Zarr encoding conventions.

The best way to really understand how recipes work is to go through the relevant tutorials for this recipe class. These are, in order of increasing complexity

Below we give a very basic overview of how this recipe is used.

First you must define a file pattern. Once you have a file pattern object, initializing an XarrayZarrRecipe can be as simple as this.

recipe = XarrayZarrRecipe(file_pattern)

There are many other options we could pass, all covered in the API documentation below.

All recipes need storage for the target dataset. If you have already defined a pangeo_forge_recipes.storage.FSSpecTarget object, then you can either assign it when you initialize the recipe or later, e.g.

recipe.target = FSSpecTarget(fs=fs, root_path=target_path)

This recipe may also requires a cache, a place to store temporary files. We can create one as follows.

recipe.input_cache = CacheFSSpecTarget(fs=fs, root_path=cache_path)

Once your recipe is defined and has its targets assigned, you’re ready to move on to Recipe Execution.

The API documentation below explains all of the possible options for XarrayZarrRecipe. Many of these options are explored further in the Recipe Tutorials.

API Documentation

class pangeo_forge_recipes.recipes.XarrayZarrRecipe(file_pattern, inputs_per_chunk=1, target_chunks=<factory>, target=None, input_cache=None, metadata_cache=None, cache_inputs=None, copy_input_to_local_file=False, consolidate_zarr=True, consolidate_dimension_coordinates=True, xarray_open_kwargs=<factory>, xarray_concat_kwargs=<factory>, delete_input_encoding=True, process_input=None, process_chunk=None, lock_timeout=None, subset_inputs=<factory>, open_input_with_fsspec_reference=False)

This class represents a dataset composed of many individual NetCDF files. This class uses Xarray to read and write data and writes its output to Zarr. The organization of the source files is described by the file_pattern. Currently this recipe supports at most one MergeDim and one ConcatDim in the File Pattern.

Parameters
  • file_pattern (FilePattern) – An object which describes the organization of the input files.

  • inputs_per_chunk (int) – The number of inputs to use in each chunk along the concat dim. Must be an integer >= 1.

  • target_chunks (Dict[str, int]) – Desired chunk structure for the targret dataset. This is a dictionary mapping dimension names to chunk size. When using a patterns.FilePattern with a patterns.ConcatDim that specifies n_items_per_file, then you don’t need to include the concat dim in target_chunks.

  • target (Optional[AbstractTarget]) – A location in which to put the dataset. Can also be assigned at run time.

  • input_cache (Optional[CacheFSSpecTarget]) – A location in which to cache temporary data.

  • metadata_cache (Optional[MetadataTarget]) – A location in which to cache metadata for inputs and chunks. Required if nitems_per_file=None on concat dim in file pattern.

  • cache_inputs (Optional[bool]) – If True, inputs are copied to input_cache before opening. If False, try to open inputs directly from their source location.

  • copy_input_to_local_file (bool) – Whether to copy the inputs to a temporary local file. In this case, a path (rather than file object) is passed to xr.open_dataset. This is required for engines that can’t open file-like objects (e.g. pynio).

  • consolidate_zarr (bool) – Whether to consolidate the resulting Zarr dataset.

  • consolidate_dimension_coordinates (bool) – Whether to rewrite coordinate variables as a single chunk. We recommend consolidating coordinate variables to avoid many small read requests to get the coordinates in xarray.

  • xarray_open_kwargs (dict) – Extra options for opening the inputs with Xarray.

  • xarray_concat_kwargs (dict) – Extra options to pass to Xarray when concatenating the inputs to form a chunk.

  • delete_input_encoding (bool) – Whether to remove Xarray encoding from variables in the input dataset

  • process_input (Optional[Callable[[Dataset, str], Dataset]]) – Function to call on each opened input, with signature (ds: xr.Dataset, filename: str) -> ds: xr.Dataset.

  • process_chunk (Optional[Callable[[Dataset], Dataset]]) – Function to call on each concatenated chunk, with signature (ds: xr.Dataset) -> ds: xr.Dataset.

  • lock_timeout (Optional[int]) – The default timeout for acquiring a chunk lock.

  • subset_inputs (Dict[str, int]) – If set, break each input file up into multiple chunks along dimension according to the specified mapping. For example, {'time': 5} would split each input file into 5 chunks along the time dimension. Multiple dimensions are allowed.

  • open_input_with_fsspec_reference (bool) – If True, use fsspec-reference-maker to generate a reference filesystem for each input, to be used when opening the file with Xarray as a virtual Zarr dataset.

HDF Reference Recipe

Like the XarrayZarrRecipe, this recipe allows us to more efficiently access data from a bunch of NetCDF / HDF files. However, this recipe does not actually copy the original source data. Instead, it generates metadata files which reference and index the original data, allowing it to be accessed more quickly and easily. For more background, see this blog post.

There is currently one tutorial for this recipe:

API Documentation

class pangeo_forge_recipes.recipes.HDFReferenceRecipe(file_pattern, output_json_fname='reference.json', output_intake_yaml_fname='reference.yaml', target=None, metadata_cache=None, netcdf_storage_options=<factory>, inline_threshold=500, output_storage_options=<factory>, template_count=20, xarray_open_kwargs=<factory>, xarray_concat_args=<factory>)

Generates reference files for each input netCDF, then combines into one ensemble output

Currently supports concat or merge along a single dimension.

See fsspec-reference-maker and fsspec’s ReferenceFileSystem. To use this class, you must have fsspec-reference-maker, ujson, xarray, fsspec, zarr, h5py and ujson in your recipe’s requirements.

This class will also produce an Intake catalog stub in YAML format You can use intake (and intake-xarray) to load the dataset This is the recommended way to distribute access.

Parameters
  • file_pattern (FilePattern) – FilePattern describing the original data files. Paths should include protocol specifier, e.g. https://

  • output_json_fname (str) – The name of the json file in which to store the reference filesystem.

  • output_intake_yaml_fname (str) – The name of the generated intake catalog file.

  • target (Optional[FSSpecTarget]) – Final storage location in which to put the reference dataset files (json and yaml).

  • metadata_cache (Optional[MetadataTarget]) – A location in which to cache metadata for files.

  • netcdf_storage_options (dict) – dict of kwargs for creating fsspec instance to read original data files

  • inline_threshold (int) – blocks with fewer bytes than this will be inlined into the output reference file

  • output_storage_options (dict) – dict of kwargs for creating fsspec instance when writing final output

  • template_count (Optional[int]) – the number of occurrences of a URL before it gets made a template. None to disable templating

  • xarray_open_kwargs (dict) – kwargs passed to xarray.open_dataset. Only used if file_pattern has more than one file.

  • xarray_concat_args (dict) – kwargs passed to xarray.concat Only used if file_pattern has more than one file.