API Reference#

File Patterns#

class pangeo_forge_recipes.patterns.FilePattern(format_function, *combine_dims, fsspec_open_kwargs=None, query_string_secrets=None, file_type='netcdf4')#

Represents an n-dimensional matrix of individual files to be combined through a combination of merge and concat operations. Each operation generates a new dimension to the matrix.

Parameters
  • format_function (Callable) – A function that takes one argument for each combine_op and returns a string representing the filename / url paths. Each argument name should correspond to a name in the combine_dims list.

  • combine_dims (Union[MergeDim, ConcatDim]) – A sequence of either concat or merge dimensions. The outer product of the keys is used to generate the full list of file paths.

  • fsspec_open_kwargs (Optional[Dict[str, Any]]) – Extra options for opening the inputs with fsspec. May include block_size, username, password, etc.

  • query_string_secrets (Optional[Dict[str, str]]) – If provided, these key/value pairs are appended to the query string of each file_pattern url at runtime.

  • file_type (str) – The file format of the source files for this pattern. Must be one of the options defined by pangeo_forge_recipes.patterns.FileType. Note: FileType.opendap cannot be used with caching.

__getitem__(indexer)#

Get a filename path for a particular key.

Return type

str

__iter__()#

Iterate over all keys in the pattern.

Return type

Iterator[Index]

property concat_dims: List[str]#

List of dims that are concat operations

Return type

List[str]

property concat_sequence_lens: Dict[str, Optional[int]]#

Dictionary mapping concat dims to sequence lengths. Only available if nitems_per_input is set on the dimension.

Return type

Dict[str, Optional[int]]

property dims: Dict[str, int]#

Dictionary representing the dimensions of the FilePattern. Keys are dimension names, values are the number of items along each dimension.

Return type

Dict[str, int]

items()#

Iterate over key, filename pairs.

property merge_dims: List[str]#

List of dims that are merge operations

Return type

List[str]

property nitems_per_input: Dict[str, Optional[int]]#

Dictionary mapping concat dims to number of items per file.

Return type

Dict[str, Optional[int]]

sha256()#

Compute a sha256 hash for the instance.

property shape: Tuple[int, ...]#

Shape of the filename matrix.

Return type

Tuple[int, ...]

Combine Dimensions#

class pangeo_forge_recipes.patterns.ConcatDim(name, keys, nitems_per_file=None)#

Represents a concatenation operation across a dimension of a FilePattern.

Parameters
  • name (str) – The name of the dimension we are concatenating over. For files with labeled dimensions, this should match the dimension name within the file. The most common value is "time".

  • keys (Sequence[Any]) – The keys used to represent each individual item along this dimension. This will be used by a FilePattern object to evaluate the file name.

  • nitems_per_file (Optional[int]) – If each file contains the exact same known number of items in each file along the concat dimension, this can be set to provide a fast path for recipes.

class pangeo_forge_recipes.patterns.MergeDim(name, keys)#

Represents a merge operation across a dimension of a FilePattern.

Parameters
  • name (str) – The name of the dimension we are are merging over. The actual value is not used by most recipes. The most common value is "variable".

  • keys (Sequence[Any]) – The keys used to represent each individual item along this dimension. This will be used by a FilePattern object to evaluate the file name.

Indexing#

class pangeo_forge_recipes.patterns.Index#
class pangeo_forge_recipes.patterns.DimIndex(name, index, sequence_len, operation)#

Object used to index a single dimension of a FilePattern or Recipe Chunks.

Parameters
  • name (str) – The name of the dimension.

  • index (int) – The position of the item within the sequence.

  • sequence_len (int) – The total length of the sequence.

  • operation (CombineOp) – What type of Combine Operation does this dimension represent.

class pangeo_forge_recipes.patterns.CombineOp(value)#

Used to uniquely identify different combine operations across Pangeo Forge Recipes.

Recipes#

class pangeo_forge_recipes.recipes.BaseRecipe#
class pangeo_forge_recipes.recipes.HDFReferenceRecipe(file_pattern, storage_config=<factory>, output_json_fname='reference.json', output_intake_yaml_fname='reference.yaml', netcdf_storage_options=<factory>, target_options=<factory>, inline_threshold=500, output_storage_options=<factory>, concat_dims=<factory>, identical_dims=<factory>, coo_map=<factory>, coo_dtypes=<factory>, preprocess=None, postprocess=None)#

Generates reference files for each input netCDF, then combines into one ensemble output

Currently supports concat or merge along a single dimension.

See kerchunk and fsspec’s ReferenceFileSystem. To use this class, you must have kerchunk, ujson, xarray, fsspec, zarr, h5py and ujson in your recipe’s requirements.

This class will also produce an Intake catalog stub in YAML format You can use intake (and intake-xarray) to load the dataset This is the recommended way to distribute access.

Parameters
  • file_pattern (FilePattern) – FilePattern describing the original data files. Paths should include protocol specifier, e.g. https://

  • output_json_fname (str) – The name of the json file in which to store the reference filesystem.

  • output_intake_yaml_fname (str) – The name of the generated intake catalog file.

  • storage_config (StorageConfig) – Defines locations for writing the reference dataset files (json and yaml) and for caching metadata for files. Both locations default to tempdir.TemporaryDirectory; this default config can be used for testing and debugging the recipe. In an actual execution context, the default config is re-assigned to point to the destination(s) of choice, which can be any combination of fsspec-compatible storage backends.

  • netcdf_storage_options (dict) – dict of kwargs for creating fsspec instance to read original data files

  • target_options (Optional[dict]) – dict of kwargs for creating fsspec instance to read Kerchunk reference files

  • inline_threshold (int) – blocks with fewer bytes than this will be inlined into the output reference file

  • output_storage_options (dict) – dict of kwargs for creating fsspec instance when writing final output

  • identical_dims (Optional[list]) – coordiate-like variables that are assumed to be the same in every input, and so do not need any concatenation

  • coo_map (Optional[dict]) – set of “selectors” defining how to fetch the dimension coordinates of any given chunk for each of the concat dimes. By default, this is the variable in the dataset with the same name as the given concat dim, except for “var”, where the default is the name of each input variable. See the doc of MultiZarrToZarr for more details on possibilities.

  • coo_dtyes – optional coercion of coordinate values before write Note that, if using cftime to read coordinate values, output will also be encoded with cftime (i.e., ints and special attributes) unless you specify an “M8[*]” as the output type.

  • preprocess (Optional[Callable]) – a function applied to each HDF file’s references before combine

  • postprocess (Optional[Callable]) – a function applied to the global combined references before write

class pangeo_forge_recipes.recipes.XarrayZarrRecipe(file_pattern, storage_config=<factory>, inputs_per_chunk=1, target_chunks=<factory>, cache_inputs=None, copy_input_to_local_file=False, consolidate_zarr=True, consolidate_dimension_coordinates=True, xarray_open_kwargs=<factory>, xarray_concat_kwargs=<factory>, delete_input_encoding=True, process_input=None, process_chunk=None, lock_timeout=None, subset_inputs=<factory>, open_input_with_kerchunk=False)#

This configuration represents a dataset composed of many individual NetCDF files. This class uses Xarray to read and write data and writes its output to Zarr. The organization of the source files is described by the file_pattern. Currently this recipe supports at most one MergeDim and one ConcatDim in the File Pattern.

Parameters
  • file_pattern (FilePattern) – An object which describes the organization of the input files.

  • inputs_per_chunk (int) – The number of inputs to use in each chunk along the concat dim. Must be an integer >= 1.

  • target_chunks (Dict[str, int]) – Desired chunk structure for the targret dataset. This is a dictionary mapping dimension names to chunk size. When using a patterns.FilePattern with a patterns.ConcatDim that specifies n_items_per_file, then you don’t need to include the concat dim in target_chunks.

  • storage_config (StorageConfig) – Defines locations for writing the output dataset, caching temporary data, and for caching metadata for inputs and chunks. All three locations default to tempdir.TemporaryDirectory; this default config can be used for testing and debugging the recipe. In an actual execution context, the default config is re-assigned to point to the destination(s) of choice, which can be any combination of fsspec-compatible storage backends.

  • cache_inputs (Optional[bool]) – If True, inputs are copied to input_cache before opening. If False, try to open inputs directly from their source location.

  • copy_input_to_local_file (bool) – Whether to copy the inputs to a temporary local file. In this case, a path (rather than file object) is passed to xr.open_dataset. This is required for engines that can’t open file-like objects (e.g. pynio).

  • consolidate_zarr (bool) – Whether to consolidate the resulting Zarr dataset.

  • consolidate_dimension_coordinates (bool) – Whether to rewrite coordinate variables as a single chunk. We recommend consolidating coordinate variables to avoid many small read requests to get the coordinates in xarray.

  • xarray_open_kwargs (dict) – Extra options for opening the inputs with Xarray.

  • xarray_concat_kwargs (dict) – Extra options to pass to Xarray when concatenating the inputs to form a chunk.

  • delete_input_encoding (bool) – Whether to remove Xarray encoding from variables in the input dataset

  • process_input (Optional[Callable[[xr.Dataset, str], xr.Dataset]]) – Function to call on each opened input, with signature (ds: xr.Dataset, filename: str) -> ds: xr.Dataset.

  • process_chunk (Optional[Callable[[xr.Dataset], xr.Dataset]]) – Function to call on each concatenated chunk, with signature (ds: xr.Dataset) -> ds: xr.Dataset.

  • lock_timeout (Optional[int]) – The default timeout for acquiring a chunk lock.

  • subset_inputs (SubsetSpec) – If set, break each input file up into multiple chunks along dimension according to the specified mapping. For example, {'time': 5} would split each input file into 5 chunks along the time dimension. Multiple dimensions are allowed.

  • open_input_with_kerchunk (bool) – If True, use kerchunk to generate a reference filesystem for each input, to be used when opening the file with Xarray as a virtual Zarr dataset.

cache_metadata: bool = False#

Whether metadata caching is needed.

concat_dim: str#

The concatenation dimension name.

concat_dim_chunks: Optional[int] = None#

The desired chunking along the sequence dimension.

init_chunks: List[ChunkKey]#

List of chunks needed to initialize the recipe.

nitems_per_input: Optional[int] = None#

How many items per input along concat_dim.

Storage#

class pangeo_forge_recipes.storage.AbstractTarget#
abstract exists(path)#

Check that the file exists.

Return type

bool

open(path, **kwargs)#

Open file with a context manager.

abstract rm(path)#

Remove file.

Return type

None

abstract size(path)#

Get file size

Return type

int

class pangeo_forge_recipes.storage.CacheFSSpecTarget(fs, root_path='')#

Alias for FlatFSSpecTarget

class pangeo_forge_recipes.storage.FSSpecTarget(fs, root_path='')#

Representation of a storage target for Pangeo Forge.

Parameters
  • fs (AbstractFileSystem) – The filesystem object we are writing to.

  • root_path (str) – The path under which the target data will be stored.

exists(path)#

Check that the file is in the cache.

Return type

bool

get_mapper()#

Get a mutable mapping object suitable for storing Zarr data.

Return type

FSMap

open(path, **kwargs)#

Open file with a context manager.

Return type

Iterator[Any]

rm(path, recursive=False)#

Remove file from the cache.

Return type

None

size(path)#

Get file size

Return type

int

class pangeo_forge_recipes.storage.FlatFSSpecTarget(fs, root_path='')#

A target that sanitizes all the path names so that everything is stored in a single directory.

Designed to be used as a cache for inputs.

class pangeo_forge_recipes.storage.MetadataTarget(fs, root_path='')#

Target for storing metadata dictionaries as json.

class pangeo_forge_recipes.storage.StorageConfig(target, cache=None, metadata=None)#

A storage configuration container for recipe classes.

Parameters
  • target (FSSpecTarget) – The destination to which to write the output data.

  • cache (Optional[CacheFSSpecTarget]) – A location for caching source files.

  • metadata (Optional[MetadataTarget]) – A location for recipes to cache metadata about source files. Required if nitems_per_file=None on concat dim in file pattern.

pangeo_forge_recipes.storage.file_opener(fname, cache=None, copy_to_local=False, bypass_open=False, secrets=None, **open_kwargs)#

Context manager for opening files.

Parameters
  • fname (str) – The filename / url to open. Fsspec will inspect the protocol (e.g. http, ftp) and determine the appropriate filesystem type to use.

  • cache (Optional[CacheFSSpecTarget]) – A target where the file may have been cached. If none, the file will be opened directly.

  • copy_to_local (bool) – If True, always copy the file to a local temporary file before opening. In this case, function yields a path name rather than an open file.

  • bypass_open (bool) – If True, skip trying to open the file at all and just return the filename back directly. (A fancy way of doing nothing!)

  • secrets (Optional[dict]) – Dictionary of secrets to encode into the query string.

Return type

Iterator[Union[OpenFile, str]]

pangeo_forge_recipes.storage.temporary_storage_config()#

A factory function for setting a default storage config on pangeo_forge_recipes.recipes.base.StorageMixin.

Executors#

class pangeo_forge_recipes.executors.BeamPipelineExecutor(*args, **kwds)#
static execute(plan, *args, **kwargs)#

Execute a plan. All args and kwargs are passed to a apache_beam.Pipeline.

class pangeo_forge_recipes.executors.DaskPipelineExecutor(*args, **kwds)#
class pangeo_forge_recipes.executors.FunctionPipelineExecutor(*args, **kwds)#

A generator which returns a single callable python function with no arguments. Calling this function will run the whole recipe

class pangeo_forge_recipes.executors.GeneratorPipelineExecutor(*args, **kwds)#

An executor which returns a Generator. The Generator yeilds function, args, kwargs, which can be called step by step to iterate through the recipe.