API Reference
Contents
API Reference#
File Patterns#
- class pangeo_forge_recipes.patterns.FilePattern(format_function, *combine_dims, fsspec_open_kwargs=None, query_string_secrets=None, file_type='netcdf4')#
Represents an n-dimensional matrix of individual files to be combined through a combination of merge and concat operations. Each operation generates a new dimension to the matrix.
- Parameters
format_function (
Callable
) – A function that takes one argument for each combine_op and returns a string representing the filename / url paths. Each argument name should correspond to aname
in thecombine_dims
list.combine_dims (
Union
[MergeDim
,ConcatDim
]) – A sequence of either concat or merge dimensions. The outer product of the keys is used to generate the full list of file paths.fsspec_open_kwargs (
Optional
[Dict
[str
,Any
]]) – Extra options for opening the inputs with fsspec. May includeblock_size
,username
,password
, etc.query_string_secrets (
Optional
[Dict
[str
,str
]]) – If provided, these key/value pairs are appended to the query string of eachfile_pattern
url at runtime.file_type (
str
) – The file format of the source files for this pattern. Must be one of the options defined bypangeo_forge_recipes.patterns.FileType
. Note:FileType.opendap
cannot be used with caching.
- __getitem__(indexer)#
Get a filename path for a particular key.
- Return type
str
- property concat_dims: List[str]#
List of dims that are concat operations
- Return type
List
[str
]
- property concat_sequence_lens: Dict[str, Optional[int]]#
Dictionary mapping concat dims to sequence lengths. Only available if
nitems_per_input
is set on the dimension.- Return type
Dict
[str
,Optional
[int
]]
- property dims: Dict[str, int]#
Dictionary representing the dimensions of the FilePattern. Keys are dimension names, values are the number of items along each dimension.
- Return type
Dict
[str
,int
]
- items()#
Iterate over key, filename pairs.
- property merge_dims: List[str]#
List of dims that are merge operations
- Return type
List
[str
]
- property nitems_per_input: Dict[str, Optional[int]]#
Dictionary mapping concat dims to number of items per file.
- Return type
Dict
[str
,Optional
[int
]]
- property sha256#
Compute a sha256 hash for the instance.
- property shape: Tuple[int, ...]#
Shape of the filename matrix.
- Return type
Tuple
[int
,...
]
Combine Dimensions#
- class pangeo_forge_recipes.patterns.ConcatDim(name, keys, nitems_per_file=None)#
Represents a concatenation operation across a dimension of a FilePattern.
- Parameters
name (
str
) – The name of the dimension we are concatenating over. For files with labeled dimensions, this should match the dimension name within the file. The most common value is"time"
.keys (
Sequence
[Any
]) – The keys used to represent each individual item along this dimension. This will be used by aFilePattern
object to evaluate the file name.nitems_per_file (
Optional
[int
]) – If each file contains the exact same known number of items in each file along the concat dimension, this can be set to provide a fast path for recipes.
- class pangeo_forge_recipes.patterns.MergeDim(name, keys)#
Represents a merge operation across a dimension of a FilePattern.
- Parameters
name (
str
) – The name of the dimension we are are merging over. The actual value is not used by most recipes. The most common value is"variable"
.keys (
Sequence
[Any
]) – The keys used to represent each individual item along this dimension. This will be used by aFilePattern
object to evaluate the file name.
Indexing#
- class pangeo_forge_recipes.patterns.Index#
- class pangeo_forge_recipes.patterns.DimIndex(name, index, sequence_len, operation)#
Object used to index a single dimension of a FilePattern or Recipe Chunks.
- Parameters
name (
str
) – The name of the dimension.index (
int
) – The position of the item within the sequence.sequence_len (
int
) – The total length of the sequence.operation (
CombineOp
) – What type of Combine Operation does this dimension represent.
- class pangeo_forge_recipes.patterns.CombineOp(value)#
Used to uniquely identify different combine operations across Pangeo Forge Recipes.
Recipes#
- class pangeo_forge_recipes.recipes.BaseRecipe(sha256=None)#
- class pangeo_forge_recipes.recipes.HDFReferenceRecipe(file_pattern, storage_config=<factory>, sha256=None, output_json_fname='reference.json', output_intake_yaml_fname='reference.yaml', netcdf_storage_options=<factory>, target_options=<factory>, inline_threshold=500, output_storage_options=<factory>, concat_dims=<factory>, identical_dims=<factory>, coo_map=<factory>, coo_dtypes=<factory>, preprocess=None, postprocess=None)#
Generates reference files for each input netCDF, then combines into one ensemble output
Currently supports concat or merge along a single dimension.
See kerchunk and fsspec’s ReferenceFileSystem. To use this class, you must have kerchunk, ujson, xarray, fsspec, zarr, h5py and ujson in your recipe’s requirements.
This class will also produce an Intake catalog stub in YAML format You can use intake (and intake-xarray) to load the dataset This is the recommended way to distribute access.
- Parameters
file_pattern (FilePattern) – FilePattern describing the original data files. Paths should include protocol specifier, e.g. https://
output_json_fname (str) – The name of the json file in which to store the reference filesystem.
output_intake_yaml_fname (str) – The name of the generated intake catalog file.
storage_config (StorageConfig) – Defines locations for writing the reference dataset files (json and yaml) and for caching metadata for files. Both locations default to
tempdir.TemporaryDirectory
; this default config can be used for testing and debugging the recipe. In an actual execution context, the default config is re-assigned to point to the destination(s) of choice, which can be any combination offsspec
-compatible storage backends.netcdf_storage_options (dict) – dict of kwargs for creating fsspec instance to read original data files
target_options (Optional[dict]) – dict of kwargs for creating fsspec instance to read Kerchunk reference files
inline_threshold (int) – blocks with fewer bytes than this will be inlined into the output reference file
output_storage_options (dict) – dict of kwargs for creating fsspec instance when writing final output
identical_dims (Optional[list]) – coordiate-like variables that are assumed to be the same in every input, and so do not need any concatenation
coo_map (Optional[dict]) – set of “selectors” defining how to fetch the dimension coordinates of any given chunk for each of the concat dimes. By default, this is the variable in the dataset with the same name as the given concat dim, except for “var”, where the default is the name of each input variable. See the doc of MultiZarrToZarr for more details on possibilities.
coo_dtyes – optional coercion of coordinate values before write Note that, if using cftime to read coordinate values, output will also be encoded with cftime (i.e., ints and special attributes) unless you specify an “M8[*]” as the output type.
preprocess (Optional[Callable]) – a function applied to each HDF file’s references before combine
postprocess (Optional[Callable]) – a function applied to the global combined references before write
- class pangeo_forge_recipes.recipes.XarrayZarrRecipe(file_pattern, storage_config=<factory>, sha256=None, inputs_per_chunk=1, target_chunks=<factory>, cache_inputs=None, copy_input_to_local_file=False, consolidate_zarr=True, consolidate_dimension_coordinates=True, xarray_open_kwargs=<factory>, xarray_concat_kwargs=<factory>, delete_input_encoding=True, process_input=None, process_chunk=None, lock_timeout=None, subset_inputs=<factory>, open_input_with_kerchunk=False)#
This configuration represents a dataset composed of many individual NetCDF files. This class uses Xarray to read and write data and writes its output to Zarr. The organization of the source files is described by the
file_pattern
. Currently this recipe supports at most oneMergeDim
and oneConcatDim
in the File Pattern.- Parameters
file_pattern (FilePattern) – An object which describes the organization of the input files.
inputs_per_chunk (int) – The number of inputs to use in each chunk along the concat dim. Must be an integer >= 1.
target_chunks (Dict[str, int]) – Desired chunk structure for the target dataset. This is a dictionary mapping dimension names to chunk size. When using a
patterns.FilePattern
with apatterns.ConcatDim
that specifiesn_items_per_file
, then you don’t need to include the concat dim intarget_chunks
.storage_config (StorageConfig) – Defines locations for writing the output dataset, caching temporary data, and for caching metadata for inputs and chunks. All three locations default to
tempdir.TemporaryDirectory
; this default config can be used for testing and debugging the recipe. In an actual execution context, the default config is re-assigned to point to the destination(s) of choice, which can be any combination offsspec
-compatible storage backends.cache_inputs (Optional[bool]) – If
True
, inputs are copied toinput_cache
before opening. IfFalse
, try to open inputs directly from their source location.copy_input_to_local_file (bool) – Whether to copy the inputs to a temporary local file. In this case, a path (rather than file object) is passed to
xr.open_dataset
. This is required for engines that can’t open file-like objects (e.g. pynio).consolidate_zarr (bool) – Whether to consolidate the resulting Zarr dataset.
consolidate_dimension_coordinates (bool) – Whether to rewrite coordinate variables as a single chunk. We recommend consolidating coordinate variables to avoid many small read requests to get the coordinates in xarray.
xarray_open_kwargs (dict) – Extra options for opening the inputs with Xarray.
xarray_concat_kwargs (dict) – Extra options to pass to Xarray when concatenating the inputs to form a chunk.
delete_input_encoding (bool) – Whether to remove Xarray encoding from variables in the input dataset
process_input (Optional[Callable[[xr.Dataset, str], xr.Dataset]]) – Function to call on each opened input, with signature (ds: xr.Dataset, filename: str) -> ds: xr.Dataset.
process_chunk (Optional[Callable[[xr.Dataset], xr.Dataset]]) – Function to call on each concatenated chunk, with signature (ds: xr.Dataset) -> ds: xr.Dataset.
lock_timeout (Optional[int]) – The default timeout for acquiring a chunk lock.
subset_inputs (SubsetSpec) – If set, break each input file up into multiple chunks along dimension according to the specified mapping. For example,
{'time': 5}
would split each input file into 5 chunks along the time dimension. Multiple dimensions are allowed.open_input_with_kerchunk (bool) – If True, use kerchunk to generate a reference filesystem for each input, to be used when opening the file with Xarray as a virtual Zarr dataset.
- cache_metadata: bool = False#
Whether metadata caching is needed.
- concat_dim: str#
The concatenation dimension name.
- concat_dim_chunks: Optional[int] = None#
The desired chunking along the sequence dimension.
- init_chunks: List[ChunkKey]#
List of chunks needed to initialize the recipe.
- nitems_per_input: Optional[int] = None#
How many items per input along concat_dim.
Storage#
- class pangeo_forge_recipes.storage.AbstractTarget#
- abstract exists(path)#
Check that the file exists.
- Return type
bool
- open(path, **kwargs)#
Open file with a context manager.
- abstract rm(path)#
Remove file.
- Return type
None
- abstract size(path)#
Get file size
- Return type
int
- class pangeo_forge_recipes.storage.CacheFSSpecTarget(fs, root_path='')#
Alias for FlatFSSpecTarget
- class pangeo_forge_recipes.storage.FSSpecTarget(fs, root_path='')#
Representation of a storage target for Pangeo Forge.
- Parameters
fs (
AbstractFileSystem
) – The filesystem object we are writing to.root_path (
str
) – The path under which the target data will be stored.
- exists(path)#
Check that the file is in the cache.
- Return type
bool
- get_mapper()#
Get a mutable mapping object suitable for storing Zarr data.
- Return type
FSMap
- open(path, **kwargs)#
Open file with a context manager.
- Return type
Iterator
[Any
]
- rm(path, recursive=False)#
Remove file from the cache.
- Return type
None
- size(path)#
Get file size
- Return type
int
- class pangeo_forge_recipes.storage.FlatFSSpecTarget(fs, root_path='')#
A target that sanitizes all the path names so that everything is stored in a single directory.
Designed to be used as a cache for inputs.
- class pangeo_forge_recipes.storage.MetadataTarget(fs, root_path='')#
Target for storing metadata dictionaries as json.
- class pangeo_forge_recipes.storage.StorageConfig(target, cache=None, metadata=None)#
A storage configuration container for recipe classes.
- Parameters
target (
FSSpecTarget
) – The destination to which to write the output data.cache (
Optional
[CacheFSSpecTarget
]) – A location for caching source files.metadata (
Optional
[MetadataTarget
]) – A location for recipes to cache metadata about source files. Required ifnitems_per_file=None
on concat dim in file pattern.
- pangeo_forge_recipes.storage.file_opener(fname, cache=None, copy_to_local=False, bypass_open=False, secrets=None, **open_kwargs)#
Context manager for opening files.
- Parameters
fname (
str
) – The filename / url to open. Fsspec will inspect the protocol (e.g. http, ftp) and determine the appropriate filesystem type to use.cache (
Optional
[CacheFSSpecTarget
]) – A target where the file may have been cached. If none, the file will be opened directly.copy_to_local (
bool
) – If True, always copy the file to a local temporary file before opening. In this case, function yields a path name rather than an open file.bypass_open (
bool
) – If True, skip trying to open the file at all and just return the filename back directly. (A fancy way of doing nothing!)secrets (
Optional
[dict
]) – Dictionary of secrets to encode into the query string.
- Return type
Iterator
[Union
[OpenFile
,str
]]
- pangeo_forge_recipes.storage.temporary_storage_config()#
A factory function for setting a default storage config on
pangeo_forge_recipes.recipes.base.StorageMixin
.
Executors#
- class pangeo_forge_recipes.executors.BeamPipelineExecutor(*args, **kwds)#
- static execute(plan, *args, **kwargs)#
Execute a plan. All args and kwargs are passed to a apache_beam.Pipeline.
- class pangeo_forge_recipes.executors.DaskPipelineExecutor(*args, **kwds)#
- class pangeo_forge_recipes.executors.FunctionPipelineExecutor(*args, **kwds)#
A generator which returns a single callable python function with no arguments. Calling this function will run the whole recipe
- class pangeo_forge_recipes.executors.GeneratorPipelineExecutor(*args, **kwds)#
An executor which returns a Generator. The Generator yeilds function, args, kwargs, which can be called step by step to iterate through the recipe.