Storage
Contents
Storage#
Recipes need a place to store data. This information is provided to the recipe by its .storage_config
attribute, which is an object of type pangeo_forge_recipes.storage.StorageConfig
.
The StorageConfig
object looks like this
- class pangeo_forge_recipes.storage.StorageConfig(target, cache=None, metadata=None)
A storage configuration container for recipe classes.
- Parameters
target (
FSSpecTarget
) – The destination to which to write the output data.cache (
Optional
[CacheFSSpecTarget
]) – A location for caching source files.metadata (
Optional
[MetadataTarget
]) – A location for recipes to cache metadata about source files. Required ifnitems_per_file=None
on concat dim in file pattern.
As shown above, the storage configuration includes three distinct parts: target
, cache
, and metadata
.
Default storage#
When you create a new recipe, a default StorageConfig
will automatically be created pointing at a local a local tempfile.TemporaryDirectory
.
This allows you to write data to temporary local storage during the recipe development and debugging process.
This means that any recipe can immediately be executed with minimal configuration.
However, in a realistic “production” scenario, you will want to customize your storage locations.
Customizing storage: the target
#
To write a recipe’s full dataset to a persistant storage location, re-assign .storage_config
to be a pangeo_forge_recipes.storage.StorageConfig
pointing to the location(s) of your choice. The minimal requirement for instantiating StorageConfig
is a location in which to store the final dataset produced by the recipe. This is called the target
. Pangeo Forge has a special class for this: pangeo_forge_recipes.storage.FSSpecTarget
.
Creating a target
requires two arguments:
The
fs
argument is an fsspec filesystem. Fsspec supports many different types of storage via its built in and third party implementations.The
root_path
argument specifies the specific path where the data should be stored.
For example, creating a storage target for AWS S3 might look like this:
import s3fs
from pangeo_forge_recipes.storage import FSSpecTarget
fs = s3fs.S3FileSystem(key="MY_AWS_KEY", secret="MY_AWS_SECRET")
target_path = "pangeo-forge-bucket/my-dataset-v1.zarr"
target = FSSpecTarget(fs=fs, root_path=target_path)
This target can then be assiged to a recipe as follows:
from pangeo_forge_recipes.storage import StorageConfig
recipe.storage_config = StorageConfig(target)
Once assigned, the target
can be accessed from the recipe with:
recipe.target
Customizing storage continued: caching#
Oftentimes it is useful to cache input files, rather than read them directly from the data provider. Input files can be cached at a location defined by a pangeo_forge_recipes.storage.CacheFSSpecTarget
object. Some recipes require separate caching of metadata, which is provided by a third class pangeo_forge_recipes.storage.MetadataTarget
.
A StorageConfig
which declares all three storage locations is assigned as follows:
from pangeo_forge_recipes.storage import CacheFSSpecTarget, FSSpecTarget, MetadataTarget, StorageConfig
# define your fsspec filesystems for the target, cache, and metadata locations here
target = FSSpecTarget(fs=<fsspec-filesystem-for-target>, root_path="<path-for-target>")
cache = CacheFSSpecTarget(fs=<fsspec-filesystem-for-cache>, root_path="<path-for-cache>")
metadata = MetadataTarget(fs=<fsspec-filesystem-for-metadata>, root_path="<path-for-metadata>")
recipe.storage_config = StorageConfig(target, cache, metadata)