Introduction Tutorial

Welcome to Pangeo Forge. This tutorial will guide you through producing your first analysis-ready, cloud-optimized (ARCO) dataset.

To create your ARCO dataset, you’ll need to create a new recipe.


  1. fork the GitHub repository

  2. Develop the recipe (see below)

  3. Submit the new recipe as a pull request

At this point the pangeo-forge maintainers (and a host of bots) will verify that your recipe is shipshape and ready for inclusion in pangeo-forge. See for more on what happens next.

Architecture: High Level View

See maintaining for more on what happens next.

Initial Directory Structure Setup / Forking

Contributing a recipe begins with forking the pangeo-forge staged-recipes github repository to your account. From the pangeo-forge staged-recipes repository, click the fork badge in the top right. This will create a copy of the repo where you can commit changes.

The current directory structure for this repo is:

├── pyproject.toml
├── recipes
│   └── <your_recipe_name>
│       ├── meta.yml
│       └──
└── setup.cfg

Once you have cloned the repository and created a new branch.. <insert git steps?>, create a new directory in recipes/ with your project name along with empty and meta.yml files. For example, you could create a directory named noaa_oisst and populate it with the and meta.yml files.

Developing a Recipe


Each recipe requies two primary files: meta.yaml and Each are described below:


The meta.yaml file includes the metadata that describes your recipe. A full example is availble in the staged-recipes repository. Below some of the key sections are highlighted

specify recipe environment:

pangeo_forge_version: "0.5.0"  # pypi version of pangeo-forge
pangeo_notebook_version: "2021.07.17"  # docker image that the flow will run in

Define the recipe name and where to find the recipe object:

  - id: noaa-oisst-avhrr-only  # name of feedstock?
    object: "recipe:recipe"  # import schema for recipe object `{module}:{object}

Specify bakery and compute resources. Details on available bakeries can be found in the bakeries.yaml file.

  id: ""  # must come from a valid list of bakeries
  target: pangeo-forge-aws-bakery-flowcachebucketdasktest4-10neo67y7a924
    memory: 4096
    cpu: 1024

The final section includes metadata about the dataset and maintainers of the recipe. This dataset metadata ?can/will? be used to create a STAC catalog to aid in dataset discoverability.

title: "NOAA Optimum Interpolated SST" #Dataset title
description: "Analysis-ready Zarr datasets derived from NOAA OISST NetCDF" #Short dataset description
    - name: "NOAA NCEI" #dataset distributor/source
      description: "National Oceanographic & Atmospheric Administration National Centers for Environmental Information"
        - producer
        - licensor
  license: "CC-BY-4.0"
  - name: "Ryan Abernathey"
    orcid: "0000-0001-5999-4917"
    github: rabernat

The file is where the processing steps are defined. For detailed descriptions on recipe ingredients, check out the recipe-user-guide

Multiple recipe examples dealing more more complex data cleaning/processing can be found on the tutorials page


import pandas as pd

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from import XarrayZarrRecipe

Define inputs to the recipe:

start_date = "1981-09-01"
end_date = "2021-01-05"

def format_function(time):
    base = pd.Timestamp(start_date)
    day = base + pd.Timedelta(days=time)
    input_url_pattern = (
    return input_url_pattern.format(day=day)

dates = pd.date_range(start_date, end_date, freq="D")

The FilePattern part of the recipe is a crucial bit that defines the location of the input files. By exploring the input data source, you can usually determine a basic filepattern and then recreate it in the FilePattern part of the recipe.

More details and examples on can be found in the File Patterns Explainer

pattern = FilePattern(format_function, ConcatDim("time", range(len(dates)), 1))

Construct the recipe:

recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=20, cache_inputs=True)


Local Testing

With our previous recipe construction, we can create a pruned copy of the first two entries for testing.

recipe = recipe.copy_pruned()

Using ffspec and pangeo_forge_recipes, we can create a LocalFileSystem to cache recipe data. If you wish you can use any ffspec file system instead of a LocalFileSystem ex. (s3fs, gcsfs etc.)

from fsspec.implementations.local import LocalFileSystem
from import MetadataTarget, CacheFSSpecTarget

fs_local = LocalFileSystem()

recipe.input_cache = CacheFSSpecTarget(fs_local, "<filepath_for_input_cache>")
recipe.metadata_cache = MetadataTarget(fs_local, "<filepath_for_metadata>") = MetadataTarget(fs_local, "<filepath_for_zarr_store>")

Optionally we can setup logging to see under the hood a bit.

def setup_logging():
    import logging
    import sys
    formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s')
    logger = logging.getLogger("pangeo_forge_recipes")
    sh = logging.StreamHandler(stream=sys.stdout)

Next we can test run our pruned recipe using Prefect.

flow = recipe.to_prefect()

Finally we can verify a slice of the dataset

ds_target = xr.open_zarr(, consolidated=True)

Submitting the Recipe

Once the local recipe testing passes successfully, you can submit the recipe for execution. To do this, create a pull request in the staged-recipes repository

Automated Tests

Once a pull request of the recipe has been submitted, one of the pangeo-forge maintainers can trigger a CI test by running:


?How does a submitter know when their data is sent to a bakery/processed?

Data Access/Catalog

How is the data accessed once finished?

Maintaining a Recipe

What do we want to include here: