Running your recipe on Pangeo Forge Cloud#

Welcome to the Pangeo Forge introduction tutorial! This is the 3rd part in a sequence, the flow of which is described here.

Outline Part 3#

We are at an exciting point - transitioning to Pangeo Forge Cloud. In this part of the tutorial we are setting up our recipe, which we have thus far only run in a limited compute environment on a small section of data, to run at scale in the cloud. In order to do that we will need to:

  1. Fork the staged-recipes repo

  2. Add the recipe files: a .py file and a meta.yaml file

  3. Make a PR to the staged-recipes repo

A note for Sandbox users#

If you have been using the Pangeo Forge Sandbox for the first two parts that’s great. In order to complete this part of the tutorial you will have to complete step 1 locally, and download the files you make in step 2 in order to make the PR in step 3.

Fork the staged-recipes repo#

pangeo-forge/staged-recipes is a repository that exists as a staging ground for recipes. It is where recipes get reviewed before they are run. Once the recipe is run the code will be transitioned to its own repository for that recipe, called a Feedstock.

You can fork a repo through the web browser or the Github CLI. Checkout the Github docs for steps how to do this.

Add the recipe files#

Within staged-recipes, recipes files should go in a new folder for your dataset in the recipes subdirectory. The name of the new folder will become the name of the feedstock repository, the repository where the recipe code will live after the data have been processed.

In the example below we call the folder oisst, so the feedstoack will be called oisst-feedstock. The final file structure we are creating is this:

staged-recipes/recipes/
                └──oisst/
                   ├──recipe.py
                   └──meta.yaml

The name of the folder oisst would vary based on the name of the dataset.

Copy the recipe code into a single .py file#

Within the oisst folder create a file called recipe.py and copy the recipe creation code from the first two parts of this tutorial. We don’t have to copy any of the code we used for local testing - the cloud automation will take care of testing and scaling the processing on the cloud infrastructure. We will call this file recipe.py the recipe module. For OISST it should look like:

import pandas as pd

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.recipes import XarrayZarrRecipe

dates = pd.date_range('1981-09-01', '2022-02-01', freq='D')

URL_FORMAT = (
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/"
    "v2.1/access/avhrr/{time:%Y%m}/oisst-avhrr-v02r01.{time:%Y%m%d}.nc"
)

def make_url(time):
    return URL_FORMAT.format(time=time)

time_concat_dim = ConcatDim("time", dates, nitems_per_file=1)
pattern = FilePattern(make_url, time_concat_dim)

recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=2)

Another step, complete!

Create a meta.yaml file#

The meta.yaml is a YAML file. YAML is a common language used for writing configuration files. meta.yaml contains two important things:

  1. metadata about the recipe

  2. the Bakery, designating the cloud infrastructure where the recipe will be run and stored.

Here we will walk through each field of the meta.yaml. A template of meta.yaml is also available here.

title and description#

These fields describe the dataset. They are not highly restricted.

1title: "NOAA Optimum Interpolated SST"
2description: "1/4 degree daily gap filled sea surface temperature (SST)"

pangeo_forge_version#

This is the version of the pangeo_forge_recipes library that you used to create the recipe. It’s important to track in case someone wants to run your recipe in the future. Conda users can find this information with conda list.

3pangeo_forge_version: "0.8.2"

recipes section#

The recipes section explains the recipes contained in the recipe module (recipe.py). This feels a bit repetitive in the case of OISST, but becomes relevant in the case where someone is defining multiple recipe classes in the same recipe module, for example with different chunk schemes.

4recipes:
5  - id: noaa-oisst-avhrr-only
6    object: "recipe:recipe"

The id noaa-oisst-avhrr-only is the name that we are giving our recipe class. It is a string that we as the maintainer chose. The entry recipe:recipe describes where the recipe Python object is. We are telling it that our recipe object is in a file called recipe, inside of of a variable called recipe. Unless there is a specific reason to deviate, recipe:recipe is a good convention here.

provenance section#

Provenance explains the origin of the dataset. The core information about provenance is the provider field, which is outlined as part of the STAC Metadata Specification. See the STAC Provider docs for more details.

 7provenance:
 8  providers:
 9    - name: "NOAA NCEI"
10      description: "National Oceanographic & Atmospheric Administration National Centers for Environmental Information"
11      roles:
12        - producer
13        - licensor
14      url: https://www.ncdc.noaa.gov/oisst
15  license: "CC-BY-4.0"

One field to highlight is the license field, described in the STAC docs here. It is important to locate the licensing information of the dataset and provide it in the meta.yaml.

maintainers section#

This is information about you, the recipe creator! Multiple maintainers can be listed. The required fields are name and github username; orcid and email may also be included.

17maintainers:
18  - name: "Dorothy Vaughan"
19    orcid: "9999-9999-9999-9999"
20    github: dvaughan0987

bakery section#

Bakeries are where the work gets done on Pangeo Forge Cloud. A single bakery is a set of cloud infrastructure hosted by a particular institution or group.

Selecting a bakery is how you choose where the recipe will be run and hosted. The Pangeo Forge website hosts a full list of available bakeries.

17bakery:
18  id: "pangeo-ldeo-nsf-earthcube"

And that is the meta.yaml! Between the meta.yaml and recipe.py we have now put together all the files we need for cloud processing.

Make a PR to the staged-recipes repo#

At this point you should have created two files - recipe.py and meta.yaml and they should be in the new folder you created for your dataset in staged-recipes/recipes.

It’s time to submit the changes as a Pull Request. Creating the Pull Request on Github is what officially submits your recipe for review to run. If you have opened an issue for your dataset you can reference it in the Pull Request. Otherwise, provide a notes about the datasets and hit submit!

After the PR#

With the PR in, all the steps to stage the recipe are complete! At this point a @pangeo-forge-bot will perform a series of automated checks on your PR, a full listing of which is provided in PR Checks Reference.

All information you need to contribute your recipe to Pangeo Forge Cloud will be provided in the PR discussion thread by either @pangeo-forge-bot or a human maintainer of Pangeo Forge.

Merging the PR will transform your submitted files into a new Pangeo Forge Feedstock repository and initiate full builds for all recipes contained in your PR. A complete description of what to expect during and post PR merge is provided in Recipe Contribution.

End of the Introduction Tutorial#

Congratulations, you’ve completed the introduction tutorial!

From here, we hope you are excited to try writing your own recipe. As you write, you may find additional documentation helpful, such as the Recipes User Guide or the more advanced Recipe Tutorials. For recipes questions not covered there, you are invited to open Issues on the pangeo-forge/pangeo-forge-recipes GitHub repository.

Happy ARCO building! We look forward to your Recipe Contribution.