Xarray-to-Zarr Sequential Recipe: NOAA OISST#

This tutorial describes how to create a recipe from scratch. The source data is a sequence of NetCDF files accessed via HTTP. The target is a Zarr store.

Step 1: Get to know your source data#

If you are developing a new recipe, you are probably starting from an existing dataset. The first step is to just get to know the dataset. For this tutorial, our example will be the NOAA Optimum Interpolation Sea Surface Temperature (OISST) v2.1. The authoritative website describing the data is https://www.ncei.noaa.gov/products/optimum-interpolation-sst. Scroll down on that page to the Data Access section.

We will use the AVHRR-Only version of the data and follow the corresponding link to the Gridded netCDF Data. Browsing through the directories, we can see that there is one file per day. The very first day of the dataset is stored at the following URL:

https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc

From this example, we can work out the pattern of the file naming conventions. But first, let’s just download one of the files and open it up.

! wget https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc 
import xarray as xr

ds = xr.open_dataset("oisst-avhrr-v02r01.19810901.nc")
ds

We can see there are four data variables, all with dimension (time, zlev, lat, lon). There is a dimension coordinate for each dimension, and no non-dimension coordinates. Each file in the sequence presumably has the same zlev, lat, and lon, but we expect time to be different in each one.

Let’s also check the total size of the dataset in the file.

print(f"File size is {ds.nbytes/1e6} MB")

The file size is important because it will help us define the chunk size Pangeo Forge will use to build up the target dataset.

Step 2: Define File Pattern#

The first step in developing a recipe is to define a File Pattern. The file pattern describes how the source files (a.k.a. “inputs”) are organized.

In this case, we have a very simple sequence of files that we want to concatenate along a single dimension (time), so we can use the helper function pangeo_forge_recipes.patterns.pattern_from_file_sequence(). This allows us to simply pass a list of URLs, which we define explicitly.

from pangeo_forge_recipes.patterns import pattern_from_file_sequence

pattern_from_file_sequence?

To populate the file_list, we need understand the file naming conventions. Let’s look again at the first URL

https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc

From this we deduce the following format string.

input_url_pattern = (
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation"
    "/v2.1/access/avhrr/{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc"
)

To convert this to an actual list of files, we use Pandas. At the time of writing, the latest available data is from 2022-07-15.

import pandas as pd

dates = pd.date_range("1981-09-01", "2022-07-15", freq="D")
input_urls = [
    input_url_pattern.format(
        yyyymm=day.strftime("%Y%m"), yyyymmdd=day.strftime("%Y%m%d")
    )
    for day in dates
]
print(f"Found {len(input_urls)} files!")
input_urls[-1]

Now we can define our pattern. We will include one more piece of information: we know from examining the file above that there is only one timestep per file. So we can set nitems_per_file=1.

pattern = pattern_from_file_sequence(input_urls, "time", nitems_per_file=1)
pattern

To check out pattern, we can try to get the data back out. The pattern is designed to be iterated over, so to key the first key, we do:

for key in pattern:
    break
key

We can now use “getitem” syntax on the FilePattern object to retrieve the file name based on this key.

pattern[key]

As an alternative way to create the same pattern we could use the more verbose syntax to create a FilePattern class. With this method, we have to define a function which returns the file path, given a particular key. We might do it like this.

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern

def format_function(time):
    return input_url_pattern.format(
        yyyymm=time.strftime("%Y%m"), yyyymmdd=time.strftime("%Y%m%d")
    )

concat_dim = ConcatDim(name="time", keys=dates, nitems_per_file=1)
pattern = FilePattern(format_function, concat_dim)
pattern

We can check that it gives us the same thing:

pattern[key]

Step 3: Pick a Recipe class#

Now that we have the file pattern defined, we have to plug it into a Recipe. Since we are reading NetCDF files, we will use the pangeo_forge_recipes.recipe.XarrayZarrRecipe class Let’s examine its documentation string in our notebook.

from pangeo_forge_recipes.recipes import XarrayZarrRecipe
XarrayZarrRecipe?

There are lots of optional parameters, but only file_pattern is required. We can initialize our recipe by passing the file pattern to the recipe class.

from pangeo_forge_recipes.recipes import XarrayZarrRecipe

recipe = XarrayZarrRecipe(pattern)
recipe

Now let’s think about the Zarr chunks that this recipe will produce. Each target chunk corresponds to one input. So each variable chunk will only be a few MB. That is too small. Let’s increase inputs_per_chunk to 10. This means that we will need to be able to hold 10 files like the one we examined above in memory at once. That’s 16MB * 10 = 160MB. Not a problem!

recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=10)
recipe

Step 4: Play with the recipe#

Now we will just explore our recipe a bit to check whether things make sense.

We will also turn on Pangeo Forge’s logging.

from pangeo_forge_recipes.recipes import setup_logging
setup_logging()

We can see how many inputs the recipe has like this:

all_inputs = list(recipe.iter_inputs())
len(all_inputs)

And how many chunks:

all_chunks = list(recipe.iter_chunks())
len(all_chunks)

We can now try to load the first chunk. This will raise an exception because we have not initialized any targets.

(Note that the open_chunk and open_input methods must be called as context managers.

%xmode minimal

from pangeo_forge_recipes.recipes.xarray_zarr import open_chunk

try:
    with open_chunk(all_chunks[0], config=recipe) as ds:
        display(ds)
except FileNotFoundError as e:
    print(str(e))

Step 5: Create storage targets#

To experiment with our object a bit more, let’s attempt to load a chunk.

try:
    with open_chunk(all_chunks[0], config=recipe) as ds:
        display(ds)
except FileNotFoundError as e:
    print(e)

It still didn’t work! That’s because we have not cached the inputs yet. We can have the recipe tell us which inputs are needed for each chunk via the inputs_for_chunk method.

from pangeo_forge_recipes.recipes.xarray_zarr import cache_input, inputs_for_chunk

ninputs = recipe.file_pattern.dims["time"]

for input_file in inputs_for_chunk(all_chunks[0], recipe.inputs_per_chunk, ninputs):
    cache_input(input_file, config=recipe)

Step 6: Examine some chunks#

Now we can finally open the first chunk!

with open_chunk(all_chunks[0], config=recipe) as ds:
    display(ds)
    # need to load if we want to access the data outside of the context
    ds.load()
print(f'Total chunk size: {ds.nbytes / 1e6} MB')

👀 Inspect the Xarray HTML repr above carefully by clicking on the buttons to expand the different sections.

  • ✅ Is the shape of the variable what we expect?

  • ✅ Is time going in the right order?

  • ✅ Do the variable attributes make sense?

Now let’s visualize some data and make sure things look good

ds.sst[0].plot()
ds.ice[-1].plot()

The data look good! Now let’s try a random chunk from the middle.

chunk_number = 500
chunk_key = list(recipe.iter_chunks())[chunk_number]
for input_file in inputs_for_chunk(chunk_key, recipe.inputs_per_chunk, ninputs):
    cache_input(input_file, config=recipe)
with open_chunk(chunk_key, config=recipe) as ds_chunk:
    ds_chunk.load()
ds_chunk

Step 7: Try writing data#

Now that we can see our chunks opening correctly, we are ready to try writing data to our target.

We can write a Zarr store containing only the first two timesteps of our dataset as follows:

pruned_recipe = recipe.copy_pruned()
pruned_recipe.to_function()()

Now we can examine the output of our pruned execution test:

ds = xr.open_zarr(recipe.target_mapper, consolidated=True)
ds

Postscript: Execute the full recipe#

We are now confident that our recipe works as we expect. At this point we could either:

Hopefully now you have a better understanding of how Pangeo Forge recipes work.