Xarray-to-Zarr Sequential Recipe: NOAA OISST#

This tutorial describes how to create a recipe from scratch. The source data is a sequence of NetCDF files accessed via HTTP. The target is a Zarr store.

Step 1: Get to know your source data#

If you are developing a new recipe, you are probably starting from an existing dataset. The first step is to just get to know the dataset. For this tutorial, our example will be the NOAA Optimum Interpolation Sea Surface Temperature (OISST) v2.1. The authoritative website describing the data is https://www.ncdc.noaa.gov/oisst/optimum-interpolation-sea-surface-temperature-oisst-v21. This website contains links to the actual data files on the data access page. We will use the AVHRR-Only version of the data and follow the corresponding link to the Gridded netCDF Data. Browsing through the directories, we can see that there is one file per day. The very first day of the dataset is stored at the following URL:

https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc

From this example, we can work out the pattern of the file naming conventions. But first, let’s just download one of the files and open it up.

! wget https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc 
--2022-05-10 17:38:06--  https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc
Resolving www.ncei.noaa.gov (www.ncei.noaa.gov)... 2610:20:8040:2::167, 2610:20:8040:2::178, 2610:20:8040:2::168, ...
Connecting to www.ncei.noaa.gov (www.ncei.noaa.gov)|2610:20:8040:2::167|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1714749 (1.6M) [application/x-netcdf]
Saving to: ‘oisst-avhrr-v02r01.19810901.nc.1’

oisst-avhrr-v02r01. 100%[===================>]   1.63M  2.78MB/s    in 0.6s    

2022-05-10 17:38:08 (2.78 MB/s) - ‘oisst-avhrr-v02r01.19810901.nc.1’ saved [1714749/1714749]
import xarray as xr

ds = xr.open_dataset("oisst-avhrr-v02r01.19810901.nc")
ds
<xarray.Dataset>
Dimensions:  (time: 1, zlev: 1, lat: 720, lon: 1440)
Coordinates:
  * lat      (lat) float32 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
  * time     (time) datetime64[ns] 1981-09-01T12:00:00
  * zlev     (zlev) float32 0.0
Data variables:
    anom     (time, zlev, lat, lon) float32 ...
    err      (time, zlev, lat, lon) float32 ...
    ice      (time, zlev, lat, lon) float32 ...
    sst      (time, zlev, lat, lon) float32 ...
Attributes: (12/37)
    title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
    source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
    id:                         oisst-avhrr-v02r01.19810901.nc
    naming_authority:           gov.noaa.ncei
    summary:                    NOAAs 1/4-degree Daily Optimum Interpolation ...
    cdm_data_type:              Grid
    ...                         ...
    metadata_link:              https://doi.org/10.25921/RE9P-PT57
    ncei_template_version:      NCEI_NetCDF_Grid_Template_v2.0
    comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
    sensor:                     Thermometer, AVHRR
    Conventions:                CF-1.6, ACDD-1.3
    references:                 Reynolds, et al.(2007) Daily High-Resolution-...

We can see there are four data variables, all with dimension (time, zlev, lat, lon). There is a dimension coordinate for each dimension, and no non-dimension coordinates. Each file in the sequence presumably has the same zlev, lat, and lon, but we expect time to be different in each one.

Let’s also check the total size of the dataset in the file.

print(f"File size is {ds.nbytes/1e6} MB")
File size is 16.597452 MB

The file size is important because it will help us define the chunk size Pangeo Forge will use to build up the target dataset.

Step 2: Define File Pattern#

The first step in developing a recipe is to define a File Pattern. The file pattern describes how the source files (a.k.a. “inputs”) are organized.

In this case, we have a very simple sequence of files that we want to concatenate along a single dimension (time), so we can use the helper function pangeo_forge_recipes.patterns.pattern_from_file_sequence(). This allows us to simply pass a list of URLs, which we define explicitly.

from pangeo_forge_recipes.patterns import pattern_from_file_sequence

pattern_from_file_sequence?
Signature:
pattern_from_file_sequence(
    file_list,
    concat_dim,
    nitems_per_file=None,
    **kwargs,
)
Docstring: Convenience function for creating a FilePattern from a list of files.
File:      ~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/patterns.py
Type:      function

To populate the file_list, we need understand the file naming conventions. Let’s look again at the first URL

https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc

From this we deduce the following format string.

input_url_pattern = (
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation"
    "/v2.1/access/avhrr/{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc"
)

To convert this to an actual list of files, we use Pandas. At the time of writing, the latest available data is from 2021-01-05.

import pandas as pd

dates = pd.date_range("1981-09-01", "2021-01-05", freq="D")
input_urls = [
    input_url_pattern.format(
        yyyymm=day.strftime("%Y%m"), yyyymmdd=day.strftime("%Y%m%d")
    )
    for day in dates
]
print(f"Found {len(input_urls)} files!")
input_urls[-1]
Found 14372 files!
'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202101/oisst-avhrr-v02r01.20210105.nc'

Now we can define our pattern. We will include one more piece of information: we know from examining the file above that there is only one timestep per file. So we can set nitems_per_file=1.

pattern = pattern_from_file_sequence(input_urls, "time", nitems_per_file=1)
pattern
<FilePattern {'time': 14372}>

To check out pattern, we can try to get the data back out. The pattern is designed to be iterated over, so to key the first key, we do:

for key in pattern:
    break
key
frozenset({DimIndex(name='time', index=0, sequence_len=14372, operation=<CombineOp.CONCAT: 2>)})

We can now use “getitem” syntax on the FilePattern object to retrieve the file name based on this key.

pattern[key]
'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc'

As an alternative way to create the same pattern we could use the more verbose syntax to create a FilePattern class. With this method, we have to define a function which returns the file path, given a particular key. We might do it like this.

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern

def format_function(time):
    return input_url_pattern.format(
        yyyymm=time.strftime("%Y%m"), yyyymmdd=time.strftime("%Y%m%d")
    )

concat_dim = ConcatDim(name="time", keys=dates, nitems_per_file=1)
pattern = FilePattern(format_function, concat_dim)
pattern
<FilePattern {'time': 14372}>

We can check that it gives us the same thing:

pattern[key]
'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc'

Step 3: Pick a Recipe class#

Now that we have the file pattern defined, we have to plug it into a Recipe. Since we are reading NetCDF files, we will use the pangeo_forge_recipes.recipe.XarrayZarrRecipe class Let’s examine its documentation string in our notebook.

from pangeo_forge_recipes.recipes import XarrayZarrRecipe
XarrayZarrRecipe?
Init signature:
XarrayZarrRecipe(
    file_pattern: 'FilePattern',
    storage_config: 'StorageConfig' = <factory>,
    inputs_per_chunk: 'int' = 1,
    target_chunks: 'Dict[str, int]' = <factory>,
    cache_inputs: 'Optional[bool]' = None,
    copy_input_to_local_file: 'bool' = False,
    consolidate_zarr: 'bool' = True,
    consolidate_dimension_coordinates: 'bool' = True,
    xarray_open_kwargs: 'dict' = <factory>,
    xarray_concat_kwargs: 'dict' = <factory>,
    delete_input_encoding: 'bool' = True,
    process_input: 'Optional[Callable[[xr.Dataset, str], xr.Dataset]]' = None,
    process_chunk: 'Optional[Callable[[xr.Dataset], xr.Dataset]]' = None,
    lock_timeout: 'Optional[int]' = None,
    subset_inputs: 'SubsetSpec' = <factory>,
    open_input_with_kerchunk: 'bool' = False,
) -> None
Docstring:     
This configuration represents a dataset composed of many individual NetCDF files.
This class uses Xarray to read and write data and writes its output to Zarr.
The organization of the source files is described by the ``file_pattern``.
Currently this recipe supports at most one ``MergeDim`` and one ``ConcatDim``
in the File Pattern.

:param file_pattern: An object which describes the organization of the input files.
:param inputs_per_chunk: The number of inputs to use in each chunk along the concat dim.
   Must be an integer >= 1.
:param target_chunks: Desired chunk structure for the targret dataset. This is a dictionary
   mapping dimension names to chunk size. When using a :class:`patterns.FilePattern` with
   a :class:`patterns.ConcatDim` that specifies ``n_items_per_file``, then you don't need
   to include the concat dim in ``target_chunks``.
:param storage_config: Defines locations for writing the output dataset, caching temporary data,
  and for caching metadata for inputs and chunks. All three locations default to
  ``tempdir.TemporaryDirectory``; this default config can be used for testing and debugging the
  recipe. In an actual execution context, the default config is re-assigned to point to the
  destination(s) of choice, which can be any combination of ``fsspec``-compatible storage
  backends.
:param cache_inputs: If ``True``, inputs are copied to ``input_cache`` before
  opening. If ``False``, try to open inputs directly from their source location.
:param copy_input_to_local_file: Whether to copy the inputs to a temporary
  local file. In this case, a path (rather than file object) is passed to
  ``xr.open_dataset``. This is required for engines that can't open
  file-like objects (e.g. pynio).
:param consolidate_zarr: Whether to consolidate the resulting Zarr dataset.
:param consolidate_dimension_coordinates: Whether to rewrite coordinate variables as a
    single chunk. We recommend consolidating coordinate variables to avoid
    many small read requests to get the coordinates in xarray.
:param xarray_open_kwargs: Extra options for opening the inputs with Xarray.
:param xarray_concat_kwargs: Extra options to pass to Xarray when concatenating
  the inputs to form a chunk.
:param delete_input_encoding: Whether to remove Xarray encoding from variables
  in the input dataset
:param process_input: Function to call on each opened input, with signature
  `(ds: xr.Dataset, filename: str) -> ds: xr.Dataset`.
:param process_chunk: Function to call on each concatenated chunk, with signature
  `(ds: xr.Dataset) -> ds: xr.Dataset`.
:param lock_timeout: The default timeout for acquiring a chunk lock.
:param subset_inputs: If set, break each input file up into multiple chunks
  along dimension according to the specified mapping. For example,
  ``{'time': 5}`` would split each input file into 5 chunks along the
  time dimension. Multiple dimensions are allowed.
:param open_input_with_kerchunk: If True, use kerchunk
  to generate a reference filesystem for each input, to be used when opening
  the file with Xarray as a virtual Zarr dataset.
File:           ~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py
Type:           ABCMeta
Subclasses:     

There are lots of optional parameters, but only file_pattern is required. We can initialize our recipe by passing the file pattern to the recipe class.

from pangeo_forge_recipes.recipes import XarrayZarrRecipe

recipe = XarrayZarrRecipe(pattern)
recipe
XarrayZarrRecipe(file_pattern=<FilePattern {'time': 14372}>, storage_config=StorageConfig(target=FSSpecTarget(fs=<fsspec.implementations.local.LocalFileSystem object at 0x7fb0085f9c40>, root_path='/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmp649pmwcu/9TeFs2ek'), cache=CacheFSSpecTarget(fs=<fsspec.implementations.local.LocalFileSystem object at 0x7fb0085f9c40>, root_path='/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmp649pmwcu/twM2ZPEI'), metadata=MetadataTarget(fs=<fsspec.implementations.local.LocalFileSystem object at 0x7fb0085f9c40>, root_path='/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmp649pmwcu/MP85exFF')), inputs_per_chunk=1, target_chunks={}, cache_inputs=True, copy_input_to_local_file=False, consolidate_zarr=True, consolidate_dimension_coordinates=True, xarray_open_kwargs={}, xarray_concat_kwargs={}, delete_input_encoding=True, process_input=None, process_chunk=None, lock_timeout=None, subset_inputs={}, open_input_with_kerchunk=False)

Now let’s think about the Zarr chunks that this recipe will produce. Each target chunk corresponds to one input. So each variable chunk will only be a few MB. That is too small. Let’s increase inputs_per_chunk to 10. This means that we will need to be able to hold 10 files like the one we examined above in memory at once. That’s 16MB * 10 = 160MB. Not a problem!

recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=10)
recipe
XarrayZarrRecipe(file_pattern=<FilePattern {'time': 14372}>, storage_config=StorageConfig(target=FSSpecTarget(fs=<fsspec.implementations.local.LocalFileSystem object at 0x7fb0085f9c40>, root_path='/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmp649pmwcu/ooL74qvo'), cache=CacheFSSpecTarget(fs=<fsspec.implementations.local.LocalFileSystem object at 0x7fb0085f9c40>, root_path='/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmp649pmwcu/66PIP8WX'), metadata=MetadataTarget(fs=<fsspec.implementations.local.LocalFileSystem object at 0x7fb0085f9c40>, root_path='/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmp649pmwcu/IxuO6yYM')), inputs_per_chunk=10, target_chunks={}, cache_inputs=True, copy_input_to_local_file=False, consolidate_zarr=True, consolidate_dimension_coordinates=True, xarray_open_kwargs={}, xarray_concat_kwargs={}, delete_input_encoding=True, process_input=None, process_chunk=None, lock_timeout=None, subset_inputs={}, open_input_with_kerchunk=False)

Step 4: Play with the recipe#

Now we will just explore our recipe a bit to check whether things make sense.

We will also turn on Pangeo Forge’s logging.

from pangeo_forge_recipes.recipes import setup_logging
setup_logging()

We can see how many inputs the recipe has like this:

all_inputs = list(recipe.iter_inputs())
len(all_inputs)
14372

And how many chunks:

all_chunks = list(recipe.iter_chunks())
len(all_chunks)
1438

We can now try to load the first chunk. This will raise an exception because we have not initialized any targets.

(Note that the open_chunk and open_input methods must be called as context managers.

%xmode minimal

from pangeo_forge_recipes.recipes.xarray_zarr import open_chunk

try:
    with open_chunk(all_chunks[0], config=recipe) as ds:
        display(ds)
except FileNotFoundError as e:
    print(str(e))
Exception reporting mode: Minimal
[05/10/22 17:38:10] INFO     Opening inputs for chunk                      xarray_zarr.py:390
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=1438,                                              
                             operation=<CombineOp.CONCAT: 2>)})                              
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/198                   
                             109/oisst-avhrr-v02r01.19810901.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810901.nc' from               
                             cache                                                           
[Errno 2] No such file or directory: '/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmp649pmwcu/66PIP8WX/fe866b608e5c7eafba93f06954124ba1-https_www.ncei.noaa.gov_data_sea-surface-temperature-optimum-interpolation_v2.1_access_avhrr_198109_oisst-avhrr-v02r01.19810901.nc'

Step 5: Create storage targets#

To experiment with our object a bit more, let’s attempt to load a chunk.

try:
    with open_chunk(all_chunks[0], config=recipe) as ds:
        display(ds)
except FileNotFoundError as e:
    print(e)
                    INFO     Opening inputs for chunk                      xarray_zarr.py:390
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=1438,                                              
                             operation=<CombineOp.CONCAT: 2>)})                              
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/198                   
                             109/oisst-avhrr-v02r01.19810901.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810901.nc' from               
                             cache                                                           
[Errno 2] No such file or directory: '/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmp649pmwcu/66PIP8WX/fe866b608e5c7eafba93f06954124ba1-https_www.ncei.noaa.gov_data_sea-surface-temperature-optimum-interpolation_v2.1_access_avhrr_198109_oisst-avhrr-v02r01.19810901.nc'

It still didn’t work! That’s because we have not cached the inputs yet. We can have the recipe tell us which inputs are needed for each chunk via the inputs_for_chunk method.

from pangeo_forge_recipes.recipes.xarray_zarr import cache_input, inputs_for_chunk

ninputs = recipe.file_pattern.dims["time"]

for input_file in inputs_for_chunk(all_chunks[0], recipe.inputs_per_chunk, ninputs):
    cache_input(input_file, config=recipe)
                    INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=0, sequence_len=14372,                                    
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/198109/oisst-avhrr-v02r01.19810901.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810               
                             901.nc' to cache                                                
[05/10/22 17:38:13] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=1, sequence_len=14372,                                    
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/198109/oisst-avhrr-v02r01.19810902.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810               
                             902.nc' to cache                                                
[05/10/22 17:38:14] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=2, sequence_len=14372,                                    
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/198109/oisst-avhrr-v02r01.19810903.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810               
                             903.nc' to cache                                                
[05/10/22 17:38:16] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=3, sequence_len=14372,                                    
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/198109/oisst-avhrr-v02r01.19810904.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810               
                             904.nc' to cache                                                
[05/10/22 17:38:18] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=4, sequence_len=14372,                                    
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/198109/oisst-avhrr-v02r01.19810905.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810               
                             905.nc' to cache                                                
[05/10/22 17:38:20] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=5, sequence_len=14372,                                    
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/198109/oisst-avhrr-v02r01.19810906.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810               
                             906.nc' to cache                                                
[05/10/22 17:38:22] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=6, sequence_len=14372,                                    
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/198109/oisst-avhrr-v02r01.19810907.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810               
                             907.nc' to cache                                                
[05/10/22 17:38:24] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=7, sequence_len=14372,                                    
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/198109/oisst-avhrr-v02r01.19810908.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810               
                             908.nc' to cache                                                
[05/10/22 17:38:26] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=8, sequence_len=14372,                                    
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/198109/oisst-avhrr-v02r01.19810909.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810               
                             909.nc' to cache                                                
[05/10/22 17:38:28] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=9, sequence_len=14372,                                    
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/198109/oisst-avhrr-v02r01.19810910.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810               
                             910.nc' to cache                                                

Step 6: Examine some chunks#

Now we can finally open the first chunk!

with open_chunk(all_chunks[0], config=recipe) as ds:
    display(ds)
    # need to load if we want to access the data outside of the context
    ds.load()
[05/10/22 17:38:30] INFO     Opening inputs for chunk                      xarray_zarr.py:390
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=1438,                                              
                             operation=<CombineOp.CONCAT: 2>)})                              
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/198                   
                             109/oisst-avhrr-v02r01.19810901.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810901.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=1,                           
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/198                   
                             109/oisst-avhrr-v02r01.19810902.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810902.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=2,                           
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/198                   
                             109/oisst-avhrr-v02r01.19810903.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810903.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=3,                           
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/198                   
                             109/oisst-avhrr-v02r01.19810904.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810904.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=4,                           
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/198                   
                             109/oisst-avhrr-v02r01.19810905.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810905.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=5,                           
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/198                   
                             109/oisst-avhrr-v02r01.19810906.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810906.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=6,                           
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/198                   
                             109/oisst-avhrr-v02r01.19810907.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810907.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=7,                           
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/198                   
                             109/oisst-avhrr-v02r01.19810908.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810908.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=8,                           
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/198                   
                             109/oisst-avhrr-v02r01.19810909.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810909.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=9,                           
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/198                   
                             109/oisst-avhrr-v02r01.19810910.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810910.nc' from               
                             cache                                                           
                    INFO     Combining inputs for chunk                    xarray_zarr.py:408
                             'Index({DimIndex(name='time', index=0,                          
                             sequence_len=1438,                                              
                             operation=<CombineOp.CONCAT: 2>)})'                             
<xarray.Dataset>
Dimensions:  (time: 10, zlev: 1, lat: 720, lon: 1440)
Coordinates:
  * lat      (lat) float32 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
  * time     (time) datetime64[ns] 1981-09-01T12:00:00 ... 1981-09-10T12:00:00
  * zlev     (zlev) float32 0.0
Data variables:
    anom     (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
    err      (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
    ice      (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
    sst      (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
Attributes: (12/37)
    title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
    source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
    id:                         oisst-avhrr-v02r01.19810901.nc
    naming_authority:           gov.noaa.ncei
    summary:                    NOAAs 1/4-degree Daily Optimum Interpolation ...
    cdm_data_type:              Grid
    ...                         ...
    metadata_link:              https://doi.org/10.25921/RE9P-PT57
    ncei_template_version:      NCEI_NetCDF_Grid_Template_v2.0
    comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
    sensor:                     Thermometer, AVHRR
    Conventions:                CF-1.6, ACDD-1.3
    references:                 Reynolds, et al.(2007) Daily High-Resolution-...
print(f'Total chunk size: {ds.nbytes / 1e6} MB')
Total chunk size: 165.896724 MB

👀 Inspect the Xarray HTML repr above carefully by clicking on the buttons to expand the different sections.

  • ✅ Is the shape of the variable what we expect?

  • ✅ Is time going in the right order?

  • ✅ Do the variable attributes make sense?

Now let’s visualize some data and make sure things look good

ds.sst[0].plot()
<matplotlib.collections.QuadMesh at 0x7fafc0428760>
../../../_images/b18cf1d00095d29637933bb81f4b983c547b7b73854f4e0847fa7732cec3921d.png
ds.ice[-1].plot()
<matplotlib.collections.QuadMesh at 0x7fb03ddeb040>
../../../_images/6fa985ee1ce06e15d65f820bfb1696eca4d9845dd56d324b6e469ebf76a22a86.png

The data look good! Now let’s try a random chunk from the middle.

chunk_number = 500
chunk_key = list(recipe.iter_chunks())[chunk_number]
for input_file in inputs_for_chunk(chunk_key, recipe.inputs_per_chunk, ninputs):
    cache_input(input_file, config=recipe)
[05/10/22 17:38:32] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=5000, sequence_len=14372,                                 
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/199505/oisst-avhrr-v02r01.19950511.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/199505/oisst-avhrr-v02r01.19950               
                             511.nc' to cache                                                
[05/10/22 17:38:34] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=5001, sequence_len=14372,                                 
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/199505/oisst-avhrr-v02r01.19950512.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/199505/oisst-avhrr-v02r01.19950               
                             512.nc' to cache                                                
[05/10/22 17:38:36] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=5002, sequence_len=14372,                                 
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/199505/oisst-avhrr-v02r01.19950513.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/199505/oisst-avhrr-v02r01.19950               
                             513.nc' to cache                                                
[05/10/22 17:38:38] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=5003, sequence_len=14372,                                 
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/199505/oisst-avhrr-v02r01.19950514.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/199505/oisst-avhrr-v02r01.19950               
                             514.nc' to cache                                                
[05/10/22 17:38:40] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=5004, sequence_len=14372,                                 
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/199505/oisst-avhrr-v02r01.19950515.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/199505/oisst-avhrr-v02r01.19950               
                             515.nc' to cache                                                
[05/10/22 17:38:42] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=5005, sequence_len=14372,                                 
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/199505/oisst-avhrr-v02r01.19950516.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/199505/oisst-avhrr-v02r01.19950               
                             516.nc' to cache                                                
[05/10/22 17:38:44] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=5006, sequence_len=14372,                                 
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/199505/oisst-avhrr-v02r01.19950517.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/199505/oisst-avhrr-v02r01.19950               
                             517.nc' to cache                                                
[05/10/22 17:38:46] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=5007, sequence_len=14372,                                 
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/199505/oisst-avhrr-v02r01.19950518.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/199505/oisst-avhrr-v02r01.19950               
                             518.nc' to cache                                                
[05/10/22 17:38:48] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=5008, sequence_len=14372,                                 
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/199505/oisst-avhrr-v02r01.19950519.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/199505/oisst-avhrr-v02r01.19950               
                             519.nc' to cache                                                
[05/10/22 17:38:50] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=5009, sequence_len=14372,                                 
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/199505/oisst-avhrr-v02r01.19950520.nc'               
                    INFO     Copying remote file 'https://www.ncei.noaa.gov/da storage.py:172
                             ta/sea-surface-temperature-optimum-interpolation/               
                             v2.1/access/avhrr/199505/oisst-avhrr-v02r01.19950               
                             520.nc' to cache                                                
with open_chunk(chunk_key, config=recipe) as ds_chunk:
    ds_chunk.load()
ds_chunk
[05/10/22 17:38:52] INFO     Opening inputs for chunk                      xarray_zarr.py:390
                             Index({DimIndex(name='time', index=500,                         
                             sequence_len=1438,                                              
                             operation=<CombineOp.CONCAT: 2>)})                              
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=5000,                        
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/199                   
                             505/oisst-avhrr-v02r01.19950511.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/199505/oisst-avhrr-v02r01.19950511.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=5001,                        
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/199                   
                             505/oisst-avhrr-v02r01.19950512.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/199505/oisst-avhrr-v02r01.19950512.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=5002,                        
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/199                   
                             505/oisst-avhrr-v02r01.19950513.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/199505/oisst-avhrr-v02r01.19950513.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=5003,                        
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/199                   
                             505/oisst-avhrr-v02r01.19950514.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/199505/oisst-avhrr-v02r01.19950514.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=5004,                        
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/199                   
                             505/oisst-avhrr-v02r01.19950515.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/199505/oisst-avhrr-v02r01.19950515.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=5005,                        
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/199                   
                             505/oisst-avhrr-v02r01.19950516.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/199505/oisst-avhrr-v02r01.19950516.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=5006,                        
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/199                   
                             505/oisst-avhrr-v02r01.19950517.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/199505/oisst-avhrr-v02r01.19950517.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=5007,                        
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/199                   
                             505/oisst-avhrr-v02r01.19950518.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/199505/oisst-avhrr-v02r01.19950518.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=5008,                        
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/199                   
                             505/oisst-avhrr-v02r01.19950519.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/199505/oisst-avhrr-v02r01.19950519.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=5009,                        
                             sequence_len=14372,                                             
                             operation=<CombineOp.CONCAT: 2>)}): 'https://                   
                             www.ncei.noaa.gov/data/sea-surface-temperatur                   
                             e-optimum-interpolation/v2.1/access/avhrr/199                   
                             505/oisst-avhrr-v02r01.19950520.nc'                             
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/199505/oisst-avhrr-v02r01.19950520.nc' from               
                             cache                                                           
                    INFO     Combining inputs for chunk                    xarray_zarr.py:408
                             'Index({DimIndex(name='time', index=500,                        
                             sequence_len=1438,                                              
                             operation=<CombineOp.CONCAT: 2>)})'                             
<xarray.Dataset>
Dimensions:  (time: 10, zlev: 1, lat: 720, lon: 1440)
Coordinates:
  * lat      (lat) float32 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
  * time     (time) datetime64[ns] 1995-05-11T12:00:00 ... 1995-05-20T12:00:00
  * zlev     (zlev) float32 0.0
Data variables:
    anom     (time, zlev, lat, lon) float32 nan nan nan nan ... 0.11 0.11 0.11
    err      (time, zlev, lat, lon) float32 nan nan nan nan ... 0.3 0.3 0.3 0.3
    ice      (time, zlev, lat, lon) float32 nan nan nan nan ... 0.97 0.97 0.97
    sst      (time, zlev, lat, lon) float32 nan nan nan ... -1.69 -1.69 -1.69
Attributes: (12/37)
    title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
    source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
    id:                         oisst-avhrr-v02r01.19950511.nc
    naming_authority:           gov.noaa.ncei
    summary:                    NOAAs 1/4-degree Daily Optimum Interpolation ...
    cdm_data_type:              Grid
    ...                         ...
    metadata_link:              https://doi.org/10.25921/RE9P-PT57
    ncei_template_version:      NCEI_NetCDF_Grid_Template_v2.0
    comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
    sensor:                     Thermometer, AVHRR
    Conventions:                CF-1.6, ACDD-1.3
    references:                 Reynolds, et al.(2007) Daily High-Resolution-...

Step 7: Try writing data#

Now that we can see our chunks opening correctly, we are ready to try writing data to our target.

We can write a Zarr store containing only the first two timesteps of our dataset as follows:

pruned_recipe = recipe.copy_pruned()
pruned_recipe.to_function()()
[05/10/22 17:38:53] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=0, sequence_len=2,                                        
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/198109/oisst-avhrr-v02r01.19810901.nc'               
                    INFO     File 'https://www.ncei.noaa.gov/data/sea-surface- storage.py:167
                             temperature-optimum-interpolation/v2.1/access/avh               
                             rr/198109/oisst-avhrr-v02r01.19810901.nc' is                    
                             already cached                                                  
                    INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=1, sequence_len=2,                                        
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 'https://www.ncei.noaa.gov/data/sea- storage.py:161
                             surface-temperature-optimum-interpolation/v2.1/ac               
                             cess/avhrr/198109/oisst-avhrr-v02r01.19810902.nc'               
[05/10/22 17:38:54] INFO     File 'https://www.ncei.noaa.gov/data/sea-surface- storage.py:167
                             temperature-optimum-interpolation/v2.1/access/avh               
                             rr/198109/oisst-avhrr-v02r01.19810902.nc' is                    
                             already cached                                                  
//pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py:115: RuntimeWarning: Failed to open Zarr store with consolidated metadata, falling back to try reading non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
  return xr.open_zarr(target.get_mapper())
                    INFO     Creating a new dataset in target              xarray_zarr.py:511
                    INFO     Opening inputs for chunk                      xarray_zarr.py:390
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=1, operation=<CombineOp.CONCAT:                    
                             2>)})                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}): 'https://www.ncei.noaa.gov/data/sea-su                   
                             rface-temperature-optimum-interpolation/v2.1/                   
                             access/avhrr/198109/oisst-avhrr-v02r01.198109                   
                             01.nc'                                                          
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810901.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=1,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}): 'https://www.ncei.noaa.gov/data/sea-su                   
                             rface-temperature-optimum-interpolation/v2.1/                   
                             access/avhrr/198109/oisst-avhrr-v02r01.198109                   
                             02.nc'                                                          
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810902.nc' from               
                             cache                                                           
                    INFO     Combining inputs for chunk                    xarray_zarr.py:408
                             'Index({DimIndex(name='time', index=0,                          
                             sequence_len=1, operation=<CombineOp.CONCAT:                    
                             2>)})'                                                          
                    INFO     Storing dataset in /var/folders/tt/4f941hdn0z xarray_zarr.py:553
                             q549zdwhcgg98c0000gn/T/tmp649pmwcu/ooL74qvo                     
                    INFO     Expanding target concat dim 'time' to size 2  xarray_zarr.py:569
                    INFO     Opening inputs for chunk                      xarray_zarr.py:390
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=1, operation=<CombineOp.CONCAT:                    
                             2>)})                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}): 'https://www.ncei.noaa.gov/data/sea-su                   
                             rface-temperature-optimum-interpolation/v2.1/                   
                             access/avhrr/198109/oisst-avhrr-v02r01.198109                   
                             01.nc'                                                          
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810901.nc' from               
                             cache                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=1,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}): 'https://www.ncei.noaa.gov/data/sea-su                   
                             rface-temperature-optimum-interpolation/v2.1/                   
                             access/avhrr/198109/oisst-avhrr-v02r01.198109                   
                             02.nc'                                                          
                    INFO     Opening 'https://www.ncei.noaa.gov/data/sea-surfa storage.py:267
                             ce-temperature-optimum-interpolation/v2.1/access/               
                             avhrr/198109/oisst-avhrr-v02r01.19810902.nc' from               
                             cache                                                           
                    INFO     Combining inputs for chunk                    xarray_zarr.py:408
                             'Index({DimIndex(name='time', index=0,                          
                             sequence_len=1, operation=<CombineOp.CONCAT:                    
                             2>)})'                                                          
//pangeo-forge-recipes/pangeo_forge_recipes/chunk_grid.py:51: UserWarning: chunksize (10) > dimsize (2). Decreasing chunksize to 2
  warnings.warn(
                    INFO     Storing variable anom chunk                   xarray_zarr.py:632
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=1, operation=<CombineOp.CONCAT:                    
                             2>)}) to Zarr region (slice(0, 2, None),                        
                             slice(None, None, None), slice(None, None,                      
                             None), slice(None, None, None))                                 
                    INFO     Storing variable err chunk                    xarray_zarr.py:632
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=1, operation=<CombineOp.CONCAT:                    
                             2>)}) to Zarr region (slice(0, 2, None),                        
                             slice(None, None, None), slice(None, None,                      
                             None), slice(None, None, None))                                 
                    INFO     Storing variable ice chunk                    xarray_zarr.py:632
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=1, operation=<CombineOp.CONCAT:                    
                             2>)}) to Zarr region (slice(0, 2, None),                        
                             slice(None, None, None), slice(None, None,                      
                             None), slice(None, None, None))                                 
                    INFO     Storing variable sst chunk                    xarray_zarr.py:632
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=1, operation=<CombineOp.CONCAT:                    
                             2>)}) to Zarr region (slice(0, 2, None),                        
                             slice(None, None, None), slice(None, None,                      
                             None), slice(None, None, None))                                 
                    INFO     Storing variable time chunk                   xarray_zarr.py:632
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=1, operation=<CombineOp.CONCAT:                    
                             2>)}) to Zarr region (slice(0, 2, None),)                       
                    INFO     Consolidating dimension coordinate arrays     xarray_zarr.py:649
                    INFO     Consolidating Zarr metadata                   xarray_zarr.py:673

Now we can examine the output of our pruned execution test:

ds = xr.open_zarr(recipe.target_mapper, consolidated=True)
ds
<xarray.Dataset>
Dimensions:  (time: 2, zlev: 1, lat: 720, lon: 1440)
Coordinates:
  * lat      (lat) float32 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
  * time     (time) datetime64[ns] 1981-09-01T12:00:00 1981-09-02T12:00:00
  * zlev     (zlev) float32 0.0
Data variables:
    anom     (time, zlev, lat, lon) float32 dask.array<chunksize=(2, 1, 720, 1440), meta=np.ndarray>
    err      (time, zlev, lat, lon) float32 dask.array<chunksize=(2, 1, 720, 1440), meta=np.ndarray>
    ice      (time, zlev, lat, lon) float32 dask.array<chunksize=(2, 1, 720, 1440), meta=np.ndarray>
    sst      (time, zlev, lat, lon) float32 dask.array<chunksize=(2, 1, 720, 1440), meta=np.ndarray>
Attributes: (12/37)
    Conventions:                CF-1.6, ACDD-1.3
    cdm_data_type:              Grid
    comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
    creator_email:              oisst-help@noaa.gov
    creator_url:                https://www.ncei.noaa.gov/
    date_created:               2020-05-08T19:05:13Z
    ...                         ...
    source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
    standard_name_vocabulary:   CF Standard Name Table (v40, 25 January 2017)
    summary:                    NOAAs 1/4-degree Daily Optimum Interpolation ...
    time_coverage_end:          1981-09-01T23:59:59Z
    time_coverage_start:        1981-09-01T00:00:00Z
    title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...

Postscript: Execute the full recipe#

We are now confident that our recipe works as we expect. At this point we could either:

Hopefully now you have a better understanding of how Pangeo Forge recipes work.