NetCDF Zarr Sequential Recipe: CMIP6

This tutorial describes how to create a suitable recipe for many of the CMIP6 datasets. The source data is a sequence of NetCDF files accessed from the ‘s3://esgf-world’ bucket. The target is a Zarr store.

Background

  • The s3://esgf-world bucket has about 250,000 datasets stored in about 950,000 netcdf files (for an average of about four netcdf files per dataset). This is a small subset of the WCRP-CMIP6 collection available at the Federated ESGF-COG nodes such as https://esgf-node.llnl.gov/search/cmip6, but it is faster and easier to work with.

  • Each CMIP6 dataset can be identified by a 6-tuple consisting of:

      (model,experiment,ensemble_member,mip_table,variable,grid_label)
    

and so a convenient name for a particular dataset is a string of these values joined with a ‘.’ separator:

  dataset = model.experiment.ensemble_member.mip_table.variable.grid_label
  • There can be multiple versions of a dataset, designated by a string beginning with ‘v’ and then an 8 digit date, loosely associated with its creation time

import pandas as pd
import xarray as xr
import s3fs

Step 1: Get to know your source data

The CMIP6 collection is very heterogeneous, so getting to know the source data is rather complicated. We first need to identify a dataset and learn how to list the set of netcdf files which are associated with it. Fortunately, you can explore the data here: https://esgf-world.s3.amazonaws.com/index.html#CMIP6/ or download a CSV file listing all of the netcdf files, one per line.

Here we will read the CSV file into a pandas dataframe so we can search, sort and subset the available datasets and their netcdf files.

netcdf_cat = 's3://cmip6-nc/esgf-world.csv.gz'
df_s3 = pd.read_csv(netcdf_cat, dtype='unicode')
df_s3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1056266 entries, 0 to 1056265
Data columns (total 13 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   project          1056266 non-null  object
 1   institution_id   1056266 non-null  object
 2   source_id        1056266 non-null  object
 3   experiment_id    1056266 non-null  object
 4   frequency        559718 non-null   object
 5   modeling_realm   559718 non-null   object
 6   table_id         1056266 non-null  object
 7   member_id        1056266 non-null  object
 8   grid_label       1056266 non-null  object
 9   variable_id      1056266 non-null  object
 10  temporal_subset  1027893 non-null  object
 11  version          1056266 non-null  object
 12  path             1056266 non-null  object
dtypes: object(13)
memory usage: 104.8+ MB
# So there are 956,306 entries, one for each netcdf file. We can see the first five here:
# The 'path' column is the most important - you may need to scroll the window to see it!

df_s3.head()
project institution_id source_id experiment_id frequency modeling_realm table_id member_id grid_label variable_id temporal_subset version path
0 CMIP6 AS-RCEC TaiESM1 histSST-piNTCF NaN NaN AERmon r1i1p1f1 gn ps 185001-201412 v20200318 s3://esgf-world/CMIP6/AerChemMIP/AS-RCEC/TaiES...
1 CMIP6 AS-RCEC TaiESM1 histSST-piNTCF NaN NaN CFmon r1i1p1f1 gn ta 185001-201412 v20200318 s3://esgf-world/CMIP6/AerChemMIP/AS-RCEC/TaiES...
2 CMIP6 AS-RCEC TaiESM1 histSST-piNTCF NaN NaN LImon r1i1p1f1 gn snc 185002-201412 v20200318 s3://esgf-world/CMIP6/AerChemMIP/AS-RCEC/TaiES...
3 CMIP6 AS-RCEC TaiESM1 histSST-piNTCF NaN NaN LImon r1i1p1f1 gn snd 185002-201412 v20200318 s3://esgf-world/CMIP6/AerChemMIP/AS-RCEC/TaiES...
4 CMIP6 AS-RCEC TaiESM1 histSST-piNTCF NaN NaN LImon r1i1p1f1 gn snw 185002-201412 v20200318 s3://esgf-world/CMIP6/AerChemMIP/AS-RCEC/TaiES...
# We will add a new column which is our short name for the datasets (may take a moment for all 956306 rows)
df_s3['dataset'] = df_s3.apply(lambda row: '.'.join(row.path.split('/')[6:12]),axis=1)
# the number of unique dataset names can be found using the 'nunique' method
df_s3.dataset.nunique()
239268
# The value in the `path` column of the first row is:
df_s3.path.values[0]
's3://esgf-world/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/histSST-piNTCF/r1i1p1f1/AERmon/ps/gn/v20200318/ps_AERmon_TaiESM1_histSST-piNTCF_r1i1p1f1_gn_185001-201412.nc'
# which has the short name:
df_s3.dataset.values[0]
'TaiESM1.histSST-piNTCF.r1i1p1f1.AERmon.ps.gn'
# some datasets have multiple versions: (will just check one in each 500 of them ...)
for dataset in df_s3.dataset.unique()[::500]:
    df_dataset = df_s3[df_s3.dataset==dataset]
    if df_dataset.version.nunique() > 1:
        print(dataset,df_dataset.version.unique())
EC-Earth3-LR.piControl.r1i1p1f1.Omon.mlotst.gn ['v20200409' 'v20200919']
FIO-ESM-2-0.piControl.r1i1p1f1.Amon.rsds.gn ['v20190911' 'v20191010']
IPSL-CM6A-LR.piControl.r1i1p1f1.Amon.o3.gr ['v20181022' 'v20181123']
CESM2.1pctCO2.r1i1p1f1.day.zg.gn ['v20190425' 'v20190826']
NorCPM1.historical.r1i1p1f1.Omon.thetao.gr ['v20190914' 'v20200724']
NorESM2-LM.piControl.r1i1p1f1.Ofx.areacello.gn ['v20190815' 'v20190920']
NorESM2-LM.hist-GHG.r1i1p1f1.Emon.va.gn ['v20190909' 'v20191108']
CESM2.deforest-globe.r1i1p1f1.Amon.rsuscs.gn ['v20190401' 'v20191122']
# So pick a dataset, any dataset, and try it!  N.B. some datasets are VERY large - especially the day, 6hourly, etc.
#dataset = df_s3.dataset[10450]
# or:
dataset = 'GFDL-CM4.historical.r1i1p1f1.Amon.tas.gr1'
df_dataset = df_s3[df_s3.dataset==dataset]
df_dataset
project institution_id source_id experiment_id frequency modeling_realm table_id member_id grid_label variable_id temporal_subset version path dataset
603842 CMIP6 NOAA-GFDL GFDL-CM4 historical mon atmos Amon r1i1p1f1 gr1 tas 185001-194912 v20180701 s3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/... GFDL-CM4.historical.r1i1p1f1.Amon.tas.gr1
603843 CMIP6 NOAA-GFDL GFDL-CM4 historical mon atmos Amon r1i1p1f1 gr1 tas 195001-201412 v20180701 s3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/... GFDL-CM4.historical.r1i1p1f1.Amon.tas.gr1

So is this what we expect?

  • this dataset is split over 3 netcdf files - see any trouble here?

  • lets do a quick sanity check (make sure one and only one variable is specified) and get only the latest version of the files

dvars = df_dataset.variable_id.unique()
assert len(dvars) > 0, 'no netcdf files found for this dataset'
assert len(dvars) == 1, f"trouble with this dataset, too many datasets found: {dvars}"
    
var = dvars[0]
print('The variable is:',var)

# make sure we are looking at the last available version:
last_version = sorted(df_dataset.version.unique())[-1]
dze = df_dataset[df_dataset.version == last_version].reset_index(drop=True)

input_urls = sorted(dze.path.unique())
input_urls
The variable is: tas
['s3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_185001-194912.nc',
 's3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_195001-201412.nc']

There are only two files - one netcdf file was from an older version!

  • We want to look at the first netcdf file to make sure we know what to expect

  • To use xarray.open_dataset, we need to turn the input_url (starting with ‘s3://’) into an appropriate file_like object.

# Connect to AWS S3 storage
fs_s3 = s3fs.S3FileSystem(anon=True)

file_url = fs_s3.open(input_urls[0], mode='rb')
ds = xr.open_dataset(file_url)
print(ds)
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 180, lon: 288, time: 1200)
Coordinates:
  * bnds       (bnds) float64 1.0 2.0
    height     float64 ...
  * lat        (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
  * lon        (lon) float64 0.625 1.875 3.125 4.375 ... 355.6 356.9 358.1 359.4
  * time       (time) object 1850-01-16 12:00:00 ... 1949-12-16 12:00:00
Data variables:
    lat_bnds   (lat, bnds) float64 ...
    lon_bnds   (lon, bnds) float64 ...
    tas        (time, lat, lon) float32 ...
    time_bnds  (time, bnds) object ...
Attributes: (12/46)
    external_variables:     areacella
    history:                File was processed by fremetar (GFDL analog of CM...
    table_id:               Amon
    activity_id:            CMIP
    branch_method:          standard
    branch_time_in_child:   0.0
    ...                     ...
    variable_id:            tas
    variant_info:           N/A
    references:             see further_info_url attribute
    variant_label:          r1i1p1f1
    branch_time_in_parent:  36500.0
    parent_time_units:      days since 0001-1-1

Step 2: Deciding how to chunk the dataset

  • For parallel I/O and subsetting the dataset in time, we will chunk the data in the time dimension

  • In order to figure out the number of time slices in each chunk, we do a small calculation on the first netcdf file

  • Here we set the desired chunk size to 50 Mb, but something between 50-100 Mb is usually alright

ntime = len(ds.time)       # the number of time slices
chunksize_optimal = 50e6  # desired chunk size in bytes
ncfile_size = ds.nbytes    # the netcdf file size
chunksize = max(int(ntime* chunksize_optimal/ ncfile_size),1)

target_chunks = ds.dims.mapping
target_chunks['time'] = chunksize

target_chunks   # a dictionary giving the chunk sizes in each dimension
{'bnds': 2, 'lat': 180, 'lon': 288, 'time': 241}

Step 3: Define a pre-processing function

  • This is an optional step which we want to apply to each chunk

  • Here we change some data variables into coordinate variables, but you can define your own pre-processing step here

# the netcdf lists some of the coordinate variables as data variables. This is a fix which we want to apply to each chunk.
def set_bnds_as_coords(ds):
    new_coords_vars = [var for var in ds.data_vars if 'bnds' in var or 'bounds' in var]
    ds = ds.set_coords(new_coords_vars)
    return ds

Step 4: Create a recipe

  • A FilePattern is the starting place for all recipes. These Python objects are the “raw ingredients” upon which the recipe will act. They describe how the individual source files are organized logically as part of a larger dataset. To create a file pattern, the first step is to define a function which takes any variable components of the source file path as inputs, and returns full file path strings.

  • Revisting our input urls, we see that the only variable components of these paths are the 13-character numerical strings which immediatly precede the .nc file extension:

for url in input_urls:
    print(f'''{url}
    ''')
s3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_185001-194912.nc
    
s3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_195001-201412.nc
    

What do these strings refer to?

  • If it was not immediately apparent, comparison to our dataset coordinates makes it clear that these numerical strings are time ranges; the string '185001-194912' from the first url, e.g., represents a time range from Jan 1850 through Dec 1949:

print(ds.coords)
Coordinates:
  * bnds     (bnds) float64 1.0 2.0
    height   float64 ...
  * lat      (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
  * lon      (lon) float64 0.625 1.875 3.125 4.375 ... 355.6 356.9 358.1 359.4
  * time     (time) object 1850-01-16 12:00:00 ... 1949-12-16 12:00:00

Let’s define a function that takes these strings as input

  • … and returns full file paths!

def make_full_path(time):
    '''
    Parameters
    ----------
    time : str
    
        A 13-character string, comprised of two 6-character dates delimited by a dash. 
        The first four characters of each date are the year, and the final two are the month.
        
        e.g. The time range from Jan 1850 through Dec 1949 is expressed as '185001-194912'.
            
    '''
    base_url = 's3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20180701/'
    return base_url + f'tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_{time}.nc'
# And let's be sure to test our function before moving on.

test_url = make_full_path('185001-194912')
print(test_url)

# If our function works, inputting '185001-194912' should have returned a url identical to
# the first of the two urls in the list named `input_urls` defined in cell 10, above:

test_url == input_urls[0]
s3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_185001-194912.nc
True

Combining dimensions

  • Before we initialize our file pattern, we need to define how we want files to be combined in our eventual zarr store

  • We have two options:

    1. Concatenating dimensions with a ConcatDim instance

    2. Merging dimensions with a MergeDim instance

  • Our current dataset requires only concatenation, which we can achieve by instantiating ConcatDim with our variable name ("time") as a positional argument, followed by a keys kwarg, which is a list containing all of the ways which this variable appears in our set of source file paths.

Note: This example reads from only two source files, so we can simply copy-and-paste their respective time variables into a list. If the number of source files was much larger, we might consider finding a way to create this keys list programatically.

from pangeo_forge_recipes.patterns import ConcatDim
time_concat_dim = ConcatDim("time", keys=['185001-194912', '195001-201412'])

Instantiating the file pattern

  • Now that we have a both file path function and our “combine dimensions” object, we can move on to instantiating to file pattern, passing these two objects as arguments.

  • Note that we will use fsspec.open under the hood for most file opening, so if there are any special keyword arguments we want to pass to this function, now is the time to do it.

  • Here we specify fsspec_open_kwargs={'anon':True} as a keyword argument in the FilePattern, because we want to access the source files anonymously.

from pangeo_forge_recipes.patterns import FilePattern
pattern = FilePattern(make_full_path, time_concat_dim, fsspec_open_kwargs={'anon':True})
pattern
<FilePattern {'time': 2}>

By inspecting our instantiated pattern we see that our pattern has indexed our two files chronologically according to the concatenation key we provided it, and assigned the correct url to each file using the file path function:

for index, fname in pattern.items():
    print(index, fname)
Index({DimIndex(name='time', index=0, sequence_len=2, operation=<CombineOp.CONCAT: 2>)}) s3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_185001-194912.nc
Index({DimIndex(name='time', index=1, sequence_len=2, operation=<CombineOp.CONCAT: 2>)}) s3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_195001-201412.nc

Time to make the recipe!

  • In it’s most basic form, XarrayZarrRecipe can be instantiated using a file pattern as the only argument. Here we’ll be using some of the optional arguments to specify a few additional preferences:

from pangeo_forge_recipes.recipes.xarray_zarr import XarrayZarrRecipe

recipe = XarrayZarrRecipe(
    pattern, 
    target_chunks=target_chunks,
    process_chunk=set_bnds_as_coords,
    xarray_concat_kwargs={'join':'exact'},
)

Step 5: Execute the recipe

Here we use the basic function executor. For details on all execution modes, see:

https://pangeo-forge.readthedocs.io/en/latest/pangeo_forge_recipes/recipe_user_guide/execution.html

from pangeo_forge_recipes.recipes import setup_logging

setup_logging()  # setup execution logs

recipe.to_function()()  # compile and execute recipe as single Python function
[05/10/22 16:27:20] INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=0, sequence_len=2,                                        
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 's3://esgf-world/CMIP6/CMIP/NOAA-GFD storage.py:161
                             L/GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v2018               
                             0701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_18               
                             5001-194912.nc'                                                 
                    INFO     Copying remote file 's3://esgf-world/CMIP6/CMIP/N storage.py:172
                             OAA-GFDL/GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr               
                             1/v20180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1               
                             _gr1_185001-194912.nc' to cache                                 
[05/10/22 16:27:47] INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}): 's3://esgf-world/CMIP6/CMIP/NOAA-GFDL/                   
                             GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20                   
                             180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_                   
                             gr1_185001-194912.nc'                                           
                    INFO     Opening 's3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFD storage.py:267
                             L-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20180701/               
                             tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_185001-               
                             194912.nc' from cache                                           
                    INFO     Caching metadata for input                    xarray_zarr.py:167
                             'Index({DimIndex(name='time', index=0,                          
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)})'                                                          
                    INFO     Caching input 'Index({DimIndex(name='time',   xarray_zarr.py:153
                             index=1, sequence_len=2,                                        
                             operation=<CombineOp.CONCAT: 2>)})'                             
                    INFO     Caching file 's3://esgf-world/CMIP6/CMIP/NOAA-GFD storage.py:161
                             L/GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v2018               
                             0701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_19               
                             5001-201412.nc'                                                 
                    INFO     Copying remote file 's3://esgf-world/CMIP6/CMIP/N storage.py:172
                             OAA-GFDL/GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr               
                             1/v20180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1               
                             _gr1_195001-201412.nc' to cache                                 
[05/10/22 16:28:02] INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=1,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}): 's3://esgf-world/CMIP6/CMIP/NOAA-GFDL/                   
                             GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20                   
                             180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_                   
                             gr1_195001-201412.nc'                                           
                    INFO     Opening 's3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFD storage.py:267
                             L-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20180701/               
                             tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_195001-               
                             201412.nc' from cache                                           
                    INFO     Caching metadata for input                    xarray_zarr.py:167
                             'Index({DimIndex(name='time', index=1,                          
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)})'                                                          
//pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py:115: RuntimeWarning: Failed to open Zarr store with consolidated metadata, falling back to try reading non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
  return xr.open_zarr(target.get_mapper())
                    INFO     Creating a new dataset in target              xarray_zarr.py:511
                    INFO     Opening inputs for chunk                      xarray_zarr.py:390
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)})                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}): 's3://esgf-world/CMIP6/CMIP/NOAA-GFDL/                   
                             GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20                   
                             180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_                   
                             gr1_185001-194912.nc'                                           
                    INFO     Opening 's3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFD storage.py:267
                             L-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20180701/               
                             tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_185001-               
                             194912.nc' from cache                                           
                    INFO     Combining inputs for chunk                    xarray_zarr.py:408
                             'Index({DimIndex(name='time', index=0,                          
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)})'                                                          
                    INFO     Storing dataset in /var/folders/tt/4f941hdn0z xarray_zarr.py:553
                             q549zdwhcgg98c0000gn/T/tmpkxb_4ar2/rP3aoHSm                     
                    INFO     Expanding target concat dim 'time' to size    xarray_zarr.py:569
                             1980                                                            
                    INFO     Opening inputs for chunk                      xarray_zarr.py:390
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)})                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}): 's3://esgf-world/CMIP6/CMIP/NOAA-GFDL/                   
                             GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20                   
                             180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_                   
                             gr1_185001-194912.nc'                                           
                    INFO     Opening 's3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFD storage.py:267
                             L-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20180701/               
                             tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_185001-               
                             194912.nc' from cache                                           
                    INFO     Combining inputs for chunk                    xarray_zarr.py:408
                             'Index({DimIndex(name='time', index=0,                          
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)})'                                                          
[05/10/22 16:28:03] INFO     Storing variable tas chunk                    xarray_zarr.py:632
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}) to Zarr region (slice(0, 1200, None),                     
                             slice(None, None, None), slice(None, None,                      
                             None))                                                          
                    INFO     Storing variable time chunk                   xarray_zarr.py:632
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}) to Zarr region (slice(0, 1200, None),)                    
                    INFO     Storing variable time_bnds chunk              xarray_zarr.py:632
                             Index({DimIndex(name='time', index=0,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}) to Zarr region (slice(0, 1200, None),                     
                             slice(None, None, None))                                        
                    INFO     Opening inputs for chunk                      xarray_zarr.py:390
                             Index({DimIndex(name='time', index=1,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)})                                                           
                    INFO     Opening input with Xarray                     xarray_zarr.py:253
                             Index({DimIndex(name='time', index=1,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}): 's3://esgf-world/CMIP6/CMIP/NOAA-GFDL/                   
                             GFDL-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20                   
                             180701/tas_Amon_GFDL-CM4_historical_r1i1p1f1_                   
                             gr1_195001-201412.nc'                                           
                    INFO     Opening 's3://esgf-world/CMIP6/CMIP/NOAA-GFDL/GFD storage.py:267
                             L-CM4/historical/r1i1p1f1/Amon/tas/gr1/v20180701/               
                             tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_195001-               
                             201412.nc' from cache                                           
[05/10/22 16:28:04] INFO     Combining inputs for chunk                    xarray_zarr.py:408
                             'Index({DimIndex(name='time', index=1,                          
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)})'                                                          
                    INFO     Storing variable tas chunk                    xarray_zarr.py:632
                             Index({DimIndex(name='time', index=1,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}) to Zarr region (slice(1200, 1980,                         
                             None), slice(None, None, None), slice(None,                     
                             None, None))                                                    
                    INFO     Storing variable time chunk                   xarray_zarr.py:632
                             Index({DimIndex(name='time', index=1,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}) to Zarr region (slice(1200, 1980,                         
                             None),)                                                         
                    INFO     Storing variable time_bnds chunk              xarray_zarr.py:632
                             Index({DimIndex(name='time', index=1,                           
                             sequence_len=2, operation=<CombineOp.CONCAT:                    
                             2>)}) to Zarr region (slice(1200, 1980,                         
                             None), slice(None, None, None))                                 
                    INFO     Consolidating dimension coordinate arrays     xarray_zarr.py:649
                    INFO     Consolidating Zarr metadata                   xarray_zarr.py:673

Step 6: Check the resulting Zarr store

# Check to see if it worked:
ds = xr.open_zarr(recipe.target_mapper)
print(ds)
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 180, lon: 288, time: 1980)
Coordinates:
  * bnds       (bnds) float64 1.0 2.0
    height     float64 ...
  * lat        (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(180, 2), meta=np.ndarray>
  * lon        (lon) float64 0.625 1.875 3.125 4.375 ... 355.6 356.9 358.1 359.4
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(288, 2), meta=np.ndarray>
  * time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bnds  (time, bnds) object dask.array<chunksize=(241, 2), meta=np.ndarray>
Data variables:
    tas        (time, lat, lon) float32 dask.array<chunksize=(241, 180, 288), meta=np.ndarray>
Attributes: (12/46)
    Conventions:            CF-1.7 CMIP-6.0 UGRID-1.0
    activity_id:            CMIP
    branch_method:          standard
    branch_time_in_child:   0.0
    branch_time_in_parent:  36500.0
    comment:                <null ref>
    ...                     ...
    table_id:               Amon
    title:                  NOAA GFDL GFDL-CM4 model output prepared for CMIP...
    tracking_id:            hdl:21.14100/e4193a02-6405-49b6-8ad3-65def741a4dd
    variable_id:            tas
    variant_info:           N/A
    variant_label:          r1i1p1f1
ds[var][-1].plot()
<matplotlib.collections.QuadMesh at 0x7f7c80eb4e20>
../../../_images/cmip6-recipe_39_1.png

Postscript

  • If you find a CMIP6 dataset for which this recipe does not work, Please report it at issue#105 so we can refine the recipe, if possible.

# Troubles found:

dataset = 'IPSL-CM6A-LR.abrupt-4xCO2.r1i1p1f1.Lmon.cLeaf.gr'  # need decode_coords=False in xr.open_dataset, but using xarray_open_kwargs = {'decode_coords':False}, still throws an error when caching the input