File patterns#

Introduction#

File patterns define recipe inputs#

FilePatterns are the starting point for any Pangeo Forge recipe. They are the raw inputs (or “ingredients”) upon which the recipe will act. File patterns describe:

  • Where individual source files are located; and

  • How they should be organized logically as part of an aggregate dataset. (In this respect, file patterns are conceptually similar to NcML documents.)

Note

API Reference is available here: pangeo_forge_recipes.patterns.FilePattern

Pangeo Forge Pulls Data#

A central concept in Pangeo Forge is that data are “pulled”, not “pushed” to the storage location. A file pattern describes where to find your data; when you execute the recipe, the data will automatically be fetched and transformed. You cannot “upload” data to Pangeo Forge. This is deliberate.

For recipes built from public, open data, it’s always best to try to get the data from its original, authoritative source. For example, if you want to use satellite data from NASA, you need to find the URLs which point to that data on NASA’s servers.

Pangeo Forge supports a huge range of different transfer protocols for accessing URL-based data files, thanks to the filesystem-spec framework. A full list of protocols can be found in the fsspec docs (built-in implementations | other implementations). In order for Pangeo Forge to pull your data, it should be accessible over the public internet via one of these protocols.

Tip

To access data stored on HPC filesystems, Globus may be useful.

Create a file pattern#

Let’s explore a simple example of how to create a file pattern for an imaginary dataset with file paths which look like this:

http://data-provider.org/data/temperature/temperature_01.txt
http://data-provider.org/data/temperature/temperature_02.txt
...
http://data-provider.org/data/temperature/temperature_10.txt
http://data-provider.org/data/humidity/humidity_01.txt
http://data-provider.org/data/humidity/humidity_02.txt
...
http://data-provider.org/data/humidity/humidity_10.txt

This is a relatively common way to organize data files:

  • There are two different “variables” (temperature and humidity), stored in separate files.

  • There is a sequence of 10 files for each variable. We will assume that this represents the “time” axis of the data.

We observe that there are essentially two dimensions to the file organization: variable (2) and time (10). The product of these (2 x 10 = 20) determines the total number of files in our dataset. We refer to the unique identifiers for each dimension (temperature, humidity; 1, 2, ..., 10) as the keys for our file pattern. At this point, we don’t really care what is inside these files. We are just interested in the logical organization of the files themselves; this is what a FilePattern is meant to describe.

Sneak peek: the full code#

Here is the full code we will be writing below to describe a file pattern for this imaginary dataset, provided upfront for reference:

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, MergeDim

def make_full_path(variable, time):
    return f"http://data-provider.org/data/{variable}/{variable}_{time:02d}.txt"

variable_merge_dim = MergeDim("variable", ["temperature", "humidity"])
time_concat_dim = ConcatDim("time", list(range(1, 11)))

kws = {}  # no keyword arguments used for this example
pattern = FilePattern(make_full_path, variable_merge_dim, time_concat_dim, **kws)
pattern
<FilePattern {'variable': 2, 'time': 10}>

In what follows we will look at each element of this code one-by-one.

Format function#

The starting point for creating a file pattern is to write a function which maps the keys for each dimension into full file paths. This function might look something like this:

def make_full_path(variable, time):
    return f"http://data-provider.org/data/{variable}/{variable}_{time:02d}.txt"

# check that it works
make_full_path("humidity", 3)
'http://data-provider.org/data/humidity/humidity_03.txt'

Important

Argument names in your Format function must match the names used in your Combine dimensions. Here, the function make_full_path has two arguments: variable, and time. These are the same as the names used in our Combine dimensions.

Combine dimensions#

We now need to define the “combine dimensions” of the file pattern. Combine dimensions are one of two types:

File patterns permit us to combine multiple combine dims into a single pattern. For the present example, we have one MergeDim:

from pangeo_forge_recipes.patterns import MergeDim
variable_merge_dim = MergeDim("variable", ["temperature", "humidity"])

…and one ConcatDim:

from pangeo_forge_recipes.patterns import ConcatDim
time_concat_dim = ConcatDim("time", list(range(1, 11)))

Keyword arguments#

FilePattern objects carry all of the information needed to open source files, which may include source-server specific arguments and/or authentication information. These options can be specified via keyword arguments. Please refer to the API Reference for more on these optional parameters: pangeo_forge_recipes.patterns.FilePattern.

Warning

Secrets including login credentials and API tokens should never be committed to a public repository. As such, we strongly suggest that you do not instantiate your FilePattern with these or any other secrets when developing your recipe. If your source files require authentication via Keyword arguments, it is advisable to provide these values as variables in the Deployment environment, and not as literal values in the recipe file itself.

Putting it all together#

We are now ready to create our file pattern. We do this by bringing together the Format function, Combine dimensions, and (optionally) any Keyword arguments.

from pangeo_forge_recipes.patterns import FilePattern

kws = {}  # no keyword arguments used for this example
pattern = FilePattern(make_full_path, variable_merge_dim, time_concat_dim, **kws)
pattern
<FilePattern {'variable': 2, 'time': 10}>

To see the full code in one place, please refer back to Sneak peek: the full code.

Inspect a FilePattern#

We can inspect file patterns manually to understand how they work. This is not necessary to create a recipe; however digging into a FilePattern’s internals may be helpful in debugging a complex recipe. Internally, the file pattern maps the keys of the Combine dimensions to logical indices. We can see all of these keys by iterating over the patterns using the items() method:

for index, fname in pattern.items():
    print(index, fname)
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=0, indexed=False)} http://data-provider.org/data/temperature/temperature_01.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=1, indexed=False)} http://data-provider.org/data/temperature/temperature_02.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=2, indexed=False)} http://data-provider.org/data/temperature/temperature_03.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=3, indexed=False)} http://data-provider.org/data/temperature/temperature_04.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=4, indexed=False)} http://data-provider.org/data/temperature/temperature_05.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=5, indexed=False)} http://data-provider.org/data/temperature/temperature_06.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=6, indexed=False)} http://data-provider.org/data/temperature/temperature_07.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=7, indexed=False)} http://data-provider.org/data/temperature/temperature_08.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=8, indexed=False)} http://data-provider.org/data/temperature/temperature_09.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=9, indexed=False)} http://data-provider.org/data/temperature/temperature_10.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=0, indexed=False)} http://data-provider.org/data/humidity/humidity_01.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=1, indexed=False)} http://data-provider.org/data/humidity/humidity_02.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=2, indexed=False)} http://data-provider.org/data/humidity/humidity_03.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=3, indexed=False)} http://data-provider.org/data/humidity/humidity_04.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=4, indexed=False)} http://data-provider.org/data/humidity/humidity_05.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=5, indexed=False)} http://data-provider.org/data/humidity/humidity_06.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=6, indexed=False)} http://data-provider.org/data/humidity/humidity_07.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=7, indexed=False)} http://data-provider.org/data/humidity/humidity_08.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=8, indexed=False)} http://data-provider.org/data/humidity/humidity_09.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=9, indexed=False)} http://data-provider.org/data/humidity/humidity_10.txt

Hint

This items() method will come up again in From file pattern to PCollection.

The index is a pangeo_forge_recipes.patterns.Index used internally by pangeo-forge-recipes to align the source files in the aggregate dataset. Users generally will not need to interact with indexes manually, but it may be interesting to note that we can retrieve source filenames from the file pattern via “getitem” syntax, using an index as key:

pattern[index]
'http://data-provider.org/data/humidity/humidity_10.txt'

From file pattern to PCollection#

As covered in Recipe Composition, a recipe is composed of a sequence of Apache Beam transforms. The data Apache Beam transforms operate on are PCollections. Therefore, we bring the contents of a FilePattern into a recipe, we pass the index:url pairs generated by the file pattern’s items() method into Beam’s Create constructor as follows:

import apache_beam as beam

recipe = (
  beam.Create(pattern.items())
  # ... continue with additional transforms here
)

We now have our data properly initialized, and can begin composing the recipe with Transforms.