File patterns#
Introduction#
File patterns define recipe inputs#
FilePattern
s are the starting point for any Pangeo Forge recipe. They are the raw
inputs (or “ingredients”) upon which the recipe will act. File patterns describe:
Where individual source files are located; and
How they should be organized logically as part of an aggregate dataset. (In this respect, file patterns are conceptually similar to NcML documents.)
Note
API Reference is available here: pangeo_forge_recipes.patterns.FilePattern
Pangeo Forge Pulls Data#
A central concept in Pangeo Forge is that data are “pulled”, not “pushed” to the storage location. A file pattern describes where to find your data; when you execute the recipe, the data will automatically be fetched and transformed. You cannot “upload” data to Pangeo Forge. This is deliberate.
For recipes built from public, open data, it’s always best to try to get the data from its original, authoritative source. For example, if you want to use satellite data from NASA, you need to find the URLs which point to that data on NASA’s servers.
Pangeo Forge supports a huge range of different transfer protocols for accessing URL-based data files, thanks to the filesystem-spec framework. A full list of protocols can be found in the fsspec docs (built-in implementations | other implementations). In order for Pangeo Forge to pull your data, it should be accessible over the public internet via one of these protocols.
Tip
To access data stored on HPC filesystems, Globus may be useful.
Create a file pattern#
Let’s explore a simple example of how to create a file pattern for an imaginary dataset with file paths which look like this:
http://data-provider.org/data/temperature/temperature_01.txt
http://data-provider.org/data/temperature/temperature_02.txt
...
http://data-provider.org/data/temperature/temperature_10.txt
http://data-provider.org/data/humidity/humidity_01.txt
http://data-provider.org/data/humidity/humidity_02.txt
...
http://data-provider.org/data/humidity/humidity_10.txt
This is a relatively common way to organize data files:
There are two different “variables” (temperature and humidity), stored in separate files.
There is a sequence of 10 files for each variable. We will assume that this represents the “time” axis of the data.
We observe that there are essentially two dimensions to the file organization:
variable (2) and time (10). The product of these (2 x 10 = 20) determines the total
number of files in our dataset.
We refer to the unique identifiers for each dimension (temperature
, humidity
; 1, 2, ..., 10
)
as the keys for our file pattern.
At this point, we don’t really care what is inside these files.
We are just interested in the logical organization of the files themselves;
this is what a FilePattern
is meant to describe.
Sneak peek: the full code#
Here is the full code we will be writing below to describe a file pattern for this imaginary dataset, provided upfront for reference:
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, MergeDim
def make_full_path(variable, time):
return f"http://data-provider.org/data/{variable}/{variable}_{time:02d}.txt"
variable_merge_dim = MergeDim("variable", ["temperature", "humidity"])
time_concat_dim = ConcatDim("time", list(range(1, 11)))
kws = {} # no keyword arguments used for this example
pattern = FilePattern(make_full_path, variable_merge_dim, time_concat_dim, **kws)
pattern
<FilePattern {'variable': 2, 'time': 10}>
In what follows we will look at each element of this code one-by-one.
Format function#
The starting point for creating a file pattern is to write a function which maps the keys for each dimension into full file paths. This function might look something like this:
def make_full_path(variable, time):
return f"http://data-provider.org/data/{variable}/{variable}_{time:02d}.txt"
# check that it works
make_full_path("humidity", 3)
'http://data-provider.org/data/humidity/humidity_03.txt'
Important
Argument names in your Format function must match the names
used in your Combine dimensions.
Here, the function make_full_path
has two arguments: variable
, and time
.
These are the same as the names used in our Combine dimensions.
Combine dimensions#
We now need to define the “combine dimensions” of the file pattern. Combine dimensions are one of two types:
pangeo_forge_recipes.patterns.ConcatDim
: The files should be combined by concatenating the same variables sequentially along an axis. This is conceptually similar to Xarray’s concat operation.pangeo_forge_recipes.patterns.MergeDim
: The files be combined by merging multiple distinct variables into a single dataset. This is conceptually similar to Xarray’s merge operation.
File patterns permit us to combine multiple combine dims into a single pattern.
For the present example, we have one MergeDim
:
from pangeo_forge_recipes.patterns import MergeDim
variable_merge_dim = MergeDim("variable", ["temperature", "humidity"])
…and one ConcatDim
:
from pangeo_forge_recipes.patterns import ConcatDim
time_concat_dim = ConcatDim("time", list(range(1, 11)))
Keyword arguments#
FilePattern
objects carry all of the information needed to open source files, which may include
source-server specific arguments and/or authentication information. These options can be specified
via keyword arguments. Please refer to the API Reference for more on these optional parameters:
pangeo_forge_recipes.patterns.FilePattern
.
Warning
Secrets including login credentials and API tokens should never be committed to a public repository. As such,
we strongly suggest that you do not instantiate your FilePattern
with these or any other secrets when
developing your recipe. If your source files require authentication via Keyword arguments, it is advisable to provide these values as variables in the Deployment environment, and not
as literal values in the recipe file itself.
Putting it all together#
We are now ready to create our file pattern. We do this by bringing together the Format function, Combine dimensions, and (optionally) any Keyword arguments.
from pangeo_forge_recipes.patterns import FilePattern
kws = {} # no keyword arguments used for this example
pattern = FilePattern(make_full_path, variable_merge_dim, time_concat_dim, **kws)
pattern
<FilePattern {'variable': 2, 'time': 10}>
To see the full code in one place, please refer back to Sneak peek: the full code.
Inspect a FilePattern
#
We can inspect file patterns manually to understand how they work. This is not necessary
to create a recipe; however digging into a FilePattern
’s internals may be helpful in
debugging a complex recipe. Internally, the file pattern maps the keys of the
Combine dimensions to logical indices. We can see all of these keys by iterating over
the patterns using the items()
method:
for index, fname in pattern.items():
print(index, fname)
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=0, indexed=False)} http://data-provider.org/data/temperature/temperature_01.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=1, indexed=False)} http://data-provider.org/data/temperature/temperature_02.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=2, indexed=False)} http://data-provider.org/data/temperature/temperature_03.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=3, indexed=False)} http://data-provider.org/data/temperature/temperature_04.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=4, indexed=False)} http://data-provider.org/data/temperature/temperature_05.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=5, indexed=False)} http://data-provider.org/data/temperature/temperature_06.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=6, indexed=False)} http://data-provider.org/data/temperature/temperature_07.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=7, indexed=False)} http://data-provider.org/data/temperature/temperature_08.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=8, indexed=False)} http://data-provider.org/data/temperature/temperature_09.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=0, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=9, indexed=False)} http://data-provider.org/data/temperature/temperature_10.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=0, indexed=False)} http://data-provider.org/data/humidity/humidity_01.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=1, indexed=False)} http://data-provider.org/data/humidity/humidity_02.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=2, indexed=False)} http://data-provider.org/data/humidity/humidity_03.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=3, indexed=False)} http://data-provider.org/data/humidity/humidity_04.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=4, indexed=False)} http://data-provider.org/data/humidity/humidity_05.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=5, indexed=False)} http://data-provider.org/data/humidity/humidity_06.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=6, indexed=False)} http://data-provider.org/data/humidity/humidity_07.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=7, indexed=False)} http://data-provider.org/data/humidity/humidity_08.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=8, indexed=False)} http://data-provider.org/data/humidity/humidity_09.txt
{Dimension(name='variable', operation=<CombineOp.MERGE: 1>): Position(value=1, indexed=False), Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=9, indexed=False)} http://data-provider.org/data/humidity/humidity_10.txt
Hint
This items()
method will come up again in From file pattern to PCollection.
The index is a pangeo_forge_recipes.patterns.Index
used internally by pangeo-forge-recipes
to align the source files in the aggregate dataset.
Users generally will not need to interact with indexes manually, but it may be interesting to
note that we can retrieve source filenames from the file pattern via “getitem” syntax, using
an index as key:
pattern[index]
'http://data-provider.org/data/humidity/humidity_10.txt'
From file pattern to PCollection
#
As covered in Recipe Composition, a recipe is composed of a sequence of Apache Beam transforms.
The data collection that Apache Beam transforms operates on is a
PCollection
.
Therefore, we bring the contents of a FilePattern
into a recipe, we pass the index:url
pairs generated by the file pattern’s items()
method into Beam’s Create
constructor
as follows:
import apache_beam as beam
recipe = (
beam.Create(pattern.items())
# ... continue with additional transforms here
)
We now have our data properly initialized, and can begin composing the recipe with Transforms.