Context
@singofwalls, @betolink, and I are at a a hackday working on virtualizing data typically accessed via xopr, which reads CReSIS polar radar .mat files (MATLAB v7.3, i.e. HDF5). These files contain a mix of:
- Numeric data arrays that virtualize cleanly:
Data, GPS_time, Time, Latitude, Longitude, Elevation, Roll, Pitch, Heading, Surface
- MATLAB metadata structs that cannot be virtualized:
param_records, param_csarp, param_array, param_sar, param_combine, param_qlook, etc. — deeply nested HDF5 groups containing cell arrays, char arrays (uint16), and object references
The parameter groups vary across CReSIS campaigns and seasons, so there's no fixed list of names to drop. But they consistently match param_*.
Currently, drop_variables requires exact string matches:
parser = HDFParser(drop_variables=["param_records", "param_csarp", "param_array",
"param_sar", "param_combine", "param_qlook",
"param_radar", "file_type"])
This is fragile — a new campaign season might introduce param_collate or param_post and break the parse.
Proposal
Allow drop_variables to accept compiled regex patterns (or glob strings) alongside plain strings:
import re
parser = HDFParser(drop_variables=["file_type", re.compile(r"param_.*")])
This would apply to both the array and group filtering in _construct_manifest_group.
Alternatives considered
drop_failures=True: Catch and skip any variable/group that raises during parsing. This works but silently swallows errors — a bug in the parser for a legitimate variable would be hidden. Pattern matching is more explicit about intent.
- Pre-inspecting with h5py: Users can open the file with h5py first, filter keys, and pass an exact
drop_variables list. This works for root-level variables but doesn't help with nested groups that fail during recursive traversal.
- A dedicated MATLAB parser: Overkill — the numeric arrays in MATLAB HDF5 files are standard HDF5 datasets. The only problem is the metadata cruft around them.
Scope
The change would be in _construct_manifest_group (parsers/hdf/hdf.py). The key not in drop_variables check would become a helper function that tests both exact membership and regex matching. The type of drop_variables would widen from Iterable[str] | None to Iterable[str | re.Pattern] | None.
Context
@singofwalls, @betolink, and I are at a a hackday working on virtualizing data typically accessed via xopr, which reads CReSIS polar radar
.matfiles (MATLAB v7.3, i.e. HDF5). These files contain a mix of:Data,GPS_time,Time,Latitude,Longitude,Elevation,Roll,Pitch,Heading,Surfaceparam_records,param_csarp,param_array,param_sar,param_combine,param_qlook, etc. — deeply nested HDF5 groups containing cell arrays, char arrays (uint16), and object referencesThe parameter groups vary across CReSIS campaigns and seasons, so there's no fixed list of names to drop. But they consistently match
param_*.Currently,
drop_variablesrequires exact string matches:This is fragile — a new campaign season might introduce
param_collateorparam_postand break the parse.Proposal
Allow
drop_variablesto accept compiled regex patterns (or glob strings) alongside plain strings:This would apply to both the array and group filtering in
_construct_manifest_group.Alternatives considered
drop_failures=True: Catch and skip any variable/group that raises during parsing. This works but silently swallows errors — a bug in the parser for a legitimate variable would be hidden. Pattern matching is more explicit about intent.drop_variableslist. This works for root-level variables but doesn't help with nested groups that fail during recursive traversal.Scope
The change would be in
_construct_manifest_group(parsers/hdf/hdf.py). Thekey not in drop_variablescheck would become a helper function that tests both exact membership and regex matching. The type ofdrop_variableswould widen fromIterable[str] | NonetoIterable[str | re.Pattern] | None.