Skip to content

Feature: Support pattern matching in drop_variables for HDFParser #970

Description

@maxrjones

Context

@singofwalls, @betolink, and I are at a a hackday working on virtualizing data typically accessed via xopr, which reads CReSIS polar radar .mat files (MATLAB v7.3, i.e. HDF5). These files contain a mix of:

  • Numeric data arrays that virtualize cleanly: Data, GPS_time, Time, Latitude, Longitude, Elevation, Roll, Pitch, Heading, Surface
  • MATLAB metadata structs that cannot be virtualized: param_records, param_csarp, param_array, param_sar, param_combine, param_qlook, etc. — deeply nested HDF5 groups containing cell arrays, char arrays (uint16), and object references

The parameter groups vary across CReSIS campaigns and seasons, so there's no fixed list of names to drop. But they consistently match param_*.

Currently, drop_variables requires exact string matches:

parser = HDFParser(drop_variables=["param_records", "param_csarp", "param_array", 
                                    "param_sar", "param_combine", "param_qlook",
                                    "param_radar", "file_type"])

This is fragile — a new campaign season might introduce param_collate or param_post and break the parse.

Proposal

Allow drop_variables to accept compiled regex patterns (or glob strings) alongside plain strings:

import re

parser = HDFParser(drop_variables=["file_type", re.compile(r"param_.*")])

This would apply to both the array and group filtering in _construct_manifest_group.

Alternatives considered

  • drop_failures=True: Catch and skip any variable/group that raises during parsing. This works but silently swallows errors — a bug in the parser for a legitimate variable would be hidden. Pattern matching is more explicit about intent.
  • Pre-inspecting with h5py: Users can open the file with h5py first, filter keys, and pass an exact drop_variables list. This works for root-level variables but doesn't help with nested groups that fail during recursive traversal.
  • A dedicated MATLAB parser: Overkill — the numeric arrays in MATLAB HDF5 files are standard HDF5 datasets. The only problem is the metadata cruft around them.

Scope

The change would be in _construct_manifest_group (parsers/hdf/hdf.py). The key not in drop_variables check would become a helper function that tests both exact membership and regex matching. The type of drop_variables would widen from Iterable[str] | None to Iterable[str | re.Pattern] | None.

Metadata

Metadata

Assignees

No one assigned

    Labels

    HDF parserNon-kerchunk-based HDF parser

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions