Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ All notable changes to this project are documented in this file.
[#21](https://github.com/leroyvn/xarray-validate/pull/21)
- Allow making `DatasetSchema`'s data variables optional, and allow unknown
data variables [#22](https://github.com/leroyvn/xarray-validate/pull/22)
- Support pattern matching for coordinate and data variable keys
[#23](https://github.com/leroyvn/xarray-validate/pull/23)

## 0.0.4 — 2025-12-17

Expand Down
12 changes: 6 additions & 6 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# TODO list

- DOCS: Add thorough YAML schema writing guide
- DatasetSchema: Allow defining optional variables or coordinates
- DatasetSchema: Allow regex-based variable or coordinate name matching
- AttrSchema: Support string input in the type field when deserializing
- AttrSchema: Add regex-based string validation for attributes
- AttrSchema: Add pint-based unit validation system
- [ ] DOCS: Add thorough YAML schema writing guide
- [x] DatasetSchema: Allow defining optional variables or coordinates
- [x] DatasetSchema: Allow regex-based variable or coordinate name matching
- [ ] AttrSchema: Support string input in the type field when deserializing
- [ ] AttrSchema: Add regex-based string validation for attributes
- [ ] AttrSchema: Add pint-based unit validation system
88 changes: 88 additions & 0 deletions docs/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,94 @@ errors will be collected and reported after running all subschemas. For example:
('dims', SchemaError('dimension mismatch in axis 0: got y, expected x')),
('dims', SchemaError('dimension mismatch in axis 1: got x, expected y'))])

Pattern matching for coordinates and data variables
----------------------------------------------------

Coordinate and data variable keys in schemas support pattern matching, allowing
you to validate multiple similarly-named items with a single schema definition.
Two pattern types are supported:

**Glob patterns** use wildcards (``*`` and ``?``) for simple matching:

.. doctest::

>>> ds = xr.Dataset(
... {
... "x_0": xr.DataArray([1, 2, 3], dims="x"),
... "x_1": xr.DataArray([4, 5, 6], dims="x"),
... "x_2": xr.DataArray([7, 8, 9], dims="x"),
... }
... )
>>> schema = DatasetSchema(
... data_vars={
... "x_*": DataArraySchema(dtype=np.int64, dims=["x"], shape=(3,))
... }
... )
>>> schema.validate(ds)

**Regex patterns** use regular expressions enclosed in curly braces for precise
matching:

.. doctest::

>>> ds = xr.Dataset(
... {
... "x_0": xr.DataArray([1, 2, 3], dims="x"),
... "x_1": xr.DataArray([4, 5, 6], dims="x"),
... "x_foo": xr.DataArray([7, 8, 9], dims="x"), # Won't match
... }
... )
>>> schema = DatasetSchema(
... data_vars={
... "{x_\\d+}": DataArraySchema(dtype=np.int64, dims=["x"], shape=(3,))
... },
... allow_extra_keys=True, # Allow x_foo to exist
... )
>>> schema.validate(ds)

Pattern matching also works with :class:`.CoordsSchema`:

.. doctest::

>>> da = xr.DataArray(
... np.ones((3, 3)),
... dims=["x", "y"],
... coords={
... "x": np.arange(3),
... "x_label_0": ("x", np.array(["a", "b", "c"], dtype=object)),
... "x_label_1": ("x", np.array(["d", "e", "f"], dtype=object)),
... },
... )
>>> schema = DataArraySchema(
... coords=CoordsSchema(
... {
... "x": DataArraySchema(dtype=np.int64),
... "x_label_*": DataArraySchema(dtype=object),
... }
... )
... )
>>> schema.validate(da)

**Pattern matching rules:**

- Exact keys take precedence over patterns
- When ``require_all_keys=True`` (default), only exact keys are required;
pattern keys are optional
- When ``allow_extra_keys=False``, keys must match either an exact key or a
pattern
- Multiple patterns can match the same key; all matching schemas will validate
it

.. admonition:: Tips
:class: tip

* Learn more about Python's Unix shell-style wildcards in the :mod:`fnmatch`
module documentation.
* Learn more about Python's regular expressions in the :mod:`re` module
documentation.
* Internally, Unix-style wildcards are converted to regular expressions
using the :func:`fnmatch.translate` function.

Loading schemas from serialized data structures
-----------------------------------------------

Expand Down
120 changes: 120 additions & 0 deletions src/xarray_validate/_match.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
"""Pattern matching support functions."""

import fnmatch
import re
from typing import Any, Dict, Mapping, Set, Tuple


def is_regex_pattern(key: str) -> bool:
"""Check if a key is a regex pattern (enclosed in curly braces)."""
return key.startswith("{") and key.endswith("}")


def is_glob_pattern(key: str) -> bool:
"""Check if a key is a glob pattern (contains * or ?)."""
return "*" in key or "?" in key


def is_pattern_key(key: str) -> bool:
"""Check if a key is any kind of pattern (glob or regex)."""
return is_glob_pattern(key) or is_regex_pattern(key)


def pattern_to_regex(pattern: str) -> re.Pattern:
r"""
Convert a pattern key to a compiled regex.

Supports two pattern types:

- glob patterns: ``'x_*'`` matches ``x_0``, ``x_1``, ``x_foo``, etc.
- regex patterns: ``'{x_\\d+}'`` matches ``x_0``, ``x_1``, but not ``x_foo``

Parameters
----------
pattern : str
The pattern string (regex in curly braces or glob).

Returns
-------
re.Pattern
Compiled regex pattern
"""
if is_regex_pattern(pattern):
# Remove curly braces and compile as regex
regex_str = pattern[1:-1]
return re.compile(regex_str)

elif is_glob_pattern(pattern):
# Convert glob to regex
regex_str = fnmatch.translate(pattern)
return re.compile(regex_str)

else:
# Exact match
return re.compile(re.escape(pattern) + "$")


def separate_keys(
schema_keys: Dict[str, Any],
) -> Tuple[Dict[str, Any], Dict[str, Any], Dict[str, re.Pattern]]:
"""
Separate schema keys into exact and pattern keys, and compile patterns.

Parameters
----------
schema_keys : dict
Dictionary with string keys (exact or pattern) and schema values.

Returns
-------
exact_keys : dict
Dictionary with exact (non-pattern) keys.

pattern_keys : dict
Dictionary with pattern keys.

compiled_patterns : dict
Dictionary mapping pattern keys to compiled regex objects.
"""
exact_keys = {k: v for k, v in schema_keys.items() if not is_pattern_key(k)}
pattern_keys = {k: v for k, v in schema_keys.items() if is_pattern_key(k)}
compiled_patterns = {k: pattern_to_regex(k) for k in pattern_keys}
return exact_keys, pattern_keys, compiled_patterns


def find_matched_keys(
actual_keys: Mapping[str, Any],
exact_keys: Dict[str, Any],
compiled_patterns: Dict[str, re.Pattern],
) -> Set[str]:
"""
Find all actual keys that match either exact or pattern keys.

Parameters
----------
actual_keys : mapping
The actual keys to check (*e.g.* ``coords`` or ``data_vars``).

exact_keys : dict
Dictionary with exact (non-pattern) keys.

compiled_patterns : dict
Dictionary mapping pattern keys to compiled regex objects.

Returns
-------
set
Set of actual keys that match either exact or pattern keys.
"""
matched_keys = set()
for key_name in actual_keys:
# Check exact match
if key_name in exact_keys:
matched_keys.add(key_name)
continue
# Check pattern match
for pattern, regex in compiled_patterns.items():
if regex.fullmatch(key_name):
matched_keys.add(key_name)
break
return matched_keys
37 changes: 32 additions & 5 deletions src/xarray_validate/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
import attrs as _attrs
import xarray as xr

from . import _match
from .base import (
BaseSchema,
SchemaError,
Expand All @@ -34,19 +35,26 @@

@_attrs.define(on_setattr=[_attrs.setters.convert, _attrs.setters.validate])
class CoordsSchema(BaseSchema):
"""
r"""
Schema container for Coordinates

Parameters
----------
coords : dict
Dict of coordinate keys and ``DataArraySchema`` objects.
Dict of coordinate keys and ``DataArraySchema`` objects. Keys can be
either exact coordinate names or patterns:

- Exact match: ``'time'`` matches only 'time'
- Glob pattern: ``'x_*'`` matches x_0, x_1, x_foo, etc.
- Regex pattern: ``'{x_\\d+}'`` matches x_0, x_1, but not x_foo

require_all_keys : bool, default: True
Whether to require to all coordinates included in ``coords``.
Only applies to exact keys, not pattern keys.

allow_extra_keys : bool, default: True
Whether to allow coordinates not included in ``coords`` dict.
Coordinates matching pattern keys are not considered "extra".
"""

coords: Dict[str, DataArraySchema] = _attrs.field()
Expand Down Expand Up @@ -78,8 +86,12 @@ def validate(
) -> None:
# Inherit docstring

# Separate exact keys from pattern keys and compile patterns
exact_keys, pattern_keys, compiled_patterns = _match.separate_keys(self.coords)

if self.require_all_keys:
missing_keys = set(self.coords) - set(coords)
# Only check exact keys for require_all_keys
missing_keys = set(exact_keys) - set(coords)
if missing_keys:
error = SchemaError(f"coords has missing keys: {missing_keys}")
if context:
Expand All @@ -88,15 +100,20 @@ def validate(
raise error

if not self.allow_extra_keys:
extra_keys = set(coords) - set(self.coords)
# Check that all coordinates match either exact or pattern keys
matched_coords = _match.find_matched_keys(
coords, exact_keys, compiled_patterns
)
extra_keys = set(coords) - matched_coords
if extra_keys:
error = SchemaError(f"coords has extra keys: {extra_keys}")
if context:
context.handle_error(error)
else:
raise error

for key, da_schema in self.coords.items():
# Validate coordinates matching exact keys
for key, da_schema in exact_keys.items():
if key not in coords:
error = SchemaError(f"key {key} not in coords")
if context:
Expand All @@ -107,6 +124,16 @@ def validate(
child_context = context.push(f"coords.{key}") if context else None
da_schema.validate(coords[key], child_context)

# Validate coordinates matching pattern keys
for pattern_key, da_schema in pattern_keys.items():
regex = compiled_patterns[pattern_key]
for coord_name in coords:
if regex.fullmatch(coord_name) and coord_name not in exact_keys:
child_context = (
context.push(f"coords.{coord_name}") if context else None
)
da_schema.validate(coords[coord_name], child_context)


@_attrs.define(on_setattr=[_attrs.setters.convert, _attrs.setters.validate])
class DataArraySchema(BaseSchema):
Expand Down
Loading