Skip to content

[Roadmap] Fill / sentinel-value handling in VirtualiZarr #371

Description

@maxrjones

Tracking the work to resolve the 15 open issues labelled fill-sentinel-values plus the related upstream gaps in xarray's FillValueCoder.

Framing

VirtualiZarr correctness is measured against the Zarr spec, not against xarray equivalence. Several recent reports (zarr-developers/VirtualiZarr#989, zarr-developers/VirtualiZarr#485, zarr-developers/VirtualiZarr#628) hit xarray's FillValueCoder.decode failing on JSON-native scalars in zarr metadata — the parser is producing spec-compliant output that xarray's HDF5-style coder can't consume. Tracked upstream at pydata/xarray#11332. Those are upstream xarray issues, not virtualizarr bugs.

The property-test infrastructure added in zarr-developers/VirtualiZarr#990 distinguishes failure categories:

Failure shape Attribution Action
Both engines fail identically Upstream xarray / zarr-python Track upstream — no virtualizarr PR
Observed (virtualizarr) fails; reference ok Virtualizarr-specific bug Fix in VirtualiZarr
Both succeed but differ Real correctness gap Fix in VirtualiZarr

Root-cause clusters

The 15 open issues plus two new findings collapse into 8 underlying problems. Each issue is listed under its primary cluster; cross-cluster cascades are noted inline.

A. Parser crashes during fill extraction — local parser fixes, ~5-20 lines each.

B. HDF parser _FillValue encoding gaps — local parser fix; emit base64 for kind S per docs/custom_parsers.md.

C. xarray FillValueCoder lacking branches — upstream, tracked at pydata/xarray#11332. Out of virtualizarr scope.

D. h5py default fillvalue propagated indiscriminately — parser fix: use dataset.id.get_create_plist().fill_value_defined() to skip propagating defaults. Fixing D removes the cascade into C for vlen-string-without-_FillValue cases.

E. Cross-parser inconsistency — different parsers produce different fill defaults / metadata for the same source. Architectural fix.

F. Writer-side fill semantics — writer-API design questions, distinct from parser fixes.

G. Attribute serialization fidelity — zarr v3 metadata is JSON; lossy for some attribute shapes.

H. Cross-cutting encoding model — meta-discussion; closes via the totality of the other clusters.

Phases

  1. Local parser fixes (low risk): Correctly handle HDF5 fillvalue for string dtype arrays. zarr-developers/VirtualiZarr#988, structured-dtype guard at _extract_attrs, ZarrParser default lookup, S-dtype base64 encoding. ~50 lines total across several small PRs.
  2. Upstream advocacy (parallel track, no virtualizarr PRs): pydata/xarray#11332 tracks the FillValueCoder JSON-native-scalar gap; engage with zarr-specs#351, zarr-extensions#33.
  3. fill_value_defined() distinction: stop propagating h5py-default fills to zarr storage.
  4. Cross-parser consistency: extend the property-test suite to Kerchunk, TIFF; port HDFParser conventions; document the contract in docs/custom_parsers.md.
  5. Writer-side round-trips: Icechunk / Kerchunk writers preserve fill semantics.
  6. Attribute fidelity: policy for non-JSON-serializable attrs (How to handle non-JSON serializable attributes? zarr-developers/VirtualiZarr#715), scalar dtype preservation across JSON metadata.

Phase 2 runs in parallel with all others. BothEnginesFailedIdenticallyError cases auto-resolve when xarray ships the fix; no virtualizarr code change required.

Status

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions