Skip to content

Correctly handle HDF5 fillvalue for string dtype arrays.#988

Open
sharkinsspatial wants to merge 8 commits into
mainfrom
fix/problem_fillvalues
Open

Correctly handle HDF5 fillvalue for string dtype arrays.#988
sharkinsspatial wants to merge 8 commits into
mainfrom
fix/problem_fillvalues

Conversation

@sharkinsspatial

Copy link
Copy Markdown
Collaborator

What I did

Updated the HDF5 parser to support string dtype arrays with a bytes or str fillvalue.

Acceptance criteria:

@codecov

codecov Bot commented May 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.36364% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.86%. Comparing base (122d959) to head (b38e56c).
⚠️ Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
virtualizarr/parsers/hdf/hdf.py 83.33% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #988      +/-   ##
==========================================
- Coverage   89.91%   89.86%   -0.06%     
==========================================
  Files          36       36              
  Lines        2212     2230      +18     
==========================================
+ Hits         1989     2004      +15     
- Misses        223      226       +3     
Files with missing lines Coverage Δ
virtualizarr/codecs.py 93.44% <100.00%> (+0.22%) ⬆️
virtualizarr/parsers/hdf/hdf.py 94.07% <83.33%> (-1.51%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread virtualizarr/parsers/hdf/hdf.py Outdated
abarciauskas-bgse and others added 3 commits May 18, 2026 15:40
…llback (#993)

* fix: handle H5Dataset missing fillvalue attribute with dtype-based fallback

Some h5py Dataset objects do not expose a fillvalue attribute. Guard against
AttributeError in _get_fill_value and fall back to np.ma.default_fill_value
for the dataset's dtype. Adds a unit test covering this path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add release note

* fix: catch RuntimeError (not AttributeError) from h5py fillvalue

h5py.Dataset.fillvalue delegates to the HDF5 C library via Cython and
raises RuntimeError for unsupported dtypes (e.g. variable-length strings,
compound types). AttributeError can never fire on a real h5py.Dataset
since fillvalue is always a defined property.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@TomNicholas

Copy link
Copy Markdown
Member

Good to merge now so I can include it in the release?

@maxrjones

Copy link
Copy Markdown
Member

Good to merge now so I can include it in the release?

let me take a quick look at the codecs changes. Assigning vlen bytes to all object dtypes seems a bit weird to me.

Comment thread virtualizarr/codecs.py Outdated
try:
raw = dataset.fillvalue
except RuntimeError:
return np.ma.default_fill_value(dataset.dtype)

@abarciauskas-bgse abarciauskas-bgse May 18, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this could be introducing a different type of bug where now we are introducing a value for the zarr metadata fill_value that could differ from what is in the dataset's original metadata. I realized today that for GPM IMERG HH precipitation the attrs from the h5py Dataset include both _FillValue and CodeMissingValue which both are -9999.0:

h5ds = h5py.File(s3fs.open(gpm_imerg_s3_url))
h5ds['/Grid/precipitation'].attrs['CodeMissingValue']
# np.bytes_(b'-9999.9')
h5ds['/Grid/precipitation'].attrs['_FillValue']
# np.float32(-9999.9)

Whereas this fallback returns 1e+20:

np.ma.default_fill_value(h5ds['/Grid/precipitation'].dtype)
# 1e+20

I recognize that CodeMissingValue is not a standard attribute (as far as I can tell), but should the zarr fill_value attribute fall back to the _FillValue attribute picked up by h5py?

Looking at NASA-IMPACT/veda-odd#371 would this be handled by the more general purpose phase 3 "fill_value_defined() distinction: stop propagating h5py-default fills to zarr storage."

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxrjones curious what you think

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abarciauskas-bgse For clarification, fill_value and _FillValue are distinct semantic concepts https://virtualizarr.readthedocs.io/en/stable/custom_parsers.html#fill-values. fill_value is always set by HDF5 but as you noted above, not all types of fill_value that HDF5 can produce are supported by Zarr v3.

@abarciauskas-bgse abarciauskas-bgse May 19, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right ok, thanks for reminding me about that distinction -- I recalled there were 2 semantic concepts but forgot about the details and that excellent documentation. In that case, since this function is for the first fill value concept -- Value for uninitialized chunks - (e.g., Zarr fill_value) -- then probably this fallback is reasonable?

And I recognize that the h5ds['/Grid/precipitation'].attrs['_FillValue'] is for the 2nd concept Sentinel value - (e.g., CF _FillValue ))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CodeMissingValue is a different issue, but this is a good flag since Zarr's default fill values could differ from numpy or support dtypes not defined in np.ma.default_fill_value

@sharkinsspatial why did you chose to use numpy's default fill values here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a change I made, to be clear, and in discussion with @maxrjones it sounds like there may be a Zarr native fallback to use to look into

sharkinsspatial and others added 2 commits May 19, 2026 11:29
Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working fill-sentinel-values

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failure to read HDF5 data with data arrays of dtype='<U1'

4 participants