Correctly handle HDF5 fillvalue for string dtype arrays.#988
Correctly handle HDF5 fillvalue for string dtype arrays.#988sharkinsspatial wants to merge 8 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #988 +/- ##
==========================================
- Coverage 89.91% 89.86% -0.06%
==========================================
Files 36 36
Lines 2212 2230 +18
==========================================
+ Hits 1989 2004 +15
- Misses 223 226 +3
🚀 New features to boost your workflow:
|
…llback (#993) * fix: handle H5Dataset missing fillvalue attribute with dtype-based fallback Some h5py Dataset objects do not expose a fillvalue attribute. Guard against AttributeError in _get_fill_value and fall back to np.ma.default_fill_value for the dataset's dtype. Adds a unit test covering this path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add release note * fix: catch RuntimeError (not AttributeError) from h5py fillvalue h5py.Dataset.fillvalue delegates to the HDF5 C library via Cython and raises RuntimeError for unsupported dtypes (e.g. variable-length strings, compound types). AttributeError can never fire on a real h5py.Dataset since fillvalue is always a defined property. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Good to merge now so I can include it in the release? |
let me take a quick look at the codecs changes. Assigning vlen bytes to all object dtypes seems a bit weird to me. |
| try: | ||
| raw = dataset.fillvalue | ||
| except RuntimeError: | ||
| return np.ma.default_fill_value(dataset.dtype) |
There was a problem hiding this comment.
I'm wondering if this could be introducing a different type of bug where now we are introducing a value for the zarr metadata fill_value that could differ from what is in the dataset's original metadata. I realized today that for GPM IMERG HH precipitation the attrs from the h5py Dataset include both _FillValue and CodeMissingValue which both are -9999.0:
h5ds = h5py.File(s3fs.open(gpm_imerg_s3_url))
h5ds['/Grid/precipitation'].attrs['CodeMissingValue']
# np.bytes_(b'-9999.9')
h5ds['/Grid/precipitation'].attrs['_FillValue']
# np.float32(-9999.9)Whereas this fallback returns 1e+20:
np.ma.default_fill_value(h5ds['/Grid/precipitation'].dtype)
# 1e+20
I recognize that CodeMissingValue is not a standard attribute (as far as I can tell), but should the zarr fill_value attribute fall back to the _FillValue attribute picked up by h5py?
Looking at NASA-IMPACT/veda-odd#371 would this be handled by the more general purpose phase 3 "fill_value_defined() distinction: stop propagating h5py-default fills to zarr storage."
There was a problem hiding this comment.
@abarciauskas-bgse For clarification, fill_value and _FillValue are distinct semantic concepts https://virtualizarr.readthedocs.io/en/stable/custom_parsers.html#fill-values. fill_value is always set by HDF5 but as you noted above, not all types of fill_value that HDF5 can produce are supported by Zarr v3.
There was a problem hiding this comment.
Right ok, thanks for reminding me about that distinction -- I recalled there were 2 semantic concepts but forgot about the details and that excellent documentation. In that case, since this function is for the first fill value concept -- Value for uninitialized chunks - (e.g., Zarr fill_value) -- then probably this fallback is reasonable?
And I recognize that the h5ds['/Grid/precipitation'].attrs['_FillValue'] is for the 2nd concept Sentinel value - (e.g., CF _FillValue ))
There was a problem hiding this comment.
The CodeMissingValue is a different issue, but this is a good flag since Zarr's default fill values could differ from numpy or support dtypes not defined in np.ma.default_fill_value
@sharkinsspatial why did you chose to use numpy's default fill values here?
There was a problem hiding this comment.
That was a change I made, to be clear, and in discussion with @maxrjones it sounds like there may be a Zarr native fallback to use to look into
Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>
What I did
Updated the HDF5 parser to support string dtype arrays with a
bytesorstrfillvalue.Acceptance criteria: