Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/releases.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ Adds an `IcechunkParser` for reading existing icechunk repositories as virtual d

### Bug fixes

- HDFParser now correctly parses datasets with either no fill value or a string dtype fill value.
([#988](https://github.com/zarr-developers/VirtualiZarr/pull/938)).
By [Sean Harkins](https://github.com/sharkinsspatial) and [Aimee Barciauskas](https://github.com/abarciauskas-bgse).
- Fix `vds.vz.to_icechunk()` raising `IcechunkError("invalid zarr key format")` when the manifest contains inlined chunks. The icechunk writer now always emits `c/0/0/0`-form chunk keys regardless of the manifest's stored chunk-key separator. Mainly surfaces with `IcechunkParser` (icechunk inlines small chunks aggressively); existing parsers don't produce inlined chunks and aren't affected.
([#991](https://github.com/zarr-developers/VirtualiZarr/pull/991)).
By [Tom Nicholas](https://github.com/TomNicholas).
Expand Down
6 changes: 4 additions & 2 deletions virtualizarr/codecs.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import zarr
from zarr.abc.codec import ArrayArrayCodec, ArrayBytesCodec, BytesBytesCodec
from zarr.abc.codec import Codec as ZarrCodec
from zarr.codecs import BytesCodec
from zarr.codecs import BytesCodec, VLenUTF8Codec
from zarr.core.codec_pipeline import BatchedCodecPipeline
from zarr.core.metadata.v3 import ArrayV3Metadata

Expand Down Expand Up @@ -106,7 +106,9 @@ def convert_to_codec_pipeline(
arrayarray_codecs, arraybytes_codec, bytesbytes_codecs = extract_codecs(zarr_codecs)

if arraybytes_codec is None:
if dtype.byteorder == ">":
if isinstance(dtype, np.dtypes.StringDType):
arraybytes_codec = VLenUTF8Codec()
elif dtype.byteorder == ">":
arraybytes_codec = BytesCodec(endian="big")
else:
arraybytes_codec = BytesCodec()
Expand Down
39 changes: 37 additions & 2 deletions virtualizarr/parsers/hdf/hdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,25 @@
from h5py import Group as H5Group


def _get_fill_value(dataset: H5Dataset):
"""
Extract the fill value from an h5py dataset, handling string/bytes dtypes
that don't return numpy scalars from dataset.fillvalue.
"""
try:
raw = dataset.fillvalue
except RuntimeError:
return np.ma.default_fill_value(dataset.dtype)

@abarciauskas-bgse abarciauskas-bgse May 18, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this could be introducing a different type of bug where now we are introducing a value for the zarr metadata fill_value that could differ from what is in the dataset's original metadata. I realized today that for GPM IMERG HH precipitation the attrs from the h5py Dataset include both _FillValue and CodeMissingValue which both are -9999.0:

h5ds = h5py.File(s3fs.open(gpm_imerg_s3_url))
h5ds['/Grid/precipitation'].attrs['CodeMissingValue']
# np.bytes_(b'-9999.9')
h5ds['/Grid/precipitation'].attrs['_FillValue']
# np.float32(-9999.9)

Whereas this fallback returns 1e+20:

np.ma.default_fill_value(h5ds['/Grid/precipitation'].dtype)
# 1e+20

I recognize that CodeMissingValue is not a standard attribute (as far as I can tell), but should the zarr fill_value attribute fall back to the _FillValue attribute picked up by h5py?

Looking at NASA-IMPACT/veda-odd#371 would this be handled by the more general purpose phase 3 "fill_value_defined() distinction: stop propagating h5py-default fills to zarr storage."

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxrjones curious what you think

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abarciauskas-bgse For clarification, fill_value and _FillValue are distinct semantic concepts https://virtualizarr.readthedocs.io/en/stable/custom_parsers.html#fill-values. fill_value is always set by HDF5 but as you noted above, not all types of fill_value that HDF5 can produce are supported by Zarr v3.

@abarciauskas-bgse abarciauskas-bgse May 19, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right ok, thanks for reminding me about that distinction -- I recalled there were 2 semantic concepts but forgot about the details and that excellent documentation. In that case, since this function is for the first fill value concept -- Value for uninitialized chunks - (e.g., Zarr fill_value) -- then probably this fallback is reasonable?

And I recognize that the h5ds['/Grid/precipitation'].attrs['_FillValue'] is for the 2nd concept Sentinel value - (e.g., CF _FillValue ))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CodeMissingValue is a different issue, but this is a good flag since Zarr's default fill values could differ from numpy or support dtypes not defined in np.ma.default_fill_value

@sharkinsspatial why did you chose to use numpy's default fill values here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a change I made, to be clear, and in discussion with @maxrjones it sounds like there may be a Zarr native fallback to use to look into

if isinstance(raw, bytes):
return raw.decode("utf-8", errors="replace")
elif isinstance(raw, str):
return raw
elif isinstance(raw, np.generic):
return raw.item()
else:
return raw


def _construct_manifest_array(
filepath: str,
dataset: H5Dataset,
Expand Down Expand Up @@ -64,6 +83,22 @@ def _construct_manifest_array(
attrs = _extract_attrs(dataset)
dtype = dataset.dtype

# HDF5 variable-length strings use numpy object dtype, which zarr v3 cannot
# resolve automatically. Map to StringDType which zarr maps to VariableLengthUTF8.
# Discriminate against other object-kind HDF5 dtypes (vlen arrays, object/
# region references) that would silently be coerced to StringDType and
# produce garbage downstream — those aren't supported yet, so fail loudly.
if h5py.check_string_dtype(dtype) is not None:
dtype = np.dtypes.StringDType()
elif dtype.kind == "O":
raise NotImplementedError(
f"HDF5 object dtype {dtype!r} is not a variable-length string and "
f"is not yet supported by HDFParser. h5py exposes vlen arrays "
f"(`h5py.vlen_dtype`) and object/region references "
f"(`h5py.ref_dtype`, `h5py.regionref_dtype`) as numpy object dtype; "
f"please open an issue if your file needs one of these."
)

# Temporarily disable use CF->Codecs - TODO re-enable in subsequent PR.
# cfcodec = cfcodec_from_dataset(dataset)
# if cfcodec:
Expand All @@ -74,13 +109,13 @@ def _construct_manifest_array(
# else:
# dtype = dataset.dtype

if "_FillValue" in attrs:
if "_FillValue" in attrs and dtype.kind not in ("S", "U", "O", "T"):
encoded_cf_fill_value = encode_cf_fill_value(attrs["_FillValue"], dtype)
attrs["_FillValue"] = encoded_cf_fill_value

codec_configs = [zarr_codec_config_to_v3(codec.get_config()) for codec in codecs]

fill_value = dataset.fillvalue.item()
fill_value = _get_fill_value(dataset)
dims = tuple(_dataset_dims(dataset, group=group))
metadata = create_v3_array_metadata(
shape=dataset.shape,
Expand Down
32 changes: 32 additions & 0 deletions virtualizarr/tests/test_parsers/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -459,6 +459,38 @@ def chunked_roundtrip_hdf5_s3_file(minio_bucket, cf_array_fill_value_hdf5_file):
return f"s3://{minio_bucket['bucket']}/{filepath}"


@pytest.fixture(
params=[
{"dtype": "S10", "data": np.array([b"hello", b"world"], dtype="S10")},
{
"dtype": None,
"data": np.array(["hello", "world"], dtype=h5py.string_dtype()),
},
],
ids=["fixed-length-bytes", "variable-length-string"],
)
def string_dtype_hdf5_url(tmp_path: Path, request) -> str:
filepath = str(tmp_path / "string_dtype.nc")

with h5py.File(filepath, "w") as f:
f.create_dataset(name="data", data=request.param["data"])

return f"file://{filepath}"


@pytest.fixture
def string_dtype_with_fillvalue_hdf5_url(tmp_path: Path) -> str:
filepath = str(tmp_path / "string_dtype_fillvalue.nc")

with h5py.File(filepath, "w") as f:
dset = f.create_dataset(
name="data", data=np.array(["hello", "world"], dtype=h5py.string_dtype())
)
dset.attrs["_FillValue"] = ""

return f"file://{filepath}"


@pytest.fixture()
def big_endian_dtype_hdf5_file(tmpdir):
filepath = f"{tmpdir}/big_endian.nc"
Expand Down
30 changes: 30 additions & 0 deletions virtualizarr/tests/test_parsers/test_hdf/test_hdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,19 @@ def test_cf_fill_value(self, cf_fill_value_hdf5_file):
metadata = manifest_store._group.arrays["data"].metadata
assert "_FillValue" in metadata.attributes

def test_string_dtype_fill_value(self, string_dtype_hdf5_url):
manifest_store = manifest_store_from_hdf_url(string_dtype_hdf5_url)
metadata = manifest_store._group.arrays["data"].metadata
assert isinstance(metadata.fill_value, (str, bytes, np.bytes_))

def test_string_dtype_cf_fill_value(self, string_dtype_with_fillvalue_hdf5_url):
manifest_store = manifest_store_from_hdf_url(
string_dtype_with_fillvalue_hdf5_url
)
metadata = manifest_store._group.arrays["data"].metadata
assert "_FillValue" in metadata.attributes
assert isinstance(metadata.attributes["_FillValue"], str)

def test_cf_array_fill_value(self, cf_array_fill_value_hdf5_file):
cf_array_fill_value_hdf5_url = f"file://{cf_array_fill_value_hdf5_file}"
manifest_store = manifest_store_from_hdf_url(cf_array_fill_value_hdf5_url)
Expand Down Expand Up @@ -228,3 +241,20 @@ def test_netcdf_over_https():
):
np.testing.assert_allclose(ds["z"].min().to_numpy(), -6)
np.testing.assert_allclose(ds["z"].max().to_numpy(), 817)


def test_fillvalue_runtime_error():
from virtualizarr.parsers.hdf.hdf import _get_fill_value

dtype = np.dtype("float32")

class _RuntimeErrorDataset:
@property
def fillvalue(self):
raise RuntimeError("Unable to get fill value")

dataset = _RuntimeErrorDataset()
dataset.dtype = dtype # type: ignore[attr-defined]

result = _get_fill_value(dataset)
assert result == np.ma.default_fill_value(dtype)
Loading