Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/releases.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@
- `open_virtual_dataset` and `open_virtual_datatree` now populate `ds.encoding["source"]` with the normalized source URI, mirroring [`xarray.open_dataset`][]'s behaviour. Parsers that have already set `encoding["source"]` are left untouched.
By [Tom Nicholas](https://github.com/TomNicholas).

### Bug fixes

- `open_virtual_mfdataset(..., combine="nested")` now raises a clear, actionable `ValueError` when a `concat_dim` is a scalar (non-dimension) coordinate that is left virtual (e.g. when each file holds a single timestep so `time` is a scalar). Previously this produced an opaque `TypeError: Could not find a Chunk Manager which recognises type ManifestArray` from xarray, because concatenating along a new dimension requires building an index from the coordinate's values, which a virtual `ManifestArray` cannot provide. The new error tells the user to load the coordinate via `loadable_variables`. Related to [#114](https://github.com/zarr-developers/VirtualiZarr/issues/114).
([#1015](https://github.com/zarr-developers/VirtualiZarr/pull/1015)).
By [Max Jones](https://github.com/maxrjones).

### Documentation

- Document that virtual concatenation also requires homogeneous CF encoding (`scale_factor`/`add_offset`) across files — xarray's default attribute-merging silently drops mismatched values and produces incorrectly-decoded data on read. Added a new FAQ bullet and a warning admonition under "Combining virtual datasets" in the usage docs. See [#1004](https://github.com/zarr-developers/VirtualiZarr/issues/1004).
Expand Down
22 changes: 22 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,28 @@ combined_vds = open_virtual_mfdataset(

We passed `combine='nested'` to specify that we want the datasets to be combined in the order they appear, using `xr.combine_nested` under the hood.

!!! warning "Concatenating along a scalar coordinate"
The examples above concatenate along `time`, which is already a *dimension* in each file (so each file holds many timesteps). Concatenating along an existing dimension just stacks the virtual references end-to-end and requires no in-memory data.

A common alternative layout stores **one timestep per file**, so that `time` is a *scalar* (0-dimensional) coordinate rather than a dimension. To combine these, xarray must promote that scalar to a new dimension and build an index for it, which requires the coordinate's values in memory. Because `time` is left virtual by default (the automatic loading only covers 1D dimension coordinates), the combine will fail unless you load it explicitly:

```python
combined_vds = open_virtual_mfdataset(
urls,
registry=registry,
parser=parser,
combine="nested",
concat_dim="time",
loadable_variables=["time"], # load the scalar concat coordinate into memory
coords="minimal",
compat="override",
)
```

Note that passing `loadable_variables` *replaces* the default set, so the other dimension coordinates (e.g. `lat`/`lon`) are no longer loaded automatically. The `coords="minimal"` and `compat="override"` arguments (see the note above) stop xarray from trying to load those still-virtual coordinates in order to compare them. Alternatively, list every coordinate you want in memory, e.g. `loadable_variables=["time", "lat", "lon"]`.

More generally, any coordinate you concatenate along that is *not already a dimension* in each source must be listed in `loadable_variables`.

### Ordering by coordinate values

If you're happy to load 1D dimension coordinates into memory, you can use their values to do the ordering for you!
Expand Down
113 changes: 113 additions & 0 deletions virtualizarr/tests/test_xarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
slow_test,
)
from virtualizarr.tests.utils import obstore_http, obstore_s3
from virtualizarr.xarray import _raise_on_virtual_scalar_concat_dim


def test_wrapping(array_v3_metadata):
Expand Down Expand Up @@ -968,6 +969,118 @@ def test_parallel_open(
)
xrt.assert_identical(combined_vds, expected_vds)

def test_nested_concat_dim_scalar_virtual_coord_raises(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should delete all the other tests from this PR except the first part of this one. Every other test here either tests a private internal, or tests a behaviour that should already be tested.

self, tmp_path, local_registry
):
# When each file holds a single timestep, `time` is a scalar coordinate that
# is left virtual by default. xarray.concat must build an index for the new
# dimension, which can't be done from a virtual ManifestArray. Raise a clear,
# actionable error pointing at loadable_variables instead of a cryptic
# chunk-manager TypeError.
filepath1 = tmp_path / "scalar_time1.nc"
filepath2 = tmp_path / "scalar_time2.nc"
with xr.tutorial.open_dataset("air_temperature") as ds:
ds.isel(time=0).to_netcdf(filepath1, format="NETCDF4")
ds.isel(time=1).to_netcdf(filepath2, format="NETCDF4")

parser = HDFParser()
with pytest.raises(ValueError, match="loadable_variables"):
open_virtual_mfdataset(
[str(filepath1), str(filepath2)],
registry=local_registry,
parser=parser,
combine="nested",
concat_dim="time",
)

# ...and that loading the scalar concat coordinate makes it work. The other
# coordinates (lat/lon) are left virtual here, so we also pass
# coords="minimal"/compat="override" to avoid xarray loading them for
# comparison (the documented pattern for combining virtual datasets).
combined_vds = open_virtual_mfdataset(
[str(filepath1), str(filepath2)],
registry=local_registry,
parser=parser,
combine="nested",
concat_dim="time",
loadable_variables=["time"],
coords="minimal",
compat="override",
)
assert combined_vds.sizes["time"] == 2

def test_nested_concat_dim_none_merges_without_raising(
self, netcdf4_file, local_registry
):
# combine="nested" with concat_dim=None is the merge-only path: datasets are
# merged rather than concatenated, so no new dimension/index is built and the
# scalar-concat guard must not fire. Merging a file with itself shares an
# identical grid, so the virtual variables need no reindexing; compat="override"
# avoids comparing (and thus loading) the duplicate virtual data variable.
parser = HDFParser()

combined_vds = open_virtual_mfdataset(
[netcdf4_file, netcdf4_file],
registry=local_registry,
parser=parser,
combine="nested",
concat_dim=None,
compat="override",
)

with open_virtual_dataset(
url=netcdf4_file, registry=local_registry, parser=parser
) as expected_vds:
xrt.assert_identical(combined_vds, expected_vds)


class TestRaiseOnVirtualScalarConcatDim:
"""Unit tests for the pre-combine guard against virtual scalar concat dimensions."""

def _scalar_virtual_coord_ds(self, array_v3_metadata):
metadata = array_v3_metadata(chunks=(), shape=())
manifest = ChunkManifest(
entries={"0": {"path": "/foo.nc", "offset": 0, "length": 8}}
)
marr = ManifestArray(metadata=metadata, chunkmanifest=manifest)
return xr.Dataset(coords=xr.Coordinates({"time": ((), marr)}, indexes={}))

def test_raises_for_scalar_virtual_concat_dim(self, array_v3_metadata):
ds = self._scalar_virtual_coord_ds(array_v3_metadata)
with pytest.raises(ValueError, match="loadable_variables"):
_raise_on_virtual_scalar_concat_dim([ds], ["time"])

def test_no_raise_for_loaded_scalar_concat_dim(self):
ds = xr.Dataset(coords={"time": ((), np.array(5.0))})
_raise_on_virtual_scalar_concat_dim([ds], ["time"])

def test_no_raise_for_existing_virtual_dim_coord(self, array_v3_metadata):
# Concatenating along an existing dimension does not build a new index, so a
# virtual dimension coordinate is fine.
metadata = array_v3_metadata(chunks=(2,), shape=(2,))
manifest = ChunkManifest(
entries={"0": {"path": "/foo.nc", "offset": 0, "length": 16}}
)
marr = ManifestArray(metadata=metadata, chunkmanifest=manifest)
ds = xr.Dataset(coords=xr.Coordinates({"time": (("time",), marr)}, indexes={}))
_raise_on_virtual_scalar_concat_dim([ds], ["time"])

def test_no_raise_for_non_string_or_absent_concat_dim(self):
ds = xr.Dataset(coords={"time": ((), np.array(5.0))})
_raise_on_virtual_scalar_concat_dim([ds], [None])
_raise_on_virtual_scalar_concat_dim([ds], ["nonexistent"])

def test_raises_when_only_one_dataset_is_offending(self, array_v3_metadata):
# The dimension may be scalar+virtual in only some of the files; a single
# offending source is enough for concat to fail.
good = xr.Dataset(coords={"time": ((), np.array(5.0))})
bad = self._scalar_virtual_coord_ds(array_v3_metadata)
with pytest.raises(ValueError, match="loadable_variables"):
_raise_on_virtual_scalar_concat_dim([good, bad], ["time"])

def test_no_raise_for_empty_datasets(self):
_raise_on_virtual_scalar_concat_dim([], ["time"])


def test_drop_variables(netcdf4_file, local_registry):
parser = HDFParser()
Expand Down
54 changes: 54 additions & 0 deletions virtualizarr/xarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -383,6 +383,7 @@ def open_virtual_mfdataset(
# Combine all datasets, closing them in case of a ValueError
try:
if combine == "nested":
_raise_on_virtual_scalar_concat_dim(virtual_datasets, concat_dim)
# Combined nested list by successive concat and merge operations
# along each dimension, using structure given by "ids"
combined_vds = _nested_combine(
Expand Down Expand Up @@ -430,6 +431,59 @@ def open_virtual_mfdataset(
return combined_vds


def _raise_on_virtual_scalar_concat_dim(
virtual_datasets: Sequence[Dataset],
concat_dims: (
str
| DataArray
| Index
| Sequence[str]
| Sequence[DataArray]
| Sequence[Index]
| None
),
) -> None:
"""Raise a clear error when a concatenation dimension is an unloaded scalar coordinate.

When ``concat_dim`` names a scalar (non-dimension) coordinate, ``xarray.concat``
promotes it to a new dimension and builds a pandas index from its values. That
requires materializing the coordinate as a numpy array, which is impossible for a
virtual [ManifestArray][virtualizarr.manifests.ManifestArray]; the result is an
opaque ``TypeError`` from xarray's chunk-manager lookup. Such a coordinate must
instead be loaded into memory via ``loadable_variables``.

Index creation happens in ``expand_dims`` on each dataset individually, so a single
offending source is enough to fail the concat. We therefore check every dataset, not
just the first, since the datasets may be heterogeneous (e.g. the dimension could be
scalar in only one file).

Concatenating along an *existing* dimension does not create a new index, so a virtual
dimension coordinate is fine in that case and is left untouched here.
"""
if concat_dims is None:
return
dims: Sequence[Any] = (
[concat_dims]
if isinstance(concat_dims, str | DataArray | Index)
else concat_dims
)
concat_dim_names = [dim for dim in dims if isinstance(dim, str)]
for ds in virtual_datasets:
for dim in concat_dim_names:
if dim in ds.dims:
continue
var = ds.variables.get(dim)
if var is not None and isinstance(var.data, ManifestArray):
raise ValueError(
f"Cannot concatenate along {dim!r} because it is a scalar "
"coordinate that has not been loaded into memory in at least one "
"source. xarray builds an index for the new dimension when "
"concatenating, which requires the coordinate's values, but it is "
"backed by a virtual ManifestArray. Load it by passing "
f"loadable_variables=[{dim!r}, ...] to open_virtual_mfdataset."
)


def construct_fully_virtual_dataset(
virtual_vars: Mapping[str, xr.Variable],
coord_names: Iterable[str] | None = None,
Expand Down
Loading