diff --git a/docs/data_structures.md b/docs/data_structures.md index dd1654b9..94e0979a 100644 --- a/docs/data_structures.md +++ b/docs/data_structures.md @@ -244,8 +244,9 @@ NotImplementedError: ManifestArrays can't be converted into numpy arrays or pand The whole point is to manipulate references to the data without actually loading any data. !!! note - You also cannot currently index into a `ManifestArray`, as arbitrary indexing would require loading data values to create the new array. - We could imagine supporting indexing without loading data when slicing only along chunk boundaries, but this has not yet been implemented (see [GH issue #51](https://github.com/zarr-developers/VirtualiZarr/issues/51)). + You can index into a `ManifestArray` as long as the selection aligns with chunk boundaries — slicing through the interior of a chunk would require loading the chunk's bytes, which a virtual array deliberately cannot do. + Chunk-aligned integer and slice indexing is supported, including mixed integer + slice indexers; integer indexers drop the indexed axis as in numpy. Misaligned selections raise `SubChunkIndexingError`. + Arbitrary fancy indexing (e.g. with a boolean mask or integer array) is not supported, since it would generally require loading data. ## Zarr Groups diff --git a/docs/faq.md b/docs/faq.md index 4bc9782e..5d4e3476 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -174,7 +174,7 @@ Users of Kerchunk may find the following comparison table useful, which shows wh | Renaming dimensions | ❌ | `xarray.Dataset.rename_dims` | | Renaming manifest file paths | `kerchunk.utils.rename_target` | `vds.vz.rename_paths` | | Splitting uncompressed data into chunks | `kerchunk.utils.subchunk` | `xarray.Dataset.chunk` (❌ Not yet implemented - see [PR #199](https://github.com/zarr-developers/VirtualiZarr/pull/199)) -| Selecting specific chunks | ❌ | `xarray.Dataset.isel` (❌ Not yet implemented - see [issue #51](https://github.com/zarr-developers/VirtualiZarr/issues/51)) | +| Selecting specific chunks | ❌ | `xarray.Dataset.isel` (✅ chunk-aligned selections only) | **Parallelization** | | | | Parallelized generation of references | Wrapping kerchunk's opener inside `dask.delayed` | Wrapping `open_virtual_dataset` inside `dask.delayed` | Parallelized combining of references (tree-reduce) | `kerchunk.combine.auto_dask` | Wrapping `ManifestArray` objects within `dask.array.Array` objects inside `xarray.Dataset` to use dask's `concatenate` (⚠️ Untested, but also unnecessary) | diff --git a/docs/releases.md b/docs/releases.md index 0f1ce158..7b12814f 100644 --- a/docs/releases.md +++ b/docs/releases.md @@ -1,5 +1,18 @@ # Release notes +## Unreleased + +### New Features + +- `ManifestArray` now supports chunk-aligned integer and slice indexing along each axis, including multi-chunk slices, mixed integer + slice indexers, and selections that include a partial final chunk. Integer indexers drop the indexed axis (numpy / array-API semantics) and are legal only when `chunk_size == 1` along that axis; slice indexers preserve the axis. This makes `xarray.Dataset.isel` work end-to-end on virtual datasets for any chunk-aligned selection. Indexers that would split individual chunks raise a new `SubChunkIndexingError` (a `ValueError` subclass) — a permanent constraint of a virtual array, not a missing feature. Previously slice misalignment silently no-op'd while integer indexing unconditionally raised `NotImplementedError`. Closes [#51](https://github.com/zarr-developers/VirtualiZarr/issues/51), supersedes [#499](https://github.com/zarr-developers/VirtualiZarr/pull/499). + By [Tom Nicholas](https://github.com/TomNicholas). + +### Bug fixes + +### Documentation + +### Internal changes + ## v2.6.1 (3rd May 2026) Adds end-to-end support for inlined chunk references in `ChunkManifest` (read via Kerchunk parsers, write via Kerchunk and Icechunk writers), plus Zarr-Python 3.2.0 compatibility and several bug fixes. diff --git a/docs/scaling.md b/docs/scaling.md index 1ef04fe6..9ca7b940 100644 --- a/docs/scaling.md +++ b/docs/scaling.md @@ -308,6 +308,37 @@ for i, batch in enumerate(file_batches): Notice this workflow could also be used for appending data only as it becomes available, e.g. by replacing the for loop with a cron job. +### Splitting a single large virtual dataset across commits + +A single Icechunk commit cannot include more than 50 million chunk references at once. +If a single source — typically a massive Zarr store opened via [`ZarrParser`][virtualizarr.parsers.ZarrParser] — produces a virtual dataset whose arrays together exceed that, you can't write it in one transaction even after all the references are already in memory. + +In that case you can slice the virtual dataset along an axis where the slicing falls on chunk boundaries (often `time`), and commit each slice with `append_dim`. Chunk-aligned slicing on a `ManifestArray` (and therefore on the variables of a virtual `xarray.Dataset`) only subsets the manifest, so this is cheap — no chunks are loaded. + +```python +import icechunk as ic + +# Parse the giant Zarr store once, producing a virtual dataset that exceeds +# 50M refs in total but whose `time` axis is chunked. +vds = vz.open_virtual_dataset(, parser=ZarrParser(), registry=registry) + +chunk_size_time = vds.chunksizes["time"] # must align the splits to chunk boundaries +step = chunk_size_time * N # pick N so that each slice has < 50M refs + +repo = ic.Repository.open() + +for i, start in enumerate(range(0, vds.sizes["time"], step)): + session = repo.writable_session("main") + slice_vds = vds.isel(time=slice(start, start + step)) + append_dim = "time" if i > 0 else None + slice_vds.vz.to_icechunk(session.store, append_dim=append_dim) + session.commit(f"wrote virtual references for time slice {i}") +``` + +If the slice boundaries don't align with chunk edges along that axis, the indexing call raises `SubChunkIndexingError`. + +(Remember you can also subset the Dataset to specific variables and commit those separately too if necessary.) + ### Retries Sometimes an [`open_virtual_dataset`][virtualizarr.open_virtual_dataset] call might fail for a transient reason, such as a failed HTTP response from a server. diff --git a/virtualizarr/manifests/array.py b/virtualizarr/manifests/array.py index de42c853..9e581471 100644 --- a/virtualizarr/manifests/array.py +++ b/virtualizarr/manifests/array.py @@ -155,8 +155,6 @@ def __array_function__(self, func, types, args, kwargs) -> Any: return MANIFESTARRAY_HANDLED_ARRAY_FUNCTIONS[func](*args, **kwargs) - # Everything beyond here is basically just to make this array class wrappable by xarray # - def __array_ufunc__(self, ufunc, method, *inputs, **kwargs) -> Any: """We have to define this in order to convince xarray that this class is a duckarray, even though we will never support ufuncs.""" if ufunc == np.isnan: @@ -227,16 +225,37 @@ def __getitem__( /, ) -> "ManifestArray": """ - Perform numpy-style indexing on this ManifestArray. + Index into this ManifestArray, returning a new ManifestArray view over a subset of chunks. + + Supports only chunk-aligned selections. A ManifestArray only stores references to where + each chunk's bytes live, never their decoded values, so any indexer that would split into + the interior of a chunk would require loading the underlying data — which defeats the + point of a virtual array. Selections that would do so raise ``SubChunkIndexingError`` + (a ``ValueError`` subclass); this is a permanent constraint, not a missing feature. - Only supports limited indexing, because in general you cannot slice inside of a compressed chunk. - Mainly required because Xarray uses this instead of expand dims (by passing Nones) and often will index with a no-op. + Supported indexers (and tuples thereof): - Could potentially support indexing with slices aligned along chunk boundaries, but currently does not. + - ``Ellipsis`` and ``None`` — no-ops and new-axis insertion. + - ``slice`` with ``step == 1`` whose start and stop land on chunk boundaries + (``stop == axis_length`` is also allowed, so a partial final chunk can be selected). + Slice indexers preserve the axis. + - ``int`` — drops the indexed axis, following numpy / array-API semantics. Only legal + when ``chunk_size == 1`` along that axis; otherwise picking a single element would + require splitting a chunk. + + Anything else — fancy indexing with arrays, misaligned slices, ``step != 1`` — + raises ``SubChunkIndexingError`` or ``NotImplementedError``. Parameters ---------- key + A basic indexer or tuple of basic indexers, one per array axis (with ``Ellipsis`` + and ``None`` allowed as per the array API). + + Returns + ------- + ManifestArray + A new array whose ``ChunkManifest`` references only the selected chunks. """ return index(self, key) diff --git a/virtualizarr/manifests/indexing.py b/virtualizarr/manifests/indexing.py index 5d69e1f9..580808c4 100644 --- a/virtualizarr/manifests/indexing.py +++ b/virtualizarr/manifests/indexing.py @@ -1,9 +1,11 @@ from types import EllipsisType -from typing import TYPE_CHECKING, TypeAlias, cast +from typing import TYPE_CHECKING, Any, TypeAlias, cast import numpy as np from virtualizarr.manifests.array_api import expand_dims +from virtualizarr.manifests.manifest import ChunkManifest +from virtualizarr.manifests.utils import copy_and_replace_metadata # indexer with only basic selectors, no new axes or ellipsis T_BasicIndexer_1d: TypeAlias = int | slice | np.ndarray @@ -14,6 +16,12 @@ T_Indexer: TypeAlias = T_Indexer_1d | tuple[T_Indexer_1d, ...] +class SubChunkIndexingError(ValueError): + """ + Raised when an indexer would split individual chunks of a compressed ManifestArray. + """ + + if TYPE_CHECKING: from virtualizarr.manifests.array import ManifestArray @@ -138,56 +146,180 @@ def apply_indexer( def apply_selection( marr: "ManifestArray", indexer_without_newaxes: tuple[T_BasicIndexer_1d, ...] ) -> "ManifestArray": - """Applies indexes to subset along each dimension.""" + """ + Apply chunk-aligned subsetting along each dimension. + + Slices and integer indexers are only supported if they align with chunk boundaries, since + splitting individual chunks would require loading their bytes. See GitHub issue #51. + + Integer indexers drop the indexed axis (numpy / array-API semantics); slice indexers + preserve it. Integer indexing is only legal when ``chunk_size == 1`` along that axis, + since picking a single element of a larger chunk would require splitting it. + """ + from virtualizarr.manifests.array import ManifestArray # at this point there should be no ellipsis, no Nones, and one 1D indexer for each axis. assert len(indexer_without_newaxes) == marr.ndim - output_arr = marr - for axis, (length, indexer_1d) in enumerate( - zip(marr.shape, indexer_without_newaxes) + # validate types and reject anything we can never support per-axis + narrowed_indexers: list[int | slice] = [] + for indexer_1d in indexer_without_newaxes: + if isinstance(indexer_1d, np.ndarray): + raise NotImplementedError( + f"Unsupported indexer. So-called 'fancy indexing' via numpy arrays is not supported, but received {indexer_1d}" + ) + if not isinstance(indexer_1d, (int, slice)): + raise TypeError(f"Invalid indexer type: {indexer_1d}") + narrowed_indexers.append(indexer_1d) + + new_shape: list[int] = [] + new_chunks: list[int] = [] + chunk_grid_selectors: list[int | slice] = [] + kept_axes: list[int] = [] + for axis, (axis_length, chunk_size, indexer_1d) in enumerate( + zip(marr.shape, marr.chunks, narrowed_indexers, strict=True) ): - output_arr = apply_selection_1d(output_arr, indexer_1d, length) - - return output_arr + chunk_grid_selector, new_axis_length = _compute_chunk_aligned_selection_1d( + indexer_1d, axis_length=axis_length, chunk_size=chunk_size + ) + chunk_grid_selectors.append(chunk_grid_selector) + # int selectors drop the axis from the output array + if not isinstance(indexer_1d, int): + new_shape.append(new_axis_length) + new_chunks.append(chunk_size) + kept_axes.append(axis) + + chunk_grid_selectors_tuple = tuple(chunk_grid_selectors) + + # short-circuit if every axis selects the whole chunk grid via a slice (a no-op) + if all( + isinstance(cgs, slice) and cgs == slice(0, dim, 1) + for cgs, dim in zip(chunk_grid_selectors_tuple, marr.manifest.shape_chunk_grid) + ): + return marr + + new_manifest = _subset_manifest(marr.manifest, chunk_grid_selectors_tuple) + old_dimension_names = marr.metadata.dimension_names + # zarr's dimension_names is tuple[str | None, ...] but copy_and_replace_metadata's + # type hint says Iterable[str]; the runtime handles None entries fine, so cast through. + new_dimension_names: Any + if old_dimension_names is None: + new_dimension_names = "default" # sentinel: leave as None + else: + new_dimension_names = tuple(old_dimension_names[a] for a in kept_axes) + new_metadata = copy_and_replace_metadata( + marr.metadata, + new_shape=new_shape, + new_chunks=new_chunks, + new_dimension_names=new_dimension_names, + ) + return ManifestArray(metadata=new_metadata, chunkmanifest=new_manifest) -def apply_selection_1d( - marr: "ManifestArray", indexer_1d: T_BasicIndexer_1d, length: int -) -> "ManifestArray": +def _compute_chunk_aligned_selection_1d( + indexer_1d: int | slice, axis_length: int, chunk_size: int +) -> tuple[int | slice, int]: """ - Actually index the ManifestArray along 1 dimension. + Translate a 1D array-space indexer (int or slice) into a chunk-grid selector plus the + new axis length. The selector is an ``int`` for int indexers (so the chunk-grid axis + is dropped) and a ``slice`` for slice indexers (so the chunk-grid axis is preserved). - Notice that none of these options actually do any indexing right now! + Raises SubChunkIndexingError if the selection would require splitting individual chunks. """ - - if isinstance(indexer_1d, slice): - if slice_is_no_op(indexer_1d, axis_length=length): - pass - else: - NotImplementedError( - f"Unsupported indexer. Indexing within a ManifestArray using ints or slices is not yet supported (see GitHub issue #51), but received {indexer_1d}" + if isinstance(indexer_1d, int): + i = indexer_1d + if i < 0: + # Allow negative indexing - this makes it wrap around + i += axis_length + if not (0 <= i < axis_length): + raise IndexError( + f"index {indexer_1d} is out of bounds for axis with size {axis_length}" ) - elif isinstance(indexer_1d, int): - # TODO cover possibility of indexing into a length-1 dimension (which just removes that dimension)? - raise NotImplementedError( - f"Unsupported indexer. Indexing within a ManifestArray using ints or slices is not yet supported (see GitHub issue #51), but received {indexer_1d}" - ) - elif isinstance(indexer_1d, np.ndarray): - raise NotImplementedError( - f"Unsupported indexer. So-called 'fancy indexing' via numpy arrays is not supported, but received {indexer_1d}" - ) + start, stop, step = i, i + 1, 1 else: - # should never get here - raise TypeError(f"Invalid indexer type: {indexer_1d}") + start, stop, step = indexer_1d.indices(axis_length) + + # The final chunk may legitimately be partial, so allow stop == axis_length even if + # axis_length % chunk_size != 0; otherwise both endpoints must land on chunk boundaries. + # TODO step != 1 we can actually support for uncompressed arrays along the first axis, + # see https://github.com/zarr-developers/VirtualiZarr/issues/86 + if ( + step != 1 + or start % chunk_size != 0 + or (stop != axis_length and stop % chunk_size != 0) + ): + raise SubChunkIndexingError( + f"Cannot index ManifestArray axis of length {axis_length} and chunk length " + f"{chunk_size} with {indexer_1d!r}: slice would split individual chunks, " + "which a ManifestArray cannot do without loading the underlying data." + ) + + chunk_start = start // chunk_size - return marr + if isinstance(indexer_1d, int): + # int indexer drops the array axis, so the chunk-grid axis is dropped too via an int selector + return chunk_start, 1 + # slice indexer: ceil-divide stop so a partial final chunk is included when stop == axis_length + chunk_stop = -(-stop // chunk_size) + return slice(chunk_start, chunk_stop, 1), stop - start -def slice_is_no_op(slice_indexer_1d: slice, axis_length: int) -> bool: - if slice_indexer_1d == slice(None): - return True - elif slice_indexer_1d == slice(0, axis_length, 1): - return True + +def _subset_manifest( + manifest: ChunkManifest, chunk_grid_selectors: tuple[int | slice, ...] +) -> ChunkManifest: + """ + Subset a ChunkManifest by indexing its underlying chunk-grid arrays. Each entry of + ``chunk_grid_selectors`` is either an int (which drops the corresponding chunk-grid axis) + or a slice (which keeps it). + """ + # When every axis is int-indexed, numpy returns a 0D scalar (Python str for StringDType, + # numpy scalar for the numeric arrays). Wrap back into a 0D ndarray so from_arrays accepts it. + # np.asarray's return type erases the dtype param, so cast back to keep from_arrays happy. + new_paths = cast( + "np.ndarray[Any, np.dtypes.StringDType]", + np.asarray(manifest._paths[chunk_grid_selectors], dtype=manifest._paths.dtype), + ) + new_offsets = cast( + "np.ndarray[Any, np.dtype[np.uint64]]", + np.asarray( + manifest._offsets[chunk_grid_selectors], dtype=manifest._offsets.dtype + ), + ) + new_lengths = cast( + "np.ndarray[Any, np.dtype[np.uint64]]", + np.asarray( + manifest._lengths[chunk_grid_selectors], dtype=manifest._lengths.dtype + ), + ) + + if manifest._inlined: + # For each old chunk-grid key, keep it only if int-indexed axes match exactly and + # slice-indexed axes fall inside the slice. Re-map the surviving key by omitting + # int-indexed positions and offsetting slice-indexed positions by the slice start. + new_inlined: dict[tuple[int, ...], bytes] = {} + for coords, data in manifest._inlined.items(): + new_coord: list[int] = [] + keep = True + for coord, sel in zip(coords, chunk_grid_selectors): + if isinstance(sel, int): + if coord != sel: + keep = False + break + else: + if not (sel.start <= coord < sel.stop): + keep = False + break + new_coord.append(coord - sel.start) + if keep: + new_inlined[tuple(new_coord)] = data else: - return False + new_inlined = {} + + return ChunkManifest.from_arrays( + paths=new_paths, + offsets=new_offsets, + lengths=new_lengths, + inlined=new_inlined, + validate_paths=False, + ) diff --git a/virtualizarr/tests/test_manifests/test_array.py b/virtualizarr/tests/test_manifests/test_array.py index ba46e001..08187ec1 100644 --- a/virtualizarr/tests/test_manifests/test_array.py +++ b/virtualizarr/tests/test_manifests/test_array.py @@ -7,6 +7,7 @@ ZLIB_CODEC, ) from virtualizarr.manifests import ChunkManifest, ManifestArray +from virtualizarr.manifests.indexing import SubChunkIndexingError class TestInit: @@ -843,88 +844,35 @@ def test_raise_on_multiple_ellipses( [ # obvious no-ops ((2,), (1,), slice(0, 2), (2,), (1,)), - # reduces shape - pytest.param( - (1,), - (1,), - 0, - (1,), - (1,), - marks=pytest.mark.xfail( - reason="Chunk-aligned indexing not yet implemented" - ), - ), - # requires chunk-aligned selection - pytest.param( - (2,), - (1,), - 0, - (1,), - (1,), - marks=pytest.mark.xfail( - reason="Chunk-aligned indexing not yet implemented" - ), - ), - pytest.param( - (2,), - (1,), - 1, - (1,), - (1,), - marks=pytest.mark.xfail( - reason="Chunk-aligned indexing not yet implemented" - ), - ), - pytest.param( - (2,), - (1,), - (0, ...), - (1,), - (1,), - marks=pytest.mark.xfail( - reason="Chunk-aligned indexing not yet implemented" - ), - ), - pytest.param( - (2,), - (1,), - (..., 0), - (1,), - (1,), - marks=pytest.mark.xfail( - reason="Chunk-aligned indexing not yet implemented" - ), - ), - pytest.param( - (2,), - (1,), - slice(0, 1), - (1,), - (1,), - marks=pytest.mark.xfail( - reason="Chunk-aligned indexing not yet implemented" - ), - ), - pytest.param( - (2,), - (1,), - (..., slice(0, 1)), - (1,), - (1,), - marks=pytest.mark.xfail( - reason="Chunk-aligned indexing not yet implemented" - ), - ), - pytest.param( - (2,), - (1,), - (slice(0, 1), ...), - (1,), - (1,), - marks=pytest.mark.xfail( - reason="Chunk-aligned indexing not yet implemented" - ), - ), + # integer indexing drops the indexed axis (numpy / array-API semantics). + # Only legal when chunk_size == 1 along that axis: otherwise picking a + # single element would require splitting a chunk. + ((1,), (1,), 0, (), ()), + ((2,), (1,), 0, (), ()), + ((2,), (1,), 1, (), ()), + ((2,), (1,), (0, ...), (), ()), + ((2,), (1,), (..., 0), (), ()), + # multi-axis integer indexing drops every indexed axis + ((2, 2), (1, 1), (0, 0), (), ()), + ((3, 3), (1, 1), (1, 2), (), ()), + # chunk-aligned slicing preserves the axis + ((2,), (1,), slice(0, 1), (1,), (1,)), + ((2,), (1,), (..., slice(0, 1)), (1,), (1,)), + ((2,), (1,), (slice(0, 1), ...), (1,), (1,)), + # multi-chunk slices and slices over multi-chunk axes + ((8,), (2,), slice(0, 4), (4,), (2,)), + ((8,), (2,), slice(2, 6), (4,), (2,)), + ((8,), (2,), slice(6, 8), (2,), (2,)), + # multi-dim slicing + ((4, 4), (2, 2), (slice(0, 2), slice(0, 2)), (2, 2), (2, 2)), + ((4, 4), (2, 2), (slice(2, 4), slice(0, 4)), (2, 4), (2, 2)), + # mixed integer + slice indexing — the integer-indexed axis drops, the + # slice-indexed axis stays. + ((2, 4), (1, 2), (0, slice(0, 2)), (2,), (2,)), + ((4, 4), (1, 2), (2, slice(0, 4)), (4,), (2,)), + # partial final chunk along an axis + ((5,), (2,), slice(0, 4), (4,), (2,)), + ((5,), (2,), slice(2, 5), (3,), (2,)), ], ) def test_chunk_selection_cases( @@ -935,6 +883,55 @@ def test_chunk_selection_cases( assert indexed.shape == out_shape assert indexed.chunks == out_chunks + def test_integer_indexing_subsets_manifest_to_one_chunk(self, array_v3_metadata): + # Make sure dropping the axis also drops the manifest's chunk-grid axis and + # keeps just the referenced chunk's bytes. + metadata = array_v3_metadata(shape=(4, 2), chunks=(1, 2)) + manifest = ChunkManifest( + entries={ + "0.0": {"path": "/a.nc", "offset": 0, "length": 16}, + "1.0": {"path": "/a.nc", "offset": 100, "length": 16}, + "2.0": {"path": "/a.nc", "offset": 200, "length": 16}, + "3.0": {"path": "/a.nc", "offset": 300, "length": 16}, + } + ) + marr = ManifestArray(metadata=metadata, chunkmanifest=manifest) + + result = marr[2, ...] + + assert result.shape == (2,) + assert result.chunks == (2,) + assert result.manifest.shape_chunk_grid == (1,) + assert result.manifest.dict() == { + "0": {"path": "file:///a.nc", "offset": 200, "length": 16}, + } + + @pytest.mark.parametrize( + "in_shape, in_chunks, indexer", + [ + # int on a multi-element chunk — only chunk_size == 1 permits int indexing + ((4,), (2,), 0), + ((4,), (2,), 1), + ((4,), (2,), 2), + ((4,), (2,), 3), + # slice start misaligned + ((4,), (2,), slice(1, 3)), + # slice stop misaligned (and stop != axis_length) + ((4,), (2,), slice(0, 3)), + ((4,), (2,), slice(0, 1)), + # step != 1 + ((4,), (2,), slice(0, 4, 2)), + # only one axis misaligned in a multi-dim selection + ((4, 4), (2, 2), (slice(0, 4), slice(0, 1))), + # int on a chunk_size > 1 axis even when other axes are fine + ((4, 4), (2, 2), (0, slice(0, 4))), + ], + ) + def test_misaligned_with_chunks(self, manifest_array, in_shape, in_chunks, indexer): + marr = manifest_array(shape=in_shape, chunks=in_chunks) + with pytest.raises(SubChunkIndexingError, match="split individual chunks"): + marr[indexer] + def test_to_xarray(array_v3_metadata): chunks = (5, 10) diff --git a/virtualizarr/tests/test_xarray.py b/virtualizarr/tests/test_xarray.py index d3095a44..631b711b 100644 --- a/virtualizarr/tests/test_xarray.py +++ b/virtualizarr/tests/test_xarray.py @@ -18,6 +18,7 @@ open_virtual_mfdataset, ) from virtualizarr.manifests import ChunkManifest, ManifestArray +from virtualizarr.manifests.indexing import SubChunkIndexingError from virtualizarr.parsers import HDFParser from virtualizarr.tests import ( requires_dask, @@ -1006,3 +1007,115 @@ def test_to_xarray_nonscalar_no_dimension_names(array_v3_metadata): with pytest.raises(ValueError, match="without dimension names"): marr.to_virtual_variable() + + +class TestIsel: + # Verifies the workflow documented in docs/scaling.md under + # "Splitting a single large virtual dataset across commits": slicing a virtual + # xarray.Dataset with .isel along a chunk-aligned axis subsets the underlying + # ChunkManifest without touching the data, and misaligned splits raise. + + def _virtual_dataset(self, array_v3_metadata): + # shape=(8, 4), chunks=(2, 4): 4 chunks along "time", 1 chunk along "x" + metadata = array_v3_metadata( + shape=(8, 4), chunks=(2, 4), dimension_names=["time", "x"] + ) + manifest = ChunkManifest( + entries={ + "0.0": {"path": "/a.nc", "offset": 0, "length": 64}, + "1.0": {"path": "/a.nc", "offset": 100, "length": 64}, + "2.0": {"path": "/a.nc", "offset": 200, "length": 64}, + "3.0": {"path": "/a.nc", "offset": 300, "length": 64}, + } + ) + marr = ManifestArray(metadata=metadata, chunkmanifest=manifest) + return xr.Dataset({"foo": (["time", "x"], marr)}) + + def test_isel_along_chunk_boundary(self, array_v3_metadata): + vds = self._virtual_dataset(array_v3_metadata) + + sliced = vds.isel(time=slice(2, 6)) + + assert sliced.sizes == {"time": 4, "x": 4} + assert isinstance(sliced["foo"].data, ManifestArray) + assert sliced["foo"].data.shape == (4, 4) + assert sliced["foo"].data.chunks == (2, 4) + # only the two middle chunks should remain, re-indexed from 0 + assert sliced["foo"].data.manifest.dict() == { + "0.0": {"path": "file:///a.nc", "offset": 100, "length": 64}, + "1.0": {"path": "file:///a.nc", "offset": 200, "length": 64}, + } + + def test_isel_single_chunk_via_length_1_slice(self, array_v3_metadata): + # A length-1 slice picks a single chunk while preserving the axis as length 1 + # (the array stays 2D). Useful when the caller wants to keep dimensions stable. + vds = self._virtual_dataset(array_v3_metadata) + + sliced = vds.isel(time=slice(2, 4)) + + assert sliced.sizes == {"time": 2, "x": 4} + assert isinstance(sliced["foo"].data, ManifestArray) + # slice(2, 4) on chunks=(2, 4) picks chunk index 1 (the second chunk) + assert sliced["foo"].data.manifest.dict() == { + "0.0": {"path": "file:///a.nc", "offset": 100, "length": 64}, + } + + def test_isel_integer_drops_axis(self, array_v3_metadata): + # Integer .isel drops the indexed axis (numpy / array-API semantics). Only + # works when chunk_size == 1 along that axis; otherwise it's sub-chunk indexing. + # shape=(4, 4), chunks=(1, 4): 4 single-row chunks along "time". + metadata = array_v3_metadata( + shape=(4, 4), chunks=(1, 4), dimension_names=["time", "x"] + ) + manifest = ChunkManifest( + entries={ + "0.0": {"path": "/a.nc", "offset": 0, "length": 32}, + "1.0": {"path": "/a.nc", "offset": 100, "length": 32}, + "2.0": {"path": "/a.nc", "offset": 200, "length": 32}, + "3.0": {"path": "/a.nc", "offset": 300, "length": 32}, + } + ) + marr = ManifestArray(metadata=metadata, chunkmanifest=manifest) + vds = xr.Dataset({"foo": (["time", "x"], marr)}) + + sliced = vds.isel(time=2) + + assert sliced["foo"].dims == ("x",) + assert sliced.sizes == {"x": 4} + assert isinstance(sliced["foo"].data, ManifestArray) + assert sliced["foo"].data.shape == (4,) + assert sliced["foo"].data.chunks == (4,) + assert sliced["foo"].data.manifest.dict() == { + "0": {"path": "file:///a.nc", "offset": 200, "length": 32}, + } + + def test_isel_integer_misaligned_raises(self, array_v3_metadata): + # chunk_size > 1 along the indexed axis — picking one element would split a chunk. + vds = self._virtual_dataset(array_v3_metadata) + + with pytest.raises(SubChunkIndexingError, match="split individual chunks"): + vds.isel(time=1) + + def test_isel_misaligned_raises(self, array_v3_metadata): + vds = self._virtual_dataset(array_v3_metadata) + + with pytest.raises(SubChunkIndexingError, match="split individual chunks"): + vds.isel(time=slice(1, 5)) + + def test_isel_iterative_append_simulation(self, array_v3_metadata): + # Simulate the scaling.md recipe: walk a chunk-aligned step across "time" + # and confirm each slice yields a valid ManifestArray with the expected refs. + vds = self._virtual_dataset(array_v3_metadata) + step = 4 # two chunks of size 2 per slice + + seen_refs = [] + for start in range(0, vds.sizes["time"], step): + slice_vds = vds.isel(time=slice(start, start + step)) + assert isinstance(slice_vds["foo"].data, ManifestArray) + seen_refs.extend( + entry["offset"] + for entry in slice_vds["foo"].data.manifest.dict().values() + ) + + # every original chunk should have been visited exactly once + assert sorted(seen_refs) == [0, 100, 200, 300]