Skip to content
Merged
5 changes: 3 additions & 2 deletions docs/data_structures.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,8 +244,9 @@ NotImplementedError: ManifestArrays can't be converted into numpy arrays or pand
The whole point is to manipulate references to the data without actually loading any data.

!!! note
You also cannot currently index into a `ManifestArray`, as arbitrary indexing would require loading data values to create the new array.
We could imagine supporting indexing without loading data when slicing only along chunk boundaries, but this has not yet been implemented (see [GH issue #51](https://github.com/zarr-developers/VirtualiZarr/issues/51)).
You can index into a `ManifestArray` as long as the selection aligns with chunk boundaries — slicing through the interior of a chunk would require loading the chunk's bytes, which a virtual array deliberately cannot do.
Chunk-aligned integer and slice indexing is supported, including mixed integer + slice indexers; integer indexers drop the indexed axis as in numpy. Misaligned selections raise `SubChunkIndexingError`.
Arbitrary fancy indexing (e.g. with a boolean mask or integer array) is not supported, since it would generally require loading data.

## Zarr Groups

Expand Down
2 changes: 1 addition & 1 deletion docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ Users of Kerchunk may find the following comparison table useful, which shows wh
| Renaming dimensions | ❌ | `xarray.Dataset.rename_dims` |
| Renaming manifest file paths | `kerchunk.utils.rename_target` | `vds.vz.rename_paths` |
| Splitting uncompressed data into chunks | `kerchunk.utils.subchunk` | `xarray.Dataset.chunk` (❌ Not yet implemented - see [PR #199](https://github.com/zarr-developers/VirtualiZarr/pull/199))
| Selecting specific chunks | ❌ | `xarray.Dataset.isel` (❌ Not yet implemented - see [issue #51](https://github.com/zarr-developers/VirtualiZarr/issues/51)) |
| Selecting specific chunks | ❌ | `xarray.Dataset.isel` (✅ chunk-aligned selections only) |
**Parallelization** | | |
| Parallelized generation of references | Wrapping kerchunk's opener inside `dask.delayed` | Wrapping `open_virtual_dataset` inside `dask.delayed`
| Parallelized combining of references (tree-reduce) | `kerchunk.combine.auto_dask` | Wrapping `ManifestArray` objects within `dask.array.Array` objects inside `xarray.Dataset` to use dask's `concatenate` (⚠️ Untested, but also unnecessary) |
Expand Down
13 changes: 13 additions & 0 deletions docs/releases.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# Release notes

## Unreleased

### New Features

- `ManifestArray` now supports chunk-aligned integer and slice indexing along each axis, including multi-chunk slices, mixed integer + slice indexers, and selections that include a partial final chunk. Integer indexers drop the indexed axis (numpy / array-API semantics) and are legal only when `chunk_size == 1` along that axis; slice indexers preserve the axis. This makes `xarray.Dataset.isel` work end-to-end on virtual datasets for any chunk-aligned selection. Indexers that would split individual chunks raise a new `SubChunkIndexingError` (a `ValueError` subclass) — a permanent constraint of a virtual array, not a missing feature. Previously slice misalignment silently no-op'd while integer indexing unconditionally raised `NotImplementedError`. Closes [#51](https://github.com/zarr-developers/VirtualiZarr/issues/51), supersedes [#499](https://github.com/zarr-developers/VirtualiZarr/pull/499).
By [Tom Nicholas](https://github.com/TomNicholas).

### Bug fixes

### Documentation

### Internal changes

## v2.6.1 (3rd May 2026)

Adds end-to-end support for inlined chunk references in `ChunkManifest` (read via Kerchunk parsers, write via Kerchunk and Icechunk writers), plus Zarr-Python 3.2.0 compatibility and several bug fixes.
Expand Down
31 changes: 31 additions & 0 deletions docs/scaling.md
Original file line number Diff line number Diff line change
Expand Up @@ -308,6 +308,37 @@ for i, batch in enumerate(file_batches):

Notice this workflow could also be used for appending data only as it becomes available, e.g. by replacing the for loop with a cron job.

### Splitting a single large virtual dataset across commits

A single Icechunk commit cannot include more than 50 million chunk references at once.
If a single source — typically a massive Zarr store opened via [`ZarrParser`][virtualizarr.parsers.ZarrParser] — produces a virtual dataset whose arrays together exceed that, you can't write it in one transaction even after all the references are already in memory.

In that case you can slice the virtual dataset along an axis where the slicing falls on chunk boundaries (often `time`), and commit each slice with `append_dim`. Chunk-aligned slicing on a `ManifestArray` (and therefore on the variables of a virtual `xarray.Dataset`) only subsets the manifest, so this is cheap — no chunks are loaded.

```python
import icechunk as ic

# Parse the giant Zarr store once, producing a virtual dataset that exceeds
# 50M refs in total but whose `time` axis is chunked.
vds = vz.open_virtual_dataset(<zarr_store>, parser=ZarrParser(), registry=registry)

chunk_size_time = vds.chunksizes["time"] # must align the splits to chunk boundaries
step = chunk_size_time * N # pick N so that each slice has < 50M refs

repo = ic.Repository.open(<repo_url>)

for i, start in enumerate(range(0, vds.sizes["time"], step)):
session = repo.writable_session("main")
slice_vds = vds.isel(time=slice(start, start + step))
append_dim = "time" if i > 0 else None
slice_vds.vz.to_icechunk(session.store, append_dim=append_dim)
session.commit(f"wrote virtual references for time slice {i}")
```

If the slice boundaries don't align with chunk edges along that axis, the indexing call raises `SubChunkIndexingError`.

(Remember you can also subset the Dataset to specific variables and commit those separately too if necessary.)

### Retries

Sometimes an [`open_virtual_dataset`][virtualizarr.open_virtual_dataset] call might fail for a transient reason, such as a failed HTTP response from a server.
Expand Down
31 changes: 25 additions & 6 deletions virtualizarr/manifests/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,8 +155,6 @@ def __array_function__(self, func, types, args, kwargs) -> Any:

return MANIFESTARRAY_HANDLED_ARRAY_FUNCTIONS[func](*args, **kwargs)

# Everything beyond here is basically just to make this array class wrappable by xarray #

def __array_ufunc__(self, ufunc, method, *inputs, **kwargs) -> Any:
"""We have to define this in order to convince xarray that this class is a duckarray, even though we will never support ufuncs."""
if ufunc == np.isnan:
Expand Down Expand Up @@ -227,16 +225,37 @@ def __getitem__(
/,
) -> "ManifestArray":
"""
Perform numpy-style indexing on this ManifestArray.
Index into this ManifestArray, returning a new ManifestArray view over a subset of chunks.

Supports only chunk-aligned selections. A ManifestArray only stores references to where
each chunk's bytes live, never their decoded values, so any indexer that would split into
the interior of a chunk would require loading the underlying data — which defeats the
point of a virtual array. Selections that would do so raise ``SubChunkIndexingError``
(a ``ValueError`` subclass); this is a permanent constraint, not a missing feature.

Only supports limited indexing, because in general you cannot slice inside of a compressed chunk.
Mainly required because Xarray uses this instead of expand dims (by passing Nones) and often will index with a no-op.
Supported indexers (and tuples thereof):

Could potentially support indexing with slices aligned along chunk boundaries, but currently does not.
- ``Ellipsis`` and ``None`` — no-ops and new-axis insertion.
- ``slice`` with ``step == 1`` whose start and stop land on chunk boundaries
(``stop == axis_length`` is also allowed, so a partial final chunk can be selected).
Slice indexers preserve the axis.
- ``int`` — drops the indexed axis, following numpy / array-API semantics. Only legal
when ``chunk_size == 1`` along that axis; otherwise picking a single element would
require splitting a chunk.

Anything else — fancy indexing with arrays, misaligned slices, ``step != 1`` —
raises ``SubChunkIndexingError`` or ``NotImplementedError``.

Parameters
----------
key
A basic indexer or tuple of basic indexers, one per array axis (with ``Ellipsis``
and ``None`` allowed as per the array API).

Returns
-------
ManifestArray
A new array whose ``ChunkManifest`` references only the selected chunks.
"""
return index(self, key)

Expand Down
Loading
Loading