Skip to content

perf: eager identity-transform fast-path + lazy-path speedups#5

Open
d-v-b-agent wants to merge 4 commits into
refactor/simplify-indexingfrom
perf/eager-fast-path-and-lazy-speedups
Open

perf: eager identity-transform fast-path + lazy-path speedups#5
d-v-b-agent wants to merge 4 commits into
refactor/simplify-indexingfrom
perf/eager-fast-path-and-lazy-speedups

Conversation

@d-v-b-agent

Copy link
Copy Markdown
Owner

Builds on zarr-developers#3906. Targets the performance regression that PR reported (CodSpeed ~18%, 14 benchmarks) and speeds up the lazy path. Stacks on refactor/simplify-indexing; the diff here is just the 4 commits below.

Root cause (profiled deterministically via per-op call counts — same metric class as CodSpeed)

The transform resolver is +55% function calls vs the legacy indexers for an equivalent read. Crucially, eager tuple selections already fall back to the legacy path (flat, no regression) — only selections routed through get_basic_selection (bare slices) and .lazy views hit the resolver. So the regression is isolated to the transform path, not the eager fallback.

Separately, AsyncArray.shape/ndim had become ~10× slower (recomputed from transform.domain.shape per access).

Changes

  1. cache shape/ndim; reuse _chunk_grid in the transform resolvers (was rebuilding ChunkGrid.from_metadata per call).
  2. identity-transform fast-path — a freshly-opened array has the identity transform, so eager indexing routes straight to the original Basic/Orthogonal/Mask/Coordinate indexers; only non-identity .lazy[...] views use the resolver. Restores the incompatible-shape ValueError the fields-only coordinate branch had dropped (now exercised by the fast-path).
  3. lazy: one chunk_grid[coords] lookup per chunk (was 2× — in _is_complete_chunk and _get_chunk_spec; the single most expensive line in the per-chunk loop).
  4. lazy: single-pass sub_transform_to_selections with hoisted domain bounds.

Results (deterministic call-count proxy)

  • Eager regression eliminated: bare-slice read 322,494 → 207,820 calls/op, back within 0.3% of main; tuple selections unchanged; .lazy unchanged.
  • Lazy slice reads ~7% fewer ops (chunk_grid dedup is the measurable win).
  • Tests: test_transforms/ + test_lazy_indexing.py + test_indexing.py310 passed, 1 skipped, 5 xfailed.

Not done / honest ceiling

A lazy read's floor is async/codec dispatch — identical to the eager path, not reducible here. Removing the per-chunk bridge entirely needs the codec pipeline to consume IndexTransform natively (the direction chunk_resolution.py already flags). An interning/from_shape/resolution-memo caching layer was prototyped and dropped — it interned cleanly (78% hit rate) but gave ~0 measurable win because construction/resolution were never the bottleneck.

🤖 Generated with Claude Code

d-v-b-agent and others added 4 commits June 29, 2026 16:21
`AsyncArray.shape` recomputed `self._transform.domain.shape` (a zip+genexpr)
on every access — ~10x slower than returning the stored tuple — and `ndim`
went through it as well. Cache `_shape` wherever the transform is set and read
it directly.

Also thread the array's already-built `_chunk_grid` into the transform
read/write resolvers instead of rebuilding it via `ChunkGrid.from_metadata`
on every call.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011D1uGKtHP9s7E4WqiUSYG4
…xing

A freshly-opened array carries the identity transform, so eager indexing
yields exactly the coordinates the original indexers compute. Route such
arrays straight to the legacy Basic/Orthogonal/Mask/Coordinate indexer path;
only non-identity transforms (opt-in `.lazy[...]` views) go through the
transform resolver. This removes the per-chunk transform-resolution overhead
from the common eager path while preserving lazy semantics.

Also restore the incompatible-shape ValueError in the coordinate-selection
legacy branch — the transform path carried it but the fields-only branch had
dropped it, and the eager fast-path now exercises that branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011D1uGKtHP9s7E4WqiUSYG4
…ad/write

`_is_complete_chunk` and `_get_chunk_spec` each looked up
`chunk_grid[chunk_coords]` — the single most expensive line in the per-chunk
loop. Split the ArraySpec builder out (`_array_spec_from_chunk_spec`), look the
ChunkSpec up once in the read/write loops, and pass it to both.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011D1uGKtHP9s7E4WqiUSYG4
…in bounds

Build chunk_sel and out_sel in a single pass over the output maps instead of
two, with the domain min/max tuples hoisted to locals.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011D1uGKtHP9s7E4WqiUSYG4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant