perf: eager identity-transform fast-path + lazy-path speedups by d-v-b-agent · Pull Request #5 · d-v-b-agent/zarr-python

d-v-b-agent · 2026-06-29T16:36:54Z

Builds on zarr-developers#3906. Targets the performance regression that PR reported (CodSpeed ~18%, 14 benchmarks) and speeds up the lazy path. Stacks on refactor/simplify-indexing; the diff here is just the 4 commits below.

Root cause (profiled deterministically via per-op call counts — same metric class as CodSpeed)

The transform resolver is +55% function calls vs the legacy indexers for an equivalent read. Crucially, eager tuple selections already fall back to the legacy path (flat, no regression) — only selections routed through get_basic_selection (bare slices) and .lazy views hit the resolver. So the regression is isolated to the transform path, not the eager fallback.

Separately, AsyncArray.shape/ndim had become ~10× slower (recomputed from transform.domain.shape per access).

Changes

cache shape/ndim; reuse _chunk_grid in the transform resolvers (was rebuilding ChunkGrid.from_metadata per call).
identity-transform fast-path — a freshly-opened array has the identity transform, so eager indexing routes straight to the original Basic/Orthogonal/Mask/Coordinate indexers; only non-identity .lazy[...] views use the resolver. Restores the incompatible-shape ValueError the fields-only coordinate branch had dropped (now exercised by the fast-path).
lazy: one chunk_grid[coords] lookup per chunk (was 2× — in _is_complete_chunk and _get_chunk_spec; the single most expensive line in the per-chunk loop).
lazy: single-pass sub_transform_to_selections with hoisted domain bounds.

Results (deterministic call-count proxy)

Eager regression eliminated: bare-slice read 322,494 → 207,820 calls/op, back within 0.3% of main; tuple selections unchanged; .lazy unchanged.
Lazy slice reads ~7% fewer ops (chunk_grid dedup is the measurable win).
Tests: test_transforms/ + test_lazy_indexing.py + test_indexing.py → 310 passed, 1 skipped, 5 xfailed.

Not done / honest ceiling

A lazy read's floor is async/codec dispatch — identical to the eager path, not reducible here. Removing the per-chunk bridge entirely needs the codec pipeline to consume IndexTransform natively (the direction chunk_resolution.py already flags). An interning/from_shape/resolution-memo caching layer was prototyped and dropped — it interned cleanly (78% hit rate) but gave ~0 measurable win because construction/resolution were never the bottleneck.

🤖 Generated with Claude Code

`AsyncArray.shape` recomputed `self._transform.domain.shape` (a zip+genexpr) on every access — ~10x slower than returning the stored tuple — and `ndim` went through it as well. Cache `_shape` wherever the transform is set and read it directly. Also thread the array's already-built `_chunk_grid` into the transform read/write resolvers instead of rebuilding it via `ChunkGrid.from_metadata` on every call. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011D1uGKtHP9s7E4WqiUSYG4

…xing A freshly-opened array carries the identity transform, so eager indexing yields exactly the coordinates the original indexers compute. Route such arrays straight to the legacy Basic/Orthogonal/Mask/Coordinate indexer path; only non-identity transforms (opt-in `.lazy[...]` views) go through the transform resolver. This removes the per-chunk transform-resolution overhead from the common eager path while preserving lazy semantics. Also restore the incompatible-shape ValueError in the coordinate-selection legacy branch — the transform path carried it but the fields-only branch had dropped it, and the eager fast-path now exercises that branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011D1uGKtHP9s7E4WqiUSYG4

…ad/write `_is_complete_chunk` and `_get_chunk_spec` each looked up `chunk_grid[chunk_coords]` — the single most expensive line in the per-chunk loop. Split the ArraySpec builder out (`_array_spec_from_chunk_spec`), look the ChunkSpec up once in the read/write loops, and pass it to both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011D1uGKtHP9s7E4WqiUSYG4

…in bounds Build chunk_sel and out_sel in a single pass over the output maps instead of two, with the domain min/max tuples hoisted to locals. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011D1uGKtHP9s7E4WqiUSYG4

d-v-b-agent and others added 4 commits June 29, 2026 16:21

github-actions Bot added the needs release notes label Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: eager identity-transform fast-path + lazy-path speedups#5

perf: eager identity-transform fast-path + lazy-path speedups#5
d-v-b-agent wants to merge 4 commits into
refactor/simplify-indexingfrom
perf/eager-fast-path-and-lazy-speedups

d-v-b-agent commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

d-v-b-agent commented Jun 29, 2026

Root cause (profiled deterministically via per-op call counts — same metric class as CodSpeed)

Changes

Results (deterministic call-count proxy)

Not done / honest ceiling

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant