perf: eager identity-transform fast-path + lazy-path speedups#5
Open
d-v-b-agent wants to merge 4 commits into
Open
perf: eager identity-transform fast-path + lazy-path speedups#5d-v-b-agent wants to merge 4 commits into
d-v-b-agent wants to merge 4 commits into
Conversation
`AsyncArray.shape` recomputed `self._transform.domain.shape` (a zip+genexpr) on every access — ~10x slower than returning the stored tuple — and `ndim` went through it as well. Cache `_shape` wherever the transform is set and read it directly. Also thread the array's already-built `_chunk_grid` into the transform read/write resolvers instead of rebuilding it via `ChunkGrid.from_metadata` on every call. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011D1uGKtHP9s7E4WqiUSYG4
…xing A freshly-opened array carries the identity transform, so eager indexing yields exactly the coordinates the original indexers compute. Route such arrays straight to the legacy Basic/Orthogonal/Mask/Coordinate indexer path; only non-identity transforms (opt-in `.lazy[...]` views) go through the transform resolver. This removes the per-chunk transform-resolution overhead from the common eager path while preserving lazy semantics. Also restore the incompatible-shape ValueError in the coordinate-selection legacy branch — the transform path carried it but the fields-only branch had dropped it, and the eager fast-path now exercises that branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011D1uGKtHP9s7E4WqiUSYG4
…ad/write `_is_complete_chunk` and `_get_chunk_spec` each looked up `chunk_grid[chunk_coords]` — the single most expensive line in the per-chunk loop. Split the ArraySpec builder out (`_array_spec_from_chunk_spec`), look the ChunkSpec up once in the read/write loops, and pass it to both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011D1uGKtHP9s7E4WqiUSYG4
…in bounds Build chunk_sel and out_sel in a single pass over the output maps instead of two, with the domain min/max tuples hoisted to locals. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011D1uGKtHP9s7E4WqiUSYG4
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on zarr-developers#3906. Targets the performance regression that PR reported (CodSpeed ~18%, 14 benchmarks) and speeds up the lazy path. Stacks on
refactor/simplify-indexing; the diff here is just the 4 commits below.Root cause (profiled deterministically via per-op call counts — same metric class as CodSpeed)
The transform resolver is +55% function calls vs the legacy indexers for an equivalent read. Crucially, eager tuple selections already fall back to the legacy path (flat, no regression) — only selections routed through
get_basic_selection(bare slices) and.lazyviews hit the resolver. So the regression is isolated to the transform path, not the eager fallback.Separately,
AsyncArray.shape/ndimhad become ~10× slower (recomputed fromtransform.domain.shapeper access).Changes
shape/ndim; reuse_chunk_gridin the transform resolvers (was rebuildingChunkGrid.from_metadataper call)..lazy[...]views use the resolver. Restores the incompatible-shapeValueErrorthe fields-only coordinate branch had dropped (now exercised by the fast-path).chunk_grid[coords]lookup per chunk (was 2× — in_is_complete_chunkand_get_chunk_spec; the single most expensive line in the per-chunk loop).sub_transform_to_selectionswith hoisted domain bounds.Results (deterministic call-count proxy)
main; tuple selections unchanged;.lazyunchanged.test_transforms/ + test_lazy_indexing.py + test_indexing.py→ 310 passed, 1 skipped, 5 xfailed.Not done / honest ceiling
A lazy read's floor is async/codec dispatch — identical to the eager path, not reducible here. Removing the per-chunk bridge entirely needs the codec pipeline to consume
IndexTransformnatively (the directionchunk_resolution.pyalready flags). An interning/from_shape/resolution-memo caching layer was prototyped and dropped — it interned cleanly (78% hit rate) but gave ~0 measurable win because construction/resolution were never the bottleneck.🤖 Generated with Claude Code