Skip to content

Vectorise _partition_bumps to fix O(n_chunks x count) dask graph build (#3612)#3617

Merged
brendancol merged 3 commits into
mainfrom
deep-sweep-performance-bump-2026-07-02
Jul 2, 2026
Merged

Vectorise _partition_bumps to fix O(n_chunks x count) dask graph build (#3612)#3617
brendancol merged 3 commits into
mainfrom
deep-sweep-performance-bump-2026-07-02

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #3612.

_partition_bumps assigned each bump to the dask chunk holding its centre pixel by looping over every chunk and rescanning the full locs array with a fresh boolean mask. That is O(n_chunks x n_bumps), and it runs eagerly during graph construction, before any .compute(). On a finely chunked raster the rescan dominates the runtime.

This replaces the double loop with a single np.searchsorted pass that buckets every bump into its chunk, then groups by chunk. Complexity drops to about O(n_bumps log n_chunks). The output is byte-identical to the old code, including chunk-local coordinates and intra-chunk order, so compute results do not change.

Measured on the standalone helper (count = 100k):

n_chunks old new speedup
100 0.036 s 0.010 s 3.4x
1,600 0.505 s 0.021 s 24.5x
6,400 2.156 s 0.041 s 52.1x
25,600 8.325 s 0.097 s 85.5x

Backends

The change is in _partition_bumps, shared by _bump_dask_numpy and _bump_dask_cupy. The numpy and cupy (non-dask) paths do not touch this helper and are unaffected. Verified on a CUDA host: the cupy and dask+cupy tests pass.

Benchmark

Adds benchmarks/benchmarks/bump.py. The dask case is finely chunked and times graph construction only, so a return of the per-chunk rescan would show up as a regression.

Test plan

  • pytest xrspatial/tests/test_bump.py (19 passed, includes cupy and dask+cupy on GPU)
  • New test_partition_bumps_matches_bruteforce checks the vectorised partition against the original boolean-mask reference across an irregular chunk layout
  • Benchmark imports and runs for numpy and dask

Found by /sweep-performance.

…3612)

_partition_bumps rescanned the full locs array with a boolean mask once
per chunk, so dask graph construction ran in O(n_chunks * n_bumps) before
any compute. Bucket each bump into its chunk in a single searchsorted pass
instead (byte-identical partitions, ~85x faster at 25.6k chunks).

Add a benchmark (benchmarks/benchmarks/bump.py) and a regression test that
locks the partition against the original boolean-mask reference.

Sweep: performance. Backends: dask+numpy, dask+cupy (shared helper).
@brendancol

Copy link
Copy Markdown
Contributor Author

PR Review: Vectorise _partition_bumps to fix O(n_chunks x count) dask graph build (#3612)

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

None.

Nits (optional improvements)

  • xrspatial/bump.py _partition_bumps: the searchsorted(..., side='right') - 1 bucketing assumes every loc is in [0, W) and [0, H). An out-of-range coordinate (x >= W or negative) would map to a chunk index outside the grid and silently land in the wrong (a, b) bucket or a nonexistent key, whereas the old boolean-mask version simply dropped it. This is not reachable from bump() (locs come from np.random.choice(range(w/h)), so they are always in range), and this is a private helper, so it is not a real bug. A one-line comment stating the in-range precondition would make the assumption explicit for future callers.
  • heights[idx].copy(): fancy indexing already returns a fresh copy, so the trailing .copy() is redundant. The original code had the same redundant .copy() on the masked path, so keeping it preserves style; only worth changing if you are touching the line anyway.

What looks good

  • The vectorised partition is byte-identical to the original, including chunk-local coordinates and intra-chunk order (stable argsort), so compute results do not change. The new test_partition_bumps_matches_bruteforce locks this against the original boolean-mask reference over an irregular chunk layout with empty chunks.
  • Empty locs is guarded and returns an all-None dict, matching the old behaviour.
  • The change is confined to the shared _partition_bumps helper; the numpy and cupy non-dask paths are untouched. dask+numpy and dask+cupy both use the helper.
  • A benchmark was added; its dask case is finely chunked and times graph construction only, so a return of the per-chunk rescan would register as a regression.

Checklist

  • Algorithm matches reference (byte-identical to prior partition, verified)
  • All implemented backends produce consistent results (cupy + dask+cupy pass on GPU)
  • NaN handling is correct (unchanged; helper does not touch NaN semantics)
  • Edge cases are covered by tests (irregular chunks, empty chunks)
  • Dask chunk boundaries handled correctly (half-open assignment preserved)
  • No premature materialization or unnecessary copies
  • Benchmark exists
  • README feature matrix updated (not applicable, no new public function)
  • Docstrings present and accurate (helper docstring unchanged and still accurate)

@brendancol

Copy link
Copy Markdown
Contributor Author

Follow-up on the review nits:

  • Nit 1 (in-range precondition): applied in eb32e89 as a comment on _partition_bumps documenting that callers pass locs clamped to the raster bounds.
  • Nit 2 (redundant heights[idx].copy()): kept intentionally. The copy keeps the dask graph from holding a view into the caller's heights array; the cosmetic gain is not worth the aliasing risk.

No blockers or suggestions were raised. Tests still green (19 passed, includes cupy + dask+cupy on GPU).

@brendancol brendancol merged commit 05fb3f7 into main Jul 2, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bump(): dask graph construction is O(n_chunks x count) in _partition_bumps

1 participant