diff --git a/.claude/sweep-performance-state.csv b/.claude/sweep-performance-state.csv
index dfc282a06..a8b0809d7 100644
--- a/.claude/sweep-performance-state.csv
+++ b/.claude/sweep-performance-state.csv
@@ -1,51 +1,51 @@
-module,last_inspected,oom_verdict,bottleneck,high_count,issue,notes
-aspect,2026-05-29,SAFE,compute-bound,1,2688,"dask+cupy geodesic densified full lat/lon on one GPU at graph build (OOM at scale); fixed via per-block map_blocks cupy conversion. planar/numpy/dask SAFE; geodesic GPU kernel ~184 regs, mitigated by 16x16 blocks."
-balanced_allocation,2026-04-16T12:00:00Z,WILL OOM,memory-bound,8,1114,"Re-audit 2026-04-16 after PR 1203 float32 fix. 8 HIGH found (friction.compute L339, argmin.compute in iter loop L182, double all_nan recompute L206, stacked cost_surfaces allocation). Covered by existing documented limitation on #1114. Not refiled."
-bilateral,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
-bump,2026-04-16T12:00:00Z,SAFE,compute-bound,0,1206,Re-audit 2026-04-16: fix verified SAFE. No HIGH findings. MEDIUM: CuPy backend runs CPU kernel then transfers to GPU (documented limitation).
-classify,2026-04-16T18:00:00Z,SAFE,compute-bound,0,fixed-in-tree,"Fixed-in-tree 2026-04-16: _run_dask_head_tail_breaks now persists data_clean once and fuses mean+head_count per iter (912ms -> 339ms, 0.37x IMPROVED); added _run_dask_box_plot that samples via _generate_sample_indices instead of boolean fancy indexing on dask array; _run_dask_cupy_box_plot likewise. 85 existing classify tests pass."
-contour,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
-convolution,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
-corridor,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
-cost_distance,2026-04-16T12:00:00Z,WILL OOM,memory-bound,4,1118,"Re-audit 2026-04-16 after PR 1192 Bellman-Ford fix. 4 HIGH re-surface in iterative tile_cache path (L645 full-dataset materialization, L1015 da.from_delayed wrapping computed tiles). Finite max_cost path remains SAFE. Unbounded path is fundamentally O(dataset) driver memory — covered by #1118."
-curvature,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
-dasymetric,2026-03-31T18:00:00Z,SAFE,memory-bound,0,1126,Memory guard added to validate_disaggregation. Core disaggregate uses map_blocks.
-diffusion,2026-03-31T18:00:00Z,WILL OOM,memory-bound,2,1116,Scalar diffusivity now passed as float to chunks. DataArray diffusivity passed as dask array via map_overlap.
-edge_detection,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
-emerging_hotspots,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
-erosion,2026-03-31T18:00:00Z,WILL OOM,memory-bound,2,1120,Memory guard added. Algorithm inherently global.
-fire,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
-flood,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
-focal,2026-05-29,SAFE,compute-bound,1,2734,"HIGH: _hotspots_dask_cupy chunk fn round-tripped each chunk host<->GPU (cupy.asnumpy classify cupy.asarray); fixed PR 2739 to reuse _run_gpu_hotspots on device. LOW (not fixed): _apply_numpy/_hotspots_cupy use zeros_like where empty would suffice. CUDA kernels regs<=62, no register-pressure issue."
-geodesic,2026-03-31T18:00:00Z,N/A,compute-bound,0,,
-geotiff,2026-06-11,SAFE,IO-bound,0,3235,"Pass 15 (2026-06-11): 1 MEDIUM found and fixed. _pack (_attrs.py:~1795) guarded the no-sentinel integer restore with an eager bool(out.isnull().any()), which executed the whole upstream dask graph at to_geotiff(pack=True) call time; the streaming writer then executed it again, so every source chunk computed twice (measured 32 decode-task executions for 16 chunks on a 512x512 int16 SCALE/OFFSET no-GDAL_NODATA source; 71->33 total task starts post-fix). Filed #3235, fixed by mapping a per-chunk NaN guard (_pack_guard_no_nan) into the graph for dask-backed data (raises from the write's single compute; numpy keeps the eager call-time check; meta= preserves cupy backing). 9 new tests in test_pack_lazy_nan_guard_3235.py incl. fusion-proof execution counter and cupy-chunk guard unit test (dask+cupy e2e still blocked upstream by #3112). Scrutinised all 16 commits since 2026-06-08 (pack/unpack series #3065/#3075/#3079/#3129/#3174/#3175, VRT placement #3135, compression_level gate #3176, streaming banding #3136, dask+cupy writer order fix #3171): no other regressions; #3171's get-then-asarray order is intentional D2H for gpu=False. GPU validated on-device this pass: eager GPU unpack returns cupy with exact parity (387ms incl warmup, only 0-d scalar .get()s -- no bulk host round trip), dask+GPU unpack lazy (112 tasks/16 chunks, cupy meta, compute returns cupy, parity 0.0), GDS fast path intact without unpack (4 tasks/chunk); unpack disqualifying GDS is documented intentional. Dask CPU probe 4 tasks/chunk, 50k-task cap intact. Note: #1714 (_write_vrt_tiled synchronous scheduler) is now FIXED+CLOSED (scheduler='threads' at _writers/eager.py:1517) -- drop from the open-issue list. LOW noted (no fix): _pack does identity (data-0.0)/1.0 arithmetic allocating two full-array temporaries when scale==1/offset==0 (masked_nodata-only pack); prior deferred LOWs unchanged. SAFE/IO-bound holds. | Pass 14 (2026-06-09): MEDIUM found and fixed -- _write_streaming ran one dask .compute() per 256-row tile-row/strip, so a source chunk taller than the band re-executed once per band it overlapped (measured 2x at chunks=512, 4x at chunks=1024, whole upstream graph re-runs for computed pipelines). Filed #3117, fixed via _stream_row_bands: consecutive tile-rows/strips group into row bands sized by the source chunk-row span (one-chunk halo, #3007 accounting) under streaming_buffer_bytes; each band computes once and tiles/strips are carved from the materialised band. Wide rasters needing column segmentation keep the per-tile-row path. Post-fix per-chunk executions == 1 on the default read->write round trip. 5 new tests (TestRowBandRecompute3117 + _stream_row_bands unit); write/integration/parity suites pass (2195). LOW deferred (no fix): _read_geotiff_gpu_chunked parses header+all IFDs twice at graph build (_backends/gpu.py ~1367-1419, cap check then GDS probe; build-time only). GPU paths validated on-device this pass: eager gpu read returns cupy with parity, dask+GPU chunked read lazy (17 tasks/4 chunks) with parity; GPU writer full materialisation is documented intentional (streaming_buffer_bytes no-op). Read path keeps 50k-task graph cap; dask read probe 4 tasks/chunk. SAFE/IO-bound holds. | Pass 13 (2026-05-20): 1 MEDIUM found and fixed. _nvjpeg_batch_encode (_gpu_decode.py:~L1560) and _nvjpeg2k_batch_encode (~L2958) called cupy.cuda.Device().synchronize() inside the per-tile encode loops, a whole-device fence that blocked every CUDA stream and serialised concurrent work (e.g. predictor encodes on other streams). The decode-side counterpart _try_nvjpeg_batch_decode already used cupy.cuda.Stream.null.synchronize() at L1442; the encoder side was inconsistent. Filed #2212 and fixed both encoders to use Stream.null.synchronize(), scoping the per-tile sync to the default stream the encode/retrieve calls were issued on. nvJPEG / nvJPEG2000 encoders maintain a single shared state per encoder so encodes within a batch are inherently serial; the fix removes the device-wide blocker without changing the API ordering contract. 5 new tests in test_nvjpeg_encode_stream_sync_2212.py (AST checks that neither encoder contains Device().synchronize() inside a for-loop, that both call Stream.null.synchronize() in the loop, and that the decoder reference pattern stays pinned). All 5 new tests + 19 existing related encode/decode tests pass. nvjpeg/nvjpeg2k shared libs not present on this host so end-to-end encode verification is gated; add cuda-unavailable-libs note to re-validate on a host with the RAPIDS conda env. SAFE/IO-bound verdict holds; no change in dask graph cost. Dask probe: 2560x2560 deflate-tiled file via read_geotiff_dask(chunks=256) yields 400 tasks for 100 chunks (4 tasks/chunk), well under the 50K cap. LOW deferred (no fix in this PR): _build_ifd called twice per IFD level in _assemble_standard_layout (_writer.py:1531+1543), _assemble_cog_layout (1582+1625), and the COG overview path (2519+2546+2740) -- the first call's bytes are discarded; only the overflow byte length is used to compute pixel_data_offset. Cost is bounded by IFD count (typically 1-5 overview levels) so absolute impact is minor. Pre-existing pattern. | Pass 12 (2026-05-18): 1 MEDIUM found and fixed. _try_nvjpeg2k_batch_decode at _gpu_decode.py:~L2725-2778 allocated per-tile per-component cupy.empty buffers (N*S round-trips through the cupy memory pool) and called cupy.cuda.Device().synchronize() once per tile, forcing default-stream serialisation that defeats nvJPEG2000's internal pipelining. Filed #2107 and fixed: pre-allocate a single d_comp_pool sized n_tiles*samples*tile_height*pitch under a _check_gpu_memory guard, derive per-tile/per-component views as slab offsets, and replace the per-tile sync with a single batch-end sync. Same pattern as #1659 (_try_nvcomp_from_device_bufs), #1688 (_try_kvikio_read_tiles), #1712 (_nvcomp_batch_compress). 7 new tests in test_nvjpeg2k_single_alloc_2107.py: AST-level structural assertions confirm no cupy.empty inside the for-loop and no Device().synchronize() inside the loop, plus pool/per_tile_comp_bytes presence and _check_gpu_memory guard checks; lib-absent short-circuit; unsupported-dtype cleanup contract; cupy-only pool slab-non-overlap test (gpu-marked). libnvjpeg2k.so not present on this host so the end-to-end nvJPEG2000 decode is gated -- note added to re-validate on a host with the RAPIDS conda env. All 30 jpeg2000/compression tests + 7 new tests pass. SAFE/IO-bound verdict holds (no change in dask graph cost). Dask probe: 4096x4096 deflate-tiled file via read_geotiff_dask(chunks=512) yields 256 tasks for 64 chunks (4 tasks/chunk), well under the 50K cap. | Pass 11 (2026-05-18): 1 MEDIUM found and fixed. _read_strips (_reader.py:~L1972) and _fetch_decode_cog_http_strips (_reader.py:~L2670) decoded strips sequentially in a Python for-loop while the tile counterparts (_read_tiles L2146, _fetch_decode_cog_http_tiles L2898) gated parallel decode on _PARALLEL_DECODE_PIXEL_THRESHOLD via ThreadPoolExecutor. Filed #2100 and fixed: both strip paths now collect jobs, parallel-decode when n_strips > 1 and strip_pixels >= 64K, then place sequentially. Measured (uint16, 4-core): 4096x4096 deflate 130ms->34ms (3.82x), 8192x8192 deflate 531ms->146ms (3.63x), 8192x8192 zstd 211ms->85ms (2.48x), uncompressed 25ms->22ms (1.14x). 5 new tests in test_parallel_strip_decode_2100.py (parallel/serial parity, pool-engaged on multi-strip, serial-path for single-strip, windowed cross-strip read, HTTP COG strip parity). 3998 tests pass; 8 pre-existing failures predating this change (predictor2 BE + size_param_validation_gpu_vrt reference now-private read_to_array attr). SAFE/IO-bound verdict holds. | Pass 10 (2026-05-15): 1 new MEDIUM found and fixed; 2 LOW noted. MEDIUM (_reader.py:2737): _fetch_decode_cog_http_tiles decoded tiles sequentially in a Python for-loop after the concurrent fetch landed (issue #1480). Local _read_tiles parallelises decode whenever tile_pixels >= 64K via ThreadPoolExecutor (_reader.py:2017); the HTTP path was structurally similar but never picked up the same gate, so wide windowed reads of multi-tile COGs left deflate/zstd decode single-threaded. Mirrored the local-path threshold + pool. 5 new tests in test_cog_http_parallel_decode_2026_05_15.py (parallel + serial round-trip correctness, pool-instantiation branch selection above the threshold, single-tile path skips the pool, structural _decode_strip_or_tile call count == n_tiles). All 262 COG/HTTP tests pass; 3162 of 3164 selected geotiff tests pass overall (2 pre-existing failures predating Pass 9 per prior notes -- test_predictor2_big_endian_gpu_1517 references the now-private read_to_array attr, and the test_size_param_validation_gpu_vrt_1776 tile_size=4 validator failure). LOW deferred (no fix in this PR): (1) _block_reduce_2d_gpu (_gpu_decode.py:3142/3163/3189) does bool(mask.any().item()) per overview level when nodata is set, paying one device sync per level; the alternative (unconditional cupy.putmask) always pays the work cost and the short-circuit is correct under the current API. (2) _nvcomp_batch_compress adler32 staging (_gpu_decode.py:2543-2546) issues n_tiles slice-assign kernels into a fresh contig buffer despite all callers passing slices of a single underlying d_tile_buf; an API refactor to accept the source buffer directly would skip the rebuild. SAFE/IO-bound verdict holds. Dask probe: 2560x2560 chunks=256 yields 400 tasks (4 per chunk), well under the 50000 cap. GPU probe: 1024x1024 float32 zstd read returns CuPy-backed in 236 ms with no host round-trip. | Rockout 2026-05-15: LOW filed #1934 -- _apply_nodata_mask_gpu used cupy.where (allocating); switched to cupy.putmask on the already-owned buffer (float path) and on the post-astype float64 buffer (int path). Saves one chunk-sized device allocation per call. 7 new tests in test_apply_nodata_mask_gpu_inplace_1934.py; 52 related nodata tests pass. | Pass 8 (2026-05-12): 1 new MEDIUM found and fixed. _assemble_standard_layout/_assemble_cog_layout returned bytes(bytearray), doubling peak memory transiently during eager writes. Filed #1756, fixed by returning the bytearray directly. Measured: 95 MB uint8 raster peak drops 202 MB -> 107 MB. _write_bytes / parse_header already accepted the buffer protocol so the change is transparent to callers. 6 new tests in test_assemble_layout_no_bytes_copy_1756.py. 2123 existing geotiff tests pass; the 10 unrelated failures (test_no_georef_windowed_coords_1710, test_predictor2_big_endian_gpu_1517) reference the now-private read_to_array attribute (commit 8adb749, issue #1708) and predate this change. SAFE/IO-bound verdict holds. | Pass 7 (2026-05-12): re-audit identified 4 MEDIUM findings, all real, all backed by microbenches. (1) unpack_bits sub-byte loops for bps=2/4/12 in _compression.py:836-878 were 100-200x slower than vectorised numpy (filed #1713, fixed in this branch: bps=4 2M pixels drops from 165ms to 3ms = 55x; bps=2/12 similar). (2) _write_vrt_tiled at __init__.py:1708 uses scheduler='synchronous' on independent tile writes; measured 33% slowdown on 256-tile zstd write vs threads scheduler (filed #1714, no fix yet). (3) _nvcomp_batch_compress at _gpu_decode.py:2522-2526 still does per-tile cupy.get().tobytes() despite #1552 / #1659 fixing the same pattern elsewhere; measured 45% reduction with concat+single get on n=1024 (filed #1712, no fix yet). (4) _nvcomp_batch_compress at _gpu_decode.py:2457 uses per-tile cupy.empty allocations; 1024 tiles 16KB drops from 4.7ms to 1.0ms with single contiguous + views (bundled into #1712). Cat 6 OOM verdict: SAFE/IO-bound holds -- read_geotiff_dask caps task count at _MAX_DASK_CHUNKS=50_000 and per-chunk memory is bounded by chunk size. _inflate_tiles_kernel resource usage on Ampere: 67 regs/thread, 2896B local/thread, 8192B shared/block (LZW kernel: 29 regs, 24576B shared) -- register pressure under control; high local memory in inflate is unavoidable (LZ77 state) but only thread 0 in each block uses it. | Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions. | Pass 5 (2026-05-12): re-audit identified MEDIUM in _gpu_decode.py:1577 _try_nvcomp_from_device_bufs: per-tile cupy.empty + trailing cupy.concatenate doubled peak VRAM and added serial concat. Filed #1659 and fixed to single-buffer + pointer offsets (matches LZW/deflate/host-buffer patterns at L1847/L1878/L1114). Microbench (alloc+concat overhead only, not full nvCOMP latency): n=256 tile_bytes=65536 drops 3.66ms->0.69ms, n=256 tile_bytes=262144 drops 8.18ms->0.13ms. Tests: 5 new tests in test_nvcomp_from_device_bufs_single_alloc_1659.py (codec short-circuit, no-lib short-circuit, memory-guard contract, real ZSTD round-trip via nvCOMP, structural single-buffer check). 1458 existing geotiff tests pass, 3 unrelated matplotlib/py3.14 failures pre-existing. SAFE/IO-bound verdict holds. | Pass 6 (2026-05-12): re-audit on top of #1659. New HIGH in _try_kvikio_read_tiles at _gpu_decode.py:941: per-tile cupy.empty() + blocking IOFuture.get() inside loop serialised GDS reads to ~1 outstanding pread, missed parallelism the kvikio worker pool was designed for, paid per-tile cupy.empty setup (matches #1659 anti-pattern in nvCOMP path), and lacked _check_gpu_memory guard. Filed #1688 and fixed to single contiguous buffer + batched submit + guard. Microbench with 8-worker pool simulation: 256 tiles@1ms latency drops 256ms->38.7ms (~6.6x); single-thread simulation 256ms->28.5ms (9x). Tests: 9 new tests in test_kvikio_batched_pread_1688.py (kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard, partial-read fallback, round-trip data, zero-size/all-sparse tiles). All 1577 geotiff tests pass except pre-existing matplotlib/py3.14 failures."
-glcm,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,"Downgraded to MEDIUM. da.stack without rechunk is scheduling overhead, not OOM risk."
-hillshade,2026-04-16T12:00:00Z,SAFE,compute-bound,0,,"Re-audit after Horn's method rewrite (PR 1175): clean stencil, map_overlap depth=(1,1), no materialization. Zero findings."
-hydro,2026-05-01,RISKY,memory-bound,0,1416,"Fixed-in-tree 2026-05-01: hand_mfd._hand_mfd_dask now assembles via da.map_blocks instead of eager da.block of pre-computed tiles (matches hand_dinf pattern). Remaining MEDIUM: sink_d8 CCL fully materializes labels (inherently global), flow_accumulation_mfd frac_bdry held in driver dict instead of memmap-backed BoundaryStore. D8 iterative paths (flow_accum/fill/watershed/basin/stream_*) use serial-tile sweep with memmap-backed boundary store -- per-tile RAM bounded but driver iterates O(diameter) times. flow_direction_*, flow_path/snap_pour_point/twi/hand_d8/hand_dinf are SAFE."
-interpolate,2026-06-12,SAFE,compute-bound,0,3298,"3 MEDIUM fixed via #3298: kriging no-variance path now uses dual form k0 @ (K_inv @ z_aug) (1.6x, drops (P,N+1) w temp); dask variance computed in one map_blocks pass (was 2.06x); dask+cupy chunk-invariant uploads (idw x/y/z, kriging x/y/z/K_inv) hoisted and cKDTree built once for dask k-nearest. 1 LOW documented, not fixed: _experimental_variogram bins pairs with per-lag boolean masks, O(nlags*N^2) passes where np.bincount would do one. Dask graphs are plain map_blocks, 2 tasks/chunk, no fan-in; memory guards cover host allocations. GPU paths executed on this host (CUDA available)."
-interpolate-kriging,2026-06-04,SAFE,graph-bound,0,2923,"MEDIUM: memory guard used full-grid k0 term on dask templates -> spurious MemoryError (issue #2923, fixed). LOW: _experimental_variogram nlags python loop vectorizable via bincount (~1.2x, pair-array materialization dominates) - doc only. Dask graph clean (2 tasks/chunk); cupy returns device arrays; no .values/.compute/.data.get materialization."
-interpolate_spline,2026-06-04,SAFE,compute-bound,0,,"scope=spline-only. Audited _spline.py + _validation.py only (not _idw/_kriging). 1 MEDIUM (Cat3 GPU transfer): _spline_dask_cupy/_spline_cupy re-uploaded invariant x_pts/y_pts/weights host->device once per chunk. Fixed in PR #2929: added _tps_evaluate_gpu taking on-device point/weight arrays + only per-chunk grid slices; dask+cupy uploads invariants once at graph build (verified 48->3 on 16 chunks, scales with chunk count). numpy/cupy/dask+cupy parity ~1e-14. Added cupy+dask+cupy parity tests and an upload-count regression test (red without fix: 48!=3). _tps_cuda_kernel 30 regs/thread, 6 scalar locals -- no register pressure. CPU/dask+numpy eval @ngjit, row-major, no materialization. Dask graph probe 2560x2560/256 chunks = 200 tasks (2/chunk), no fan-in. Memory guard _check_spline_memory bounds N^2 solve. No issue filed -- gh issue create denied by auto-mode classifier; finding surfaced directly by sweep. GitHub issue field left empty."
-kde,2026-04-14T12:00:00Z,SAFE,compute-bound,0,,Graph construction serialized per-tile. _filter_points_to_tile scans all points per tile. No HIGH findings.
-mahalanobis,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,False positive. Numpy path materializes by design. Dask path uses lazy reductions + map_blocks.
-mcda,2026-06-10,SAFE,memory-bound,2,3150,"2 HIGH fixed in PR #3158: owa() dask path crashed (da.sort does not exist; memory guard pointed users at the crashing path) and wpm validation ran one compute() per criterion. MEDIUM fixed in PR #3159 (#3151): cupy piecewise + dask+cupy piecewise/categorical raised TypeError via np.asarray on cupy chunks. MEDIUM fixed in PR #3160 (#3152): monte_carlo sensitivity materialized full dask dataset (now chunk-bounded map_blocks, ~8 tasks/chunk at n_samples=1000) and crashed on cupy via per-sample .values; constrain() deep copy dropped. LOW documented, not fixed: fuzzy_overlay builds ones via layers[0]*0+1; _categorical does one full-array pass per mapping key. Verdict SAFE assumes the 3 PRs merge (pre-fix: WILL OOM for MC-on-dask, owa dask broken). GPU paths validated on CUDA host (cupy 13.6)."
-morphology,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
-multispectral,2026-05-02,SAFE,compute-bound,0,,"Re-audit 2026-05-02 after PRs 1292 (true_color memory guard) and 1301 (validate_arrays in true_color). Verified SAFE. No HIGH. MEDIUM: da.stack in _true_color_dask/_true_color_dask_cupy at L1702/L1731 creates (1,1,1,1) chunks along band axis (4 bands so impact is minor, scheduling overhead not OOM). LOW: np.zeros((h,w,4)) at L1681 then full overwrite -- np.empty would suffice. All 17 indices use plain map_blocks with no halo; 8192x8192 ndvi graph is 80 tasks, evi/arvi/ebbi 112 tasks."
-normalize,2026-03-31T18:00:00Z,SAFE,compute-bound,0,1124,Boolean indexing replaced with lazy nanmin/nanmax/nanmean/nanstd.
-pathfinding,2026-04-15T12:00:00Z,SAFE,compute-bound,0,false-positive,Downgraded. CuPy .get() is required -- A* has no GPU kernel. Per-pixel .compute() is only 2 calls for start/goal validation. seg.values in multi_stop_search collects already-computed results for stitching.
-perlin,2026-03-31T18:00:00Z,WILL OOM,memory-bound,0,,
-polygon_clip,2026-06-10,SAFE,graph-bound,0,3191,"crop=True picked tiny leading edge chunk as rasterize mask size -> 13169-task graph; fixed to max(rc),max(cc) -> 1045 tasks. crop=False/numpy/cupy clean. Cat1-5 clean. GPU+dask+cupy run-validated."
-polygonize,2026-06-12,RISKY,compute-bound,0,3303,"Pass 3 (2026-06-12): re-audit after #2817/#2913/#3041. 0 HIGH. 1 MEDIUM fixed (#3303): _compute_region_value_ranges ran a pure-Python per-pixel loop (95% of float chunk time; 0.283s of 0.299s on 1024x1024, float chunks ~30x int) and re-ran _calculate_regions on an already-labelled block; moved to jitted _region_ranges_scan + _polygonize_numpy_regions label reuse (0.299s -> 0.015s/chunk). Side fix: w_match/s_match flags were always-truthy (_is_close numba overload generator called from pure Python returns impl function); output-neutral by chunk geometry, now computed correctly in jit. Cat1/2 clean (dask.compute batching is the documented #2673 design). Cat3 validated on GPU: cupy int/float + dask+cupy run end-to-end, single documented transfer, no round-trip. Cat4/5 LOW unchanged: _calculate_regions_cupy per-unique-value labeling (low impact); per-polygon Python classify loop in _polygonize_chunk dominates only on pathological many-polygon chunks (788K polys -> 7.8s). Cat6 RISKY unchanged: driver accumulates O(total polygons); 32-chunk batches bound transient peak. 527 polygonize tests + 40 new pass."
-proximity,2026-06-09,RISKY,graph-bound,0,3103,"Pass 2 (2026-06-09): re-audit after 16 fix commits since 2026-03-31. 0 HIGH, 2 MEDIUM found and fixed: (1) #3103/PR #3126 line-sweep @ngjit closure inside _process recompiled per call (~0.42s constant overhead; 10x10 warm call 0.44s->1ms after module-level hoist with explicit args, 1000x1000 0.49s->35ms); (2) #3132/PR #3137 dask xs/ys coordinate grids built via da.tile/da.repeat+rechunk cost ~185 tasks/chunk with the ys term scaling O(raster height) (~4.3 tasks/row, 44K tasks at 10240 rows); chunk-aligned da.broadcast_to gives identical values, bounded graph 18535->5554 tasks (3.3x) on 2560^2/256 chunks; regression test bounds tasks/chunk<80 (old 100.4, new 58.7) + ragged-chunk parity. LOW not fixed: zeros+fill(-1) row buffers in line-sweep; numpy backend materializes full float64 xs/ys grids (guarded since #1111); unbounded KDTree streaming count pass computes chunks on driver by design (gh-879). GPU validated on CUDA host: cupy 1024^2 proximity 6ms device-resident with exact numpy parity, dask+cupy bounded parity exact, _proximity_cuda_kernel 56 regs/thread (no register pressure). _halo_depth python loop measured 58ms at 100K coords - not a finding. Verdict RISKY (was WILL OOM): unbounded paths either guarded (MemoryError at 80% mem) or stream via kdtree; bounded map_overlap peak scales with chunk size."
-rasterize,2026-06-09,SAFE,compute-bound,0,3107,"Pass 4 (2026-06-09): re-audit found 2 MEDIUM Cat-4 allocation findings, 0 HIGH. (a) all four backends return via astype(dtype) which copies the float64 work buffer even when dtype is already float64 (the default) -- _run_numpy L1237, _run_cupy L2211, _rasterize_tile_numpy L2460, _rasterize_tile_cupy L2688; fix astype(dtype, copy=False). (b) CPU paths allocate order as full-raster int64 (8 B/px) for every merge mode but only first/last predicates read it; for _should_write_any merges (max/min/sum/count, user callables) an int8 buffer suffices (numba wraps the dead int64 store) -- _run_numpy L1188, _rasterize_tile_numpy L2420. tracemalloc 4000x4000 numpy merge=sum: peak 25 B/px -> 10 B/px expected (out 8 + written 1 + order 1); merge=last 25 -> 17 B/px. Filed #3107, fixed via deep-sweep rockout. GPU validated on host (CUDA available): cupy 512x512 last/sum/max returns cupy.ndarray with CPU parity, dask+cupy sum parity True, no host round trip. Dask graph probe: 2560x2560 chunks=256 -> 400 tasks / 100 chunks (4.0 tasks/chunk, unchanged). LOW (not fixed, documented): _extract_polygon_boundary_segments int variant L702 is dead code (only the _float variant is called). SAFE/compute-bound: per-tile buffers scale with chunk size; scanline/burn JIT kernels dominate runtime. | Pass 3 (2026-05-27): re-audit identified 1 MEDIUM Cat-3 GPU-transfer finding. _run_cupy (L2065/L2083) and _rasterize_tile_cupy (L2541/L2555) called cupy.asarray(poly_props/poly_global) twice when all_touched=True -- once for the scanline poly_launch tuple and once for the supercover boundary_launch tuple. The two tuples reference the same per-tile props tables. Filed #2506 and fixed by hoisting the upload above the scanline/boundary conditional so both launches share the same device buffer. Microbench: 1000 polys/4 cols 0.051->0.024 ms/iter (2.1x); 10000 polys/8 cols 0.218->0.092 ms/iter (2.4x, saves 720 KB/tile of redundant H2D transfer). 12 new tests in test_rasterize_props_hoist_2506.py (4 AST-structural single-asarray-call assertions + 5 cupy all_touched parity merges + 3 dask+cupy smoke tests). All 470 rasterize tests pass. Dask graph probe: 25600x25600 chunks=1024 yields 2500 tasks for 625 tiles (4 tasks/chunk), unchanged. Noted pre-existing dask+cupy all_touched parity gap on boundary segments crossing tile borders (not addressed by this PR). SAFE/graph-bound verdict holds. | Pass 2 (2026-05-17): re-audit identified MEDIUM Cat-2/Cat-3 graph-bound waste in _run_dask_numpy/_run_dask_cupy -- full line_props/point_props embedded in every delayed tile task (polygon path already filtered via poly_props[pmask]). Filed #2020 and fixed: added _slice_props_for_tile helper to remap geom_idx and slice props per tile (mirrors polygon path). Measured 5000 points x 8 cols / 100 tiles graph shrank from ~30 MB to <0.3 MB (37x); localized lines from ~32 MB to ~1.1 MB. 9 new tests in test_rasterize_tile_props_slice_2020.py (helper unit tests + graph-payload bound + numpy/dask output parity for lines/points/sum-merge). All 184 existing rasterize tests pass; dask+cupy parity verified. Dask graph probe: 2560x2560 chunks=256 yields 400 tasks (4 tasks/chunk constant); 25600x25600 chunks=1024 yields 2500 tasks. cupy 512x512 returns cupy.ndarray with no host round-trip. CUDA _scanline_fill_gpu: 39 regs/thread, 24576 B local_mem/thread (matches static cuda.local.array allocations 2048*8 + 2048*4 bytes). SAFE/graph-bound verdict holds; previous 2026-04-15 false-positive on polygon filtering still valid. | Original (2026-04-15): Tile-by-tile graph construction with per-tile geometry filtering is the correct pattern. Pre-filtering ensures each delayed task gets only its relevant subset."
-reproject,2026-06-12,SAFE,compute-bound,0,3267 3268,"Pass 7 (2026-06-12, deep-sweep): 0 HIGH, 2 MEDIUM found and fixed. #3267: in-memory numpy path held ~7 output-sized float64 temporaries and dask promotion keyed on input size only -- measured 8.4 GB peak RSS for a 1.15 GB output (7.3x); fix computes the grid first, promotes on output size too (mirrors merge #3048 pattern), re-applies the pixel guard for materializing paths, and fixes the 3-D chunks tuple in the dask wrap (post-fix peak 0.5 GB, lazy). #3268: multi-band cupy CPU-fallback transform path re-uploaded local_row/local_col per band via cp.asarray inside _resample_cupy_native (measured 12 H2D uploads for 6 bands, 26.1 MB vs 4.4 MB needed); fix hoists the device conversion before the band loop. LOW (documented, not fixed): _resample_cupy_native redundant copy+scan when the caller pre-converted nodata (+149% on the resample step, hits _reproject_dask_cupy fast path with non-NaN nodata); geoid_height_raster allocates full HxW meshgrid x2 plus output from dims (no strips, no dask path). Dask graph probe: 2560x2560/256 chunks -> 216 tasks for 108 output chunks (2/chunk, single blockwise layer, lazy); merge 2 inputs -> 16 tasks. GPU validated on host (CUDA available): cupy 2048^2 4326->3857 in 23 ms on-device; dask+cupy eager fast path matches in-memory exactly. SAFE/compute-bound holds. | Pass 6 (2026-06-09): 0 HIGH. 1 MEDIUM found and fixed (#3106): _reproject_chunk_numpy probed try_numba_transform, then _transform_coords probed it again before the pyproj fallback -- each wasted probe re-parses CRS params (~10 pyproj to_dict/to_authority round-trips) and allocates 4 chunk-sized float64 coordinate arrays. Measured 512x512 chunk, 4326->ESRI:54009: ~0.3-0.5 ms/probe, ~11% of the 5.3 ms chunk worker, repeated per output chunk on dask+numpy and merge per-block paths. Fix: worker passes no CRS objects to _transform_coords (inner retry gated on both non-None); cupy CPU fallbacks keep the inner probe (their first numba attempt). 3 new tests (TestNoDuplicateNumbaFastPathProbe); 447 reproject tests pass. LOW (not fixed, documented): try_numba_transform allocates 4 flat arrays before branch dispatch -- wasted for the lcc/tmerc 2D-kernel branches and unsupported pairs; _resample_cupy_native does a redundant .copy() when nodata is non-NaN and the caller already passed a fresh float64 copy; per-projection param extractors (_lcc_params etc.) call crs.to_dict() without the UserWarning suppression that _get_datum_params got in #3076, so fallback chunks emit pyproj warning spam. Dask graph probe: 2560x2560/256 chunks -> 216 tasks for 108 output chunks (2/chunk, 2 layers); merge 2 inputs -> 64 tasks/32 chunks. Source window per task capped at 64 Mpix. GPU validated on host (CUDA available): cupy 1024^2 fast path 13 ms, try_cuda_transform stays on-device, dask+cupy end-to-end OK, numpy/cupy max abs diff 2e-12, NaN positions identical. SAFE/compute-bound holds. | Pass 5 (2026-05-10): 1 HIGH filed and fixed in tree -- issue #1571 + fix _merge_block_adapter same-CRS dask path. _place_same_crs in the dask adapter previously called src_data.compute() on the full source per output chunk (68x amplification measured on 256x256x2 source split into 32x32 output chunks, 8.9M pixels materialized vs 131K total source). Fix: added _place_same_crs_lazy at __init__.py:1716 that slices the source window first then computes only that slice. Verified post-fix: 1.00x ratio, 131K pixels materialized for 131K source. New regression test test_merge_dask_same_crs_bounded_materialization codifies the bound. Other audits clean: CUDA resample kernels use 16x16 blocks (cubic=46 regs, bilinear=36, nearest=22 -- well under the 64K-per-block limit, 0 local mem). _reproject_chunk_numpy/cupy already slice source first before .compute(). Dask graph at 25600x25600 src with 1024 chunks yields 4752 tasks (no per-chunk source dependency). _apply_vertical_shift uses in-place += that may not work on dask arrays -- correctness concern, not perf, defer to accuracy sweep."
-resample,2026-04-15T12:00:00Z,SAFE,compute-bound,0,false-positive,Downgraded. GPU-CPU-GPU round-trip only in aggregate path for non-integer scale factors. Interpolation (nearest/bilinear/cubic) stays on GPU. No GPU kernel exists for irregular per-pixel binning.
-sieve,2026-04-14T12:00:00Z,WILL OOM,memory-bound,0,false-positive,False positive. Memory guards already in place on both dask paths. CCL is inherently global — documented limitation. CuPy CPU fallback is deliberate and documented.
-sky_view_factor,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
-slope,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
-surface_distance,2026-03-31T18:00:00Z,SAFE,memory-bound,0,1128,Memory guard added to dd_grid allocation.
-terrain,2026-03-31T18:00:00Z,RISKY,compute-bound,0,,
-terrain_metrics,2026-03-31T18:00:00Z,SAFE,memory-bound,0,,
-viewshed,2026-04-05T12:00:00Z,SAFE,memory-bound,0,fixed-in-tree,Tier B memory estimate tightened from 280 to 368 bytes/pixel (accounts for lexsort double-alloc + computed raster). astype copy=False avoids needless float64 copy.
-visibility,2026-06-10,RISKY,compute-bound,0,3185,"cumulative_viewshed recomputed dask source per observer (fixed #3185: materialise once when no max_distance); graph grows ~64 tasks/observer with N; line_of_sight single-transect cheap; MEDIUM count temp .astype per observer (LOW, not fixed)"
-worley,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
-zonal,2026-05-27,SAFE,compute-bound,0,2526,"Pass 2 (2026-05-27): re-audit identified 3 MEDIUM findings. (1) zonal_apply 3D dask path: da.stack(layers, axis=2) left output chunks at size 1 along axis 2 -- filed #2526 and fixed by rechunking back to values_data.chunks[2] in _apply_dask_numpy (zonal.py:1691) and _apply_dask_cupy (zonal.py:1731). Confirmed via graph probe: 256x256 raster chunks=(64,64) 3 bands previously yielded chunks[2]=(1,1,1); now (3,). 1 new test (test_apply_dask_3d_axis2_rechunked_2526). 126 existing zonal tests pass. (2) _stats_cupy (zonal.py:588-608): per-zone x per-stat Python loop with cupy.float_(result) forces O(n_zones * n_stats) GPU<->CPU sync points; not fixed in this pass (CUDA-native rewrite needed, larger refactor). (3) _parallel_variance @delayed reduce iterates over all blocks in driver memory; for very large block counts the single-task merge becomes scheduler-bound but is not OOM since per-block arrays are O(n_zones). Not fixed (algorithmic refactor needed). Dask graph probe: stats(7 stats) on 2560x2560 chunks=256 -> 4449 tasks (44/chunk); stats(mean only) -> 823 tasks (8/chunk); crosstab -> 304 (3/chunk); hypsometric_integral -> 300 (3/chunk). All under 50K cap. SAFE/compute-bound verdict holds. | Fixed-in-tree 2026-04-16: rewrote hypsometric_integral dask path. Eliminated double-compute (_unique_finite_zones removed, each block discovers own zones). Replaced np.stack (O(n_blocks * n_zones) scheduler memory) with streaming dict-merge (O(n_zones)). 29 existing tests pass."
+module,last_inspected,oom_verdict,bottleneck,high_count,issue,notes
+aspect,2026-05-29,SAFE,compute-bound,1,2688,"dask+cupy geodesic densified full lat/lon on one GPU at graph build (OOM at scale); fixed via per-block map_blocks cupy conversion. planar/numpy/dask SAFE; geodesic GPU kernel ~184 regs, mitigated by 16x16 blocks."
+balanced_allocation,2026-04-16T12:00:00Z,WILL OOM,memory-bound,8,1114,"Re-audit 2026-04-16 after PR 1203 float32 fix. 8 HIGH found (friction.compute L339, argmin.compute in iter loop L182, double all_nan recompute L206, stacked cost_surfaces allocation). Covered by existing documented limitation on #1114. Not refiled."
+bilateral,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
+bump,2026-04-16T12:00:00Z,SAFE,compute-bound,0,1206,Re-audit 2026-04-16: fix verified SAFE. No HIGH findings. MEDIUM: CuPy backend runs CPU kernel then transfers to GPU (documented limitation).
+classify,2026-04-16T18:00:00Z,SAFE,compute-bound,0,fixed-in-tree,"Fixed-in-tree 2026-04-16: _run_dask_head_tail_breaks now persists data_clean once and fuses mean+head_count per iter (912ms -> 339ms, 0.37x IMPROVED); added _run_dask_box_plot that samples via _generate_sample_indices instead of boolean fancy indexing on dask array; _run_dask_cupy_box_plot likewise. 85 existing classify tests pass."
+contour,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
+convolution,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
+corridor,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
+cost_distance,2026-06-15,RISKY,memory-bound,1,3342,"Perf sweep 2026-06-15. HIGH: bounded map_overlap branch in _cost_distance_dask gated on full dims (pad>=height/width) not chunk size; pad>chunk collapses to single chunk (#880-class OOM, verified npartitions=1 at chunks=10/pad=96). Fixed: compare pad vs max chunk dim, route to iterative when pad>=chunk (matches GPU path L484). dask+cupy path already correct. Register count 37 (no pressure). nanmin().compute() L478/L1149 intentional scalar. iterative tile_cache full-dataset materialization is documented MemoryError-guarded design (#1118). All 56 tests pass incl GPU."
+curvature,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
+dasymetric,2026-03-31T18:00:00Z,SAFE,memory-bound,0,1126,Memory guard added to validate_disaggregation. Core disaggregate uses map_blocks.
+diffusion,2026-03-31T18:00:00Z,WILL OOM,memory-bound,2,1116,Scalar diffusivity now passed as float to chunks. DataArray diffusivity passed as dask array via map_overlap.
+edge_detection,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
+emerging_hotspots,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
+erosion,2026-03-31T18:00:00Z,WILL OOM,memory-bound,2,1120,Memory guard added. Algorithm inherently global.
+fire,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
+flood,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
+focal,2026-05-29,SAFE,compute-bound,1,2734,"HIGH: _hotspots_dask_cupy chunk fn round-tripped each chunk host<->GPU (cupy.asnumpy classify cupy.asarray); fixed PR 2739 to reuse _run_gpu_hotspots on device. LOW (not fixed): _apply_numpy/_hotspots_cupy use zeros_like where empty would suffice. CUDA kernels regs<=62, no register-pressure issue."
+geodesic,2026-03-31T18:00:00Z,N/A,compute-bound,0,,
+geotiff,2026-06-11,SAFE,IO-bound,0,3235,"Pass 15 (2026-06-11): 1 MEDIUM found and fixed. _pack (_attrs.py:~1795) guarded the no-sentinel integer restore with an eager bool(out.isnull().any()), which executed the whole upstream dask graph at to_geotiff(pack=True) call time; the streaming writer then executed it again, so every source chunk computed twice (measured 32 decode-task executions for 16 chunks on a 512x512 int16 SCALE/OFFSET no-GDAL_NODATA source; 71->33 total task starts post-fix). Filed #3235, fixed by mapping a per-chunk NaN guard (_pack_guard_no_nan) into the graph for dask-backed data (raises from the write's single compute; numpy keeps the eager call-time check; meta= preserves cupy backing). 9 new tests in test_pack_lazy_nan_guard_3235.py incl. fusion-proof execution counter and cupy-chunk guard unit test (dask+cupy e2e still blocked upstream by #3112). Scrutinised all 16 commits since 2026-06-08 (pack/unpack series #3065/#3075/#3079/#3129/#3174/#3175, VRT placement #3135, compression_level gate #3176, streaming banding #3136, dask+cupy writer order fix #3171): no other regressions; #3171's get-then-asarray order is intentional D2H for gpu=False. GPU validated on-device this pass: eager GPU unpack returns cupy with exact parity (387ms incl warmup, only 0-d scalar .get()s -- no bulk host round trip), dask+GPU unpack lazy (112 tasks/16 chunks, cupy meta, compute returns cupy, parity 0.0), GDS fast path intact without unpack (4 tasks/chunk); unpack disqualifying GDS is documented intentional. Dask CPU probe 4 tasks/chunk, 50k-task cap intact. Note: #1714 (_write_vrt_tiled synchronous scheduler) is now FIXED+CLOSED (scheduler='threads' at _writers/eager.py:1517) -- drop from the open-issue list. LOW noted (no fix): _pack does identity (data-0.0)/1.0 arithmetic allocating two full-array temporaries when scale==1/offset==0 (masked_nodata-only pack); prior deferred LOWs unchanged. SAFE/IO-bound holds. | Pass 14 (2026-06-09): MEDIUM found and fixed -- _write_streaming ran one dask .compute() per 256-row tile-row/strip, so a source chunk taller than the band re-executed once per band it overlapped (measured 2x at chunks=512, 4x at chunks=1024, whole upstream graph re-runs for computed pipelines). Filed #3117, fixed via _stream_row_bands: consecutive tile-rows/strips group into row bands sized by the source chunk-row span (one-chunk halo, #3007 accounting) under streaming_buffer_bytes; each band computes once and tiles/strips are carved from the materialised band. Wide rasters needing column segmentation keep the per-tile-row path. Post-fix per-chunk executions == 1 on the default read->write round trip. 5 new tests (TestRowBandRecompute3117 + _stream_row_bands unit); write/integration/parity suites pass (2195). LOW deferred (no fix): _read_geotiff_gpu_chunked parses header+all IFDs twice at graph build (_backends/gpu.py ~1367-1419, cap check then GDS probe; build-time only). GPU paths validated on-device this pass: eager gpu read returns cupy with parity, dask+GPU chunked read lazy (17 tasks/4 chunks) with parity; GPU writer full materialisation is documented intentional (streaming_buffer_bytes no-op). Read path keeps 50k-task graph cap; dask read probe 4 tasks/chunk. SAFE/IO-bound holds. | Pass 13 (2026-05-20): 1 MEDIUM found and fixed. _nvjpeg_batch_encode (_gpu_decode.py:~L1560) and _nvjpeg2k_batch_encode (~L2958) called cupy.cuda.Device().synchronize() inside the per-tile encode loops, a whole-device fence that blocked every CUDA stream and serialised concurrent work (e.g. predictor encodes on other streams). The decode-side counterpart _try_nvjpeg_batch_decode already used cupy.cuda.Stream.null.synchronize() at L1442; the encoder side was inconsistent. Filed #2212 and fixed both encoders to use Stream.null.synchronize(), scoping the per-tile sync to the default stream the encode/retrieve calls were issued on. nvJPEG / nvJPEG2000 encoders maintain a single shared state per encoder so encodes within a batch are inherently serial; the fix removes the device-wide blocker without changing the API ordering contract. 5 new tests in test_nvjpeg_encode_stream_sync_2212.py (AST checks that neither encoder contains Device().synchronize() inside a for-loop, that both call Stream.null.synchronize() in the loop, and that the decoder reference pattern stays pinned). All 5 new tests + 19 existing related encode/decode tests pass. nvjpeg/nvjpeg2k shared libs not present on this host so end-to-end encode verification is gated; add cuda-unavailable-libs note to re-validate on a host with the RAPIDS conda env. SAFE/IO-bound verdict holds; no change in dask graph cost. Dask probe: 2560x2560 deflate-tiled file via read_geotiff_dask(chunks=256) yields 400 tasks for 100 chunks (4 tasks/chunk), well under the 50K cap. LOW deferred (no fix in this PR): _build_ifd called twice per IFD level in _assemble_standard_layout (_writer.py:1531+1543), _assemble_cog_layout (1582+1625), and the COG overview path (2519+2546+2740) -- the first call's bytes are discarded; only the overflow byte length is used to compute pixel_data_offset. Cost is bounded by IFD count (typically 1-5 overview levels) so absolute impact is minor. Pre-existing pattern. | Pass 12 (2026-05-18): 1 MEDIUM found and fixed. _try_nvjpeg2k_batch_decode at _gpu_decode.py:~L2725-2778 allocated per-tile per-component cupy.empty buffers (N*S round-trips through the cupy memory pool) and called cupy.cuda.Device().synchronize() once per tile, forcing default-stream serialisation that defeats nvJPEG2000's internal pipelining. Filed #2107 and fixed: pre-allocate a single d_comp_pool sized n_tiles*samples*tile_height*pitch under a _check_gpu_memory guard, derive per-tile/per-component views as slab offsets, and replace the per-tile sync with a single batch-end sync. Same pattern as #1659 (_try_nvcomp_from_device_bufs), #1688 (_try_kvikio_read_tiles), #1712 (_nvcomp_batch_compress). 7 new tests in test_nvjpeg2k_single_alloc_2107.py: AST-level structural assertions confirm no cupy.empty inside the for-loop and no Device().synchronize() inside the loop, plus pool/per_tile_comp_bytes presence and _check_gpu_memory guard checks; lib-absent short-circuit; unsupported-dtype cleanup contract; cupy-only pool slab-non-overlap test (gpu-marked). libnvjpeg2k.so not present on this host so the end-to-end nvJPEG2000 decode is gated -- note added to re-validate on a host with the RAPIDS conda env. All 30 jpeg2000/compression tests + 7 new tests pass. SAFE/IO-bound verdict holds (no change in dask graph cost). Dask probe: 4096x4096 deflate-tiled file via read_geotiff_dask(chunks=512) yields 256 tasks for 64 chunks (4 tasks/chunk), well under the 50K cap. | Pass 11 (2026-05-18): 1 MEDIUM found and fixed. _read_strips (_reader.py:~L1972) and _fetch_decode_cog_http_strips (_reader.py:~L2670) decoded strips sequentially in a Python for-loop while the tile counterparts (_read_tiles L2146, _fetch_decode_cog_http_tiles L2898) gated parallel decode on _PARALLEL_DECODE_PIXEL_THRESHOLD via ThreadPoolExecutor. Filed #2100 and fixed: both strip paths now collect jobs, parallel-decode when n_strips > 1 and strip_pixels >= 64K, then place sequentially. Measured (uint16, 4-core): 4096x4096 deflate 130ms->34ms (3.82x), 8192x8192 deflate 531ms->146ms (3.63x), 8192x8192 zstd 211ms->85ms (2.48x), uncompressed 25ms->22ms (1.14x). 5 new tests in test_parallel_strip_decode_2100.py (parallel/serial parity, pool-engaged on multi-strip, serial-path for single-strip, windowed cross-strip read, HTTP COG strip parity). 3998 tests pass; 8 pre-existing failures predating this change (predictor2 BE + size_param_validation_gpu_vrt reference now-private read_to_array attr). SAFE/IO-bound verdict holds. | Pass 10 (2026-05-15): 1 new MEDIUM found and fixed; 2 LOW noted. MEDIUM (_reader.py:2737): _fetch_decode_cog_http_tiles decoded tiles sequentially in a Python for-loop after the concurrent fetch landed (issue #1480). Local _read_tiles parallelises decode whenever tile_pixels >= 64K via ThreadPoolExecutor (_reader.py:2017); the HTTP path was structurally similar but never picked up the same gate, so wide windowed reads of multi-tile COGs left deflate/zstd decode single-threaded. Mirrored the local-path threshold + pool. 5 new tests in test_cog_http_parallel_decode_2026_05_15.py (parallel + serial round-trip correctness, pool-instantiation branch selection above the threshold, single-tile path skips the pool, structural _decode_strip_or_tile call count == n_tiles). All 262 COG/HTTP tests pass; 3162 of 3164 selected geotiff tests pass overall (2 pre-existing failures predating Pass 9 per prior notes -- test_predictor2_big_endian_gpu_1517 references the now-private read_to_array attr, and the test_size_param_validation_gpu_vrt_1776 tile_size=4 validator failure). LOW deferred (no fix in this PR): (1) _block_reduce_2d_gpu (_gpu_decode.py:3142/3163/3189) does bool(mask.any().item()) per overview level when nodata is set, paying one device sync per level; the alternative (unconditional cupy.putmask) always pays the work cost and the short-circuit is correct under the current API. (2) _nvcomp_batch_compress adler32 staging (_gpu_decode.py:2543-2546) issues n_tiles slice-assign kernels into a fresh contig buffer despite all callers passing slices of a single underlying d_tile_buf; an API refactor to accept the source buffer directly would skip the rebuild. SAFE/IO-bound verdict holds. Dask probe: 2560x2560 chunks=256 yields 400 tasks (4 per chunk), well under the 50000 cap. GPU probe: 1024x1024 float32 zstd read returns CuPy-backed in 236 ms with no host round-trip. | Rockout 2026-05-15: LOW filed #1934 -- _apply_nodata_mask_gpu used cupy.where (allocating); switched to cupy.putmask on the already-owned buffer (float path) and on the post-astype float64 buffer (int path). Saves one chunk-sized device allocation per call. 7 new tests in test_apply_nodata_mask_gpu_inplace_1934.py; 52 related nodata tests pass. | Pass 8 (2026-05-12): 1 new MEDIUM found and fixed. _assemble_standard_layout/_assemble_cog_layout returned bytes(bytearray), doubling peak memory transiently during eager writes. Filed #1756, fixed by returning the bytearray directly. Measured: 95 MB uint8 raster peak drops 202 MB -> 107 MB. _write_bytes / parse_header already accepted the buffer protocol so the change is transparent to callers. 6 new tests in test_assemble_layout_no_bytes_copy_1756.py. 2123 existing geotiff tests pass; the 10 unrelated failures (test_no_georef_windowed_coords_1710, test_predictor2_big_endian_gpu_1517) reference the now-private read_to_array attribute (commit 8adb749, issue #1708) and predate this change. SAFE/IO-bound verdict holds. | Pass 7 (2026-05-12): re-audit identified 4 MEDIUM findings, all real, all backed by microbenches. (1) unpack_bits sub-byte loops for bps=2/4/12 in _compression.py:836-878 were 100-200x slower than vectorised numpy (filed #1713, fixed in this branch: bps=4 2M pixels drops from 165ms to 3ms = 55x; bps=2/12 similar). (2) _write_vrt_tiled at __init__.py:1708 uses scheduler='synchronous' on independent tile writes; measured 33% slowdown on 256-tile zstd write vs threads scheduler (filed #1714, no fix yet). (3) _nvcomp_batch_compress at _gpu_decode.py:2522-2526 still does per-tile cupy.get().tobytes() despite #1552 / #1659 fixing the same pattern elsewhere; measured 45% reduction with concat+single get on n=1024 (filed #1712, no fix yet). (4) _nvcomp_batch_compress at _gpu_decode.py:2457 uses per-tile cupy.empty allocations; 1024 tiles 16KB drops from 4.7ms to 1.0ms with single contiguous + views (bundled into #1712). Cat 6 OOM verdict: SAFE/IO-bound holds -- read_geotiff_dask caps task count at _MAX_DASK_CHUNKS=50_000 and per-chunk memory is bounded by chunk size. _inflate_tiles_kernel resource usage on Ampere: 67 regs/thread, 2896B local/thread, 8192B shared/block (LZW kernel: 29 regs, 24576B shared) -- register pressure under control; high local memory in inflate is unavoidable (LZ77 state) but only thread 0 in each block uses it. | Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions. | Pass 5 (2026-05-12): re-audit identified MEDIUM in _gpu_decode.py:1577 _try_nvcomp_from_device_bufs: per-tile cupy.empty + trailing cupy.concatenate doubled peak VRAM and added serial concat. Filed #1659 and fixed to single-buffer + pointer offsets (matches LZW/deflate/host-buffer patterns at L1847/L1878/L1114). Microbench (alloc+concat overhead only, not full nvCOMP latency): n=256 tile_bytes=65536 drops 3.66ms->0.69ms, n=256 tile_bytes=262144 drops 8.18ms->0.13ms. Tests: 5 new tests in test_nvcomp_from_device_bufs_single_alloc_1659.py (codec short-circuit, no-lib short-circuit, memory-guard contract, real ZSTD round-trip via nvCOMP, structural single-buffer check). 1458 existing geotiff tests pass, 3 unrelated matplotlib/py3.14 failures pre-existing. SAFE/IO-bound verdict holds. | Pass 6 (2026-05-12): re-audit on top of #1659. New HIGH in _try_kvikio_read_tiles at _gpu_decode.py:941: per-tile cupy.empty() + blocking IOFuture.get() inside loop serialised GDS reads to ~1 outstanding pread, missed parallelism the kvikio worker pool was designed for, paid per-tile cupy.empty setup (matches #1659 anti-pattern in nvCOMP path), and lacked _check_gpu_memory guard. Filed #1688 and fixed to single contiguous buffer + batched submit + guard. Microbench with 8-worker pool simulation: 256 tiles@1ms latency drops 256ms->38.7ms (~6.6x); single-thread simulation 256ms->28.5ms (9x). Tests: 9 new tests in test_kvikio_batched_pread_1688.py (kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard, partial-read fallback, round-trip data, zero-size/all-sparse tiles). All 1577 geotiff tests pass except pre-existing matplotlib/py3.14 failures."
+glcm,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,"Downgraded to MEDIUM. da.stack without rechunk is scheduling overhead, not OOM risk."
+hillshade,2026-04-16T12:00:00Z,SAFE,compute-bound,0,,"Re-audit after Horn's method rewrite (PR 1175): clean stencil, map_overlap depth=(1,1), no materialization. Zero findings."
+hydro,2026-05-01,RISKY,memory-bound,0,1416,"Fixed-in-tree 2026-05-01: hand_mfd._hand_mfd_dask now assembles via da.map_blocks instead of eager da.block of pre-computed tiles (matches hand_dinf pattern). Remaining MEDIUM: sink_d8 CCL fully materializes labels (inherently global), flow_accumulation_mfd frac_bdry held in driver dict instead of memmap-backed BoundaryStore. D8 iterative paths (flow_accum/fill/watershed/basin/stream_*) use serial-tile sweep with memmap-backed boundary store -- per-tile RAM bounded but driver iterates O(diameter) times. flow_direction_*, flow_path/snap_pour_point/twi/hand_d8/hand_dinf are SAFE."
+interpolate,2026-06-12,SAFE,compute-bound,0,3298,"3 MEDIUM fixed via #3298: kriging no-variance path now uses dual form k0 @ (K_inv @ z_aug) (1.6x, drops (P,N+1) w temp); dask variance computed in one map_blocks pass (was 2.06x); dask+cupy chunk-invariant uploads (idw x/y/z, kriging x/y/z/K_inv) hoisted and cKDTree built once for dask k-nearest. 1 LOW documented, not fixed: _experimental_variogram bins pairs with per-lag boolean masks, O(nlags*N^2) passes where np.bincount would do one. Dask graphs are plain map_blocks, 2 tasks/chunk, no fan-in; memory guards cover host allocations. GPU paths executed on this host (CUDA available)."
+interpolate-kriging,2026-06-04,SAFE,graph-bound,0,2923,"MEDIUM: memory guard used full-grid k0 term on dask templates -> spurious MemoryError (issue #2923, fixed). LOW: _experimental_variogram nlags python loop vectorizable via bincount (~1.2x, pair-array materialization dominates) - doc only. Dask graph clean (2 tasks/chunk); cupy returns device arrays; no .values/.compute/.data.get materialization."
+interpolate_spline,2026-06-04,SAFE,compute-bound,0,,"scope=spline-only. Audited _spline.py + _validation.py only (not _idw/_kriging). 1 MEDIUM (Cat3 GPU transfer): _spline_dask_cupy/_spline_cupy re-uploaded invariant x_pts/y_pts/weights host->device once per chunk. Fixed in PR #2929: added _tps_evaluate_gpu taking on-device point/weight arrays + only per-chunk grid slices; dask+cupy uploads invariants once at graph build (verified 48->3 on 16 chunks, scales with chunk count). numpy/cupy/dask+cupy parity ~1e-14. Added cupy+dask+cupy parity tests and an upload-count regression test (red without fix: 48!=3). _tps_cuda_kernel 30 regs/thread, 6 scalar locals -- no register pressure. CPU/dask+numpy eval @ngjit, row-major, no materialization. Dask graph probe 2560x2560/256 chunks = 200 tasks (2/chunk), no fan-in. Memory guard _check_spline_memory bounds N^2 solve. No issue filed -- gh issue create denied by auto-mode classifier; finding surfaced directly by sweep. GitHub issue field left empty."
+kde,2026-04-14T12:00:00Z,SAFE,compute-bound,0,,Graph construction serialized per-tile. _filter_points_to_tile scans all points per tile. No HIGH findings.
+mahalanobis,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,False positive. Numpy path materializes by design. Dask path uses lazy reductions + map_blocks.
+mcda,2026-06-10,SAFE,memory-bound,2,3150,"2 HIGH fixed in PR #3158: owa() dask path crashed (da.sort does not exist; memory guard pointed users at the crashing path) and wpm validation ran one compute() per criterion. MEDIUM fixed in PR #3159 (#3151): cupy piecewise + dask+cupy piecewise/categorical raised TypeError via np.asarray on cupy chunks. MEDIUM fixed in PR #3160 (#3152): monte_carlo sensitivity materialized full dask dataset (now chunk-bounded map_blocks, ~8 tasks/chunk at n_samples=1000) and crashed on cupy via per-sample .values; constrain() deep copy dropped. LOW documented, not fixed: fuzzy_overlay builds ones via layers[0]*0+1; _categorical does one full-array pass per mapping key. Verdict SAFE assumes the 3 PRs merge (pre-fix: WILL OOM for MC-on-dask, owa dask broken). GPU paths validated on CUDA host (cupy 13.6)."
+morphology,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
+multispectral,2026-05-02,SAFE,compute-bound,0,,"Re-audit 2026-05-02 after PRs 1292 (true_color memory guard) and 1301 (validate_arrays in true_color). Verified SAFE. No HIGH. MEDIUM: da.stack in _true_color_dask/_true_color_dask_cupy at L1702/L1731 creates (1,1,1,1) chunks along band axis (4 bands so impact is minor, scheduling overhead not OOM). LOW: np.zeros((h,w,4)) at L1681 then full overwrite -- np.empty would suffice. All 17 indices use plain map_blocks with no halo; 8192x8192 ndvi graph is 80 tasks, evi/arvi/ebbi 112 tasks."
+normalize,2026-03-31T18:00:00Z,SAFE,compute-bound,0,1124,Boolean indexing replaced with lazy nanmin/nanmax/nanmean/nanstd.
+pathfinding,2026-04-15T12:00:00Z,SAFE,compute-bound,0,false-positive,Downgraded. CuPy .get() is required -- A* has no GPU kernel. Per-pixel .compute() is only 2 calls for start/goal validation. seg.values in multi_stop_search collects already-computed results for stitching.
+perlin,2026-03-31T18:00:00Z,WILL OOM,memory-bound,0,,
+polygon_clip,2026-06-10,SAFE,graph-bound,0,3191,"crop=True picked tiny leading edge chunk as rasterize mask size -> 13169-task graph; fixed to max(rc),max(cc) -> 1045 tasks. crop=False/numpy/cupy clean. Cat1-5 clean. GPU+dask+cupy run-validated."
+polygonize,2026-06-12,RISKY,compute-bound,0,3303,"Pass 3 (2026-06-12): re-audit after #2817/#2913/#3041. 0 HIGH. 1 MEDIUM fixed (#3303): _compute_region_value_ranges ran a pure-Python per-pixel loop (95% of float chunk time; 0.283s of 0.299s on 1024x1024, float chunks ~30x int) and re-ran _calculate_regions on an already-labelled block; moved to jitted _region_ranges_scan + _polygonize_numpy_regions label reuse (0.299s -> 0.015s/chunk). Side fix: w_match/s_match flags were always-truthy (_is_close numba overload generator called from pure Python returns impl function); output-neutral by chunk geometry, now computed correctly in jit. Cat1/2 clean (dask.compute batching is the documented #2673 design). Cat3 validated on GPU: cupy int/float + dask+cupy run end-to-end, single documented transfer, no round-trip. Cat4/5 LOW unchanged: _calculate_regions_cupy per-unique-value labeling (low impact); per-polygon Python classify loop in _polygonize_chunk dominates only on pathological many-polygon chunks (788K polys -> 7.8s). Cat6 RISKY unchanged: driver accumulates O(total polygons); 32-chunk batches bound transient peak. 527 polygonize tests + 40 new pass."
+proximity,2026-06-09,RISKY,graph-bound,0,3103,"Pass 2 (2026-06-09): re-audit after 16 fix commits since 2026-03-31. 0 HIGH, 2 MEDIUM found and fixed: (1) #3103/PR #3126 line-sweep @ngjit closure inside _process recompiled per call (~0.42s constant overhead; 10x10 warm call 0.44s->1ms after module-level hoist with explicit args, 1000x1000 0.49s->35ms); (2) #3132/PR #3137 dask xs/ys coordinate grids built via da.tile/da.repeat+rechunk cost ~185 tasks/chunk with the ys term scaling O(raster height) (~4.3 tasks/row, 44K tasks at 10240 rows); chunk-aligned da.broadcast_to gives identical values, bounded graph 18535->5554 tasks (3.3x) on 2560^2/256 chunks; regression test bounds tasks/chunk<80 (old 100.4, new 58.7) + ragged-chunk parity. LOW not fixed: zeros+fill(-1) row buffers in line-sweep; numpy backend materializes full float64 xs/ys grids (guarded since #1111); unbounded KDTree streaming count pass computes chunks on driver by design (gh-879). GPU validated on CUDA host: cupy 1024^2 proximity 6ms device-resident with exact numpy parity, dask+cupy bounded parity exact, _proximity_cuda_kernel 56 regs/thread (no register pressure). _halo_depth python loop measured 58ms at 100K coords - not a finding. Verdict RISKY (was WILL OOM): unbounded paths either guarded (MemoryError at 80% mem) or stream via kdtree; bounded map_overlap peak scales with chunk size."
+rasterize,2026-06-09,SAFE,compute-bound,0,3107,"Pass 4 (2026-06-09): re-audit found 2 MEDIUM Cat-4 allocation findings, 0 HIGH. (a) all four backends return via astype(dtype) which copies the float64 work buffer even when dtype is already float64 (the default) -- _run_numpy L1237, _run_cupy L2211, _rasterize_tile_numpy L2460, _rasterize_tile_cupy L2688; fix astype(dtype, copy=False). (b) CPU paths allocate order as full-raster int64 (8 B/px) for every merge mode but only first/last predicates read it; for _should_write_any merges (max/min/sum/count, user callables) an int8 buffer suffices (numba wraps the dead int64 store) -- _run_numpy L1188, _rasterize_tile_numpy L2420. tracemalloc 4000x4000 numpy merge=sum: peak 25 B/px -> 10 B/px expected (out 8 + written 1 + order 1); merge=last 25 -> 17 B/px. Filed #3107, fixed via deep-sweep rockout. GPU validated on host (CUDA available): cupy 512x512 last/sum/max returns cupy.ndarray with CPU parity, dask+cupy sum parity True, no host round trip. Dask graph probe: 2560x2560 chunks=256 -> 400 tasks / 100 chunks (4.0 tasks/chunk, unchanged). LOW (not fixed, documented): _extract_polygon_boundary_segments int variant L702 is dead code (only the _float variant is called). SAFE/compute-bound: per-tile buffers scale with chunk size; scanline/burn JIT kernels dominate runtime. | Pass 3 (2026-05-27): re-audit identified 1 MEDIUM Cat-3 GPU-transfer finding. _run_cupy (L2065/L2083) and _rasterize_tile_cupy (L2541/L2555) called cupy.asarray(poly_props/poly_global) twice when all_touched=True -- once for the scanline poly_launch tuple and once for the supercover boundary_launch tuple. The two tuples reference the same per-tile props tables. Filed #2506 and fixed by hoisting the upload above the scanline/boundary conditional so both launches share the same device buffer. Microbench: 1000 polys/4 cols 0.051->0.024 ms/iter (2.1x); 10000 polys/8 cols 0.218->0.092 ms/iter (2.4x, saves 720 KB/tile of redundant H2D transfer). 12 new tests in test_rasterize_props_hoist_2506.py (4 AST-structural single-asarray-call assertions + 5 cupy all_touched parity merges + 3 dask+cupy smoke tests). All 470 rasterize tests pass. Dask graph probe: 25600x25600 chunks=1024 yields 2500 tasks for 625 tiles (4 tasks/chunk), unchanged. Noted pre-existing dask+cupy all_touched parity gap on boundary segments crossing tile borders (not addressed by this PR). SAFE/graph-bound verdict holds. | Pass 2 (2026-05-17): re-audit identified MEDIUM Cat-2/Cat-3 graph-bound waste in _run_dask_numpy/_run_dask_cupy -- full line_props/point_props embedded in every delayed tile task (polygon path already filtered via poly_props[pmask]). Filed #2020 and fixed: added _slice_props_for_tile helper to remap geom_idx and slice props per tile (mirrors polygon path). Measured 5000 points x 8 cols / 100 tiles graph shrank from ~30 MB to <0.3 MB (37x); localized lines from ~32 MB to ~1.1 MB. 9 new tests in test_rasterize_tile_props_slice_2020.py (helper unit tests + graph-payload bound + numpy/dask output parity for lines/points/sum-merge). All 184 existing rasterize tests pass; dask+cupy parity verified. Dask graph probe: 2560x2560 chunks=256 yields 400 tasks (4 tasks/chunk constant); 25600x25600 chunks=1024 yields 2500 tasks. cupy 512x512 returns cupy.ndarray with no host round-trip. CUDA _scanline_fill_gpu: 39 regs/thread, 24576 B local_mem/thread (matches static cuda.local.array allocations 2048*8 + 2048*4 bytes). SAFE/graph-bound verdict holds; previous 2026-04-15 false-positive on polygon filtering still valid. | Original (2026-04-15): Tile-by-tile graph construction with per-tile geometry filtering is the correct pattern. Pre-filtering ensures each delayed task gets only its relevant subset."
+reproject,2026-06-12,SAFE,compute-bound,0,3267 3268,"Pass 7 (2026-06-12, deep-sweep): 0 HIGH, 2 MEDIUM found and fixed. #3267: in-memory numpy path held ~7 output-sized float64 temporaries and dask promotion keyed on input size only -- measured 8.4 GB peak RSS for a 1.15 GB output (7.3x); fix computes the grid first, promotes on output size too (mirrors merge #3048 pattern), re-applies the pixel guard for materializing paths, and fixes the 3-D chunks tuple in the dask wrap (post-fix peak 0.5 GB, lazy). #3268: multi-band cupy CPU-fallback transform path re-uploaded local_row/local_col per band via cp.asarray inside _resample_cupy_native (measured 12 H2D uploads for 6 bands, 26.1 MB vs 4.4 MB needed); fix hoists the device conversion before the band loop. LOW (documented, not fixed): _resample_cupy_native redundant copy+scan when the caller pre-converted nodata (+149% on the resample step, hits _reproject_dask_cupy fast path with non-NaN nodata); geoid_height_raster allocates full HxW meshgrid x2 plus output from dims (no strips, no dask path). Dask graph probe: 2560x2560/256 chunks -> 216 tasks for 108 output chunks (2/chunk, single blockwise layer, lazy); merge 2 inputs -> 16 tasks. GPU validated on host (CUDA available): cupy 2048^2 4326->3857 in 23 ms on-device; dask+cupy eager fast path matches in-memory exactly. SAFE/compute-bound holds. | Pass 6 (2026-06-09): 0 HIGH. 1 MEDIUM found and fixed (#3106): _reproject_chunk_numpy probed try_numba_transform, then _transform_coords probed it again before the pyproj fallback -- each wasted probe re-parses CRS params (~10 pyproj to_dict/to_authority round-trips) and allocates 4 chunk-sized float64 coordinate arrays. Measured 512x512 chunk, 4326->ESRI:54009: ~0.3-0.5 ms/probe, ~11% of the 5.3 ms chunk worker, repeated per output chunk on dask+numpy and merge per-block paths. Fix: worker passes no CRS objects to _transform_coords (inner retry gated on both non-None); cupy CPU fallbacks keep the inner probe (their first numba attempt). 3 new tests (TestNoDuplicateNumbaFastPathProbe); 447 reproject tests pass. LOW (not fixed, documented): try_numba_transform allocates 4 flat arrays before branch dispatch -- wasted for the lcc/tmerc 2D-kernel branches and unsupported pairs; _resample_cupy_native does a redundant .copy() when nodata is non-NaN and the caller already passed a fresh float64 copy; per-projection param extractors (_lcc_params etc.) call crs.to_dict() without the UserWarning suppression that _get_datum_params got in #3076, so fallback chunks emit pyproj warning spam. Dask graph probe: 2560x2560/256 chunks -> 216 tasks for 108 output chunks (2/chunk, 2 layers); merge 2 inputs -> 64 tasks/32 chunks. Source window per task capped at 64 Mpix. GPU validated on host (CUDA available): cupy 1024^2 fast path 13 ms, try_cuda_transform stays on-device, dask+cupy end-to-end OK, numpy/cupy max abs diff 2e-12, NaN positions identical. SAFE/compute-bound holds. | Pass 5 (2026-05-10): 1 HIGH filed and fixed in tree -- issue #1571 + fix _merge_block_adapter same-CRS dask path. _place_same_crs in the dask adapter previously called src_data.compute() on the full source per output chunk (68x amplification measured on 256x256x2 source split into 32x32 output chunks, 8.9M pixels materialized vs 131K total source). Fix: added _place_same_crs_lazy at __init__.py:1716 that slices the source window first then computes only that slice. Verified post-fix: 1.00x ratio, 131K pixels materialized for 131K source. New regression test test_merge_dask_same_crs_bounded_materialization codifies the bound. Other audits clean: CUDA resample kernels use 16x16 blocks (cubic=46 regs, bilinear=36, nearest=22 -- well under the 64K-per-block limit, 0 local mem). _reproject_chunk_numpy/cupy already slice source first before .compute(). Dask graph at 25600x25600 src with 1024 chunks yields 4752 tasks (no per-chunk source dependency). _apply_vertical_shift uses in-place += that may not work on dask arrays -- correctness concern, not perf, defer to accuracy sweep."
+resample,2026-04-15T12:00:00Z,SAFE,compute-bound,0,false-positive,Downgraded. GPU-CPU-GPU round-trip only in aggregate path for non-integer scale factors. Interpolation (nearest/bilinear/cubic) stays on GPU. No GPU kernel exists for irregular per-pixel binning.
+sieve,2026-04-14T12:00:00Z,WILL OOM,memory-bound,0,false-positive,False positive. Memory guards already in place on both dask paths. CCL is inherently global — documented limitation. CuPy CPU fallback is deliberate and documented.
+sky_view_factor,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
+slope,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
+surface_distance,2026-03-31T18:00:00Z,SAFE,memory-bound,0,1128,Memory guard added to dd_grid allocation.
+terrain,2026-03-31T18:00:00Z,RISKY,compute-bound,0,,
+terrain_metrics,2026-03-31T18:00:00Z,SAFE,memory-bound,0,,
+viewshed,2026-04-05T12:00:00Z,SAFE,memory-bound,0,fixed-in-tree,Tier B memory estimate tightened from 280 to 368 bytes/pixel (accounts for lexsort double-alloc + computed raster). astype copy=False avoids needless float64 copy.
+visibility,2026-06-10,RISKY,compute-bound,0,3185,"cumulative_viewshed recomputed dask source per observer (fixed #3185: materialise once when no max_distance); graph grows ~64 tasks/observer with N; line_of_sight single-transect cheap; MEDIUM count temp .astype per observer (LOW, not fixed)"
+worley,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
+zonal,2026-05-27,SAFE,compute-bound,0,2526,"Pass 2 (2026-05-27): re-audit identified 3 MEDIUM findings. (1) zonal_apply 3D dask path: da.stack(layers, axis=2) left output chunks at size 1 along axis 2 -- filed #2526 and fixed by rechunking back to values_data.chunks[2] in _apply_dask_numpy (zonal.py:1691) and _apply_dask_cupy (zonal.py:1731). Confirmed via graph probe: 256x256 raster chunks=(64,64) 3 bands previously yielded chunks[2]=(1,1,1); now (3,). 1 new test (test_apply_dask_3d_axis2_rechunked_2526). 126 existing zonal tests pass. (2) _stats_cupy (zonal.py:588-608): per-zone x per-stat Python loop with cupy.float_(result) forces O(n_zones * n_stats) GPU<->CPU sync points; not fixed in this pass (CUDA-native rewrite needed, larger refactor). (3) _parallel_variance @delayed reduce iterates over all blocks in driver memory; for very large block counts the single-task merge becomes scheduler-bound but is not OOM since per-block arrays are O(n_zones). Not fixed (algorithmic refactor needed). Dask graph probe: stats(7 stats) on 2560x2560 chunks=256 -> 4449 tasks (44/chunk); stats(mean only) -> 823 tasks (8/chunk); crosstab -> 304 (3/chunk); hypsometric_integral -> 300 (3/chunk). All under 50K cap. SAFE/compute-bound verdict holds. | Fixed-in-tree 2026-04-16: rewrote hypsometric_integral dask path. Eliminated double-compute (_unique_finite_zones removed, each block discovers own zones). Replaced np.stack (O(n_blocks * n_zones) scheduler memory) with streaming dict-merge (O(n_zones)). 29 existing tests pass."
diff --git a/benchmarks/benchmarks/cost_distance.py b/benchmarks/benchmarks/cost_distance.py
index 910f2d726..a24f72b00 100644
--- a/benchmarks/benchmarks/cost_distance.py
+++ b/benchmarks/benchmarks/cost_distance.py
@@ -21,5 +21,16 @@ def setup(self, nx, type):
             friction.data = np.abs(friction.data) + 0.1
         self.friction = friction
 
+        # Pick a finite max_cost whose pixel radius stays well inside one
+        # chunk (chunks are ny//2 x nx//2), so the dask path exercises the
+        # bounded map_overlap branch rather than the unbounded iterative
+        # one. radius_px = max_cost / (f_min * cellsize); friction >= 0.1
+        # so using 0.1 keeps the true radius at or below the target.
+        cellsize = min(360.0 / (nx - 1), 180.0 / (ny - 1))
+        self.max_cost = 10 * 0.1 * cellsize
+
     def time_cost_distance(self, nx, type):
         cost_distance(self.agg, self.friction)
+
+    def time_cost_distance_bounded(self, nx, type):
+        cost_distance(self.agg, self.friction, max_cost=self.max_cost)
diff --git a/xrspatial/cost_distance.py b/xrspatial/cost_distance.py
index a080b06cd..20015e4e6 100644
--- a/xrspatial/cost_distance.py
+++ b/xrspatial/cost_distance.py
@@ -1160,7 +1160,15 @@ def _cost_distance_dask(source_da, friction_da, cellsize_x, cellsize_y,
 
     pad = int(max_radius + 1) if np.isfinite(max_radius) else max_dim
 
-    if not np.isfinite(max_radius) or pad >= height or pad >= width:
+    # map_overlap's depth must not exceed the chunk size; a larger depth makes
+    # dask rechunk toward bigger (eventually single) blocks, defeating
+    # out-of-core processing.  Compare pad against the chunk size, not the full
+    # array dimensions (matches the dask+cupy guard).
+    max_chunk_y = max(source_da.chunks[0])
+    max_chunk_x = max(source_da.chunks[1])
+
+    if (not np.isfinite(max_radius)
+            or pad >= max_chunk_y or pad >= max_chunk_x):
         # Use iterative tile Dijkstra — bounded memory, no single-chunk rechunk
         import warnings
         warnings.warn(
diff --git a/xrspatial/tests/test_cost_distance.py b/xrspatial/tests/test_cost_distance.py
index 309a515e2..6766d50aa 100644
--- a/xrspatial/tests/test_cost_distance.py
+++ b/xrspatial/tests/test_cost_distance.py
@@ -710,6 +710,40 @@ def _tracking_rechunk(self, *args, **kwargs):
         "cost_distance rechunked to a single chunk (OOM risk)"
 
 
+@pytest.mark.skipif(da is None, reason="dask not installed")
+def test_bounded_large_radius_no_chunk_collapse():
+    """Finite max_cost with radius > chunk size must not collapse chunks.
+
+    Regression for the #880-class OOM on the bounded path: when the pad
+    radius exceeds the chunk size (but is below the full array dimension),
+    map_overlap would silently rechunk toward a single block. The branch
+    must instead route to the iterative tile Dijkstra.
+    """
+    source = np.zeros((100, 100))
+    source[50, 50] = 1.0
+
+    friction_data = np.ones((100, 100))
+
+    raster_np = _make_raster(source, backend='numpy')
+    friction_np = _make_raster(friction_data, backend='numpy')
+    result_np = cost_distance(raster_np, friction_np, max_cost=95.0)
+
+    raster = _make_raster(source, backend='dask+numpy', chunks=(10, 10))
+    friction = _make_raster(friction_data, backend='dask+numpy', chunks=(10, 10))
+
+    # pad = 96 > chunk size 10 but < full dim 100
+    with pytest.warns(UserWarning, match="iterative tile Dijkstra"):
+        result = cost_distance(raster, friction, max_cost=95.0)
+
+    # The result must keep more than one chunk per axis (no collapse).
+    assert result.data.npartitions > 1, \
+        "bounded large-radius path collapsed to a single chunk (OOM risk)"
+
+    np.testing.assert_allclose(
+        _compute(result), result_np.data, equal_nan=True, atol=1e-4,
+    )
+
+
 # -----------------------------------------------------------------------
 # Memory guard on numpy path (Issue #1252)
 # -----------------------------------------------------------------------