from_template: guard dask grids against task-graph explosion (#3557)#3559
Merged
Conversation
Supplying a fixed chunk size with a typo-level resolution builds one task per chunk, so the block count grows with the cell count and the graph can bog down the client during construction even though no data is computed. Guard on the estimated block count (via dask's own normalize_chunks, so 'auto'/int/tuple/dict all resolve to the real grid) rather than cell count, and raise when it exceeds 1,000,000 chunks. A legitimate large grid with sensible chunks (new_england@10m ~26k) or 'auto' sizing still passes; the runaway fixed-chunk case raises before da.full builds the graph. Claude-Session: https://claude.ai/code/session_01YUNyNJkagqnTT3qEc7bt81
brendancol
commented
Jun 27, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
PR Review: from_template task-graph guard (#3559)
Targeted guard, scoped to the dask path, with the design questions from the issue answered in the code and PR body. No blockers.
Blockers
None.
Suggestions
xrspatial/templates.py:33-- the comment says the new_england@10m repro is "~12k blocks", but it is actually 26,425 (~26k). The test comment attest_templates.py:454and the PR body both say ~26k, so this number is the odd one out. Fix it to ~26k so the headroom rationale reads correctly.
Nits
xrspatial/tests/test_templates.py:421--pytest.raises(ValueError, match="chunk")is loose: the eager cell-cap message also contains the word "chunk" (it suggests "pass chunks=..."). The test does hit the dask path here becausechunks=512is supplied, so it is correct as written, but matching on something specific to the chunk-count cap (e.g."chunk limit"or the count) would keep it from silently passing on the wrong error if the paths ever shift.- Optional: every raising test reaches the guard through the numpy->dask promotion path. An explicit
backend="dask+numpy", chunks=512case would pin the non-promotion path too. Low value since the guard is backend-agnostic and runs before array construction.
What looks good
- Guarding on estimated block count via dask's own
normalize_chunksis the right call -- it answers thechunks='auto'open question for free, since dask sizes the blocks and the count follows.test_chunk_count_estimate_matches_daskpins the estimate to the realnpartitionsacross int, auto, and tuple forms. - Raise-before-build is correct for the stated failure mode. A warn would fire and then still hang in graph construction.
- The estimate is cheap even in the runaway case (only the block-size tuples are built, never the graph), so the guard does not reintroduce the very cost it prevents.
effective_chunksis computed once and reused for both the estimate and_make_data, so the guard and the real array cannot disagree on the chunking.
Checklist
- Behaviour matches the issue's design decisions (count not cells, raise not warn, auto handled)
- All four backends covered (numpy/cupy promoted, dask+numpy, dask+cupy)
- NaN fill unchanged
- Edge cases covered (auto under cap, runaway over cap, estimate parity, legit large grid)
- No premature materialization; estimate stays lazy
- Benchmark -- not needed (guard is O(blocks-per-axis), not a compute path)
- README feature matrix -- not applicable (no new public function)
- Docstring updated with the new raise behaviour
…3557) - Correct the _MAX_CHUNKS comment: new_england@10m is ~26k blocks, not ~12k (the test comment and PR body already said ~26k). - Tighten the runaway-guard tests to match "chunk limit" so they cannot pass on the eager cell-cap message, which also mentions chunks. - Add an explicit dask+numpy (non-promotion) raise test. Claude-Session: https://claude.ai/code/session_01YUNyNJkagqnTT3qEc7bt81
brendancol
commented
Jun 27, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
Follow-up review (#3559)
The three findings from the first pass are resolved in 980e76b:
- The
_MAX_CHUNKScomment now says ~26k blocks for new_england@10m, matching the test comment and PR body (templates.py:33). - Both runaway-guard tests now match
"chunk limit", so they cannot pass on the eager cell-cap message (test_templates.py). - Added
test_explicit_dask_backend_chunk_count_raisesto pin the non-promotionbackend="dask+numpy"path.
No new findings. Affected tests green (13 passed), flake8 clean. Nothing left blocking.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3557
What
#3556 exempted dask grids from the 500-million-cell cap, since a lazy
da.fullgrid never materializes. The data stays lazy, but the task graph does not: with a fixed chunk size the block count grows with the cell count, so a typo-level resolution likefrom_template('conus', resolution=1, chunks=512)builds a graph big enough to bog down the client during construction, even with nothing computed.This adds the guard #3556 deferred.
It keys on the estimated block count, not the cell count, so a legitimate large grid with sensible chunks still passes. The estimate runs through dask's own
normalize_chunks, so'auto', int, tuple, and dict chunk forms all resolve to the block gridda.fullwould build. That covers thechunks='auto'open question from the issue: dask picks the shape and the count follows.The threshold is 1,000,000 chunks. The new_england@10m repro is ~26k blocks and
'auto'sizing stays well under it, so there is wide headroom; the runaway fixed-chunk case lands in the tens of millions.It raises rather than warns. The failure mode is graph construction itself hanging, so a warn-then-build would still hang. The guard raises from the cheap estimate before
da.fullever constructs the graph.Backend coverage
The guard looks only at shape and chunks, ahead of array construction, so it covers all four backends the same way: numpy and cupy (promoted to their dask variants when
chunksis supplied), dask+numpy, and dask+cupy.Test plan
'auto'chunks on a fine resolution stays under the cap and buildsda.full(...).npartitionsacross chunk formstest_templates.pysuite green (260 passed); cupy and dask+cupy paths run locallyhttps://claude.ai/code/session_01YUNyNJkagqnTT3qEc7bt81