from_template: guard dask grids against task-graph explosion (#3557) by brendancol · Pull Request #3559 · xarray-contrib/xarray-spatial

brendancol · 2026-06-27T16:55:29Z

What

#3556 exempted dask grids from the 500-million-cell cap, since a lazy da.full grid never materializes. The data stays lazy, but the task graph does not: with a fixed chunk size the block count grows with the cell count, so a typo-level resolution like from_template('conus', resolution=1, chunks=512) builds a graph big enough to bog down the client during construction, even with nothing computed.

This adds the guard #3556 deferred.

It keys on the estimated block count, not the cell count, so a legitimate large grid with sensible chunks still passes. The estimate runs through dask's own normalize_chunks, so 'auto', int, tuple, and dict chunk forms all resolve to the block grid da.full would build. That covers the chunks='auto' open question from the issue: dask picks the shape and the count follows.

The threshold is 1,000,000 chunks. The new_england@10m repro is ~26k blocks and 'auto' sizing stays well under it, so there is wide headroom; the runaway fixed-chunk case lands in the tens of millions.

It raises rather than warns. The failure mode is graph construction itself hanging, so a warn-then-build would still hang. The guard raises from the cheap estimate before da.full ever constructs the graph.

Backend coverage

The guard looks only at shape and chunks, ahead of array construction, so it covers all four backends the same way: numpy and cupy (promoted to their dask variants when chunks is supplied), dask+numpy, and dask+cupy.

Test plan

runaway fixed chunks raises before building the graph (issue repro)
'auto' chunks on a fine resolution stays under the cap and builds
estimate matches da.full(...).npartitions across chunk forms
legitimate large grid (new_england@10m, ~26k blocks) still passes
full test_templates.py suite green (260 passed); cupy and dask+cupy paths run locally

https://claude.ai/code/session_01YUNyNJkagqnTT3qEc7bt81

Supplying a fixed chunk size with a typo-level resolution builds one task per chunk, so the block count grows with the cell count and the graph can bog down the client during construction even though no data is computed. Guard on the estimated block count (via dask's own normalize_chunks, so 'auto'/int/tuple/dict all resolve to the real grid) rather than cell count, and raise when it exceeds 1,000,000 chunks. A legitimate large grid with sensible chunks (new_england@10m ~26k) or 'auto' sizing still passes; the runaway fixed-chunk case raises before da.full builds the graph. Claude-Session: https://claude.ai/code/session_01YUNyNJkagqnTT3qEc7bt81

brendancol

PR Review: from_template task-graph guard (#3559)

Targeted guard, scoped to the dask path, with the design questions from the issue answered in the code and PR body. No blockers.

Blockers

None.

Suggestions

xrspatial/templates.py:33 -- the comment says the new_england@10m repro is "~12k blocks", but it is actually 26,425 (~26k). The test comment at test_templates.py:454 and the PR body both say ~26k, so this number is the odd one out. Fix it to ~26k so the headroom rationale reads correctly.

Nits

xrspatial/tests/test_templates.py:421 -- pytest.raises(ValueError, match="chunk") is loose: the eager cell-cap message also contains the word "chunk" (it suggests "pass chunks=..."). The test does hit the dask path here because chunks=512 is supplied, so it is correct as written, but matching on something specific to the chunk-count cap (e.g. "chunk limit" or the count) would keep it from silently passing on the wrong error if the paths ever shift.
Optional: every raising test reaches the guard through the numpy->dask promotion path. An explicit backend="dask+numpy", chunks=512 case would pin the non-promotion path too. Low value since the guard is backend-agnostic and runs before array construction.

What looks good

Guarding on estimated block count via dask's own normalize_chunks is the right call -- it answers the chunks='auto' open question for free, since dask sizes the blocks and the count follows. test_chunk_count_estimate_matches_dask pins the estimate to the real npartitions across int, auto, and tuple forms.
Raise-before-build is correct for the stated failure mode. A warn would fire and then still hang in graph construction.
The estimate is cheap even in the runaway case (only the block-size tuples are built, never the graph), so the guard does not reintroduce the very cost it prevents.
effective_chunks is computed once and reused for both the estimate and _make_data, so the guard and the real array cannot disagree on the chunking.

Checklist

Behaviour matches the issue's design decisions (count not cells, raise not warn, auto handled)
All four backends covered (numpy/cupy promoted, dask+numpy, dask+cupy)
NaN fill unchanged
Edge cases covered (auto under cap, runaway over cap, estimate parity, legit large grid)
No premature materialization; estimate stays lazy
Benchmark -- not needed (guard is O(blocks-per-axis), not a compute path)
README feature matrix -- not applicable (no new public function)
Docstring updated with the new raise behaviour

…3557) - Correct the _MAX_CHUNKS comment: new_england@10m is ~26k blocks, not ~12k (the test comment and PR body already said ~26k). - Tighten the runaway-guard tests to match "chunk limit" so they cannot pass on the eager cell-cap message, which also mentions chunks. - Add an explicit dask+numpy (non-promotion) raise test. Claude-Session: https://claude.ai/code/session_01YUNyNJkagqnTT3qEc7bt81

brendancol

Follow-up review (#3559)

The three findings from the first pass are resolved in 980e76b:

The _MAX_CHUNKS comment now says ~26k blocks for new_england@10m, matching the test comment and PR body (templates.py:33).
Both runaway-guard tests now match "chunk limit", so they cannot pass on the eager cell-cap message (test_templates.py).
Added test_explicit_dask_backend_chunk_count_raises to pin the non-promotion backend="dask+numpy" path.

No new findings. Affected tests green (13 passed), flake8 clean. Nothing left blocking.

brendancol commented Jun 27, 2026

View reviewed changes

brendancol merged commit 242d454 into main Jun 27, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

from_template: guard dask grids against task-graph explosion (#3557)#3559

from_template: guard dask grids against task-graph explosion (#3557)#3559
brendancol merged 2 commits into
mainfrom
issue-3557

brendancol commented Jun 27, 2026

Uh oh!

brendancol left a comment

Uh oh!

brendancol left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

brendancol commented Jun 27, 2026

What

Backend coverage

Test plan

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

PR Review: from_template task-graph guard (#3559)

Blockers

Suggestions

Nits

What looks good

Checklist

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

Follow-up review (#3559)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant