Skip to content

from_template: guard dask grids against task-graph explosion (#3557)#3559

Merged
brendancol merged 2 commits into
mainfrom
issue-3557
Jun 27, 2026
Merged

from_template: guard dask grids against task-graph explosion (#3557)#3559
brendancol merged 2 commits into
mainfrom
issue-3557

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #3557

What

#3556 exempted dask grids from the 500-million-cell cap, since a lazy da.full grid never materializes. The data stays lazy, but the task graph does not: with a fixed chunk size the block count grows with the cell count, so a typo-level resolution like from_template('conus', resolution=1, chunks=512) builds a graph big enough to bog down the client during construction, even with nothing computed.

This adds the guard #3556 deferred.

It keys on the estimated block count, not the cell count, so a legitimate large grid with sensible chunks still passes. The estimate runs through dask's own normalize_chunks, so 'auto', int, tuple, and dict chunk forms all resolve to the block grid da.full would build. That covers the chunks='auto' open question from the issue: dask picks the shape and the count follows.

The threshold is 1,000,000 chunks. The new_england@10m repro is ~26k blocks and 'auto' sizing stays well under it, so there is wide headroom; the runaway fixed-chunk case lands in the tens of millions.

It raises rather than warns. The failure mode is graph construction itself hanging, so a warn-then-build would still hang. The guard raises from the cheap estimate before da.full ever constructs the graph.

Backend coverage

The guard looks only at shape and chunks, ahead of array construction, so it covers all four backends the same way: numpy and cupy (promoted to their dask variants when chunks is supplied), dask+numpy, and dask+cupy.

Test plan

  • runaway fixed chunks raises before building the graph (issue repro)
  • 'auto' chunks on a fine resolution stays under the cap and builds
  • estimate matches da.full(...).npartitions across chunk forms
  • legitimate large grid (new_england@10m, ~26k blocks) still passes
  • full test_templates.py suite green (260 passed); cupy and dask+cupy paths run locally

https://claude.ai/code/session_01YUNyNJkagqnTT3qEc7bt81

Supplying a fixed chunk size with a typo-level resolution builds one task
per chunk, so the block count grows with the cell count and the graph can
bog down the client during construction even though no data is computed.

Guard on the estimated block count (via dask's own normalize_chunks, so
'auto'/int/tuple/dict all resolve to the real grid) rather than cell count,
and raise when it exceeds 1,000,000 chunks. A legitimate large grid with
sensible chunks (new_england@10m ~26k) or 'auto' sizing still passes; the
runaway fixed-chunk case raises before da.full builds the graph.

Claude-Session: https://claude.ai/code/session_01YUNyNJkagqnTT3qEc7bt81

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: from_template task-graph guard (#3559)

Targeted guard, scoped to the dask path, with the design questions from the issue answered in the code and PR body. No blockers.

Blockers

None.

Suggestions

  • xrspatial/templates.py:33 -- the comment says the new_england@10m repro is "~12k blocks", but it is actually 26,425 (~26k). The test comment at test_templates.py:454 and the PR body both say ~26k, so this number is the odd one out. Fix it to ~26k so the headroom rationale reads correctly.

Nits

  • xrspatial/tests/test_templates.py:421 -- pytest.raises(ValueError, match="chunk") is loose: the eager cell-cap message also contains the word "chunk" (it suggests "pass chunks=..."). The test does hit the dask path here because chunks=512 is supplied, so it is correct as written, but matching on something specific to the chunk-count cap (e.g. "chunk limit" or the count) would keep it from silently passing on the wrong error if the paths ever shift.
  • Optional: every raising test reaches the guard through the numpy->dask promotion path. An explicit backend="dask+numpy", chunks=512 case would pin the non-promotion path too. Low value since the guard is backend-agnostic and runs before array construction.

What looks good

  • Guarding on estimated block count via dask's own normalize_chunks is the right call -- it answers the chunks='auto' open question for free, since dask sizes the blocks and the count follows. test_chunk_count_estimate_matches_dask pins the estimate to the real npartitions across int, auto, and tuple forms.
  • Raise-before-build is correct for the stated failure mode. A warn would fire and then still hang in graph construction.
  • The estimate is cheap even in the runaway case (only the block-size tuples are built, never the graph), so the guard does not reintroduce the very cost it prevents.
  • effective_chunks is computed once and reused for both the estimate and _make_data, so the guard and the real array cannot disagree on the chunking.

Checklist

  • Behaviour matches the issue's design decisions (count not cells, raise not warn, auto handled)
  • All four backends covered (numpy/cupy promoted, dask+numpy, dask+cupy)
  • NaN fill unchanged
  • Edge cases covered (auto under cap, runaway over cap, estimate parity, legit large grid)
  • No premature materialization; estimate stays lazy
  • Benchmark -- not needed (guard is O(blocks-per-axis), not a compute path)
  • README feature matrix -- not applicable (no new public function)
  • Docstring updated with the new raise behaviour

…3557)

- Correct the _MAX_CHUNKS comment: new_england@10m is ~26k blocks, not ~12k
  (the test comment and PR body already said ~26k).
- Tighten the runaway-guard tests to match "chunk limit" so they cannot pass
  on the eager cell-cap message, which also mentions chunks.
- Add an explicit dask+numpy (non-promotion) raise test.

Claude-Session: https://claude.ai/code/session_01YUNyNJkagqnTT3qEc7bt81

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review (#3559)

The three findings from the first pass are resolved in 980e76b:

  • The _MAX_CHUNKS comment now says ~26k blocks for new_england@10m, matching the test comment and PR body (templates.py:33).
  • Both runaway-guard tests now match "chunk limit", so they cannot pass on the eager cell-cap message (test_templates.py).
  • Added test_explicit_dask_backend_chunk_count_raises to pin the non-promotion backend="dask+numpy" path.

No new findings. Affected tests green (13 passed), flake8 clean. Nothing left blocking.

@brendancol brendancol merged commit 242d454 into main Jun 27, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

from_template: guard against task-graph explosion on dask with very fine resolution

1 participant