[Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap#655
[Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap#655
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: af6d08ec2b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
af6d08e to
886abbe
Compare
| ### Evaluation paths | ||
|
|
||
| The compiler picks one of two evaluation paths to compute the maximum based on the bound expression's structure: | ||
| On GPU backends the compiler picks one of two evaluation paths to compute the maximum based on the bound expression's structure; on CPU the sequential path is always taken since the runtime's CPU max-reducer is single-threaded and the parallel dispatch's per-launch setup would be pure overhead: |
There was a problem hiding this comment.
I think I will reformulate a bit. mentioning parallel vs sequential before introducing them is not good.
There was a problem hiding this comment.
Done! Sorry I amended by mistake :/
Edit: Fixed.
| On GPU backends the compiler picks one of two evaluation paths to compute the maximum based on the bound expression's structure; on CPU the sequential path is always taken since the runtime's CPU max-reducer is single-threaded and the parallel dispatch's per-launch setup would be pure overhead: | ||
|
|
||
| - **Parallel:** the maximum is computed with a tiny parallel reduction kernel for efficiency. The reducer accepts a common subset of bound expressions: | ||
| - **Parallel:** the maximum is computed with a tiny parallel reduction kernel on the GPU for efficiency. The reducer accepts a common subset of bound expressions: |
There was a problem hiding this comment.
I would argue that on a GPU is redundant. But if we do want to include it it should be the indefinite article, not the definite article, I feel.
|
[x] doc looks ok Will check once agent jobs run: And also ponder whether I feel we need to run Genesis unit tests and/or benchmarks at that point. |
59efed0 to
a54acb1
Compare
This PR only modify adstack files. |
…ap; simplify cache invalidation to gen counter
…`_parallel` to `runtime_eval_adstack_max_reduce`
a54acb1 to
93b0dbc
Compare
|
[x] line diff report looks ok |
|
some non-adstack files modified => lets get Genesis unit tests and benchmarks please |

Adstack: skip max-reducer recognizer on CPU + lift host-eval cap; simplify cache invalidation to gen counter
TL;DR
recognize_adstack_max_reducer_specson CPU. The CPU runtime'sruntime_eval_adstack_max_reduce_serialis a single-thread loop, so the dispatch (params blob encode, body bytecode encode, observation snapshot, JIT call) is per-launch setup overhead with no parallelism to amortize. The host evaluator does the same serial walk without any of the setup. Measured +25% on the CPU rigid-step auto-diff bench (Apple M5).MaxOverRangecap from1 << 24toUINT32_MAXon CPU (matches the runtime CPU max-reducer's natural range). Walk-time observation memory drops fromO(N x body leaves)toO(body leaves)via a structural pre-walk that registers one observation per staticFieldLoad/ExternalRead/ExternalShapeleaf in the body subtree; the per-iteration evaluation then runs without pushing observations. At the lifted cap (N nearUINT32_MAX) with a one-leaf body the observation buffer drops from ~340 GB to ~100 B - six orders of magnitude. The pre-walk is independent of which iterations execute, so a nestedMaxOverRangewhose body is conditionally visited (e.g., empty inner range on some outer iterations) still gets its leaves registered.FieldLoadObs/ExternalReadObs, value mismatch forExternalShapeObs. Empirically the slow path never fires on real workloads (Genesis substep bench: 58M+ cache lookups, 0 gen-counter advances).Why
Three coupled motivations:
Recognizer + dispatch overhead is pure cost on CPU. The CPU max-reducer is
runtime_eval_adstack_max_reduce_serial, a single-thread serial loop that does the same work as the host evaluator'sMaxOverRangewalk inprogram/adstack/eval.cpp. Dispatching it pays per-launch setup cost (params blob encode, body bytecode encode, observation bookkeeping, JIT call) without compute parallelism to offset the cost.The
1 << 24host-eval cap blocks legitimate workloads on CPU. With the recognizer skipped, the host evaluator now handles every shape that previously dispatched, including shapes the runtime CPU max-reducer would walk unbounded. Matching that range needs the cap lifted toUINT32_MAX. The naive lift would OOM the host: eachSizeExprReadObservationis ~85 B and the original recording grew the vector by one entry per leaf per iteration, so a singleMaxOverRangewhose end approachesUINT32_MAXwould allocate hundreds of GB before the walk finishes (concretely: with one body leaf atN = UINT32_MAX, ~340 GB of observations - guaranteed OOM-kill on any host). Pre-walking the body structurally and recording one observation per static leaf bounds the observation vector atO(body leaves), independent ofN, so the lift is memory-safe at any iteration count.The cache's value-comparison fallback was dead code.
replay_one_observationreturnedobs.observed_valueverbatim on gen-counter match (fast path) and re-read the cell on gen-counter mismatch (slow path). The slow path existed to "save" the cache when a buffer was written but the read cell happened to be unchanged. Empirically it never fires: a steady-state Genesis CPU substep run with 58M+ cache lookups records 0 invalidations, meaning every cache entry's gen counter still matched and the slow path was never reached. Removing it drops ~50 lines of replay logic and the per-cellobserved_valuerecording.Mechanism end-to-end
1. Codegen LLVM gate
quadrants/codegen/llvm/codegen_llvm.cpp::TaskCodeGenLLVM::finalize_offloaded_task_function:OffloadedTask::ad_stack.max_reducer_specsstays empty on CPU.dispatch_max_reducersin the LLVM launcher gates on the existingany_max_reducer_taskpredicate, so the entire dispatch loop is skipped automatically.2. Host-eval cap conditional
quadrants/program/adstack/eval.cpp::evaluate_node'sMaxOverRangearm:CPU lifts to
UINT32_MAX. GPU stays at1 << 24to keep the on-device sizer within the driver's TDR window.3. Structural pre-walk for cache observations
Before the iteration loop runs, walk the body subtree once to register every static leaf with the cache. New helper
enumerate_static_observations(expr, body_node_idx, prog, ctx, reads)recurses through the IR structurally (visits every node regardless of any enclosingMaxOverRange'sbegin/endsemantics) and pushes one observation perFieldLoad/ExternalTensorRead/ExternalTensorShapeleaf. The per-iteration evaluation then runs withreads = nullptr:The structural walk is independent of which iterations actually execute. Nested
MaxOverRangewhose body is conditionally visited (e.g., empty inner range on some outer iterations) still gets its leaves registered, so a subsequent launch where that range becomes non-empty correctly invalidates on a buffer mutation.4. Cache replay simplification
quadrants/program/adstack/cache.cpp::replay_observation_is_fresh(renamed fromreplay_one_observation):Returns
boolinstead ofint64_t.try_size_expr_cache_hit,try_max_reducer_cache_hit, andtry_spirv_bytecode_cache_hitswitch fromnow != obs.observed_valueto!replay_observation_is_fresh(...). The deref slow path is gone.populate_max_reducer_body_observationsdrops theINT64_MINsentinel write toobserved_value(was the stand-in that made the now-removed deref path produce a self-equal cache hit on gen match).evaluate_field_loadandevaluate_external_tensor_readdrop theirobs.observed_value = vwrites since the field is unused for these kinds.ExternalTensorShapekeeps the write because shapes have no gen counter and fall back to value comparison.5. Test arch markers
Five existing tests in
tests/python/test_adstack.py(test_max_reducer_pins_stride_for_oversized_axis,test_max_reducer_dispatch_counts_advance_on_input_mutation,test_max_reducer_field_load_bound_var_dispatch,test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation,test_above_cap_out_of_grammar_kernel_raises) gainarch=[qd.cuda, qd.amdgpu, qd.vulkan, qd.metal]plus a docstring note pointing at the host-evaluator equivalent on CPU. With the recognizer skipped on CPU the dispatch counter stays at 0 there, and the cap-trigger test finds the liftedUINT32_MAXcap so its1 << 24 + 1shape resolves without raising.6. Docs
docs/source/user_guide/autodiff.mdAppendix C: layer the parallel-vs-sequential evaluation paths with explicit GPU / CPU qualifiers; drop the "and the read-tracking memory" rationale from the sequential walk cap paragraph (the cap is now purely a walk-time guard since observation memory is independently bounded).Per-backend coverage matrix
UINT32_MAX1 << 241 << 241 << 241 << 24Tests
The five GPU-only test markers pin (a) the recognizer being skipped on CPU (no dispatch counter advance) and (b) the cap-trigger test resolving on CPU where the cap is lifted.
test_adstack_metadata_cache_invalidates_on_host_mutationandtest_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutationcontinue to pin gen-counter-driven invalidation; they pass under the simplified replay because their writes go throughNdarray.write/SNodeRwAccessorsBankwhich bump the gen counter.