[Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap by duburcqa · Pull Request #655 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-05-08T15:49:48Z

Adstack: skip max-reducer recognizer on CPU + lift host-eval cap; simplify cache invalidation to gen counter

Single commit, three coupled changes that together let the host-side adstack sizer handle MaxOverRange iteration counts up to UINT32_MAX on CPU without OOM, while dropping dead value-comparison logic from the cache replay path that the gen-counter check already subsumed.

TL;DR

Codegen LLVM: skip recognize_adstack_max_reducer_specs on CPU. The CPU runtime's runtime_eval_adstack_max_reduce_serial is a single-thread loop, so the dispatch (params blob encode, body bytecode encode, observation snapshot, JIT call) is per-launch setup overhead with no parallelism to amortize. The host evaluator does the same serial walk without any of the setup. Measured +25% on the CPU rigid-step auto-diff bench (Apple M5).
Host eval: lift the MaxOverRange cap from 1 << 24 to UINT32_MAX on CPU (matches the runtime CPU max-reducer's natural range). Walk-time observation memory drops from O(N x body leaves) to O(body leaves) via a structural pre-walk that registers one observation per static FieldLoad / ExternalRead / ExternalShape leaf in the body subtree; the per-iteration evaluation then runs without pushing observations. At the lifted cap (N near UINT32_MAX) with a one-leaf body the observation buffer drops from ~340 GB to ~100 B - six orders of magnitude. The pre-walk is independent of which iterations execute, so a nested MaxOverRange whose body is conditionally visited (e.g., empty inner range on some outer iterations) still gets its leaves registered.
Cache replay: drop the per-cell deref slow path. Each observation now invalidates on a single criterion: gen-counter advance for FieldLoadObs / ExternalReadObs, value mismatch for ExternalShapeObs. Empirically the slow path never fires on real workloads (Genesis substep bench: 58M+ cache lookups, 0 gen-counter advances).

Why

Three coupled motivations:

Recognizer + dispatch overhead is pure cost on CPU. The CPU max-reducer is runtime_eval_adstack_max_reduce_serial, a single-thread serial loop that does the same work as the host evaluator's MaxOverRange walk in program/adstack/eval.cpp. Dispatching it pays per-launch setup cost (params blob encode, body bytecode encode, observation bookkeeping, JIT call) without compute parallelism to offset the cost.
The 1 << 24 host-eval cap blocks legitimate workloads on CPU. With the recognizer skipped, the host evaluator now handles every shape that previously dispatched, including shapes the runtime CPU max-reducer would walk unbounded. Matching that range needs the cap lifted to UINT32_MAX. The naive lift would OOM the host: each SizeExprReadObservation is ~85 B and the original recording grew the vector by one entry per leaf per iteration, so a single MaxOverRange whose end approaches UINT32_MAX would allocate hundreds of GB before the walk finishes (concretely: with one body leaf at N = UINT32_MAX, ~340 GB of observations - guaranteed OOM-kill on any host). Pre-walking the body structurally and recording one observation per static leaf bounds the observation vector at O(body leaves), independent of N, so the lift is memory-safe at any iteration count.
The cache's value-comparison fallback was dead code. replay_one_observation returned obs.observed_value verbatim on gen-counter match (fast path) and re-read the cell on gen-counter mismatch (slow path). The slow path existed to "save" the cache when a buffer was written but the read cell happened to be unchanged. Empirically it never fires: a steady-state Genesis CPU substep run with 58M+ cache lookups records 0 invalidations, meaning every cache entry's gen counter still matched and the slow path was never reached. Removing it drops ~50 lines of replay logic and the per-cell observed_value recording.

Mechanism end-to-end

1. Codegen LLVM gate

quadrants/codegen/llvm/codegen_llvm.cpp::TaskCodeGenLLVM::finalize_offloaded_task_function:

if (!arch_is_cpu(compile_config.arch)) {
  current_task->ad_stack.max_reducer_specs = recognize_adstack_max_reducer_specs(ad_stack_size_exprs_);
}

OffloadedTask::ad_stack.max_reducer_specs stays empty on CPU. dispatch_max_reducers in the LLVM launcher gates on the existing any_max_reducer_task predicate, so the entire dispatch loop is skipped automatically.

2. Host-eval cap conditional

quadrants/program/adstack/eval.cpp::evaluate_node's MaxOverRange arm:

const bool prog_is_cpu = (prog != nullptr) && arch_is_cpu(prog->compile_config().arch);
const int64_t kMaxOverRangeIterations = prog_is_cpu ? int64_t{UINT32_MAX} : (int64_t{1} << 24);

CPU lifts to UINT32_MAX. GPU stays at 1 << 24 to keep the on-device sizer within the driver's TDR window.

3. Structural pre-walk for cache observations

Before the iteration loop runs, walk the body subtree once to register every static leaf with the cache. New helper enumerate_static_observations(expr, body_node_idx, prog, ctx, reads) recurses through the IR structurally (visits every node regardless of any enclosing MaxOverRange's begin/end semantics) and pushes one observation per FieldLoad / ExternalTensorRead / ExternalTensorShape leaf. The per-iteration evaluation then runs with reads = nullptr:

enumerate_static_observations(expr, node.body_node_idx, prog, ctx, reads);
for (int64_t i = begin; i < end; ++i) {
  bound_vars[node.var_id] = i;
  evaluate_node(expr, node.body_node_idx, bound_vars, prog, ctx, /*reads=*/nullptr);
  ...
}

The structural walk is independent of which iterations actually execute. Nested MaxOverRange whose body is conditionally visited (e.g., empty inner range on some outer iterations) still gets its leaves registered, so a subsequent launch where that range becomes non-empty correctly invalidates on a buffer mutation.

4. Cache replay simplification

quadrants/program/adstack/cache.cpp::replay_observation_is_fresh (renamed from replay_one_observation):

case Obs::FieldLoadObs:
  return prog != nullptr && prog->adstack_cache().snode_write_gen(obs.snode_id) == obs.observed_gen;
case Obs::ExternalReadObs:
  return data_ptr == obs.observed_devalloc &&
         prog->adstack_cache().ndarray_data_gen(data_ptr) == obs.observed_gen;
case Obs::ExternalShapeObs:
  return static_cast<int64_t>(ctx->get_struct_arg_host<int32_t>(arg_indices)) == obs.observed_value;

Returns bool instead of int64_t. try_size_expr_cache_hit, try_max_reducer_cache_hit, and try_spirv_bytecode_cache_hit switch from now != obs.observed_value to !replay_observation_is_fresh(...). The deref slow path is gone.

populate_max_reducer_body_observations drops the INT64_MIN sentinel write to observed_value (was the stand-in that made the now-removed deref path produce a self-equal cache hit on gen match).

evaluate_field_load and evaluate_external_tensor_read drop their obs.observed_value = v writes since the field is unused for these kinds. ExternalTensorShape keeps the write because shapes have no gen counter and fall back to value comparison.

5. Test arch markers

Five existing tests in tests/python/test_adstack.py (test_max_reducer_pins_stride_for_oversized_axis, test_max_reducer_dispatch_counts_advance_on_input_mutation, test_max_reducer_field_load_bound_var_dispatch, test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation, test_above_cap_out_of_grammar_kernel_raises) gain arch=[qd.cuda, qd.amdgpu, qd.vulkan, qd.metal] plus a docstring note pointing at the host-evaluator equivalent on CPU. With the recognizer skipped on CPU the dispatch counter stays at 0 there, and the cap-trigger test finds the lifted UINT32_MAX cap so its 1 << 24 + 1 shape resolves without raising.

6. Docs

docs/source/user_guide/autodiff.md Appendix C: layer the parallel-vs-sequential evaluation paths with explicit GPU / CPU qualifiers; drop the "and the read-tracking memory" rationale from the sequential walk cap paragraph (the cap is now purely a walk-time guard since observation memory is independently bounded).

Per-backend coverage matrix

Backend	Recognizer	Host-eval cap	Cache replay
CPU	skipped	`UINT32_MAX`	gen counter only
CUDA	active	`1 << 24`	gen counter only
AMDGPU	active	`1 << 24`	gen counter only
Vulkan	active	`1 << 24`	gen counter only
Metal	active	`1 << 24`	gen counter only

Tests

The five GPU-only test markers pin (a) the recognizer being skipped on CPU (no dispatch counter advance) and (b) the cap-trigger test resolving on CPU where the cap is lifted. test_adstack_metadata_cache_invalidates_on_host_mutation and test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation continue to pin gen-counter-driven invalidation; they pass under the simplified replay because their writes go through Ndarray.write / SNodeRwAccessorsBank which bump the gen counter.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af6d08ec2b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

hughperkins · 2026-05-08T16:18:39Z

 ### Evaluation paths

-The compiler picks one of two evaluation paths to compute the maximum based on the bound expression's structure:
+On GPU backends the compiler picks one of two evaluation paths to compute the maximum based on the bound expression's structure; on CPU the sequential path is always taken since the runtime's CPU max-reducer is single-threaded and the parallel dispatch's per-launch setup would be pure overhead:


replace ; with .

I think I will reformulate a bit. mentioning parallel vs sequential before introducing them is not good.

Done! Sorry I amended by mistake :/

Edit: Fixed.

hughperkins · 2026-05-08T16:19:37Z

+On GPU backends the compiler picks one of two evaluation paths to compute the maximum based on the bound expression's structure; on CPU the sequential path is always taken since the runtime's CPU max-reducer is single-threaded and the parallel dispatch's per-launch setup would be pure overhead:

- **Parallel:** the maximum is computed with a tiny parallel reduction kernel for efficiency. The reducer accepts a common subset of bound expressions:
+- **Parallel:** the maximum is computed with a tiny parallel reduction kernel on the GPU for efficiency. The reducer accepts a common subset of bound expressions:


I would argue that on a GPU is redundant. But if we do want to include it it should be the indefinite article, not the definite article, I feel.

hughperkins · 2026-05-08T16:21:11Z

[x] doc looks ok

Will check once agent jobs run:
[ ] factorization

And also ponder whether I feel we need to run Genesis unit tests and/or benchmarks at that point.

duburcqa · 2026-05-08T16:43:58Z

And also ponder whether I feel we need to run Genesis unit tests and/or benchmarks at that point.

This PR only modify adstack files.

github-actions · 2026-05-08T17:09:48Z

Total: 9 file(s) changed, +121 -150 code lines.

github-actions · 2026-05-08T18:01:04Z

Diff coverage: 100% · 5 lines, 0 missing

…ap; simplify cache invalidation to gen counter

…`_parallel` to `runtime_eval_adstack_max_reduce`

github-actions · 2026-05-08T19:07:06Z

Total: 9 file(s) changed, +121 -150 code lines.

hughperkins · 2026-05-08T19:38:48Z

[x] line diff report looks ok

hughperkins · 2026-05-08T19:39:13Z

some non-adstack files modified

=> lets get Genesis unit tests and benchmarks please

duburcqa · 2026-05-08T19:44:53Z

some non-adstack files modified

A single line in quadrants/codegen/llvm/codegen_llvm.cpp, which obviously cannot cause any regression:

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Comment thread quadrants/program/adstack/eval.cpp Outdated

duburcqa changed the title ~~[Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap; simplify cache invalidation to gen counter~~ [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap May 8, 2026

duburcqa force-pushed the duburcqa/adstack_cache_simplify branch from af6d08e to 886abbe Compare May 8, 2026 16:09

hughperkins reviewed May 8, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/adstack_cache_simplify branch 2 times, most recently from 59efed0 to a54acb1 Compare May 8, 2026 16:37

Base automatically changed from duburcqa/adstack_max_reducer_shader to main May 8, 2026 18:03

duburcqa added 2 commits May 8, 2026 20:31

[Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval c…

d70fe2c

…ap; simplify cache invalidation to gen counter

[Refactor] Adstack max-reducer: drop dead CPU _serial path; rename …

93b0dbc

…`_parallel` to `runtime_eval_adstack_max_reduce`

duburcqa force-pushed the duburcqa/adstack_cache_simplify branch from a54acb1 to 93b0dbc Compare May 8, 2026 18:35

Conversation

duburcqa commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adstack: skip max-reducer recognizer on CPU + lift host-eval cap; simplify cache invalidation to gen counter

TL;DR

Why

Mechanism end-to-end

1. Codegen LLVM gate

2. Host-eval cap conditional

3. Structural pre-walk for cache observations

4. Cache replay simplification

5. Test arch markers

6. Docs

Per-backend coverage matrix

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

hughperkins May 8, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa May 8, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hughperkins May 8, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa May 8, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins commented May 8, 2026

Uh oh!

duburcqa commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

hughperkins commented May 8, 2026

Uh oh!

hughperkins commented May 8, 2026

Uh oh!

duburcqa commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

duburcqa commented May 8, 2026 •

edited

Loading

duburcqa May 8, 2026 •

edited

Loading

duburcqa commented May 8, 2026 •

edited

Loading