Skip to content

[Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap#655

Open
duburcqa wants to merge 2 commits intomainfrom
duburcqa/adstack_cache_simplify
Open

[Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap#655
duburcqa wants to merge 2 commits intomainfrom
duburcqa/adstack_cache_simplify

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented May 8, 2026

Adstack: skip max-reducer recognizer on CPU + lift host-eval cap; simplify cache invalidation to gen counter

Single commit, three coupled changes that together let the host-side adstack sizer handle MaxOverRange iteration counts up to UINT32_MAX on CPU without OOM, while dropping dead value-comparison logic from the cache replay path that the gen-counter check already subsumed.

TL;DR

  • Codegen LLVM: skip recognize_adstack_max_reducer_specs on CPU. The CPU runtime's runtime_eval_adstack_max_reduce_serial is a single-thread loop, so the dispatch (params blob encode, body bytecode encode, observation snapshot, JIT call) is per-launch setup overhead with no parallelism to amortize. The host evaluator does the same serial walk without any of the setup. Measured +25% on the CPU rigid-step auto-diff bench (Apple M5).
  • Host eval: lift the MaxOverRange cap from 1 << 24 to UINT32_MAX on CPU (matches the runtime CPU max-reducer's natural range). Walk-time observation memory drops from O(N x body leaves) to O(body leaves) via a structural pre-walk that registers one observation per static FieldLoad / ExternalRead / ExternalShape leaf in the body subtree; the per-iteration evaluation then runs without pushing observations. At the lifted cap (N near UINT32_MAX) with a one-leaf body the observation buffer drops from ~340 GB to ~100 B - six orders of magnitude. The pre-walk is independent of which iterations execute, so a nested MaxOverRange whose body is conditionally visited (e.g., empty inner range on some outer iterations) still gets its leaves registered.
  • Cache replay: drop the per-cell deref slow path. Each observation now invalidates on a single criterion: gen-counter advance for FieldLoadObs / ExternalReadObs, value mismatch for ExternalShapeObs. Empirically the slow path never fires on real workloads (Genesis substep bench: 58M+ cache lookups, 0 gen-counter advances).

Why

Three coupled motivations:

  1. Recognizer + dispatch overhead is pure cost on CPU. The CPU max-reducer is runtime_eval_adstack_max_reduce_serial, a single-thread serial loop that does the same work as the host evaluator's MaxOverRange walk in program/adstack/eval.cpp. Dispatching it pays per-launch setup cost (params blob encode, body bytecode encode, observation bookkeeping, JIT call) without compute parallelism to offset the cost.

  2. The 1 << 24 host-eval cap blocks legitimate workloads on CPU. With the recognizer skipped, the host evaluator now handles every shape that previously dispatched, including shapes the runtime CPU max-reducer would walk unbounded. Matching that range needs the cap lifted to UINT32_MAX. The naive lift would OOM the host: each SizeExprReadObservation is ~85 B and the original recording grew the vector by one entry per leaf per iteration, so a single MaxOverRange whose end approaches UINT32_MAX would allocate hundreds of GB before the walk finishes (concretely: with one body leaf at N = UINT32_MAX, ~340 GB of observations - guaranteed OOM-kill on any host). Pre-walking the body structurally and recording one observation per static leaf bounds the observation vector at O(body leaves), independent of N, so the lift is memory-safe at any iteration count.

  3. The cache's value-comparison fallback was dead code. replay_one_observation returned obs.observed_value verbatim on gen-counter match (fast path) and re-read the cell on gen-counter mismatch (slow path). The slow path existed to "save" the cache when a buffer was written but the read cell happened to be unchanged. Empirically it never fires: a steady-state Genesis CPU substep run with 58M+ cache lookups records 0 invalidations, meaning every cache entry's gen counter still matched and the slow path was never reached. Removing it drops ~50 lines of replay logic and the per-cell observed_value recording.

Mechanism end-to-end

1. Codegen LLVM gate

quadrants/codegen/llvm/codegen_llvm.cpp::TaskCodeGenLLVM::finalize_offloaded_task_function:

if (!arch_is_cpu(compile_config.arch)) {
  current_task->ad_stack.max_reducer_specs = recognize_adstack_max_reducer_specs(ad_stack_size_exprs_);
}

OffloadedTask::ad_stack.max_reducer_specs stays empty on CPU. dispatch_max_reducers in the LLVM launcher gates on the existing any_max_reducer_task predicate, so the entire dispatch loop is skipped automatically.

2. Host-eval cap conditional

quadrants/program/adstack/eval.cpp::evaluate_node's MaxOverRange arm:

const bool prog_is_cpu = (prog != nullptr) && arch_is_cpu(prog->compile_config().arch);
const int64_t kMaxOverRangeIterations = prog_is_cpu ? int64_t{UINT32_MAX} : (int64_t{1} << 24);

CPU lifts to UINT32_MAX. GPU stays at 1 << 24 to keep the on-device sizer within the driver's TDR window.

3. Structural pre-walk for cache observations

Before the iteration loop runs, walk the body subtree once to register every static leaf with the cache. New helper enumerate_static_observations(expr, body_node_idx, prog, ctx, reads) recurses through the IR structurally (visits every node regardless of any enclosing MaxOverRange's begin/end semantics) and pushes one observation per FieldLoad / ExternalTensorRead / ExternalTensorShape leaf. The per-iteration evaluation then runs with reads = nullptr:

enumerate_static_observations(expr, node.body_node_idx, prog, ctx, reads);
for (int64_t i = begin; i < end; ++i) {
  bound_vars[node.var_id] = i;
  evaluate_node(expr, node.body_node_idx, bound_vars, prog, ctx, /*reads=*/nullptr);
  ...
}

The structural walk is independent of which iterations actually execute. Nested MaxOverRange whose body is conditionally visited (e.g., empty inner range on some outer iterations) still gets its leaves registered, so a subsequent launch where that range becomes non-empty correctly invalidates on a buffer mutation.

4. Cache replay simplification

quadrants/program/adstack/cache.cpp::replay_observation_is_fresh (renamed from replay_one_observation):

case Obs::FieldLoadObs:
  return prog != nullptr && prog->adstack_cache().snode_write_gen(obs.snode_id) == obs.observed_gen;
case Obs::ExternalReadObs:
  return data_ptr == obs.observed_devalloc &&
         prog->adstack_cache().ndarray_data_gen(data_ptr) == obs.observed_gen;
case Obs::ExternalShapeObs:
  return static_cast<int64_t>(ctx->get_struct_arg_host<int32_t>(arg_indices)) == obs.observed_value;

Returns bool instead of int64_t. try_size_expr_cache_hit, try_max_reducer_cache_hit, and try_spirv_bytecode_cache_hit switch from now != obs.observed_value to !replay_observation_is_fresh(...). The deref slow path is gone.

populate_max_reducer_body_observations drops the INT64_MIN sentinel write to observed_value (was the stand-in that made the now-removed deref path produce a self-equal cache hit on gen match).

evaluate_field_load and evaluate_external_tensor_read drop their obs.observed_value = v writes since the field is unused for these kinds. ExternalTensorShape keeps the write because shapes have no gen counter and fall back to value comparison.

5. Test arch markers

Five existing tests in tests/python/test_adstack.py (test_max_reducer_pins_stride_for_oversized_axis, test_max_reducer_dispatch_counts_advance_on_input_mutation, test_max_reducer_field_load_bound_var_dispatch, test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation, test_above_cap_out_of_grammar_kernel_raises) gain arch=[qd.cuda, qd.amdgpu, qd.vulkan, qd.metal] plus a docstring note pointing at the host-evaluator equivalent on CPU. With the recognizer skipped on CPU the dispatch counter stays at 0 there, and the cap-trigger test finds the lifted UINT32_MAX cap so its 1 << 24 + 1 shape resolves without raising.

6. Docs

docs/source/user_guide/autodiff.md Appendix C: layer the parallel-vs-sequential evaluation paths with explicit GPU / CPU qualifiers; drop the "and the read-tracking memory" rationale from the sequential walk cap paragraph (the cap is now purely a walk-time guard since observation memory is independently bounded).

Per-backend coverage matrix

Backend Recognizer Host-eval cap Cache replay
CPU skipped UINT32_MAX gen counter only
CUDA active 1 << 24 gen counter only
AMDGPU active 1 << 24 gen counter only
Vulkan active 1 << 24 gen counter only
Metal active 1 << 24 gen counter only

Tests

The five GPU-only test markers pin (a) the recognizer being skipped on CPU (no dispatch counter advance) and (b) the cap-trigger test resolving on CPU where the cap is lifted. test_adstack_metadata_cache_invalidates_on_host_mutation and test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation continue to pin gen-counter-driven invalidation; they pass under the simplified replay because their writes go through Ndarray.write / SNodeRwAccessorsBank which bump the gen counter.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af6d08ec2b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/program/adstack/eval.cpp Outdated
@duburcqa duburcqa changed the title [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap; simplify cache invalidation to gen counter [Perf] Adstack: skip max-reducer recognizer on CPU + lift host-eval cap May 8, 2026
@duburcqa duburcqa force-pushed the duburcqa/adstack_cache_simplify branch from af6d08e to 886abbe Compare May 8, 2026 16:09
Comment thread docs/source/user_guide/autodiff.md Outdated
### Evaluation paths

The compiler picks one of two evaluation paths to compute the maximum based on the bound expression's structure:
On GPU backends the compiler picks one of two evaluation paths to compute the maximum based on the bound expression's structure; on CPU the sequential path is always taken since the runtime's CPU max-reducer is single-threaded and the parallel dispatch's per-launch setup would be pure overhead:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace ; with .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will reformulate a bit. mentioning parallel vs sequential before introducing them is not good.

Copy link
Copy Markdown
Contributor Author

@duburcqa duburcqa May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Sorry I amended by mistake :/

Edit: Fixed.

Comment thread docs/source/user_guide/autodiff.md Outdated
On GPU backends the compiler picks one of two evaluation paths to compute the maximum based on the bound expression's structure; on CPU the sequential path is always taken since the runtime's CPU max-reducer is single-threaded and the parallel dispatch's per-launch setup would be pure overhead:

- **Parallel:** the maximum is computed with a tiny parallel reduction kernel for efficiency. The reducer accepts a common subset of bound expressions:
- **Parallel:** the maximum is computed with a tiny parallel reduction kernel on the GPU for efficiency. The reducer accepts a common subset of bound expressions:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that on a GPU is redundant. But if we do want to include it it should be the indefinite article, not the definite article, I feel.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed;

@hughperkins
Copy link
Copy Markdown
Collaborator

[x] doc looks ok

Will check once agent jobs run:
[ ] factorization

And also ponder whether I feel we need to run Genesis unit tests and/or benchmarks at that point.

@duburcqa duburcqa force-pushed the duburcqa/adstack_cache_simplify branch 2 times, most recently from 59efed0 to a54acb1 Compare May 8, 2026 16:37
@duburcqa
Copy link
Copy Markdown
Contributor Author

duburcqa commented May 8, 2026

And also ponder whether I feel we need to run Genesis unit tests and/or benchmarks at that point.

This PR only modify adstack files.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Base automatically changed from duburcqa/adstack_max_reducer_shader to main May 8, 2026 18:03
duburcqa added 2 commits May 8, 2026 20:31
…ap; simplify cache invalidation to gen counter
…`_parallel` to `runtime_eval_adstack_max_reduce`
@duburcqa duburcqa force-pushed the duburcqa/adstack_cache_simplify branch from a54acb1 to 93b0dbc Compare May 8, 2026 18:35
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

@hughperkins
Copy link
Copy Markdown
Collaborator

[x] line diff report looks ok

@hughperkins
Copy link
Copy Markdown
Collaborator

some non-adstack files modified

=> lets get Genesis unit tests and benchmarks please

@duburcqa
Copy link
Copy Markdown
Contributor Author

duburcqa commented May 8, 2026

some non-adstack files modified

A single line in quadrants/codegen/llvm/codegen_llvm.cpp, which obviously cannot cause any regression:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants