[AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch + bound-var FieldLoad with cap-hit tripwires by duburcqa · Pull Request #635 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-05-06T14:33:49Z

Adstack max-reducer: parallel `MaxOverRange` dispatch with `1<<24` cap-hit tripwires

Fifteen commits, no behaviour change for users whose reverse-mode kernels never had a MaxOverRange axis above the existing 1<<24 adstack-sizer cap. Adds a per-tree parallel max-reducer that pre-evaluates recognized MaxOverRange shapes at launch and substitutes the result as a Const before any of the four adstack-sizer eval paths walks the tree. Promotes the silent truncation at the cap to a hard error on every backend whose sizer can detect it.

TL;DR

A reverse-mode kernel like

@qd.kernel
def compute(a: qd.types.ndarray(dtype=qd.i32, ndim=1)):
    for i in range(a.shape[0]):
        v = x[i]
        for _ in range(a[i]):
            v = v * 0.95 + 0.01
        y[None] += v

lowers to a per-stack SizeExpr containing MaxOverRange(0, a.shape[0], a[var]). Before this PR the adstack sizer enumerated that range linearly on every launch, with a hard 1<<24 cap above which the host evaluator raised RuntimeError, the LLVM device sizer silently truncated, and the SPIR-V on-device sizer silently clamped. Above-cap axes therefore either failed loud-but-confusing on CPU or produced wrong heap strides and corrupted gradients on GPU.

After this PR a recognize_adstack_max_reducer_specs pre-pass captures shapes that fit a deliberately narrow grammar (chains of nested MaxOverRanges across distinct bound variables; integer ndarray and field reads up to 32 bits wide indexed by literal constants or any captured chain bound variable; integer arithmetic combinators), the launcher dispatches a generic parallel-max compute kernel per captured spec at launch time, and substitute_precomputed_max_over_range rewrites the captured MaxOverRange to a Const carrying the dispatched value before any sizer eval path walks the tree. Out-of-grammar shapes whose iteration count exceeds the cap now raise via three explicit tripwires (host evaluator QD_ERROR_IF; SPIR-V on-device sizer metadata-trailing overflow-flag slot; LLVM device sizer cap-hit short-circuit + indirect stack_push overflow) instead of silently undersizing the heap.

Why

compute_bounded_adstack_size in quadrants/transforms/determine_ad_stack_size.cpp emits MaxOverRange(begin, end, body) nodes whose iteration count is bounded only by the underlying ndarray axis. Three eval paths consume the resulting trees per launch:

Host evaluator (adstack/eval.cpp::evaluate_node): hard QD_ERROR_IF at end - begin > 1<<24, on by default through evaluate_adstack_size_expr on the CPU host fast path.
LLVM device sizer interpreter (runtime_eval_adstack_size_expr in quadrants/runtime/llvm/runtime_module/runtime.cpp): break at the same threshold (silent truncation on CUDA / AMDGPU LLVM-GPU).
SPIR-V on-device sizer (adstack_sizer_shader.cpp): silent clamp effective_end = min(end, begin + (1<<24)) on Metal / Vulkan.

When the gating ndarray axis exceeds 1<<24 cells, every device path returned an under-bound on per-thread stack depth. The heap then either overflowed at qd.sync() with an opaque message naming the wrong kernel, or silently corrupted gradients with no error at all. The host path's hard error was the loud version, opt-in via QD_DEBUG_ADSTACK=1, and used as a tripwire today; it does not cover the GPU paths.

The fix preserves the cap as an internal safety latch (the per-thread sizer's serial walk is still bounded) but moves the actual evaluation of recognized shapes onto a parallel-dispatch path that scales past the cap, and turns cap-hits on the remaining out-of-grammar shapes into hard errors instead of silent truncation.

Surface API

None. The change is purely internal to the adstack-sizer pipeline. Users who never tripped the cap see no behaviour change; users whose recognized kernels did trip the cap stop seeing wrong gradients; users whose out-of-grammar kernels would have tripped the cap now see a RuntimeError / QuadrantsAssertionError at the next qd.sync() instead of silent truncation.

Mechanism end-to-end

1. Pre-pass shape recognition

quadrants/program/adstack/max_reducer.{h,cpp}::recognize_adstack_max_reducer_specs(size_exprs) walks each per-stack SerializedSizeExpr post-order and returns a std::vector<StaticAdStackMaxReducerSpec> describing every MaxOverRange node whose:

begin and end subtrees are closed-form (Const / ExternalTensorShape / Add / Sub / Mul / Max, plus any MaxOverRange already captured deeper in the same tree),
body subtree references only Const, ExternalTensorRead(arg, [...]) (single- or multi-axis, indexed by literal constants or any captured chain bound variable, leaf dtype restricted to 32-bit-or-narrower integer), FieldLoad(snode, [...]) (same index restriction; the literal-only path host-folds to Const at encode time, the bound-var path emits a kFieldLoad device node), ExternalTensorShape, and Add / Sub / Mul / Max of those.

Multi-axis support: the recognizer descends through nested MaxOverRanges as long as each inner [begin, end) is closed-form (Const / ExternalTensorShape / captured-deeper-MORs); each layer adds one axis to the captured spec, and the dispatch enumerates the cross-product of every axis. Specs come back in dependency order (deepest first); each dispatch's result becomes the substituted Const an outer spec's begin / end may reference. Captured ids are stored in task_attribs.ad_stack.max_reducer_specs (SPIR-V) and current_task->ad_stack.max_reducer_specs (LLVM); both backends populate the field at codegen time (spirv_codegen.cpp, codegen_llvm.cpp).

The integer-leaf dtype restriction (i8 / i16 / i32 / u8 / u16 / u32 only) gates the cache-revalidation sentinel: populate_max_reducer_body_observations records INT64_MIN as the observed value, and the replay path's gen-mismatch dereference must return a value strictly greater than the sentinel to force invalidation. A 64-bit leaf could legally hold INT64_MIN and false-hit on a mutated entry, so those leaves fall through to the per-task sizer's capped path.

StaticAdStackMaxReducerSpec lives in quadrants/transforms/static_adstack_analysis.h with a QD_IO_DEF so the spec round-trips through the offline cache. The struct carries axis_var_ids / axis_begin_node_idxs / axis_end_node_idxs (one entry per captured axis, outermost-first) plus dependent_mor_node_idxs listing the captured deeper-MOR keys the spec's begin / end references.

2. Generic max-reducer kernels - one per backend family

Backend	File	Mechanism
SPIR-V	`quadrants/codegen/spirv/adstack_max_reducer_shader.{h,cpp}`	Compute shader, `kAdStackMaxReducerWorkgroupSize=128`, strided `kElementsPerThread=64` per-thread iteration to keep `num_workgroups_x` under `maxComputeWorkGroupCount[0]=65535` for spec lengths up to ~536M. Body bytecode interpreter (`kConst / kBoundVariable / kExternalTensorRead / kFieldLoad / kAdd / kSub / kMul / kMax`). Per-spec output is two u32 slots: `[2k] = OpAtomicUMax` running max, `[2k+1] = OpAtomicOr` overflow flag. The u32+overflow split sidesteps spirv-cross's MSL backend gap on i64 atomics (`MSL currently does not support 64-bit atomics`), unlocking Metal and Vulkan-via-MoltenVK.
LLVM	`quadrants/runtime/llvm/runtime_module/runtime.cpp::runtime_eval_adstack_max_reduce`	Single-thread serial walk over the body bytecode, cross-product of `params.per_axis_length[]` iterations, atomic-max into `runtime->adstack_max_reducer_outputs[output_slot]`. Dispatched as a host call on CPU and as a `1x1x1` JIT-launched kernel on CUDA / AMDGPU. POD device params live in `quadrants/ir/static_adstack_max_reducer_device.h`.

The body bytecode reuses the existing AdStackSizeExprDeviceNode POD format from quadrants/ir/adstack_size_expr_device.h. encode_max_reducer_body_bytecode in quadrants/program/adstack/max_reducer.cpp extracts the body subtree, renumbers nodes to dense [0, body_node_count) indices, copies referenced index entries, and resolves kExternalTensorRead arg_buffer_offset via a closure passed by the per-backend launcher. Bound-var-indexed kFieldLoad leaves take a backend-specific base resolution: SPIR-V passes a FieldLoadDeviceEmitter whose fetch returns root_psb + place_byte_offset_in_root (pre-baked PSB address), LLVM passes a null emitter and the encoder stores (snode_root_id, place_byte_offset) in the device-node POD's arg_buffer_offset / const_value slots which the LLVM device interpreter resolves at runtime via runtime->roots[snode_root_id] + place_byte_offset.

3. Launch sequencing

Backend	File	Helper
SPIR-V	`quadrants/runtime/gfx/adstack_max_reducer_launch.cpp`	`GfxRuntime::dispatch_max_reducers(...)`
LLVM	`quadrants/runtime/llvm/llvm_adstack_lazy_claim.cpp`	`LlvmRuntimeExecutor::dispatch_max_reducers_for_tasks(...)` (overload taking `std::vector<OffloadedTask>`; per-arch launchers in `runtime/cpu/`, `runtime/cuda/`, `runtime/amdgpu/` call into it as a one-liner)

Both helpers share a level-based round dispatch:

Pass 1 - cache lookup keyed by (registry_id, stack_id, mor_node_idx) packed into a single uint64_t via pack_max_reducer_key in adstack/max_reducer.cpp. Hits drop straight into the result map; misses go to the pending list with back-references to the source SerializedSizeExpr and StaticAdStackMaxReducerSpec.
Per-round prepare + dispatch. Each round picks every undispatched spec whose dependent_mor_node_idxs are all already in the result map (cache hits + earlier rounds), substitutes those values into the working tree via substitute_precomputed_max_over_range, host-evaluates begin / end against the substituted tree, encodes the body bytecode, and dispatches the round as one cmdlist (gfx) / one batched runtime-function call sequence (LLVM). Most kernels finish in one round; nested patterns (e.g. an outer MaxOverRange whose end contains a captured inner max-of-array) take one round per dependency depth. A no-progress round drops every remaining pending spec and falls back to the per-task sizer's cap-hit path.
Per-round readback. Read u32 output slots (gfx) or i64 output slots (LLVM) at round-local indices, fall back to host-eval on overflow specs (SPIR-V; the host walks the substituted tree so already-resolved deps are folded in), record into AdStackCache::record_max_reducer_eval so the next launch can short-circuit. The recorded read observations come from populate_max_reducer_body_observations which snapshots observed_devalloc + observed_gen (ndarray) and snode_write_gen (field) so a host-side mutation of either source invalidates the cache cleanly.

The dispatch must precede publish_adstack_metadata_spirv (gfx) / publish_adstack_metadata (LLVM) so the substituted Consts are in place before the sizer eval pipeline runs.

On Apple Silicon Metal the body interpreter loads ndarray data buffers and SNode tree root buffers via PSB (raw bufferDeviceAddress), bypassing the descriptor-bound resource tracking, so the gfx launcher calls track_physical_buffer(...) once per cmdlist for every ndarray_alloc and every root_buffer_ (the useResource: hint Metal needs to mark those buffers resident for the dispatch).

4. Substitution into per-stack trees

quadrants/program/adstack/max_reducer.cpp::substitute_precomputed_max_over_range(expr, registry_id, stack_id, results) walks expr.nodes and replaces every captured MaxOverRange whose key is in results with a Const(dispatched_value). Empty-input fast path: when no captured spec matches, returns expr unchanged with no allocation.

Three eval paths consume the substituted tree:

Host fast path (eval_per_task_metadata_on_host in runtime/gfx/adstack_sizer_launch.cpp; LLVM host-eval branch in llvm_adstack_lazy_claim.cpp). The host evaluator's pointer-keyed size_expr_cache_ cannot accept a stack-local substituted tree (a transient stack address would alias unrelated cache entries across launches and return wrong cached values), so the substitution-active branch routes through a dedicated evaluate_adstack_size_expr_no_cache(...) variant; the empty-results fast path keeps the live a.size_expr reference and the cache stays warm for kernels that never trigger the recognizer.
SPIR-V on-device sizer encoder (encode_adstack_size_expr_device_bytecode_for_spirv). The encoder walks the substituted tree where each captured MaxOverRange is already a Const, so the body's ExternalTensorRead / FieldLoad leaves are not in the encoder's reads list; AdStackCache::lookup_max_reducer_reads(...) returns the recorded body observations for each captured spec, and the encoder appends them to its reads list before recording into spirv_bytecode_cache_. A mutation to the gating ndarray / field then invalidates the cached bytecode via the same gen-counter replay path the existing per-task metadata cache uses.
LLVM device sizer encoder (encode_adstack_size_expr_device_bytecode). Same substitution; same downstream llvm_per_task_ad_stack_cache_ machinery.

5. Cap-hit tripwires (`1<<24`)

The 1<<24 per-task sizer cap is structurally unreachable for max-reducer-recognized shapes (those are dispatched in parallel and substituted to Const before the sizer walks). It is reachable only for out-of-grammar shapes whose iteration count exceeds the cap. Three explicit tripwires:

Site	Mechanism
Host evaluator (`evaluate_node`)	Existing hard `QD_ERROR_IF`; surfaces as `RuntimeError` to Python on the CPU host fast path.
SPIR-V on-device sizer (`adstack_sizer_shader.cpp`)	Metadata buffer layout grew a trailing u32 overflow-flag slot at index `2 + 2*n_stacks`. The shader writes 1 there on `end - begin > cap`, and clamps `effective_end = begin` so the walk stays bounded. The host post-readback in `publish_adstack_metadata_spirv` raises `QD_ERROR_IF` when the slot is non-zero.
LLVM device sizer (`device_eval_node`)	Cap-hit short-circuit: `kMaxOverRange` returns 0 immediately on `end - begin > cap` to keep the single-thread on-device dispatch within the driver's TDR window. The cap-hit then surfaces indirectly through the existing `stack_push` overflow infrastructure on the subsequent main-kernel launch. The diagnostic message attribution depends on the kernel layout.

6. Cache invalidation

The per-spec result cache integrates into the existing AdStackCache four-layer cascade:

try_max_reducer_cache_hit (one entry per captured (registry_id, stack_id, mor_node_idx)). Hit -> no max-reducer dispatch, the cached Const is substituted into the per-stack tree.
try_size_expr_cache_hit (per-SerializedSizeExpr after substitution). Hit -> no per-thread sizer eval call.
try_per_task_ad_stack_cache_hit / try_llvm_per_task_ad_stack_cache_hit (per-task metadata blob). Hit -> no per-task sizer dispatch.
try_spirv_bytecode_cache_hit (per-task bytecode blob). Hit -> no SPIR-V bytecode encode + upload.

In steady state with an unchanged gating source every layer hits and the per-launch overhead of the option-D pipeline collapses to zero. A host-side Ndarray.write bumps ndarray_data_gen_; a host-side field write bumps snode_write_gen. Either bump propagates through every layer's gen-counter replay walk and forces a fresh dispatch.

FieldLoadObs records produced by the bound-var FieldLoad encoder path carry indices = {} since the body is evaluated at every cross-product iteration and there is no canonical scalar to re-read; replay_one_observation's FieldLoadObs arm treats the gen counter as the sole staleness signal in that mode and unconditionally invalidates on a gen mismatch.

Per-backend coverage matrix

Backend	Recognized `MaxOverRange` dispatch	Cap-hit tripwire (out-of-grammar `MaxOverRange`)
CPU (LLVM host eval)	Host call to `runtime_eval_adstack_max_reduce` ✓	`evaluate_node` `QD_ERROR_IF` ✓ (raised as `RuntimeError`)
CUDA	LLVM-GPU `1x1x1` kernel ✓	`device_eval_node` short-circuit + indirect `stack_push` overflow
AMDGPU	LLVM-GPU `1x1x1` kernel ✓	same as CUDA
Vulkan (native + MoltenVK)	SPIR-V compute shader (u32 atomicMax + atomicOr overflow) ✓	sizer metadata-trailing slot ✓ (raised as `QuadrantsAssertionError`)
Metal	same as Vulkan ✓	same as Vulkan ✓

Tests - `tests/python/test_adstack.py`

Six new regression tests, all parametrized over every available backend.

`test_max_reducer_pins_stride_for_oversized_axis`

Parametrized over (shape, body_kind) matrix that exercises the recognizer's accepted body grammar (single-axis ETR, ETR + ExternalTensorShape host-fold, closed FieldLoad host-fold, and the Add / Sub / Mul / Max arithmetic combinator). For each shape the dispatch + substitution produces the correct heap stride and the kernel runs to completion; the above-cap variants additionally pin the contract that a recognized spec ranges over an arbitrarily large axis. Uses qd.ndarray rather than numpy passthrough so the device buffer is not capped at backend-specific H2D-blit limits.

`test_max_reducer_dispatch_counts_advance_on_input_mutation`

Pins the dispatch + cache invalidation pipeline via a new Program.get_max_reducer_dispatch_count / reset_max_reducer_dispatch_count python binding (counter on AdStackCache, bumped at every record_max_reducer_eval). The first launch fires at least one dispatch; a host mutation of the gating ndarray bumps ndarray_data_gen and the next launch re-dispatches.

`test_max_reducer_grammar_fallback`

A reverse-mode kernel whose inner trip count is a compile-time constant produces no MaxOverRange and the recognizer captures nothing. The dispatch counter stays at zero; the kernel still produces the correct gradient. Pins the contract that any kernel outside the captured grammar runs unchanged so future grammar broadening cannot silently drop the fallback path.

`test_max_reducer_field_load_bound_var_dispatch`

Eight-variant parametrized test pinning the bound-var-indexed FieldLoad body grammar. Body shapes cover field[i] on its own, field[i] + arr[i] (mixed FieldLoad + ETR via Add), arr[i] + field[i] (commuted), max(field[i], arr[i]), max(field[i], const), max(field[i] + 0, field[i] * 1 - 0) (full arithmetic combinator), and the conservative-wrapper path field[field[i]] / arr[field[i]] (the trip-count builder substitutes MaxOverRange(var, 0, leaf_snode.shape, body=Load(snode, [var])) for any nested-load index that does not reduce to a single bound-var or const). Across all variants the body's max value over the indexed range is N_X and the gradient assertion is uniform.

`test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation`

Pins the cache invalidation contract for the bound-var FieldLoad body path: the encoder pushes a FieldLoadObs keyed on the snode's write generation, mutating field_a[M-1] from Python bumps snode_write_gen, and the next launch redispatches.

`test_above_cap_out_of_grammar_kernel_raises`

A kernel whose inner range bound is an i64 ndarray read fails the recognizer's dtype restriction (the INT64_MIN cache-revalidation sentinel is unreachable for sub-i64 dtypes; for i64 a mutated cell could legally hold the sentinel and false-hit on revalidation). The whole spec is dropped and the per-task sizer walks the outer MaxOverRange itself. With a.shape[0] > 1<<24 the cap fires on every adstack-sizer eval path: RuntimeError from the host evaluator on CPU, QuadrantsAssertionError from the SPIR-V on-device sizer on Metal / Vulkan, and an indirect raise via stack_push overflow on CUDA / AMDGPU LLVM-GPU.

Side-effect audit

Concern	Where checked	Verdict
Offline cache key (per-task attribs)	`StaticAdStackMaxReducerSpec` round-trips via `QD_IO_DEF`; `max_reducer_specs` added to the `QD_IO_DEF` of the SPIR-V `AdStackSizingAttribs` and the LLVM `AdStackSizingInfo`	Auto-covered
`size_expr_cache_` pointer aliasing	`evaluate_adstack_size_expr_no_cache` for the substitution-active branch only	Direct fix
`spirv_bytecode_cache_` observations	`lookup_max_reducer_reads` accessor + encoder appends body reads to the cache entry's read list	Direct fix
`per_task_ad_stack_cache_` deps	`collect_size_expr_dep_keys` walks the original tree (pre-substitution) so body reads' arg-ids are still tracked	Auto-covered
Encoder enum translation	`encode_max_reducer_body_bytecode` maps `SizeExpr::Kind` -> `AdStackSizeExprDeviceKind` per kind explicitly	Direct fix (matches the per-task-sizer encoder's existing pattern)
Inter-spec dependency ordering	Round-based dispatch picks specs whose `dependent_mor_node_idxs` are all resolved; substitutes earlier-round results into the working tree before host-evaluating begin / end and encoding the body	Direct fix (replaces a prior single-pass dispatch that walked through unresolved nested MORs)
Apple Silicon Metal PSB residency	`track_physical_buffer` called once per cmdlist on every ndarray data buffer and every `root_buffers_` SNode tree root buffer (covers both `kExternalTensorRead` and `kFieldLoad` body leaves)	Direct fix
Hard cap-gate at `publish_adstack_metadata_spirv`	`QD_ERROR_IF(!spirv_has_physical_storage_buffer ...)` + `QD_ERROR_IF(!spirv_has_int64 ...)` at the entry; drops redundant per-helper cap gates	New gate, no backend regression - Vulkan 1.3 promotes both caps into core, Metal Tier 2 advertises both
`kElementsPerThread` shader strided iteration	`num_workgroups_x = ceil(length / (kAdStackMaxReducerWorkgroupSize * kElementsPerThread))`, capped at 65535 in the launcher	Covers spec lengths up to ~536M elements per dispatch
Per-launch dispatch cost in steady state	Four-layer cache cascade short-circuits when neither `ndarray_data_gen_` nor `snode_write_gen` advances	Zero per-launch overhead in cache-warm steady state
Out-of-grammar shapes	Recognizer skips silently; per-task sizer falls through; cap-hit tripwires raise hard errors	No silent gradient corruption on any backend with explicit tripwire support

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f95788e1a4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

github-actions · 2026-05-06T15:36:27Z

Total: 23 file(s) changed, +1731 -29 code lines.

github-actions · 2026-05-06T16:48:36Z

Diff coverage: 99% · 123 lines, 1 missing

…allel max-reducer dispatch (option D)

…code interpreter for kConst/kBoundVariable/kExternalTensorRead/kAdd/kSub/kMul/kMax

…cer (single-thread serial walk over body bytecode)

…bytecode encoder (encode_max_reducer_body_bytecode)

…eplaces captured MaxOverRange nodes with Const after reducer dispatch)

…x_reducer_launch.cpp (cmdlist + buffer binding for the option-D max reducer)

…nd substitution into the SPIR-V sizer eval paths

…drop redundant per-helper cap gates

…r_tasks; strip plan-specific noise from comments

…-hit tripwires, regression tests

… GPU TDR on out-of-grammar shapes)

…id clobbering task_id under contention

…icit device-sizer signals

…l bypass

duburcqa · 2026-05-06T17:15:07Z

@claude review

github-actions · 2026-05-06T17:51:18Z

Total: 23 file(s) changed, +1993 -28 code lines.

github-actions · 2026-05-06T18:41:11Z

Diff coverage: 95% · 123 lines, 6 missing

…ross distinct bound variables (multi-axis)

github-actions · 2026-05-06T20:22:13Z

Total: 36 file(s) changed, +4083 -1861 code lines.

github-actions · 2026-05-06T20:57:55Z

Diff coverage: 95% · 126 lines, 6 missing

…ured deps before host-eval

…n body grammar

github-actions · 2026-05-07T07:32:22Z

Total: 36 file(s) changed, +4190 -1861 code lines.

…bound_eval / metadata_publish / heap_grow)

github-actions · 2026-05-07T07:45:20Z

Diff coverage: 95% · 124 lines, 6 missing

github-actions · 2026-05-07T08:18:36Z

Total: 42 file(s) changed, +5441 -2768 code lines.

github-actions · 2026-05-07T09:11:47Z

Diff coverage: 97% · 204 lines, 6 missing

hughperkins · 2026-05-07T12:36:21Z

This file is getting big. Thoughts?

hughperkins · 2026-05-07T12:39:09Z

  - *Sizer under-estimated the bound (Quadrants bug).* On unusually intricate nested loops - typically deeply nested `for i in range(arr[...])` with cumulative-index arithmetic - the sizer can compute a bound that is mathematically tighter than the actual push count. To file a bug: clear `/tmp/ir/`, rerun your script with `QD_DUMP_IR=1` set in the environment so Quadrants dumps the kernel IR there, then open an issue on the Quadrants repo with the contents of `/tmp/ir/` attached as a zip. Workaround: pass a generous `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer).
 - **Out-of-memory before the kernel even runs.** A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Surfaces as an allocator OOM at launch time. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
 - **Loop bounds backed by a mutated ndarray.** A reverse-mode kernel with `for i in range(n[j])` requires `n[j]` to hold the same value at the forward call and at `.grad()`. If anything writes to `n[j]` between those two points - the differentiable kernel itself, or any other kernel call - the backward call will trigger an `Adstack overflow` exception or the computed gradient would come out silently wrong. The safe rule: populate loop-bound ndarrays before the forward call and leave them untouched until `.grad()` returns. The reason for that is Quadrants' adstack sizer design: it reads the loop bound separately at each dispatch, which includes forward and backward calls. Tape-based eager AD like [PyTorch's autograd](https://pytorch.org/docs/stable/notes/autograd.html) is not affected, since the trip count is recorded as the forward runs and reused at backward time.
+- **Inner reverse-mode loop with a complex bound at very large extent.** An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that. Workaround: rewrite the trip count to stay within the supported subset, or shrink the enclosing loop below the threshold.


can you give an example of a 'reverse-mode loop with a complex bound at very large extent'?

Note: I feel this probably deserves its own section, rather than a bullet points, since this is very dense, and contains multiple very dense child bullet points.

can you give an example of a 'reverse-mode loop with a complex bound at very large extent'?

for j in range(arr[i // 2]) with arr[0] > (1 << 24)`. Nothing more.

Note: I feel this probably deserves its own section, rather than a bullet points, since this is very dense, and contains multiple very dense child bullet points.

This bullet should be fairly simple. Maybe I should removed details that are just confusing?

I feel this bullets are all pretty long tbh. I'm not sure if they graudally 'boil frogged' grew over time?

My hunch is taht it might be better reformatting more in the style of an 'FAQ', with a subsection heading for each current bullet point.

Understood. I can do this.

duburcqa · 2026-05-07T12:46:48Z

This file is getting big. Thoughts? quadrants/runtime/llvm/runtime_module/runtime.cpp

I'm not in favour of refactoring this file in this PR. It is a central part of Quadrant's kernel launch orchestration. Could be nice to refactor it though.

hughperkins · 2026-05-07T12:48:36Z

This file is getting big. Thoughts? quadrants/runtime/llvm/runtime_module/runtime.cpp

I'm not in favour of refactoring this file in this PR. It is a central part of Quadrant's kernel launch orchestration. Could be nice to refactor it though.

It is a central part of Quadrant's kernel launch orchestration, yes.

The ask is not to refactor the file en masse, but to find a way to move autodiff-specific things outside of it (at least, the new autodiff-related things you are adding in this pr). Please.

duburcqa · 2026-05-07T12:50:50Z

The ask is not to refactor the file en masse, but to find a way to move autodiff-specific things outside of it (at least, the new autodiff-related things you are adding in this pr). Please.

Ok I will refactor this file.

…nit linked via llvm-link

… complex-bound example

…tep finds it on the CI LLVM toolchain

…link step runs on macOS / clang-tidy paths

github-actions · 2026-05-07T15:32:09Z

Total: 46 file(s) changed, +5986 -3261 code lines.

…ap at the 120-col limit instead of 74-80

…k collapses RootMeta/DenseMeta named struct types and breaks runtime type lookup

…e helper kernel reads ndarray data unreliably; expose attribute 100 (uses_host_page_tables) and route launcher staging through it

…reducer_shader # Conflicts: # quadrants/python/export_lang.cpp # quadrants/rhi/cuda/cuda_context.h # quadrants/runtime/amdgpu/kernel_launcher.cpp # quadrants/runtime/cuda/kernel_launcher.cpp

github-actions · 2026-05-07T19:19:21Z

Total: 49 file(s) changed, +6008 -3265 code lines.

hughperkins · 2026-05-07T19:38:25Z

+
+### Inner reverse-mode loop with a complex bound at very large extent
+
+An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that.


Can you give an example of what this means? When I read this, things my brain stumbles on:

"enclosing range"

I assume this means some kind of for loop over range, but I ahve to think about it

'inner trip count'

inner, I suppose means an inner loop

count, counts something, but not iterations, but ... trips

not sure waht a 'trip' is

'shapes' again is not a term I'm familiar with in this context

'enclosing iterations'

does this mean the iterations ofr the 'enclosing range'

the iteraitons of the 'inner' loop?

something else?

I think it would be nice to have an example, that illustrates what this is talking about clealry.

hughperkins · 2026-05-07T19:40:22Z

+
+An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that.
+
+Two categories of bound expression:


You havent used 'bound exprssion' in this subsub section yet. Not sure what it refers to. Again, would be nice if the example showed this I feel.

(I kind of feel maybe this deserves its own section, outside of 'what can go wrong' potentially. Or... if you've explained all the above concepts before, then maybe refer me back to a consise definition of each concept earlier in the readme, perhaps?)

hughperkins · 2026-05-07T19:40:43Z

+Two categories of bound expression:
+
+- *Works at any enclosing-range size:* integer ndarray reads up to 32 bits wide (single- or multi-axis, indexed by literal constants or enclosing loop variables), field reads of the same width indexed by literal constants or enclosing loop variables (`my_field[None]`, `my_field[k]` for a constant `k`, `my_field[i]` where `i` is an enclosing loop variable), `arr.shape[k]` shape terms, literal integer constants, and `+`, `-`, `*`, `max` of those.
+- *Caps at the threshold:* 64-bit integer ndarray or field reads, arithmetic-indexed reads (`arr[i // 2]`, `arr[i % 4]`), and ragged inner ranges whose own bound depends on an enclosing loop variable through an unsupported leaf shape.


which threshold? Again, no mention of threshold in this sub sub section.

also, what is 'leaf shape'? (again, feel free to link back to where it's concisely explained potentially)

hughperkins · 2026-05-07T19:41:50Z

+- *Works at any enclosing-range size:* integer ndarray reads up to 32 bits wide (single- or multi-axis, indexed by literal constants or enclosing loop variables), field reads of the same width indexed by literal constants or enclosing loop variables (`my_field[None]`, `my_field[k]` for a constant `k`, `my_field[i]` where `i` is an enclosing loop variable), `arr.shape[k]` shape terms, literal integer constants, and `+`, `-`, `*`, `max` of those.
+- *Caps at the threshold:* 64-bit integer ndarray or field reads, arithmetic-indexed reads (`arr[i // 2]`, `arr[i % 4]`), and ragged inner ranges whose own bound depends on an enclosing loop variable through an unsupported leaf shape.
+
+A concrete example that hits the cap is `for j in range(arr[i // 2]):` with `arr[0] = (1 << 24) + 1`.


I would recomend putting this sooner rather than later, and using it to illustrate what you are saying. Refer to it concretely. Label it.

github-actions · 2026-05-07T20:16:52Z

Diff coverage: 97% · 216 lines, 6 missing

chatgpt-codex-connector Bot reviewed May 6, 2026

View reviewed changes

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated

Comment thread quadrants/program/adstack_size_expr_eval.cpp Outdated

Comment thread quadrants/python/export_lang.cpp Outdated

duburcqa added 15 commits May 6, 2026 19:14

[AutoDiff] Stage 1.1: recognize MaxOverRange specs reducible by a par…

53710fb

…allel max-reducer dispatch (option D)

[AutoDiff] Stage 1.2: SPIR-V max-reducer shader (option D); body byte…

04555c9

…code interpreter for kConst/kBoundVariable/kExternalTensorRead/kAdd/kSub/kMul/kMax

[AutoDiff] Stage 1.3: LLVM runtime function for the option-D max redu…

6141399

…cer (single-thread serial walk over body bytecode)

[AutoDiff] Stage 1.4a: AdStackCache max-reducer cache methods + body …

f3f7678

…bytecode encoder (encode_max_reducer_body_bytecode)

[AutoDiff] Stage 1.6: substitute_precomputed_max_over_range helper (r…

ff71ec2

…eplaces captured MaxOverRange nodes with Const after reducer dispatch)

[AutoDiff] Stage 1.4b: GfxRuntime::dispatch_max_reducers + adstack_ma…

ee88986

…x_reducer_launch.cpp (cmdlist + buffer binding for the option-D max reducer)

[AutoDiff] Stage 1.4+1.6: launch_kernel wires dispatch_max_reducers a…

893bbeb

…nd substitution into the SPIR-V sizer eval paths

[AutoDiff] Hard-require PSB+Int64 at the adstack reverse-mode entry; …

4d2a8d6

…drop redundant per-helper cap gates

[AutoDiff] Stage 1.5 + comment cleanup: LLVM dispatch_max_reducers_fo…

a5808c0

…r_tasks; strip plan-specific noise from comments

[AutoDiff] Adstack max-reducer: dispatch fixes, Metal u32 atomic, cap…

99ce158

…-hit tripwires, regression tests

[AutoDiff] Adstack: short-circuit MaxOverRange walk on cap-hit (avoid…

17bef65

… GPU TDR on out-of-grammar shapes)

[AutoDiff] Adstack: drop LLVM device sizer overflow-flag write to avo…

e85efd7

…id clobbering task_id under contention

[AutoDiff] Adstack: scope cap-hit tripwire test to backends with expl…

07c009d

…icit device-sizer signals

[AutoDiff] Adstack: drop arch restriction on cap-hit tripwire test

5431e11

[Docs] Document the per-task sizer iteration cap and its parallel-eva…

6e72837

…l bypass

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 23a7daf to eaf4ba9 Compare May 6, 2026 17:17

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from eaf4ba9 to 2ef85c1 Compare May 6, 2026 18:40

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 2ef85c1 to 75800cc Compare May 6, 2026 19:27

[AutoDiff] Adstack max-reducer: capture nested MaxOverRange chains ac…

e80f9dd

…ross distinct bound variables (multi-axis)

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 75800cc to e80f9dd Compare May 6, 2026 19:37

duburcqa added 2 commits May 7, 2026 08:55

[AutoDiff] Adstack max-reducer: round-based dispatch substitutes capt…

7f12ce5

…ured deps before host-eval

[AutoDiff] Adstack max-reducer: support bound-var-indexed FieldLoad i…

43de9a6

…n body grammar

[AutoDiff] LLVM adstack lazy-claim: split into stage-grouped subdir (…

efd3f69

…bound_eval / metadata_publish / heap_grow)

duburcqa changed the title ~~[AutoDiff] Adstack max-reducer: parallel MaxOverRange dispatch with 1<<24 cap-hit tripwires~~ [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch + bound-var FieldLoad with cap-hit tripwires May 7, 2026

duburcqa mentioned this pull request May 7, 2026

DO NOT MERGE: debug CUDA T4 max-reducer hang on PR #635 (Xid 31) #589

Draft

hughperkins reviewed May 7, 2026

View reviewed changes

duburcqa added 4 commits May 7, 2026 16:27

[Runtime] Split adstack runtime helpers into a separate translation u…

6b148f4

…nit linked via llvm-link

[Docs] Reformat 'What can go wrong' as FAQ-style subsections; tighten…

def0a38

… complex-bound example

[CI] Search $LLVM_DIR/bin for llvm-link so the runtime bitcode link s…

20f7913

…tep finds it on the CI LLVM toolchain

[CI] chmod 0755 LLVM toolchain binaries after extract so the bitcode …

bcbd7d0

…link step runs on macOS / clang-tidy paths

duburcqa added 2 commits May 7, 2026 18:03

[Docs] Reflow three comment blocks in adstack max-reducer files to wr…

b914196

…ap at the 120-col limit instead of 74-80

[Runtime] Revert separate-TU build to single-TU include-cpp; llvm-lin…

2efeaca

…k collapses RootMeta/DenseMeta named struct types and breaks runtime type lookup

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 5b3dbd2 to 95cdf06 Compare May 7, 2026 18:02

[AutoDiff] Skip LLVM max-reducer dispatch on pre-Ampere CUDA where th…

f3deef7

…e helper kernel reads ndarray data unreliably; expose attribute 100 (uses_host_page_tables) and route launcher staging through it

duburcqa force-pushed the duburcqa/adstack_max_reducer_shader branch from 95cdf06 to f3deef7 Compare May 7, 2026 18:24

duburcqa added 2 commits May 7, 2026 20:34

Merge remote-tracking branch 'origin/main' into duburcqa/adstack_max_…

a92ff37

…reducer_shader # Conflicts: # quadrants/python/export_lang.cpp # quadrants/rhi/cuda/cuda_context.h # quadrants/runtime/amdgpu/kernel_launcher.cpp # quadrants/runtime/cuda/kernel_launcher.cpp

Fix CUDA Graph grad for adstack.

f4d7df3

hughperkins reviewed May 7, 2026

View reviewed changes


		### Inner reverse-mode loop with a complex bound at very large extent

		An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that.


		An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that.

		Two categories of bound expression:

Conversation

duburcqa commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adstack max-reducer: parallel MaxOverRange dispatch with 1<<24 cap-hit tripwires

TL;DR

Why

Surface API

Mechanism end-to-end

1. Pre-pass shape recognition

2. Generic max-reducer kernels - one per backend family

3. Launch sequencing

4. Substitution into per-stack trees

5. Cap-hit tripwires (1<<24)

6. Cache invalidation

Per-backend coverage matrix

Tests - tests/python/test_adstack.py

test_max_reducer_pins_stride_for_oversized_axis

test_max_reducer_dispatch_counts_advance_on_input_mutation

test_max_reducer_grammar_fallback

test_max_reducer_field_load_bound_var_dispatch

test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation

test_above_cap_out_of_grammar_kernel_raises

Side-effect audit

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

duburcqa commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

hughperkins commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hughperkins May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa commented May 7, 2026

Uh oh!

hughperkins commented May 7, 2026

Uh oh!

duburcqa commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa commented May 6, 2026 •

edited

Loading

Adstack max-reducer: parallel `MaxOverRange` dispatch with `1<<24` cap-hit tripwires

5. Cap-hit tripwires (`1<<24`)

Tests - `tests/python/test_adstack.py`

`test_max_reducer_pins_stride_for_oversized_axis`

`test_max_reducer_dispatch_counts_advance_on_input_mutation`

`test_max_reducer_grammar_fallback`

`test_max_reducer_field_load_bound_var_dispatch`

`test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation`

`test_above_cap_out_of_grammar_kernel_raises`

hughperkins commented May 7, 2026 •

edited

Loading

hughperkins May 7, 2026 •

edited

Loading