Genesis-Embodied-AI · duburcqa · May 2, 2026 · May 5, 2026 · May 5, 2026 · May 5, 2026
diff --git a/cmake/QuadrantsCore.cmake b/cmake/QuadrantsCore.cmake
@@ -66,6 +66,7 @@ file(GLOB QUADRANTS_CORE_SOURCE
     "quadrants/jit/*"
     "quadrants/math/*"
     "quadrants/program/*"
+    "quadrants/program/adstack/*"
     "quadrants/struct/*"
     "quadrants/system/*"
     "quadrants/transforms/*"

diff --git a/docs/source/user_guide/autodiff.md b/docs/source/user_guide/autodiff.md
@@ -349,11 +349,43 @@ A large `ndrange` combined with several loop-carried variables multiplies quickl
 
 ## What can go wrong
 
-- **Adstack overflow.** Surfaces as `QuadrantsAssertionError: Adstack overflow ...` at the next Quadrants Python entry. The message names the offending kernel + offload task and the most likely cause:
-  - *Untracked tensor mutation between launches.* A tensor backing a data-dependent loop bound was written to outside Quadrants's tracking - typically a DLPack zero-copy mutation through a torch tensor sharing storage with a Quadrants ndarray, or a raw pointer write through a non-torch consumer. The cached adstack capacity was sized against the value before the mutation; if the mutation grew the bound, the next launch overflows. Fix: route the write through a Quadrants API (`Ndarray.write` / `Ndarray.fill` / a kernel that writes the value). Alternatively, catch the exception and re-launch - Quadrants invalidates the cached bound on raise, so the retry runs against the live state. Kernel state may be inconsistent after an overflow; do not retry the same step without restarting from a clean state.
-  - *Sizer under-estimated the bound (Quadrants bug).* On unusually intricate nested loops - typically deeply nested `for i in range(arr[...])` with cumulative-index arithmetic - the sizer can compute a bound that is mathematically tighter than the actual push count. To file a bug: clear `/tmp/ir/`, rerun your script with `QD_DUMP_IR=1` set in the environment so Quadrants dumps the kernel IR there, then open an issue on the Quadrants repo with the contents of `/tmp/ir/` attached as a zip. Workaround: pass a generous `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer).
-- **Out-of-memory before the kernel even runs.** A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Surfaces as an allocator OOM at launch time. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
-- **Loop bounds backed by a mutated ndarray.** A reverse-mode kernel with `for i in range(n[j])` requires `n[j]` to hold the same value at the forward call and at `.grad()`. If anything writes to `n[j]` between those two points - the differentiable kernel itself, or any other kernel call - the backward call will trigger an `Adstack overflow` exception or the computed gradient would come out silently wrong. The safe rule: populate loop-bound ndarrays before the forward call and leave them untouched until `.grad()` returns. The reason for that is Quadrants' adstack sizer design: it reads the loop bound separately at each dispatch, which includes forward and backward calls. Tape-based eager AD like [PyTorch's autograd](https://pytorch.org/docs/stable/notes/autograd.html) is not affected, since the trip count is recorded as the forward runs and reused at backward time.
+### Adstack overflow
+
+Surfaces as `QuadrantsAssertionError: Adstack overflow ...` at the next Quadrants Python entry. The message names the offending kernel + offload task and the most likely cause.
+
+The two cases the runtime distinguishes:
+
+- *Untracked tensor mutation between launches.* A tensor backing a data-dependent loop bound was written to outside Quadrants's tracking - typically a DLPack zero-copy mutation through a torch tensor sharing storage with a Quadrants ndarray, or a raw pointer write through a non-torch consumer. The cached adstack capacity was sized against the value before the mutation; if the mutation grew the bound, the next launch overflows. Workaround: route the write through a Quadrants API (`Ndarray.write` / `Ndarray.fill` / a kernel that writes the value). Alternatively, catch the exception and re-launch - Quadrants invalidates the cached bound on raise, so the retry runs against the live state. Kernel state may be inconsistent after an overflow; do not retry the same step without restarting from a clean state.
+- *Sizer under-estimated the bound (Quadrants bug).* On unusually intricate nested loops - typically deeply nested `for i in range(arr[...])` with cumulative-index arithmetic - the sizer can compute a bound that is mathematically tighter than the actual push count. To file a bug: clear `/tmp/ir/`, rerun your script with `QD_DUMP_IR=1` set in the environment so Quadrants dumps the kernel IR there, then open an issue on the Quadrants repo with the contents of `/tmp/ir/` attached as a zip. Workaround: pass a generous `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer).
+
+### Out-of-memory before the kernel even runs
+
+A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Surfaces as an allocator OOM at launch time. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
+
+### Loop bounds backed by a mutated ndarray
+
+A reverse-mode kernel with `for i in range(n[j])` requires `n[j]` to hold the same value at the forward call and at `.grad()`. If anything writes to `n[j]` between those two points - the differentiable kernel itself, or any other kernel call - the backward call will trigger an `Adstack overflow` exception or the computed gradient would come out silently wrong.
+
+The safe rule: populate loop-bound ndarrays before the forward call and leave them untouched until `.grad()` returns. The reason for that is Quadrants' adstack sizer design: it reads the loop bound separately at each dispatch, which includes forward and backward calls. Tape-based eager AD like [PyTorch's autograd](https://pytorch.org/docs/stable/notes/autograd.html) is not affected, since the trip count is recorded as the forward runs and reused at backward time.
+
+### Inner reverse-mode loop with a complex bound at very large extent
+
+Consider a reverse-mode kernel with two nested loops where the enclosed loop's iteration count depends on the outer loop variable through an arithmetic expression on an ndarray index:
+
+```python
+for i in range(arr.shape[0]):       # outer loop
+    for j in range(arr[i // 2]):    # enclosed loop: for <var> in range(<bound expression>)
+        ...
+```
+
+The enclosed loop's iteration count `arr[i // 2]` is what we call the enclosed loop's *bound expression*. Reverse-mode autodiff needs an upper bound on how many times the enclosed loop body executes across the whole kernel. To do so, the compiler analyses the bound expression at launch time by taking one of the two evaluation paths based on its structure:
+
+- **Parallel:** integer ndarray reads up to 32 bits wide, single- or multi-axis, indexed by literal constants or outer loop variables are evaluated in parallel. Field reads of the same width and the same indexing rules apply: `my_field[None]`, `my_field[k]` for a constant `k`, or `my_field[i]` where `i` is an outer loop variable. The shape term `arr.shape[k]`. Literal integer constants. And any `+`, `-`, `*`, `max` of those. The outer loop can run any number of iterations.
+- **Sequential:** 64-bit integer ndarray or field reads, arithmetic-indexed reads (`arr[i // 2]`, `arr[i % 4]`), or any nested reads where the index is itself a ndarray or field read result (e.g. `arr1[arr2[i]]`, `my_field[arr[i]]`) fallbacks to sequential evaluation. Nested loops are supported, but the classification propagates outward across loop nesting: if any enclosed loop's bound is sequential, the enclosing bound is sequential too. The outer loop is capped at 2^24 = 16 777 216 iterations; past that the kernel raises `RuntimeError: ... iteration count ... exceeds the 16777216 guard`. This cap is artificial. It keeps the single-thread GPU evaluation time tractable.
+
+In the example above, the iteration count of the enclosed loop takes the sequential path because of the `i // 2` index, which means that it would raise at launch for `arr.shape[0] = (1 << 24) + 1`.
+
+Workaround: rewrite the bound expression so it takes the parallel path (e.g. precompute `bounds[i] = arr[i // 2]` into a persistent separate buffer, pass `bounds` in as an input, and use `for j in range(bounds[i]):`), or keep the outer loop count below 2^24.
 
 ## Performance characteristics
 
@@ -394,6 +426,12 @@ def k_data_dependent(a):
     for i in range(a.shape[0]):
         while a[i] < 10:              # bound that can only be known by running the loop body
             a[i] = a[i] + 1
+
+@qd.kernel
+def k_inner_struct_for(a, field):
+    for i in range(a.shape[0]):
+        for j in field:               # struct-for as the enclosed loop with reverse-mode pushes
+            ...
 ```
 
 ## Appendix B: gate-index shapes that capture vs fall back to the worst-case heap

diff --git a/quadrants/codegen/llvm/codegen_llvm.cpp b/quadrants/codegen/llvm/codegen_llvm.cpp
@@ -17,6 +17,7 @@
 #include "quadrants/codegen/llvm/struct_llvm.h"
 #include "quadrants/util/file_sequence_writer.h"
 #include "quadrants/codegen/codegen_utils.h"
+#include "quadrants/program/adstack_size_expr_eval.h"
 #include "llvm/Support/SourceMgr.h"
 #include "llvm/AsmParser/Parser.h"
 #include "quadrants/codegen/ir_dump.h"
@@ -1993,6 +1994,18 @@ void TaskCodeGenLLVM::finalize_offloaded_task_function() {
     current_task->ad_stack.allocas = ad_stack_allocas_info_;
     current_task->ad_stack.size_exprs = ad_stack_size_exprs_;
     current_task->ad_stack.bound_expr = ad_stack_static_bound_expr_;
+    // Recognize `MaxOverRange` nodes the runtime can reduce in parallel via the dedicated max-reducer dispatch instead
+    // of letting the per-thread sizer enumerate. Indexing matches `ad_stack_size_exprs_` (same iteration order as the
+    // pre-scan above). Skip on CPU: `runtime_eval_adstack_max_reduce_serial` walks single-threaded just like the host
+    // evaluator's `MaxOverRange` loop in `program/adstack/eval.cpp`, so the dispatch's per-launch setup overhead
+    // (params blob encode, body bytecode encode, observation bookkeeping, JIT call) is pure cost without compute
+    // parallelism to offset it - measured ~28 % wallclock regression on the rigid-step CPU bench. The host evaluator
+    // handles every iteration count up to its own cap (raised to UINT32_MAX on CPU in `eval.cpp`) so above-cap shapes
+    // still resolve correctly. On CUDA / AMDGPU the parallel reducer is the whole point of the dispatch and the
+    // recognizer stays active.
+    if (!arch_is_cpu(compile_config.arch)) {
+      current_task->ad_stack.max_reducer_specs = recognize_adstack_max_reducer_specs(ad_stack_size_exprs_);
+    }
     // Snodes the task body mutates. Persisted on `OffloadedTask::snode_writes` so the LLVM
     // launcher can invalidate the per-task adstack metadata cache when a kernel that runs in
     // between mutated a SNode an enclosing `size_expr::FieldLoad` reads. Mirrors the SPIR-V

diff --git a/quadrants/codegen/llvm/llvm_compiled_data.h b/quadrants/codegen/llvm/llvm_compiled_data.h
@@ -81,6 +81,11 @@ struct AdStackSizingInfo {
   // ids are assigned per `Program` lifetime, not per-kernel-content; a deserialised task re-registers
   // itself at the next launch.
   uint32_t registry_id{0};
+  // Per-task list of `MaxOverRange` nodes the runtime reduces in parallel via a dedicated max-reducer dispatch (see the
+  // max-reducer recognizer). Empty when no captured `size_expr` contains a recognized shape. Each entry references one
+  // alloca's `size_expr` by `(stack_id, mor_node_idx)`; the runtime substitutes the dispatched value as a `Const` into
+  // the tree before the per-thread sizer walks it.
+  std::vector<StaticAdStackMaxReducerSpec> max_reducer_specs;
   QD_IO_DEF(per_thread_stride,
             per_thread_stride_float,
             per_thread_stride_int,
@@ -92,7 +97,8 @@ struct AdStackSizingInfo {
             end_offset_bytes,
             allocas,
             size_exprs,
-            bound_expr);
+            bound_expr,
+            max_reducer_specs);
 };
 
 class OffloadedTask {

diff --git a/quadrants/codegen/spirv/CMakeLists.txt b/quadrants/codegen/spirv/CMakeLists.txt
@@ -4,6 +4,7 @@ add_library(spirv_codegen)
 target_sources(spirv_codegen
   PRIVATE
     adstack_bound_reducer_shader.cpp
+    adstack_max_reducer_shader.cpp
     adstack_sizer_shader.cpp
     kernel_utils.cpp
     snode_struct_compiler.cpp