Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f95788e1a4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…allel max-reducer dispatch (option D)
…code interpreter for kConst/kBoundVariable/kExternalTensorRead/kAdd/kSub/kMul/kMax
…cer (single-thread serial walk over body bytecode)
…bytecode encoder (encode_max_reducer_body_bytecode)
…eplaces captured MaxOverRange nodes with Const after reducer dispatch)
…x_reducer_launch.cpp (cmdlist + buffer binding for the option-D max reducer)
…nd substitution into the SPIR-V sizer eval paths
…drop redundant per-helper cap gates
…r_tasks; strip plan-specific noise from comments
…-hit tripwires, regression tests
… GPU TDR on out-of-grammar shapes)
…id clobbering task_id under contention
…icit device-sizer signals
|
@claude review |
23a7daf to
eaf4ba9
Compare
eaf4ba9 to
2ef85c1
Compare
2ef85c1 to
75800cc
Compare
…ross distinct bound variables (multi-axis)
75800cc to
e80f9dd
Compare
…ured deps before host-eval
…bound_eval / metadata_publish / heap_grow)
MaxOverRange dispatch with 1<<24 cap-hit tripwires| - *Sizer under-estimated the bound (Quadrants bug).* On unusually intricate nested loops - typically deeply nested `for i in range(arr[...])` with cumulative-index arithmetic - the sizer can compute a bound that is mathematically tighter than the actual push count. To file a bug: clear `/tmp/ir/`, rerun your script with `QD_DUMP_IR=1` set in the environment so Quadrants dumps the kernel IR there, then open an issue on the Quadrants repo with the contents of `/tmp/ir/` attached as a zip. Workaround: pass a generous `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer). | ||
| - **Out-of-memory before the kernel even runs.** A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Surfaces as an allocator OOM at launch time. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom. | ||
| - **Loop bounds backed by a mutated ndarray.** A reverse-mode kernel with `for i in range(n[j])` requires `n[j]` to hold the same value at the forward call and at `.grad()`. If anything writes to `n[j]` between those two points - the differentiable kernel itself, or any other kernel call - the backward call will trigger an `Adstack overflow` exception or the computed gradient would come out silently wrong. The safe rule: populate loop-bound ndarrays before the forward call and leave them untouched until `.grad()` returns. The reason for that is Quadrants' adstack sizer design: it reads the loop bound separately at each dispatch, which includes forward and backward calls. Tape-based eager AD like [PyTorch's autograd](https://pytorch.org/docs/stable/notes/autograd.html) is not affected, since the trip count is recorded as the forward runs and reused at backward time. | ||
| - **Inner reverse-mode loop with a complex bound at very large extent.** An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that. Workaround: rewrite the trip count to stay within the supported subset, or shrink the enclosing loop below the threshold. |
There was a problem hiding this comment.
can you give an example of a 'reverse-mode loop with a complex bound at very large extent'?
Note: I feel this probably deserves its own section, rather than a bullet points, since this is very dense, and contains multiple very dense child bullet points.
There was a problem hiding this comment.
can you give an example of a 'reverse-mode loop with a complex bound at very large extent'?
for j in range(arr[i // 2]) with arr[0] > (1 << 24)`. Nothing more.
Note: I feel this probably deserves its own section, rather than a bullet points, since this is very dense, and contains multiple very dense child bullet points.
This bullet should be fairly simple. Maybe I should removed details that are just confusing?
There was a problem hiding this comment.
I feel this bullets are all pretty long tbh. I'm not sure if they graudally 'boil frogged' grew over time?
My hunch is taht it might be better reformatting more in the style of an 'FAQ', with a subsection heading for each current bullet point.
There was a problem hiding this comment.
Understood. I can do this.
I'm not in favour of refactoring this file in this PR. It is a central part of Quadrant's kernel launch orchestration. Could be nice to refactor it though. |
It is a central part of Quadrant's kernel launch orchestration, yes. The ask is not to refactor the file en masse, but to find a way to move autodiff-specific things outside of it (at least, the new autodiff-related things you are adding in this pr). Please. |
Ok I will refactor this file. |
…nit linked via llvm-link
… complex-bound example
…tep finds it on the CI LLVM toolchain
…link step runs on macOS / clang-tidy paths
…ap at the 120-col limit instead of 74-80
…k collapses RootMeta/DenseMeta named struct types and breaks runtime type lookup
5b3dbd2 to
95cdf06
Compare
…e helper kernel reads ndarray data unreliably; expose attribute 100 (uses_host_page_tables) and route launcher staging through it
95cdf06 to
f3deef7
Compare
…reducer_shader # Conflicts: # quadrants/python/export_lang.cpp # quadrants/rhi/cuda/cuda_context.h # quadrants/runtime/amdgpu/kernel_launcher.cpp # quadrants/runtime/cuda/kernel_launcher.cpp
|
|
||
| ### Inner reverse-mode loop with a complex bound at very large extent | ||
|
|
||
| An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that. |
There was a problem hiding this comment.
Can you give an example of what this means? When I read this, things my brain stumbles on:
- "enclosing range"
- I assume this means some kind of for loop over range, but I ahve to think about it
- 'inner trip count'
- inner, I suppose means an inner loop
- count, counts something, but not iterations, but ... trips
- not sure waht a 'trip' is
- 'shapes' again is not a term I'm familiar with in this context
- 'enclosing iterations'
- does this mean the iterations ofr the 'enclosing range'
- the iteraitons of the 'inner' loop?
- something else?
I think it would be nice to have an example, that illustrates what this is talking about clealry.
|
|
||
| An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that. | ||
|
|
||
| Two categories of bound expression: |
There was a problem hiding this comment.
You havent used 'bound exprssion' in this subsub section yet. Not sure what it refers to. Again, would be nice if the example showed this I feel.
(I kind of feel maybe this deserves its own section, outside of 'what can go wrong' potentially. Or... if you've explained all the above concepts before, then maybe refer me back to a consise definition of each concept earlier in the readme, perhaps?)
| Two categories of bound expression: | ||
|
|
||
| - *Works at any enclosing-range size:* integer ndarray reads up to 32 bits wide (single- or multi-axis, indexed by literal constants or enclosing loop variables), field reads of the same width indexed by literal constants or enclosing loop variables (`my_field[None]`, `my_field[k]` for a constant `k`, `my_field[i]` where `i` is an enclosing loop variable), `arr.shape[k]` shape terms, literal integer constants, and `+`, `-`, `*`, `max` of those. | ||
| - *Caps at the threshold:* 64-bit integer ndarray or field reads, arithmetic-indexed reads (`arr[i // 2]`, `arr[i % 4]`), and ragged inner ranges whose own bound depends on an enclosing loop variable through an unsupported leaf shape. |
There was a problem hiding this comment.
which threshold? Again, no mention of threshold in this sub sub section.
There was a problem hiding this comment.
also, what is 'leaf shape'? (again, feel free to link back to where it's concisely explained potentially)
| - *Works at any enclosing-range size:* integer ndarray reads up to 32 bits wide (single- or multi-axis, indexed by literal constants or enclosing loop variables), field reads of the same width indexed by literal constants or enclosing loop variables (`my_field[None]`, `my_field[k]` for a constant `k`, `my_field[i]` where `i` is an enclosing loop variable), `arr.shape[k]` shape terms, literal integer constants, and `+`, `-`, `*`, `max` of those. | ||
| - *Caps at the threshold:* 64-bit integer ndarray or field reads, arithmetic-indexed reads (`arr[i // 2]`, `arr[i % 4]`), and ragged inner ranges whose own bound depends on an enclosing loop variable through an unsupported leaf shape. | ||
|
|
||
| A concrete example that hits the cap is `for j in range(arr[i // 2]):` with `arr[0] = (1 << 24) + 1`. |
There was a problem hiding this comment.
I would recomend putting this sooner rather than later, and using it to illustrate what you are saying. Refer to it concretely. Label it.

Adstack max-reducer: parallel
MaxOverRangedispatch with1<<24cap-hit tripwiresTL;DR
A reverse-mode kernel like
lowers to a per-stack
SizeExprcontainingMaxOverRange(0, a.shape[0], a[var]). Before this PR the adstack sizer enumerated that range linearly on every launch, with a hard1<<24cap above which the host evaluator raisedRuntimeError, the LLVM device sizer silently truncated, and the SPIR-V on-device sizer silently clamped. Above-cap axes therefore either failed loud-but-confusing on CPU or produced wrong heap strides and corrupted gradients on GPU.After this PR a
recognize_adstack_max_reducer_specspre-pass captures shapes that fit a deliberately narrow grammar (chains of nestedMaxOverRanges across distinct bound variables; integer ndarray and field reads up to 32 bits wide indexed by literal constants or any captured chain bound variable; integer arithmetic combinators), the launcher dispatches a generic parallel-max compute kernel per captured spec at launch time, andsubstitute_precomputed_max_over_rangerewrites the capturedMaxOverRangeto aConstcarrying the dispatched value before any sizer eval path walks the tree. Out-of-grammar shapes whose iteration count exceeds the cap now raise via three explicit tripwires (host evaluatorQD_ERROR_IF; SPIR-V on-device sizer metadata-trailing overflow-flag slot; LLVM device sizer cap-hit short-circuit + indirectstack_pushoverflow) instead of silently undersizing the heap.Why
compute_bounded_adstack_sizeinquadrants/transforms/determine_ad_stack_size.cppemitsMaxOverRange(begin, end, body)nodes whose iteration count is bounded only by the underlying ndarray axis. Three eval paths consume the resulting trees per launch:adstack/eval.cpp::evaluate_node): hardQD_ERROR_IFatend - begin > 1<<24, on by default throughevaluate_adstack_size_expron the CPU host fast path.runtime_eval_adstack_size_exprinquadrants/runtime/llvm/runtime_module/runtime.cpp):breakat the same threshold (silent truncation on CUDA / AMDGPU LLVM-GPU).adstack_sizer_shader.cpp): silent clampeffective_end = min(end, begin + (1<<24))on Metal / Vulkan.When the gating ndarray axis exceeds
1<<24cells, every device path returned an under-bound on per-thread stack depth. The heap then either overflowed atqd.sync()with an opaque message naming the wrong kernel, or silently corrupted gradients with no error at all. The host path's hard error was the loud version, opt-in viaQD_DEBUG_ADSTACK=1, and used as a tripwire today; it does not cover the GPU paths.The fix preserves the cap as an internal safety latch (the per-thread sizer's serial walk is still bounded) but moves the actual evaluation of recognized shapes onto a parallel-dispatch path that scales past the cap, and turns cap-hits on the remaining out-of-grammar shapes into hard errors instead of silent truncation.
Surface API
None. The change is purely internal to the adstack-sizer pipeline. Users who never tripped the cap see no behaviour change; users whose recognized kernels did trip the cap stop seeing wrong gradients; users whose out-of-grammar kernels would have tripped the cap now see a
RuntimeError/QuadrantsAssertionErrorat the nextqd.sync()instead of silent truncation.Mechanism end-to-end
1. Pre-pass shape recognition
quadrants/program/adstack/max_reducer.{h,cpp}::recognize_adstack_max_reducer_specs(size_exprs)walks each per-stackSerializedSizeExprpost-order and returns astd::vector<StaticAdStackMaxReducerSpec>describing everyMaxOverRangenode whose:beginandendsubtrees are closed-form (Const/ExternalTensorShape/Add/Sub/Mul/Max, plus anyMaxOverRangealready captured deeper in the same tree),bodysubtree references onlyConst,ExternalTensorRead(arg, [...])(single- or multi-axis, indexed by literal constants or any captured chain bound variable, leaf dtype restricted to 32-bit-or-narrower integer),FieldLoad(snode, [...])(same index restriction; the literal-only path host-folds toConstat encode time, the bound-var path emits akFieldLoaddevice node),ExternalTensorShape, andAdd/Sub/Mul/Maxof those.Multi-axis support: the recognizer descends through nested
MaxOverRanges as long as each inner[begin, end)is closed-form (Const/ExternalTensorShape/ captured-deeper-MORs); each layer adds one axis to the captured spec, and the dispatch enumerates the cross-product of every axis. Specs come back in dependency order (deepest first); each dispatch's result becomes the substitutedConstan outer spec'sbegin/endmay reference. Captured ids are stored intask_attribs.ad_stack.max_reducer_specs(SPIR-V) andcurrent_task->ad_stack.max_reducer_specs(LLVM); both backends populate the field at codegen time (spirv_codegen.cpp,codegen_llvm.cpp).The integer-leaf dtype restriction (
i8/i16/i32/u8/u16/u32only) gates the cache-revalidation sentinel:populate_max_reducer_body_observationsrecordsINT64_MINas the observed value, and the replay path's gen-mismatch dereference must return a value strictly greater than the sentinel to force invalidation. A 64-bit leaf could legally holdINT64_MINand false-hit on a mutated entry, so those leaves fall through to the per-task sizer's capped path.StaticAdStackMaxReducerSpeclives inquadrants/transforms/static_adstack_analysis.hwith aQD_IO_DEFso the spec round-trips through the offline cache. The struct carriesaxis_var_ids/axis_begin_node_idxs/axis_end_node_idxs(one entry per captured axis, outermost-first) plusdependent_mor_node_idxslisting the captured deeper-MOR keys the spec'sbegin/endreferences.2. Generic max-reducer kernels - one per backend family
quadrants/codegen/spirv/adstack_max_reducer_shader.{h,cpp}kAdStackMaxReducerWorkgroupSize=128, stridedkElementsPerThread=64per-thread iteration to keepnum_workgroups_xundermaxComputeWorkGroupCount[0]=65535for spec lengths up to ~536M. Body bytecode interpreter (kConst / kBoundVariable / kExternalTensorRead / kFieldLoad / kAdd / kSub / kMul / kMax). Per-spec output is two u32 slots:[2*k] = OpAtomicUMaxrunning max,[2*k+1] = OpAtomicOroverflow flag. The u32+overflow split sidesteps spirv-cross's MSL backend gap on i64 atomics (MSL currently does not support 64-bit atomics), unlocking Metal and Vulkan-via-MoltenVK.quadrants/runtime/llvm/runtime_module/runtime.cpp::runtime_eval_adstack_max_reduceparams.per_axis_length[]iterations, atomic-max intoruntime->adstack_max_reducer_outputs[output_slot]. Dispatched as a host call on CPU and as a1x1x1JIT-launched kernel on CUDA / AMDGPU. POD device params live inquadrants/ir/static_adstack_max_reducer_device.h.The body bytecode reuses the existing
AdStackSizeExprDeviceNodePOD format fromquadrants/ir/adstack_size_expr_device.h.encode_max_reducer_body_bytecodeinquadrants/program/adstack/max_reducer.cppextracts the body subtree, renumbers nodes to dense[0, body_node_count)indices, copies referenced index entries, and resolveskExternalTensorReadarg_buffer_offsetvia a closure passed by the per-backend launcher. Bound-var-indexedkFieldLoadleaves take a backend-specific base resolution: SPIR-V passes aFieldLoadDeviceEmitterwhosefetchreturnsroot_psb + place_byte_offset_in_root(pre-baked PSB address), LLVM passes a null emitter and the encoder stores(snode_root_id, place_byte_offset)in the device-node POD'sarg_buffer_offset/const_valueslots which the LLVM device interpreter resolves at runtime viaruntime->roots[snode_root_id] + place_byte_offset.3. Launch sequencing
quadrants/runtime/gfx/adstack_max_reducer_launch.cppGfxRuntime::dispatch_max_reducers(...)quadrants/runtime/llvm/llvm_adstack_lazy_claim.cppLlvmRuntimeExecutor::dispatch_max_reducers_for_tasks(...)(overload takingstd::vector<OffloadedTask>; per-arch launchers inruntime/cpu/,runtime/cuda/,runtime/amdgpu/call into it as a one-liner)Both helpers share a level-based round dispatch:
(registry_id, stack_id, mor_node_idx)packed into a singleuint64_tviapack_max_reducer_keyinadstack/max_reducer.cpp. Hits drop straight into the result map; misses go to the pending list with back-references to the sourceSerializedSizeExprandStaticAdStackMaxReducerSpec.dependent_mor_node_idxsare all already in the result map (cache hits + earlier rounds), substitutes those values into the working tree viasubstitute_precomputed_max_over_range, host-evaluatesbegin/endagainst the substituted tree, encodes the body bytecode, and dispatches the round as one cmdlist (gfx) / one batched runtime-function call sequence (LLVM). Most kernels finish in one round; nested patterns (e.g. an outerMaxOverRangewhose end contains a captured inner max-of-array) take one round per dependency depth. A no-progress round drops every remaining pending spec and falls back to the per-task sizer's cap-hit path.AdStackCache::record_max_reducer_evalso the next launch can short-circuit. The recorded read observations come frompopulate_max_reducer_body_observationswhich snapshotsobserved_devalloc+observed_gen(ndarray) andsnode_write_gen(field) so a host-side mutation of either source invalidates the cache cleanly.The dispatch must precede
publish_adstack_metadata_spirv(gfx) /publish_adstack_metadata(LLVM) so the substitutedConsts are in place before the sizer eval pipeline runs.On Apple Silicon Metal the body interpreter loads ndarray data buffers and SNode tree root buffers via PSB (raw
bufferDeviceAddress), bypassing the descriptor-bound resource tracking, so the gfx launcher callstrack_physical_buffer(...)once per cmdlist for everyndarray_allocand everyroot_buffer_(theuseResource:hint Metal needs to mark those buffers resident for the dispatch).4. Substitution into per-stack trees
quadrants/program/adstack/max_reducer.cpp::substitute_precomputed_max_over_range(expr, registry_id, stack_id, results)walksexpr.nodesand replaces every capturedMaxOverRangewhose key is inresultswith aConst(dispatched_value). Empty-input fast path: when no captured spec matches, returnsexprunchanged with no allocation.Three eval paths consume the substituted tree:
eval_per_task_metadata_on_hostinruntime/gfx/adstack_sizer_launch.cpp; LLVM host-eval branch inllvm_adstack_lazy_claim.cpp). The host evaluator's pointer-keyedsize_expr_cache_cannot accept a stack-local substituted tree (a transient stack address would alias unrelated cache entries across launches and return wrong cached values), so the substitution-active branch routes through a dedicatedevaluate_adstack_size_expr_no_cache(...)variant; the empty-results fast path keeps the livea.size_exprreference and the cache stays warm for kernels that never trigger the recognizer.encode_adstack_size_expr_device_bytecode_for_spirv). The encoder walks the substituted tree where each capturedMaxOverRangeis already aConst, so the body'sExternalTensorRead/FieldLoadleaves are not in the encoder'sreadslist;AdStackCache::lookup_max_reducer_reads(...)returns the recorded body observations for each captured spec, and the encoder appends them to itsreadslist before recording intospirv_bytecode_cache_. A mutation to the gating ndarray / field then invalidates the cached bytecode via the same gen-counter replay path the existing per-task metadata cache uses.encode_adstack_size_expr_device_bytecode). Same substitution; same downstreamllvm_per_task_ad_stack_cache_machinery.5. Cap-hit tripwires (
1<<24)The
1<<24per-task sizer cap is structurally unreachable for max-reducer-recognized shapes (those are dispatched in parallel and substituted toConstbefore the sizer walks). It is reachable only for out-of-grammar shapes whose iteration count exceeds the cap. Three explicit tripwires:evaluate_node)QD_ERROR_IF; surfaces asRuntimeErrorto Python on the CPU host fast path.adstack_sizer_shader.cpp)2 + 2*n_stacks. The shader writes 1 there onend - begin > cap, and clampseffective_end = beginso the walk stays bounded. The host post-readback inpublish_adstack_metadata_spirvraisesQD_ERROR_IFwhen the slot is non-zero.device_eval_node)kMaxOverRangereturns 0 immediately onend - begin > capto keep the single-thread on-device dispatch within the driver's TDR window. The cap-hit then surfaces indirectly through the existingstack_pushoverflow infrastructure on the subsequent main-kernel launch. The diagnostic message attribution depends on the kernel layout.6. Cache invalidation
The per-spec result cache integrates into the existing
AdStackCachefour-layer cascade:try_max_reducer_cache_hit(one entry per captured(registry_id, stack_id, mor_node_idx)). Hit -> no max-reducer dispatch, the cachedConstis substituted into the per-stack tree.try_size_expr_cache_hit(per-SerializedSizeExprafter substitution). Hit -> no per-thread sizer eval call.try_per_task_ad_stack_cache_hit/try_llvm_per_task_ad_stack_cache_hit(per-task metadata blob). Hit -> no per-task sizer dispatch.try_spirv_bytecode_cache_hit(per-task bytecode blob). Hit -> no SPIR-V bytecode encode + upload.In steady state with an unchanged gating source every layer hits and the per-launch overhead of the option-D pipeline collapses to zero. A host-side
Ndarray.writebumpsndarray_data_gen_; a host-side field write bumpssnode_write_gen. Either bump propagates through every layer's gen-counter replay walk and forces a fresh dispatch.FieldLoadObsrecords produced by the bound-var FieldLoad encoder path carryindices = {}since the body is evaluated at every cross-product iteration and there is no canonical scalar to re-read;replay_one_observation'sFieldLoadObsarm treats the gen counter as the sole staleness signal in that mode and unconditionally invalidates on a gen mismatch.Per-backend coverage matrix
MaxOverRangedispatchMaxOverRange)runtime_eval_adstack_max_reduce✓evaluate_nodeQD_ERROR_IF✓ (raised asRuntimeError)1x1x1kernel ✓device_eval_nodeshort-circuit + indirectstack_pushoverflow1x1x1kernel ✓QuadrantsAssertionError)Tests -
tests/python/test_adstack.pySix new regression tests, all parametrized over every available backend.
test_max_reducer_pins_stride_for_oversized_axisParametrized over
(shape, body_kind)matrix that exercises the recognizer's accepted body grammar (single-axis ETR, ETR +ExternalTensorShapehost-fold, closedFieldLoadhost-fold, and theAdd/Sub/Mul/Maxarithmetic combinator). For each shape the dispatch + substitution produces the correct heap stride and the kernel runs to completion; the above-cap variants additionally pin the contract that a recognized spec ranges over an arbitrarily large axis. Usesqd.ndarrayrather than numpy passthrough so the device buffer is not capped at backend-specific H2D-blit limits.test_max_reducer_dispatch_counts_advance_on_input_mutationPins the dispatch + cache invalidation pipeline via a new
Program.get_max_reducer_dispatch_count/reset_max_reducer_dispatch_countpython binding (counter onAdStackCache, bumped at everyrecord_max_reducer_eval). The first launch fires at least one dispatch; a host mutation of the gating ndarray bumpsndarray_data_genand the next launch re-dispatches.test_max_reducer_grammar_fallbackA reverse-mode kernel whose inner trip count is a compile-time constant produces no
MaxOverRangeand the recognizer captures nothing. The dispatch counter stays at zero; the kernel still produces the correct gradient. Pins the contract that any kernel outside the captured grammar runs unchanged so future grammar broadening cannot silently drop the fallback path.test_max_reducer_field_load_bound_var_dispatchEight-variant parametrized test pinning the bound-var-indexed
FieldLoadbody grammar. Body shapes coverfield[i]on its own,field[i] + arr[i](mixed FieldLoad + ETR viaAdd),arr[i] + field[i](commuted),max(field[i], arr[i]),max(field[i], const),max(field[i] + 0, field[i] * 1 - 0)(full arithmetic combinator), and the conservative-wrapper pathfield[field[i]]/arr[field[i]](the trip-count builder substitutesMaxOverRange(var, 0, leaf_snode.shape, body=Load(snode, [var]))for any nested-load index that does not reduce to a single bound-var or const). Across all variants the body's max value over the indexed range isN_Xand the gradient assertion is uniform.test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutationPins the cache invalidation contract for the bound-var
FieldLoadbody path: the encoder pushes aFieldLoadObskeyed on the snode's write generation, mutatingfield_a[M-1]from Python bumpssnode_write_gen, and the next launch redispatches.test_above_cap_out_of_grammar_kernel_raisesA kernel whose inner range bound is an
i64ndarray read fails the recognizer's dtype restriction (theINT64_MINcache-revalidation sentinel is unreachable for sub-i64 dtypes; fori64a mutated cell could legally hold the sentinel and false-hit on revalidation). The whole spec is dropped and the per-task sizer walks the outerMaxOverRangeitself. Witha.shape[0] > 1<<24the cap fires on every adstack-sizer eval path:RuntimeErrorfrom the host evaluator on CPU,QuadrantsAssertionErrorfrom the SPIR-V on-device sizer on Metal / Vulkan, and an indirect raise viastack_pushoverflow on CUDA / AMDGPU LLVM-GPU.Side-effect audit
StaticAdStackMaxReducerSpecround-trips viaQD_IO_DEF;max_reducer_specsadded to theQD_IO_DEFof the SPIR-VAdStackSizingAttribsand the LLVMAdStackSizingInfosize_expr_cache_pointer aliasingevaluate_adstack_size_expr_no_cachefor the substitution-active branch onlyspirv_bytecode_cache_observationslookup_max_reducer_readsaccessor + encoder appends body reads to the cache entry's read listper_task_ad_stack_cache_depscollect_size_expr_dep_keyswalks the original tree (pre-substitution) so body reads' arg-ids are still trackedencode_max_reducer_body_bytecodemapsSizeExpr::Kind->AdStackSizeExprDeviceKindper kind explicitlydependent_mor_node_idxsare all resolved; substitutes earlier-round results into the working tree before host-evaluating begin / end and encoding the bodytrack_physical_buffercalled once per cmdlist on every ndarray data buffer and everyroot_buffers_SNode tree root buffer (covers bothkExternalTensorReadandkFieldLoadbody leaves)publish_adstack_metadata_spirvQD_ERROR_IF(!spirv_has_physical_storage_buffer ...)+QD_ERROR_IF(!spirv_has_int64 ...)at the entry; drops redundant per-helper cap gateskElementsPerThreadshader strided iterationnum_workgroups_x = ceil(length / (kAdStackMaxReducerWorkgroupSize * kElementsPerThread)), capped at 65535 in the launcherndarray_data_gen_norsnode_write_genadvances