From 8d289450df29bc96fa2f55a31a4014de7e8c0b53 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 00:43:06 -0700
Subject: [PATCH 1/5] [Docs] Add user-guide page for qd.simt.grid.* primitives
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Documents qd.simt.grid.memfence() — currently the sole public op in
the qd.simt.grid namespace. Covers semantics (device-scope memory
fence, no thread convergence), backend support (CUDA only today),
the producer-fence + consumer-fence pattern that decoupled-look-back
scans and Onesweep build on, and how to pick between subgroup /
block / grid scopes.

Also surfaces the asymmetry: there is no qd.simt.grid.sync()
(grid-scope barrier) — full thread synchronization across blocks
requires a kernel relaunch.

Adds grid.md to the SIMT-primitives toctree.
---
 docs/source/user_guide/grid.md  | 140 ++++++++++++++++++++++++++++++++
 docs/source/user_guide/index.md |   1 +
 2 files changed, 141 insertions(+)
 create mode 100644 docs/source/user_guide/grid.md

diff --git a/docs/source/user_guide/grid.md b/docs/source/user_guide/grid.md
new file mode 100644
index 0000000000..dedbf202a6
--- /dev/null
+++ b/docs/source/user_guide/grid.md
@@ -0,0 +1,140 @@
+# Grid primitives
+
+Grid-level primitives operate across **all blocks of a single kernel launch** — i.e. the entire device for the duration of one kernel. They sit one tier above block-scope primitives and one tier below "finish the kernel and launch a new one" (the only fully cross-block thread synchronization Quadrants offers).
+
+Grid ops live under `qd.simt.grid`. The namespace currently contains a single op, the device-scope memory fence:
+
+## What's available
+
+| Op                       | CUDA | AMDGPU | SPIR-V (Vulkan / Metal) |
+|--------------------------|------|--------|-------------------------|
+| `qd.simt.grid.memfence()`| yes  | no     | no                      |
+
+Calling a backend marked "no" raises a runtime / link-time error. AMDGPU and SPIR-V lowerings are tracked as future work; until they land, kernels that need a grid-scope fence are CUDA-only.
+
+## Barrier vs fence at grid scope
+
+There is no `grid.sync()` — Quadrants does not expose a thread-converging barrier across blocks within a single kernel launch. The reasons are practical: CUDA cooperative groups need launch-time opt-in, AMDGPU and SPIR-V either lack a comparable primitive or expose it only under non-portable extensions, and the latency of an in-kernel grid barrier is comparable to a kernel relaunch on most hardware.
+
+What Quadrants does provide at grid scope is a pure **memory fence**:
+
+- **`qd.simt.grid.memfence()`** orders memory operations issued by the calling thread so that prior writes are visible to other threads **anywhere in the grid** (across blocks) before any subsequent read in the calling thread can be reordered ahead of the fence. It does **not** synchronize threads.
+- For full thread synchronization across the grid, finish the current kernel and launch a new one — the implicit kernel-end barrier is the canonical cross-block synchronization in Quadrants.
+
+The corresponding distinctions at narrower scopes:
+
+- Block scope: `qd.simt.block.sync()` (barrier) vs `qd.simt.block.mem_sync()` (fence).
+- Subgroup scope: `qd.simt.subgroup.barrier()` (barrier) vs `qd.simt.subgroup.memory_barrier()` (fence).
+
+A useful mental model: barriers converge threads, fences order memory; grid scope only offers the fence.
+
+## Semantics
+
+### `qd.simt.grid.memfence()`
+
+A device-scope memory fence. Lowers to `__threadfence()` (`nvvm_membar_gl`) on CUDA. No convergence requirement — safe to call from divergent control flow (e.g. inside `if tid == 0`).
+
+Use this when one block (or one thread per block) needs to publish data to global memory and have the publication be visible to other blocks **without** waiting at a kernel boundary. The canonical use case is the decoupled-look-back pattern in Onesweep-style device scans:
+
+```python
+@qd.kernel
+def lookback_scan(...) -> None:
+    bid = qd.simt.block.global_thread_idx() // BLOCK_SIZE
+    tid = qd.simt.block.global_thread_idx() %  BLOCK_SIZE
+
+    block_sum = ...
+
+    if tid == 0:
+        partials[bid] = block_sum
+        qd.simt.grid.memfence()
+        flags[bid] = STATE_AGGREGATE
+
+    if tid == 0:
+        prev = bid - 1
+        while prev >= 0:
+            while flags[prev] == STATE_INVALID:
+                pass
+            qd.simt.grid.memfence()
+            block_sum += partials[prev]
+            ...
+```
+
+The two `grid.memfence()` calls are doing different jobs:
+
+1. The first orders the publication: any block reading `flags[bid] == STATE_AGGREGATE` is guaranteed to also see the published `partials[bid]`.
+2. The second is the symmetric reader-side fence: after observing the predecessor's flag, the reader needs to refresh its view of `partials[prev]` (the scope of which is already global, but the fence pins down the ordering).
+
+The fence does not require thread convergence, which is why it appears inside `if tid == 0` without deadlocking — `qd.simt.block.sync()` would deadlock there; `grid.memfence()` is safe.
+
+## When to use which fence
+
+| Scope you need to publish to        | Use                                  |
+|-------------------------------------|--------------------------------------|
+| Other lanes in the same subgroup    | `qd.simt.subgroup.memory_barrier()`  |
+| Other threads in the same block     | `qd.simt.block.mem_sync()`           |
+| Other blocks in the same grid       | `qd.simt.grid.memfence()`            |
+| Threads of a *future* kernel launch | (implicit at the kernel boundary; no explicit fence required from Python) |
+
+A wider scope is always sound — `grid.memfence()` is a strict superset of `block.mem_sync()`'s ordering — but slower, because more caches need to be drained. Pick the narrowest scope that matches your sharing pattern.
+
+## Examples
+
+### Cross-block reduction with a single kernel
+
+```python
+NUM_BLOCKS = 64
+BLOCK_SIZE = 256
+
+partials = qd.field(qd.f32, shape=(NUM_BLOCKS,))
+flags    = qd.field(qd.i32, shape=(NUM_BLOCKS,))
+
+@qd.kernel
+def reduce_one_pass(input: qd.types.NDArray[qd.f32, 1], result: qd.types.NDArray[qd.f32, 1]) -> None:
+    qd.loop_config(block_dim=BLOCK_SIZE)
+
+    bid = qd.simt.block.global_thread_idx() // BLOCK_SIZE
+    tid = qd.simt.block.global_thread_idx()  % BLOCK_SIZE
+
+    local = qd.f32(0.0)
+    i = bid * BLOCK_SIZE + tid
+    if i < input.shape[0]:
+        local = input[i]
+
+    block_sum = qd.simt.subgroup.reduce_all_add(local, 5)
+
+    if tid == 0:
+        partials[bid] = block_sum
+        qd.simt.grid.memfence()
+        flags[bid] = 1
+
+    if bid == NUM_BLOCKS - 1 and tid == 0:
+        for b in range(NUM_BLOCKS - 1):
+            while flags[b] == 0:
+                pass
+            qd.simt.grid.memfence()
+        total = qd.f32(0.0)
+        for b in range(NUM_BLOCKS):
+            total += partials[b]
+        result[0] = total
+```
+
+The two-stage publish-then-flag pattern is exactly what Onesweep, decoupled-look-back scan, and persistent-thread reductions are built on. Without the `grid.memfence()`, a reader could observe `flags[b] == 1` while still seeing the old value of `partials[b]`.
+
+### When to *not* use it
+
+If you only need to publish across threads of the same block, `qd.simt.block.mem_sync()` is several times cheaper. Use `grid.memfence()` only for true cross-block coordination.
+
+If you need every thread in the grid to *converge* (not just see each other's writes), there is no in-kernel primitive — finish the kernel and launch a new one. Quadrants kernels are inherently asynchronous from each other only through the launch boundary, which doubles as a cross-grid barrier.
+
+## Performance and portability notes
+
+- **CUDA-only today.** AMDGPU and SPIR-V lowerings are not implemented; calling `grid.memfence()` on those backends raises at trace time. Cross-platform code that needs a grid-scope fence must currently CUDA-bound the kernel, or restructure to use the kernel-launch boundary.
+- **Cost scales with the global-cache invalidation domain.** A grid fence drains the L2 (and on some GPUs the L1) caches of all SMs / CUs touching the address. On A100 / H100 the cost is on the order of tens to low hundreds of nanoseconds per call; in tight loops, prefer batching multiple cross-block updates per fence.
+- **Pair with the right ordering of memory ops.** The fence orders the *calling thread*'s memory ops; readers in other blocks need their own fence (or an atomic load) to refresh their view. The producer-fence + consumer-fence pattern in the example above is the canonical idiom.
+- **Not a substitute for atomics on contended locations.** A fence orders writes but does not serialize them. If multiple blocks write to the same location, you need an atomic regardless of how the fence is placed.
+
+## Related
+
+- `qd.simt.block.*` — the block-scope counterpart, including `qd.simt.block.mem_sync()` (block-scope fence) and `qd.simt.block.sync()` (block-scope barrier).
+- `qd.simt.subgroup.*` — subgroup-scope barriers, fences, shuffles, and reductions.
+- [parallelization](parallelization.md) — the broader synchronization story; explains how grid-scope fences fit relative to atomics, block barriers, and the kernel boundary.
diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md
index f783e84264..e16919fdd0 100644
--- a/docs/source/user_guide/index.md
+++ b/docs/source/user_guide/index.md
@@ -47,6 +47,7 @@ autodiff
 :maxdepth: 1
 :titlesonly:
 
+grid
 subgroup
 tile16
 ```

From 646eec8136be59dbea17c9ada608587a163043cb Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 08:21:47 -0700
Subject: [PATCH 2/5] [Docs] Note planned rename grid.memfence ->
 grid.mem_fence

---
 docs/source/user_guide/grid.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/docs/source/user_guide/grid.md b/docs/source/user_guide/grid.md
index dedbf202a6..6f0e5638f6 100644
--- a/docs/source/user_guide/grid.md
+++ b/docs/source/user_guide/grid.md
@@ -12,6 +12,8 @@ Grid ops live under `qd.simt.grid`. The namespace currently contains a single op
 
 Calling a backend marked "no" raises a runtime / link-time error. AMDGPU and SPIR-V lowerings are tracked as future work; until they land, kernels that need a grid-scope fence are CUDA-only.
 
+Naming note: `qd.simt.grid.memfence()` will be renamed to `qd.simt.grid.mem_fence()` (note the underscore) in the near future, for consistency with the `mem_fence` spelling used at other scopes. The new name is not yet available; this page uses the current name throughout.
+
 ## Barrier vs fence at grid scope
 
 There is no `grid.sync()` — Quadrants does not expose a thread-converging barrier across blocks within a single kernel launch. The reasons are practical: CUDA cooperative groups need launch-time opt-in, AMDGPU and SPIR-V either lack a comparable primitive or expose it only under non-portable extensions, and the latency of an in-kernel grid barrier is comparable to a kernel relaunch on most hardware.
@@ -32,6 +34,8 @@ A useful mental model: barriers converge threads, fences order memory; grid scop
 
 ### `qd.simt.grid.memfence()`
 
+**Planned rename: `qd.simt.grid.mem_fence()`** (with underscore). The op will be renamed in a future release for consistency with the `mem_fence` spelling at other scopes; the current `memfence` name remains the only spelling available today, and the rest of this section uses it.
+
 A device-scope memory fence. Lowers to `__threadfence()` (`nvvm_membar_gl`) on CUDA. No convergence requirement — safe to call from divergent control flow (e.g. inside `if tid == 0`).
 
 Use this when one block (or one thread per block) needs to publish data to global memory and have the publication be visible to other blocks **without** waiting at a kernel boundary. The canonical use case is the decoupled-look-back pattern in Onesweep-style device scans:

From ce273e47d90af40ef156092d59603d046bc33205 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 08:22:08 -0700
Subject: [PATCH 3/5] [Docs] grid: drop narrower-scope cross-references in
 barrier-vs-fence section

---
 docs/source/user_guide/grid.md | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/docs/source/user_guide/grid.md b/docs/source/user_guide/grid.md
index 6f0e5638f6..14497f4569 100644
--- a/docs/source/user_guide/grid.md
+++ b/docs/source/user_guide/grid.md
@@ -23,13 +23,6 @@ What Quadrants does provide at grid scope is a pure **memory fence**:
 - **`qd.simt.grid.memfence()`** orders memory operations issued by the calling thread so that prior writes are visible to other threads **anywhere in the grid** (across blocks) before any subsequent read in the calling thread can be reordered ahead of the fence. It does **not** synchronize threads.
 - For full thread synchronization across the grid, finish the current kernel and launch a new one — the implicit kernel-end barrier is the canonical cross-block synchronization in Quadrants.
 
-The corresponding distinctions at narrower scopes:
-
-- Block scope: `qd.simt.block.sync()` (barrier) vs `qd.simt.block.mem_sync()` (fence).
-- Subgroup scope: `qd.simt.subgroup.barrier()` (barrier) vs `qd.simt.subgroup.memory_barrier()` (fence).
-
-A useful mental model: barriers converge threads, fences order memory; grid scope only offers the fence.
-
 ## Semantics
 
 ### `qd.simt.grid.memfence()`

From ea32c95033cb1e0169a64da6354b17ecf620c1ee Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 08:23:11 -0700
Subject: [PATCH 4/5] [Docs] grid: remove 'When to use which fence' section

---
 docs/source/user_guide/grid.md | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/docs/source/user_guide/grid.md b/docs/source/user_guide/grid.md
index 14497f4569..d0589f2926 100644
--- a/docs/source/user_guide/grid.md
+++ b/docs/source/user_guide/grid.md
@@ -63,17 +63,6 @@ The two `grid.memfence()` calls are doing different jobs:
 
 The fence does not require thread convergence, which is why it appears inside `if tid == 0` without deadlocking — `qd.simt.block.sync()` would deadlock there; `grid.memfence()` is safe.
 
-## When to use which fence
-
-| Scope you need to publish to        | Use                                  |
-|-------------------------------------|--------------------------------------|
-| Other lanes in the same subgroup    | `qd.simt.subgroup.memory_barrier()`  |
-| Other threads in the same block     | `qd.simt.block.mem_sync()`           |
-| Other blocks in the same grid       | `qd.simt.grid.memfence()`            |
-| Threads of a *future* kernel launch | (implicit at the kernel boundary; no explicit fence required from Python) |
-
-A wider scope is always sound — `grid.memfence()` is a strict superset of `block.mem_sync()`'s ordering — but slower, because more caches need to be drained. Pick the narrowest scope that matches your sharing pattern.
-
 ## Examples
 
 ### Cross-block reduction with a single kernel

From 8a3f1718b191985958744947d6542e68c06adc3f Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 08:23:52 -0700
Subject: [PATCH 5/5] [Docs] grid: remove '## Examples' section

---
 docs/source/user_guide/grid.md | 51 +---------------------------------
 1 file changed, 1 insertion(+), 50 deletions(-)

diff --git a/docs/source/user_guide/grid.md b/docs/source/user_guide/grid.md
index d0589f2926..cb434b5ca9 100644
--- a/docs/source/user_guide/grid.md
+++ b/docs/source/user_guide/grid.md
@@ -63,60 +63,11 @@ The two `grid.memfence()` calls are doing different jobs:
 
 The fence does not require thread convergence, which is why it appears inside `if tid == 0` without deadlocking — `qd.simt.block.sync()` would deadlock there; `grid.memfence()` is safe.
 
-## Examples
-
-### Cross-block reduction with a single kernel
-
-```python
-NUM_BLOCKS = 64
-BLOCK_SIZE = 256
-
-partials = qd.field(qd.f32, shape=(NUM_BLOCKS,))
-flags    = qd.field(qd.i32, shape=(NUM_BLOCKS,))
-
-@qd.kernel
-def reduce_one_pass(input: qd.types.NDArray[qd.f32, 1], result: qd.types.NDArray[qd.f32, 1]) -> None:
-    qd.loop_config(block_dim=BLOCK_SIZE)
-
-    bid = qd.simt.block.global_thread_idx() // BLOCK_SIZE
-    tid = qd.simt.block.global_thread_idx()  % BLOCK_SIZE
-
-    local = qd.f32(0.0)
-    i = bid * BLOCK_SIZE + tid
-    if i < input.shape[0]:
-        local = input[i]
-
-    block_sum = qd.simt.subgroup.reduce_all_add(local, 5)
-
-    if tid == 0:
-        partials[bid] = block_sum
-        qd.simt.grid.memfence()
-        flags[bid] = 1
-
-    if bid == NUM_BLOCKS - 1 and tid == 0:
-        for b in range(NUM_BLOCKS - 1):
-            while flags[b] == 0:
-                pass
-            qd.simt.grid.memfence()
-        total = qd.f32(0.0)
-        for b in range(NUM_BLOCKS):
-            total += partials[b]
-        result[0] = total
-```
-
-The two-stage publish-then-flag pattern is exactly what Onesweep, decoupled-look-back scan, and persistent-thread reductions are built on. Without the `grid.memfence()`, a reader could observe `flags[b] == 1` while still seeing the old value of `partials[b]`.
-
-### When to *not* use it
-
-If you only need to publish across threads of the same block, `qd.simt.block.mem_sync()` is several times cheaper. Use `grid.memfence()` only for true cross-block coordination.
-
-If you need every thread in the grid to *converge* (not just see each other's writes), there is no in-kernel primitive — finish the kernel and launch a new one. Quadrants kernels are inherently asynchronous from each other only through the launch boundary, which doubles as a cross-grid barrier.
-
 ## Performance and portability notes
 
 - **CUDA-only today.** AMDGPU and SPIR-V lowerings are not implemented; calling `grid.memfence()` on those backends raises at trace time. Cross-platform code that needs a grid-scope fence must currently CUDA-bound the kernel, or restructure to use the kernel-launch boundary.
 - **Cost scales with the global-cache invalidation domain.** A grid fence drains the L2 (and on some GPUs the L1) caches of all SMs / CUs touching the address. On A100 / H100 the cost is on the order of tens to low hundreds of nanoseconds per call; in tight loops, prefer batching multiple cross-block updates per fence.
-- **Pair with the right ordering of memory ops.** The fence orders the *calling thread*'s memory ops; readers in other blocks need their own fence (or an atomic load) to refresh their view. The producer-fence + consumer-fence pattern in the example above is the canonical idiom.
+- **Pair with the right ordering of memory ops.** The fence orders the *calling thread*'s memory ops; readers in other blocks need their own fence (or an atomic load) to refresh their view. The producer-fence + consumer-fence pattern is the canonical idiom.
 - **Not a substitute for atomics on contended locations.** A fence orders writes but does not serialize them. If multiple blocks write to the same location, you need an atomic regardless of how the fence is placed.
 
 ## Related