diff --git a/docs/source/user_guide/grid.md b/docs/source/user_guide/grid.md new file mode 100644 index 0000000000..cb434b5ca9 --- /dev/null +++ b/docs/source/user_guide/grid.md @@ -0,0 +1,77 @@ +# Grid primitives + +Grid-level primitives operate across **all blocks of a single kernel launch** — i.e. the entire device for the duration of one kernel. They sit one tier above block-scope primitives and one tier below "finish the kernel and launch a new one" (the only fully cross-block thread synchronization Quadrants offers). + +Grid ops live under `qd.simt.grid`. The namespace currently contains a single op, the device-scope memory fence: + +## What's available + +| Op | CUDA | AMDGPU | SPIR-V (Vulkan / Metal) | +|--------------------------|------|--------|-------------------------| +| `qd.simt.grid.memfence()`| yes | no | no | + +Calling a backend marked "no" raises a runtime / link-time error. AMDGPU and SPIR-V lowerings are tracked as future work; until they land, kernels that need a grid-scope fence are CUDA-only. + +Naming note: `qd.simt.grid.memfence()` will be renamed to `qd.simt.grid.mem_fence()` (note the underscore) in the near future, for consistency with the `mem_fence` spelling used at other scopes. The new name is not yet available; this page uses the current name throughout. + +## Barrier vs fence at grid scope + +There is no `grid.sync()` — Quadrants does not expose a thread-converging barrier across blocks within a single kernel launch. The reasons are practical: CUDA cooperative groups need launch-time opt-in, AMDGPU and SPIR-V either lack a comparable primitive or expose it only under non-portable extensions, and the latency of an in-kernel grid barrier is comparable to a kernel relaunch on most hardware. + +What Quadrants does provide at grid scope is a pure **memory fence**: + +- **`qd.simt.grid.memfence()`** orders memory operations issued by the calling thread so that prior writes are visible to other threads **anywhere in the grid** (across blocks) before any subsequent read in the calling thread can be reordered ahead of the fence. It does **not** synchronize threads. +- For full thread synchronization across the grid, finish the current kernel and launch a new one — the implicit kernel-end barrier is the canonical cross-block synchronization in Quadrants. + +## Semantics + +### `qd.simt.grid.memfence()` + +**Planned rename: `qd.simt.grid.mem_fence()`** (with underscore). The op will be renamed in a future release for consistency with the `mem_fence` spelling at other scopes; the current `memfence` name remains the only spelling available today, and the rest of this section uses it. + +A device-scope memory fence. Lowers to `__threadfence()` (`nvvm_membar_gl`) on CUDA. No convergence requirement — safe to call from divergent control flow (e.g. inside `if tid == 0`). + +Use this when one block (or one thread per block) needs to publish data to global memory and have the publication be visible to other blocks **without** waiting at a kernel boundary. The canonical use case is the decoupled-look-back pattern in Onesweep-style device scans: + +```python +@qd.kernel +def lookback_scan(...) -> None: + bid = qd.simt.block.global_thread_idx() // BLOCK_SIZE + tid = qd.simt.block.global_thread_idx() % BLOCK_SIZE + + block_sum = ... + + if tid == 0: + partials[bid] = block_sum + qd.simt.grid.memfence() + flags[bid] = STATE_AGGREGATE + + if tid == 0: + prev = bid - 1 + while prev >= 0: + while flags[prev] == STATE_INVALID: + pass + qd.simt.grid.memfence() + block_sum += partials[prev] + ... +``` + +The two `grid.memfence()` calls are doing different jobs: + +1. The first orders the publication: any block reading `flags[bid] == STATE_AGGREGATE` is guaranteed to also see the published `partials[bid]`. +2. The second is the symmetric reader-side fence: after observing the predecessor's flag, the reader needs to refresh its view of `partials[prev]` (the scope of which is already global, but the fence pins down the ordering). + +The fence does not require thread convergence, which is why it appears inside `if tid == 0` without deadlocking — `qd.simt.block.sync()` would deadlock there; `grid.memfence()` is safe. + +## Performance and portability notes + +- **CUDA-only today.** AMDGPU and SPIR-V lowerings are not implemented; calling `grid.memfence()` on those backends raises at trace time. Cross-platform code that needs a grid-scope fence must currently CUDA-bound the kernel, or restructure to use the kernel-launch boundary. +- **Cost scales with the global-cache invalidation domain.** A grid fence drains the L2 (and on some GPUs the L1) caches of all SMs / CUs touching the address. On A100 / H100 the cost is on the order of tens to low hundreds of nanoseconds per call; in tight loops, prefer batching multiple cross-block updates per fence. +- **Pair with the right ordering of memory ops.** The fence orders the *calling thread*'s memory ops; readers in other blocks need their own fence (or an atomic load) to refresh their view. The producer-fence + consumer-fence pattern is the canonical idiom. +- **Not a substitute for atomics on contended locations.** A fence orders writes but does not serialize them. If multiple blocks write to the same location, you need an atomic regardless of how the fence is placed. + +## Related + +- `qd.simt.block.*` — the block-scope counterpart, including `qd.simt.block.mem_sync()` (block-scope fence) and `qd.simt.block.sync()` (block-scope barrier). +- `qd.simt.subgroup.*` — subgroup-scope barriers, fences, shuffles, and reductions. +- [parallelization](parallelization.md) — the broader synchronization story; explains how grid-scope fences fit relative to atomics, block barriers, and the kernel boundary. diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md index f783e84264..e16919fdd0 100644 --- a/docs/source/user_guide/index.md +++ b/docs/source/user_guide/index.md @@ -47,6 +47,7 @@ autodiff :maxdepth: 1 :titlesonly: +grid subgroup tile16 ```