Genesis-Embodied-AI · hughperkins · May 7, 2026 · May 7, 2026 · May 7, 2026 · May 7, 2026
diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md
@@ -0,0 +1,77 @@
+# Atomics
+
+Atomic read-modify-write operations on a single memory location. They do not synchronize threads; the only ordering they provide is the per-location atomicity of the read-modify-write itself. For cooperative ops across threads see the `qd.simt.block.*`, `qd.simt.subgroup.*`, and `qd.simt.grid.*` namespaces. Bit-counting helpers on integer registers (`qd.math.popcnt`, `qd.math.clz`) are documented in [math](math.md).
+
+## What's available
+
+All atomic ops follow the same shape: `qd.atomic_op(x, y)` performs `x = op(x, y)` atomically and returns the **old** value of `x`. `x` must be a writable memory target (a field element, ndarray element, or matrix slot); scalars and constant expressions are not allowed.
+
+| Op             | Semantics                              | i32 | u32 | i64 | u64 | f32 | f64 |
+|----------------|----------------------------------------|-----|-----|-----|-----|-----|-----|
+| `atomic_add`   | `x += y`                               | yes | yes | yes | yes | yes | \*  |
+| `atomic_sub`   | `x -= y`                               | yes | yes | yes | yes | yes | \*  |
+| `atomic_mul`   | `x *= y`                               | yes | yes | yes | yes | yes | \*  |
+| `atomic_min`   | `x = min(x, y)`                        | yes | yes | yes | yes | yes | \*  |
+| `atomic_max`   | `x = max(x, y)`                        | yes | yes | yes | yes | yes | \*  |
+| `atomic_and`   | `x &= y`                               | yes | yes | yes | yes | —   | —   |
+| `atomic_or`    | `x \|= y`                              | yes | yes | yes | yes | —   | —   |
+| `atomic_xor`   | `x ^= y`                               | yes | yes | yes | yes | —   | —   |
+
+\* `f64` atomic add / sub / mul / min / max is hardware-dependent: supported on CUDA sm_60+ for `add`, falls back to a CAS loop elsewhere or raises at codegen time on older targets and on backends that do not lower a CAS loop. Prefer `f32` on hot paths if portability matters.
+
+There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ runtime uses CmpXchg internally; surfacing it requires extending `AtomicOpType`.
+
+All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope).
+
+**Backend caveat for the fence-pair pattern.** Both fence helpers have current portability gaps that affect the patterns recommended on this page:
+
+- `qd.simt.block.mem_sync()` is supported on CUDA and SPIR-V; on AMDGPU it raises `ValueError("qd.block.mem_sync is not supported for arch ...")` at trace time.
+- `qd.simt.grid.memfence()` is fully implemented only on CUDA. On AMDGPU it currently links as a silent no-op (cross-block ordering will fail without any diagnostic); on SPIR-V it fails at codegen. See [grid](grid.md) for the per-backend details.
+
+On AMDGPU specifically, neither fence-pair recipe works as documented yet; cross-platform code that needs an atomic plus a fence must restructure around the kernel-launch boundary or be CUDA-bound until the AMDGPU lowerings land.
+
+## Semantics
+
+### `qd.atomic_add(x, y)` — and the rest of the family
+
+```python
+old = qd.atomic_add(x, y)
+# Effect:
+#   tmp = load(x)
+#   store(x, op(tmp, y))
+#   old = tmp
+# all three steps execute as a single atomic transaction on x.
+```
+
+Properties common to every `qd.atomic_*`:
+
+- **Returns the old value**, not the new one. This matches CUDA's `atomicAdd` and is what enables building reservation patterns: `slot = qd.atomic_add(counter, 1)` gives every thread a unique index.
+- **Per-location atomicity, no fence on the rest of memory.** Writes you issued before an atomic on `x` are not necessarily visible to other threads after they observe the new `x`. Pair the atomic with `qd.simt.block.mem_sync()` or `qd.simt.grid.memfence()` if you need that ordering.
+- **Vector / matrix arguments fan out element-wise.** `qd.atomic_add(field_of_vec3, qd.Vector([1.0, 2.0, 3.0]))` issues three independent scalar atomic-adds, one per component. There is no all-or-nothing guarantee across the components.
+
+### `qd.atomic_min(x, y)` / `qd.atomic_max(x, y)`
+
+Atomically writes back `min(x, y)` (resp. `max(x, y)`). Returns the old value of `x`. Floating-point min/max use **`minNum` / `maxNum`-style** semantics: if exactly one input is `NaN`, the **non-`NaN`** value is written back. This matches the f16 path's use of LLVM `llvm.minnum` / `llvm.maxnum` intrinsics (`quadrants/codegen/llvm/codegen_llvm.cpp:1337-1342`) and the GPU-native paths (CUDA sm_80+ `atomicMin`/`atomicMax` for floats, SPIR-V `FMin` / `FMax`). The f32 / f64 CPU CAS-loop path (`quadrants/runtime/llvm/runtime_module/atomic.h::min_f32` / `max_f32`) uses naive `<` / `>` comparisons, which give asymmetric NaN behaviour depending on operand order — do not rely on a particular result when either input is `NaN` on the CPU backend. Behaviour when *both* inputs are `NaN` is backend-dependent across the board.
+
+### `qd.atomic_and(x, y)` / `qd.atomic_or(x, y)` / `qd.atomic_xor(x, y)`
+
+Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type error at trace time.
+
+### `qd.atomic_sub(x, y)` / `qd.atomic_mul(x, y)`
+
+Atomic subtract and atomic multiply. `atomic_sub` is supported natively on most backends; `atomic_mul` on integer types lowers to a CAS loop on hardware without a native multiply atomic and is intentionally not heavily optimised — prefer reducing to a different scheme on hot paths.
+
+## Performance and portability notes
+
+- **Atomic contention is the silent killer of throughput.** The cost of `qd.atomic_add(counter, 1)` from every thread is dominated by serialization at the location, not by the per-thread arithmetic. If many threads hit the same slot, prefer a two-stage scheme: per-warp / per-block reduction first (`qd.simt.block.reduce` if available, or `qd.simt.subgroup.reduce_add`), then a single atomic per warp / block.
+- **Pair atomics with the right fence scope.** A bare atomic only orders the location it touches. To make other writes visible to readers that observe the new atomic value, follow the atomic with a fence: block-scope (`qd.simt.block.mem_sync()`) for shared-memory publishing, or grid-scope (`qd.simt.grid.memfence()`) for cross-block coordination.
+- **`f64` atomics fall off the fast path** on most backends; if you only need monotonic accumulation, consider Kahan summation in registers and a single atomic-add at the end of the block.
+- **`atomic_mul` is generally a CAS loop** under the hood; don't put it on the hot path.
+
+## Related
+
+- [math](math.md) — `qd.math.*`, including the bit-counting helpers (`popcnt`, `clz`) commonly paired with atomics in select / compact patterns.
+- `qd.simt.block.*` — block-scope barriers and memory fences (`qd.simt.block.mem_sync()`).
+- `qd.simt.subgroup.*` — warp-scope reductions and shuffles, the recommended pre-aggregation step before an atomic.
+- `qd.simt.grid.*` — device-scope memory fence (`qd.simt.grid.memfence()`).
+- [parallelization](parallelization.md) — thread-synchronization patterns and how atomics fit into the broader synchronization story.
diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md
@@ -47,6 +47,8 @@ autodiff
 :maxdepth: 1
 :titlesonly:
 
+atomics
+math
 subgroup
 tile16
 ```

diff --git a/docs/source/user_guide/math.md b/docs/source/user_guide/math.md
@@ -0,0 +1,61 @@
+# Math
+
+`qd.math` is the quadrants standard library of math helpers.
+
+This page currently documents only the bit-counting helpers. The broader `qd.math` surface is exported and usable today but is not yet documented here.
+
+## Bit operations
+
+Single-thread integer-register operations. They do not access memory and do not synchronize threads — each thread independently transforms a value in its own register.
+
+| Op                  | CUDA                | AMDGPU                       | SPIR-V (Vulkan / Metal)                            |
+|---------------------|---------------------|------------------------------|----------------------------------------------------|
+| `qd.math.popcnt(x)` | i32, u32, i64, u64  | unsupported (codegen FIXME)  | any int (`OpBitCount`)                             |
+| `qd.math.clz(x)`    | i32, i64 only \*    | unsupported (codegen FIXME)  | 32-bit only (`FindMSB`); 64-bit input is silently truncated |
+
+\* On CUDA, `qd.math.clz` rejects unsigned 32- and 64-bit inputs (`QD_NOT_IMPLEMENTED` in `quadrants/codegen/cuda/codegen_cuda.cpp`); `bit_cast` through the matching signed type as a workaround: `qd.math.clz(qd.bit_cast(x, qd.i32))`. CUDA `popcnt` accepts u32 / u64 directly; only `clz` has the signed-only restriction. On unsupported integer widths (e.g. `i8`, `i16`, `u16`) both ops also hit `QD_NOT_IMPLEMENTED`.
+
+**FIXME (AMDGPU):** the AMDGPU `emit_extra_unary` override (`quadrants/codegen/amdgpu/codegen_amdgpu.cpp`) has no `popcnt` or `clz` branch; both fall through to `QD_NOT_IMPLEMENTED`. The test suite already records this (`tests/python/test_unary_ops.py::test_popcnt` and `::test_clz` both `xfail` on AMDGPU). Until lowerings are added, AMDGPU users hit a hard codegen failure.
+
+The classic CUDA bit-tricks `__ffs` (find first set bit) and `__fns` (find n-th set bit in a mask) are not exposed; for a leading-zero count of a u32 on CUDA, the `bit_cast` workaround above is the canonical approach.
+
+### `qd.math.popcnt(x)`
+
+Counts set bits in `x` and returns an `i32`. On CUDA, lowers to `__nv_popc` for 32-bit inputs and `__nv_popcll` for 64-bit inputs (i32 / u32 / i64 / u64 only; narrower widths and AMDGPU are unsupported). On SPIR-V, lowers to `OpBitCount`.
+
+### `qd.math.clz(x)`
+
+Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. On CUDA, lowers to `__nv_clz` (i32 only) and `__nv_clzll` (i64 only); u32 / u64 must be `bit_cast` to the matching signed type. On SPIR-V, lowers to `FindMSB` with `bitwidth - 1 - FindMSB` to convert MSB index into a leading-zero count; the implementation is hard-coded to 32-bit, so 64-bit input silently truncates. AMDGPU is unsupported. See the cross-backend caveats in the support table.
+
+## Examples
+
+### Bitset population count
+
+```python
+@qd.kernel
+def count_bits(masks: qd.types.NDArray[qd.u32, 1], total: qd.types.NDArray[qd.i32, 1]) -> None:
+    n = 0
+    for i in range(masks.shape[0]):
+        n += qd.math.popcnt(masks[i])
+    qd.atomic_add(total[0], n)
+```
+
+### Highest set bit (Morton-code depth)
+
+```python
+@qd.func
+def msb(x: qd.i32) -> qd.i32:
+    return 31 - qd.math.clz(x)
+```
+
+For `qd.u32` input on CUDA, cast first: `qd.math.clz(qd.bit_cast(x, qd.i32))`.
+
+## Performance and portability notes
+
+- `qd.math.popcnt` is supported on CUDA (i32 / u32 / i64 / u64) and SPIR-V (any integer width). AMDGPU is unsupported (FIXME above).
+- `qd.math.clz` has the dtype and backend caveats noted above. Tests that depend on `qd.math.clz` over u32 or u64 should `bit_cast` to the matching signed type for portability on CUDA, and avoid 64-bit input on SPIR-V.
+
+## Related
+
+- [atomics](atomics.md) — atomic read-modify-write operations on global / shared memory; commonly paired with bit-counting in select / compact patterns.
+- `qd.bit_cast` — reinterprets a value's bit pattern as another dtype, used as a workaround for the `clz` u32 / u64 caveats above.
-Original file line number
+Diff line change
@@ Expand Up / @@ -47,6 +47,8 @@ autodiff @@
     :maxdepth: 1
     :titlesonly:
+    atomics
+    math
     subgroup
     tile16
     ```
@@ Expand Down @@