diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md new file mode 100644 index 0000000000..a3689c6372 --- /dev/null +++ b/docs/source/user_guide/atomics.md @@ -0,0 +1,77 @@ +# Atomics + +Atomic read-modify-write operations on a single memory location. They do not synchronize threads; the only ordering they provide is the per-location atomicity of the read-modify-write itself. For cooperative ops across threads see the `qd.simt.block.*`, `qd.simt.subgroup.*`, and `qd.simt.grid.*` namespaces. Bit-counting helpers on integer registers (`qd.math.popcnt`, `qd.math.clz`) are documented in [math](math.md). + +## What's available + +All atomic ops follow the same shape: `qd.atomic_op(x, y)` performs `x = op(x, y)` atomically and returns the **old** value of `x`. `x` must be a writable memory target (a field element, ndarray element, or matrix slot); scalars and constant expressions are not allowed. + +| Op | Semantics | i32 | u32 | i64 | u64 | f32 | f64 | +|----------------|----------------------------------------|-----|-----|-----|-----|-----|-----| +| `atomic_add` | `x += y` | yes | yes | yes | yes | yes | \* | +| `atomic_sub` | `x -= y` | yes | yes | yes | yes | yes | \* | +| `atomic_mul` | `x *= y` | yes | yes | yes | yes | yes | \* | +| `atomic_min` | `x = min(x, y)` | yes | yes | yes | yes | yes | \* | +| `atomic_max` | `x = max(x, y)` | yes | yes | yes | yes | yes | \* | +| `atomic_and` | `x &= y` | yes | yes | yes | yes | — | — | +| `atomic_or` | `x \|= y` | yes | yes | yes | yes | — | — | +| `atomic_xor` | `x ^= y` | yes | yes | yes | yes | — | — | + +\* `f64` atomic add / sub / mul / min / max is hardware-dependent: supported on CUDA sm_60+ for `add`, falls back to a CAS loop elsewhere or raises at codegen time on older targets and on backends that do not lower a CAS loop. Prefer `f32` on hot paths if portability matters. + +There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ runtime uses CmpXchg internally; surfacing it requires extending `AtomicOpType`. + +All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope). + +**Backend caveat for the fence-pair pattern.** Both fence helpers have current portability gaps that affect the patterns recommended on this page: + +- `qd.simt.block.mem_sync()` is supported on CUDA and SPIR-V; on AMDGPU it raises `ValueError("qd.block.mem_sync is not supported for arch ...")` at trace time. +- `qd.simt.grid.memfence()` is fully implemented only on CUDA. On AMDGPU it currently links as a silent no-op (cross-block ordering will fail without any diagnostic); on SPIR-V it fails at codegen. See [grid](grid.md) for the per-backend details. + +On AMDGPU specifically, neither fence-pair recipe works as documented yet; cross-platform code that needs an atomic plus a fence must restructure around the kernel-launch boundary or be CUDA-bound until the AMDGPU lowerings land. + +## Semantics + +### `qd.atomic_add(x, y)` — and the rest of the family + +```python +old = qd.atomic_add(x, y) +# Effect: +# tmp = load(x) +# store(x, op(tmp, y)) +# old = tmp +# all three steps execute as a single atomic transaction on x. +``` + +Properties common to every `qd.atomic_*`: + +- **Returns the old value**, not the new one. This matches CUDA's `atomicAdd` and is what enables building reservation patterns: `slot = qd.atomic_add(counter, 1)` gives every thread a unique index. +- **Per-location atomicity, no fence on the rest of memory.** Writes you issued before an atomic on `x` are not necessarily visible to other threads after they observe the new `x`. Pair the atomic with `qd.simt.block.mem_sync()` or `qd.simt.grid.memfence()` if you need that ordering. +- **Vector / matrix arguments fan out element-wise.** `qd.atomic_add(field_of_vec3, qd.Vector([1.0, 2.0, 3.0]))` issues three independent scalar atomic-adds, one per component. There is no all-or-nothing guarantee across the components. + +### `qd.atomic_min(x, y)` / `qd.atomic_max(x, y)` + +Atomically writes back `min(x, y)` (resp. `max(x, y)`). Returns the old value of `x`. Floating-point min/max use **`minNum` / `maxNum`-style** semantics: if exactly one input is `NaN`, the **non-`NaN`** value is written back. This matches the f16 path's use of LLVM `llvm.minnum` / `llvm.maxnum` intrinsics (`quadrants/codegen/llvm/codegen_llvm.cpp:1337-1342`) and the GPU-native paths (CUDA sm_80+ `atomicMin`/`atomicMax` for floats, SPIR-V `FMin` / `FMax`). The f32 / f64 CPU CAS-loop path (`quadrants/runtime/llvm/runtime_module/atomic.h::min_f32` / `max_f32`) uses naive `<` / `>` comparisons, which give asymmetric NaN behaviour depending on operand order — do not rely on a particular result when either input is `NaN` on the CPU backend. Behaviour when *both* inputs are `NaN` is backend-dependent across the board. + +### `qd.atomic_and(x, y)` / `qd.atomic_or(x, y)` / `qd.atomic_xor(x, y)` + +Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type error at trace time. + +### `qd.atomic_sub(x, y)` / `qd.atomic_mul(x, y)` + +Atomic subtract and atomic multiply. `atomic_sub` is supported natively on most backends; `atomic_mul` on integer types lowers to a CAS loop on hardware without a native multiply atomic and is intentionally not heavily optimised — prefer reducing to a different scheme on hot paths. + +## Performance and portability notes + +- **Atomic contention is the silent killer of throughput.** The cost of `qd.atomic_add(counter, 1)` from every thread is dominated by serialization at the location, not by the per-thread arithmetic. If many threads hit the same slot, prefer a two-stage scheme: per-warp / per-block reduction first (`qd.simt.block.reduce` if available, or `qd.simt.subgroup.reduce_add`), then a single atomic per warp / block. +- **Pair atomics with the right fence scope.** A bare atomic only orders the location it touches. To make other writes visible to readers that observe the new atomic value, follow the atomic with a fence: block-scope (`qd.simt.block.mem_sync()`) for shared-memory publishing, or grid-scope (`qd.simt.grid.memfence()`) for cross-block coordination. +- **`f64` atomics fall off the fast path** on most backends; if you only need monotonic accumulation, consider Kahan summation in registers and a single atomic-add at the end of the block. +- **`atomic_mul` is generally a CAS loop** under the hood; don't put it on the hot path. + +## Related + +- [math](math.md) — `qd.math.*`, including the bit-counting helpers (`popcnt`, `clz`) commonly paired with atomics in select / compact patterns. +- `qd.simt.block.*` — block-scope barriers and memory fences (`qd.simt.block.mem_sync()`). +- `qd.simt.subgroup.*` — warp-scope reductions and shuffles, the recommended pre-aggregation step before an atomic. +- `qd.simt.grid.*` — device-scope memory fence (`qd.simt.grid.memfence()`). +- [parallelization](parallelization.md) — thread-synchronization patterns and how atomics fit into the broader synchronization story. diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md index f783e84264..3fab71fd50 100644 --- a/docs/source/user_guide/index.md +++ b/docs/source/user_guide/index.md @@ -47,6 +47,8 @@ autodiff :maxdepth: 1 :titlesonly: +atomics +math subgroup tile16 ``` diff --git a/docs/source/user_guide/math.md b/docs/source/user_guide/math.md new file mode 100644 index 0000000000..088fa9f08f --- /dev/null +++ b/docs/source/user_guide/math.md @@ -0,0 +1,61 @@ +# Math + +`qd.math` is the quadrants standard library of math helpers. + +This page currently documents only the bit-counting helpers. The broader `qd.math` surface is exported and usable today but is not yet documented here. + +## Bit operations + +Single-thread integer-register operations. They do not access memory and do not synchronize threads — each thread independently transforms a value in its own register. + +| Op | CUDA | AMDGPU | SPIR-V (Vulkan / Metal) | +|---------------------|---------------------|------------------------------|----------------------------------------------------| +| `qd.math.popcnt(x)` | i32, u32, i64, u64 | unsupported (codegen FIXME) | any int (`OpBitCount`) | +| `qd.math.clz(x)` | i32, i64 only \* | unsupported (codegen FIXME) | 32-bit only (`FindMSB`); 64-bit input is silently truncated | + +\* On CUDA, `qd.math.clz` rejects unsigned 32- and 64-bit inputs (`QD_NOT_IMPLEMENTED` in `quadrants/codegen/cuda/codegen_cuda.cpp`); `bit_cast` through the matching signed type as a workaround: `qd.math.clz(qd.bit_cast(x, qd.i32))`. CUDA `popcnt` accepts u32 / u64 directly; only `clz` has the signed-only restriction. On unsupported integer widths (e.g. `i8`, `i16`, `u16`) both ops also hit `QD_NOT_IMPLEMENTED`. + +**FIXME (AMDGPU):** the AMDGPU `emit_extra_unary` override (`quadrants/codegen/amdgpu/codegen_amdgpu.cpp`) has no `popcnt` or `clz` branch; both fall through to `QD_NOT_IMPLEMENTED`. The test suite already records this (`tests/python/test_unary_ops.py::test_popcnt` and `::test_clz` both `xfail` on AMDGPU). Until lowerings are added, AMDGPU users hit a hard codegen failure. + +The classic CUDA bit-tricks `__ffs` (find first set bit) and `__fns` (find n-th set bit in a mask) are not exposed; for a leading-zero count of a u32 on CUDA, the `bit_cast` workaround above is the canonical approach. + +### `qd.math.popcnt(x)` + +Counts set bits in `x` and returns an `i32`. On CUDA, lowers to `__nv_popc` for 32-bit inputs and `__nv_popcll` for 64-bit inputs (i32 / u32 / i64 / u64 only; narrower widths and AMDGPU are unsupported). On SPIR-V, lowers to `OpBitCount`. + +### `qd.math.clz(x)` + +Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. On CUDA, lowers to `__nv_clz` (i32 only) and `__nv_clzll` (i64 only); u32 / u64 must be `bit_cast` to the matching signed type. On SPIR-V, lowers to `FindMSB` with `bitwidth - 1 - FindMSB` to convert MSB index into a leading-zero count; the implementation is hard-coded to 32-bit, so 64-bit input silently truncates. AMDGPU is unsupported. See the cross-backend caveats in the support table. + +## Examples + +### Bitset population count + +```python +@qd.kernel +def count_bits(masks: qd.types.NDArray[qd.u32, 1], total: qd.types.NDArray[qd.i32, 1]) -> None: + n = 0 + for i in range(masks.shape[0]): + n += qd.math.popcnt(masks[i]) + qd.atomic_add(total[0], n) +``` + +### Highest set bit (Morton-code depth) + +```python +@qd.func +def msb(x: qd.i32) -> qd.i32: + return 31 - qd.math.clz(x) +``` + +For `qd.u32` input on CUDA, cast first: `qd.math.clz(qd.bit_cast(x, qd.i32))`. + +## Performance and portability notes + +- `qd.math.popcnt` is supported on CUDA (i32 / u32 / i64 / u64) and SPIR-V (any integer width). AMDGPU is unsupported (FIXME above). +- `qd.math.clz` has the dtype and backend caveats noted above. Tests that depend on `qd.math.clz` over u32 or u64 should `bit_cast` to the matching signed type for portability on CUDA, and avoid 64-bit input on SPIR-V. + +## Related + +- [atomics](atomics.md) — atomic read-modify-write operations on global / shared memory; commonly paired with bit-counting in select / compact patterns. +- `qd.bit_cast` — reinterprets a value's bit pattern as another dtype, used as a workaround for the `clz` u32 / u64 caveats above.