From 96eaa1904b25b8a50f2eb9fa0658229af6bf7a8a Mon Sep 17 00:00:00 2001 From: Hugh Perkins Date: Thu, 7 May 2026 00:41:39 -0700 Subject: [PATCH 1/7] [Docs] Add user-guide page for atomics and bit operations Documents qd.atomic_* (add, sub, mul, min, max, and, or, xor) and the bit-counting helpers qd.math.popcnt / qd.math.clz. Covers semantics (returns old value, per-location atomicity, no fence on the rest of memory), backend / dtype support, performance pitfalls (contention, f64 fast path, atomic_mul CAS loop), and worked examples (slot reservation, histogram, popcnt sum, MSB). Adds atomics.md to the SIMT-primitives toctree in index.md. --- docs/source/user_guide/atomics.md | 145 ++++++++++++++++++++++++++++++ docs/source/user_guide/index.md | 1 + 2 files changed, 146 insertions(+) create mode 100644 docs/source/user_guide/atomics.md diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md new file mode 100644 index 0000000000..e9257a2351 --- /dev/null +++ b/docs/source/user_guide/atomics.md @@ -0,0 +1,145 @@ +# Atomics and bit operations + +Per-thread primitives — operations a single thread executes on its own registers or on a single memory location. They do not synchronize threads; the only ordering they provide is the per-location atomicity of the read-modify-write itself. For cooperative ops across threads see the `qd.simt.block.*`, `qd.simt.subgroup.*`, and `qd.simt.grid.*` namespaces. + +This page covers two groups: atomic read-modify-write operations on global / shared memory (`qd.atomic_*`), and bit-counting helpers on integer registers (`qd.math.popcnt`, `qd.math.clz`). + +## What's available + +### Atomics + +All atomic ops follow the same shape: `qd.atomic_op(x, y)` performs `x = op(x, y)` atomically and returns the **old** value of `x`. `x` must be a writable memory target (a field element, ndarray element, or matrix slot); scalars and constant expressions are not allowed. + +| Op | Semantics | i32 | u32 | i64 | u64 | f32 | f64 | +|----------------|----------------------------------------|-----|-----|-----|-----|-----|-----| +| `atomic_add` | `x += y` | yes | yes | yes | yes | yes | \* | +| `atomic_sub` | `x -= y` | yes | yes | yes | yes | yes | \* | +| `atomic_mul` | `x *= y` | yes | yes | yes | yes | yes | \* | +| `atomic_min` | `x = min(x, y)` | yes | yes | yes | yes | yes | \* | +| `atomic_max` | `x = max(x, y)` | yes | yes | yes | yes | yes | \* | +| `atomic_and` | `x &= y` | yes | yes | yes | yes | — | — | +| `atomic_or` | `x \|= y` | yes | yes | yes | yes | — | — | +| `atomic_xor` | `x ^= y` | yes | yes | yes | yes | — | — | + +\* `f64` atomic add / sub / mul / min / max is hardware-dependent: supported on CUDA sm_60+ for `add`, falls back to a CAS loop elsewhere or raises at codegen time on older targets and on backends that do not lower a CAS loop. Prefer `f32` on hot paths if portability matters. + +There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ runtime uses CmpXchg internally; surfacing it requires extending `AtomicOpType`. + +All atomic ops work on both global memory (fields, ndarrays) and block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope). + +### Bit-counting helpers + +| Op | What it returns | i32 | u32 | i64 | u64 | +|---------------------|----------------------------------------------|-----|-----|-----|-----| +| `qd.math.popcnt(x)` | Number of set bits in `x` | yes | yes | yes | yes | +| `qd.math.clz(x)` | Number of leading zero bits in `x` | yes | \* | yes | \* | + +\* `qd.math.clz` on CUDA currently rejects unsigned 32- and 64-bit inputs (`QD_NOT_IMPLEMENTED`); cast through `qd.bit_cast(x, qd.i32)` / `qd.i64` as a workaround. On SPIR-V, `qd.math.clz` is hard-coded to 32-bit (`FindMSB`); 64-bit input is silently truncated. + +The classic CUDA bit-tricks `__ffs` (find first set bit) and `__fns` (find n-th set bit in a mask) are not exposed; for a leading-zero count of a u32, the `bit_cast` workaround above is the canonical approach. + +## Semantics + +### `qd.atomic_add(x, y)` — and the rest of the family + +```python +old = qd.atomic_add(x, y) +# Effect: +# tmp = load(x) +# store(x, op(tmp, y)) +# old = tmp +# all three steps execute as a single atomic transaction on x. +``` + +Properties common to every `qd.atomic_*`: + +- **Returns the old value**, not the new one. This matches CUDA's `atomicAdd` and is what enables building reservation patterns: `slot = qd.atomic_add(counter, 1)` gives every thread a unique index. +- **Per-location atomicity, no fence on the rest of memory.** Writes you issued before an atomic on `x` are not necessarily visible to other threads after they observe the new `x`. Pair the atomic with `qd.simt.block.mem_sync()` or `qd.simt.grid.memfence()` if you need that ordering. +- **Vector / matrix arguments fan out element-wise.** `qd.atomic_add(field_of_vec3, qd.Vector([1.0, 2.0, 3.0]))` issues three independent scalar atomic-adds, one per component. There is no all-or-nothing guarantee across the components. + +### `qd.atomic_min(x, y)` / `qd.atomic_max(x, y)` + +Atomically writes back `min(x, y)` (resp. `max(x, y)`). Returns the old value of `x`. Floating-point min/max follow IEEE rules — `NaN` propagates: if either input is `NaN`, the result is `NaN`. + +### `qd.atomic_and(x, y)` / `qd.atomic_or(x, y)` / `qd.atomic_xor(x, y)` + +Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type error at trace time. + +### `qd.atomic_sub(x, y)` / `qd.atomic_mul(x, y)` + +Atomic subtract and atomic multiply. `atomic_sub` is supported natively on most backends; `atomic_mul` on integer types lowers to a CAS loop on hardware without a native multiply atomic and is intentionally not heavily optimised — prefer reducing to a different scheme on hot paths. + +### `qd.math.popcnt(x)` + +Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes. + +### `qd.math.clz(x)` + +Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. Lowers to `__nv_clz` / `__nv_clzll` on CUDA, `FindMSB` on SPIR-V (with `bitwidth - 1 - FindMSB` to convert MSB index into leading-zero count), `__builtin_amdgcn_sffbh_i32` on AMDGPU. See the cross-backend caveats in the support table. + +## Examples + +### Reserving a slot in an output array + +```python +counter = qd.field(qd.i32, shape=()) +output = qd.field(qd.f32, shape=(MAX_OUTPUTS,)) + +@qd.kernel +def emit(values: qd.types.NDArray[qd.f32, 1], threshold: qd.f32) -> None: + for i in range(values.shape[0]): + if values[i] > threshold: + slot = qd.atomic_add(counter[None], 1) + output[slot] = values[i] +``` + +Every thread that passes the predicate gets a unique `slot` from the counter. The pattern is the workhorse of select / compact and contact-pair generation. + +### Histogram + +```python +hist = qd.field(qd.i32, shape=(NBINS,)) + +@qd.kernel +def histogram(samples: qd.types.NDArray[qd.f32, 1]) -> None: + for i in range(samples.shape[0]): + b = qd.i32(samples[i] * NBINS) + if 0 <= b < NBINS: + qd.atomic_add(hist[b], 1) +``` + +### Bitset population count + +```python +@qd.kernel +def count_bits(masks: qd.types.NDArray[qd.u32, 1], total: qd.types.NDArray[qd.i32, 1]) -> None: + n = 0 + for i in range(masks.shape[0]): + n += qd.math.popcnt(masks[i]) + qd.atomic_add(total[0], n) +``` + +### Highest set bit (Morton-code depth) + +```python +@qd.func +def msb(x: qd.i32) -> qd.i32: + return 31 - qd.math.clz(x) +``` + +For `qd.u32` input on CUDA, cast first: `qd.math.clz(qd.bit_cast(x, qd.i32))`. + +## Performance and portability notes + +- **Atomic contention is the silent killer of throughput.** The cost of `qd.atomic_add(counter, 1)` from every thread is dominated by serialization at the location, not by the per-thread arithmetic. If many threads hit the same slot, prefer a two-stage scheme: per-warp / per-block reduction first (`qd.simt.block.reduce` if available, or `qd.simt.subgroup.reduce_add`), then a single atomic per warp / block. +- **Pair atomics with the right fence scope.** A bare atomic only orders the location it touches. To make other writes visible to readers that observe the new atomic value, follow the atomic with a fence: block-scope (`qd.simt.block.mem_sync()`) for shared-memory publishing, or grid-scope (`qd.simt.grid.memfence()`) for cross-block coordination. +- **`f64` atomics fall off the fast path** on most backends; if you only need monotonic accumulation, consider Kahan summation in registers and a single atomic-add at the end of the block. +- **`atomic_mul` is generally a CAS loop** under the hood; don't put it on the hot path. +- **Bit-trick portability is uneven.** `qd.math.popcnt` is fully cross-backend; `qd.math.clz` has the dtype caveats noted above. Tests that depend on `qd.math.clz` over u32 or u64 should `bit_cast` to the matching signed type for portability. + +## Related + +- `qd.simt.block.*` — block-scope barriers and memory fences (`qd.simt.block.mem_sync()`). +- `qd.simt.subgroup.*` — warp-scope reductions and shuffles, the recommended pre-aggregation step before an atomic. +- `qd.simt.grid.*` — device-scope memory fence (`qd.simt.grid.memfence()`). +- [parallelization](parallelization.md) — thread-synchronization patterns and how atomics fit into the broader synchronization story. diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md index f783e84264..95467a5728 100644 --- a/docs/source/user_guide/index.md +++ b/docs/source/user_guide/index.md @@ -47,6 +47,7 @@ autodiff :maxdepth: 1 :titlesonly: +atomics subgroup tile16 ``` From 7bc43c0f0de0fc6290f24c1ec8086b310571fd6c Mon Sep 17 00:00:00 2001 From: Hugh Perkins Date: Thu, 7 May 2026 06:31:47 -0700 Subject: [PATCH 2/7] [Docs] Split atomics.md into atomics.md + math.md The previous single page covered two unrelated topics: the qd.atomic_* read-modify-write ops, and the qd.math.popcnt / qd.math.clz bit-counting helpers. They share no concerns (memory coordination vs single-thread register ops) and live in different namespaces. Split into: - atomics.md -- title shortened to "Atomics". Drops the bit-counting table, the popcnt / clz semantics subsections, the bitset-popcount and Morton-MSB examples, and the bit-trick portability bullet. Adds a one-line cross-reference to math.md in the intro and the Related section. - math.md (new) -- framed as a qd.math.* page that currently documents only the bit operations. Lists the broader qd.math surface (vec / mat / GLSL ops / trig / rotations / float-classification / constants) in the intro per the project's own python/quadrants/math/ __init__.py docstring, and notes explicitly that those are exported but not yet documented here. Reuses the bit-counting table, semantics, examples, and portability notes verbatim from the previous atomics.md. Toctree gains a "math" entry next to "atomics" under the SIMT-primitives caption, preserving the existing placement of atomics rather than re-categorising as part of this split. --- docs/source/user_guide/atomics.md | 50 ++------------------------ docs/source/user_guide/index.md | 1 + docs/source/user_guide/math.md | 59 +++++++++++++++++++++++++++++++ 3 files changed, 63 insertions(+), 47 deletions(-) create mode 100644 docs/source/user_guide/math.md diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md index e9257a2351..ff13288dad 100644 --- a/docs/source/user_guide/atomics.md +++ b/docs/source/user_guide/atomics.md @@ -1,13 +1,9 @@ -# Atomics and bit operations +# Atomics -Per-thread primitives — operations a single thread executes on its own registers or on a single memory location. They do not synchronize threads; the only ordering they provide is the per-location atomicity of the read-modify-write itself. For cooperative ops across threads see the `qd.simt.block.*`, `qd.simt.subgroup.*`, and `qd.simt.grid.*` namespaces. - -This page covers two groups: atomic read-modify-write operations on global / shared memory (`qd.atomic_*`), and bit-counting helpers on integer registers (`qd.math.popcnt`, `qd.math.clz`). +Atomic read-modify-write operations on a single memory location. They do not synchronize threads; the only ordering they provide is the per-location atomicity of the read-modify-write itself. For cooperative ops across threads see the `qd.simt.block.*`, `qd.simt.subgroup.*`, and `qd.simt.grid.*` namespaces. Bit-counting helpers on integer registers (`qd.math.popcnt`, `qd.math.clz`) are documented in [math](math.md). ## What's available -### Atomics - All atomic ops follow the same shape: `qd.atomic_op(x, y)` performs `x = op(x, y)` atomically and returns the **old** value of `x`. `x` must be a writable memory target (a field element, ndarray element, or matrix slot); scalars and constant expressions are not allowed. | Op | Semantics | i32 | u32 | i64 | u64 | f32 | f64 | @@ -27,17 +23,6 @@ There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ run All atomic ops work on both global memory (fields, ndarrays) and block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope). -### Bit-counting helpers - -| Op | What it returns | i32 | u32 | i64 | u64 | -|---------------------|----------------------------------------------|-----|-----|-----|-----| -| `qd.math.popcnt(x)` | Number of set bits in `x` | yes | yes | yes | yes | -| `qd.math.clz(x)` | Number of leading zero bits in `x` | yes | \* | yes | \* | - -\* `qd.math.clz` on CUDA currently rejects unsigned 32- and 64-bit inputs (`QD_NOT_IMPLEMENTED`); cast through `qd.bit_cast(x, qd.i32)` / `qd.i64` as a workaround. On SPIR-V, `qd.math.clz` is hard-coded to 32-bit (`FindMSB`); 64-bit input is silently truncated. - -The classic CUDA bit-tricks `__ffs` (find first set bit) and `__fns` (find n-th set bit in a mask) are not exposed; for a leading-zero count of a u32, the `bit_cast` workaround above is the canonical approach. - ## Semantics ### `qd.atomic_add(x, y)` — and the rest of the family @@ -69,14 +54,6 @@ Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type err Atomic subtract and atomic multiply. `atomic_sub` is supported natively on most backends; `atomic_mul` on integer types lowers to a CAS loop on hardware without a native multiply atomic and is intentionally not heavily optimised — prefer reducing to a different scheme on hot paths. -### `qd.math.popcnt(x)` - -Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes. - -### `qd.math.clz(x)` - -Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. Lowers to `__nv_clz` / `__nv_clzll` on CUDA, `FindMSB` on SPIR-V (with `bitwidth - 1 - FindMSB` to convert MSB index into leading-zero count), `__builtin_amdgcn_sffbh_i32` on AMDGPU. See the cross-backend caveats in the support table. - ## Examples ### Reserving a slot in an output array @@ -108,37 +85,16 @@ def histogram(samples: qd.types.NDArray[qd.f32, 1]) -> None: qd.atomic_add(hist[b], 1) ``` -### Bitset population count - -```python -@qd.kernel -def count_bits(masks: qd.types.NDArray[qd.u32, 1], total: qd.types.NDArray[qd.i32, 1]) -> None: - n = 0 - for i in range(masks.shape[0]): - n += qd.math.popcnt(masks[i]) - qd.atomic_add(total[0], n) -``` - -### Highest set bit (Morton-code depth) - -```python -@qd.func -def msb(x: qd.i32) -> qd.i32: - return 31 - qd.math.clz(x) -``` - -For `qd.u32` input on CUDA, cast first: `qd.math.clz(qd.bit_cast(x, qd.i32))`. - ## Performance and portability notes - **Atomic contention is the silent killer of throughput.** The cost of `qd.atomic_add(counter, 1)` from every thread is dominated by serialization at the location, not by the per-thread arithmetic. If many threads hit the same slot, prefer a two-stage scheme: per-warp / per-block reduction first (`qd.simt.block.reduce` if available, or `qd.simt.subgroup.reduce_add`), then a single atomic per warp / block. - **Pair atomics with the right fence scope.** A bare atomic only orders the location it touches. To make other writes visible to readers that observe the new atomic value, follow the atomic with a fence: block-scope (`qd.simt.block.mem_sync()`) for shared-memory publishing, or grid-scope (`qd.simt.grid.memfence()`) for cross-block coordination. - **`f64` atomics fall off the fast path** on most backends; if you only need monotonic accumulation, consider Kahan summation in registers and a single atomic-add at the end of the block. - **`atomic_mul` is generally a CAS loop** under the hood; don't put it on the hot path. -- **Bit-trick portability is uneven.** `qd.math.popcnt` is fully cross-backend; `qd.math.clz` has the dtype caveats noted above. Tests that depend on `qd.math.clz` over u32 or u64 should `bit_cast` to the matching signed type for portability. ## Related +- [math](math.md) — `qd.math.*`, including the bit-counting helpers (`popcnt`, `clz`) commonly paired with atomics in select / compact patterns. - `qd.simt.block.*` — block-scope barriers and memory fences (`qd.simt.block.mem_sync()`). - `qd.simt.subgroup.*` — warp-scope reductions and shuffles, the recommended pre-aggregation step before an atomic. - `qd.simt.grid.*` — device-scope memory fence (`qd.simt.grid.memfence()`). diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md index 95467a5728..3fab71fd50 100644 --- a/docs/source/user_guide/index.md +++ b/docs/source/user_guide/index.md @@ -48,6 +48,7 @@ autodiff :titlesonly: atomics +math subgroup tile16 ``` diff --git a/docs/source/user_guide/math.md b/docs/source/user_guide/math.md new file mode 100644 index 0000000000..ea013c8ffc --- /dev/null +++ b/docs/source/user_guide/math.md @@ -0,0 +1,59 @@ +# Math + +`qd.math` is the quadrants standard library of math helpers. Per the module docstring it "supports glsl-style vectors, matrices and functions"; in practice the surface includes vector / matrix constructors (`vec2`, `vec3`, `vec4`, `mat2`, `mat3`, `mat4`, etc.), GLSL-style scalar / vector ops (`mix`, `clamp`, `step`, `smoothstep`, `sign`, `normalize`, `length`, `distance`, `dot`, `cross`, `reflect`, `refract`, `mod`), trig (`sin`, `cos`, `tan`, `asin`, `acos`, `atan2`, `tanh`, `exp`, `log`, `log2`, `pow`, `sqrt`, `floor`, `ceil`, `round`), rotation builders (`rot_by_axis`, `rot_yaw_pitch_roll`, `rotation2d`, `rotation3d`, `translate`, `scale`, `eye`), float-classification helpers (`isinf`, `isnan`), the constants `e`, `pi`, `inf`, `nan`, and the integer bit-counting helpers documented below. + +This page currently documents only the bit-counting helpers. The broader `qd.math` surface is exported and usable today but is not yet documented here. + +## Bit operations + +Single-thread integer-register operations. They do not access memory and do not synchronize threads — each thread independently transforms a value in its own register. + +| Op | What it returns | i32 | u32 | i64 | u64 | +|---------------------|----------------------------------------------|-----|-----|-----|-----| +| `qd.math.popcnt(x)` | Number of set bits in `x` | yes | yes | yes | yes | +| `qd.math.clz(x)` | Number of leading zero bits in `x` | yes | \* | yes | \* | + +\* `qd.math.clz` on CUDA currently rejects unsigned 32- and 64-bit inputs (`QD_NOT_IMPLEMENTED`); cast through `qd.bit_cast(x, qd.i32)` / `qd.i64` as a workaround. On SPIR-V, `qd.math.clz` is hard-coded to 32-bit (`FindMSB`); 64-bit input is silently truncated. + +The classic CUDA bit-tricks `__ffs` (find first set bit) and `__fns` (find n-th set bit in a mask) are not exposed; for a leading-zero count of a u32, the `bit_cast` workaround above is the canonical approach. + +### `qd.math.popcnt(x)` + +Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes. + +### `qd.math.clz(x)` + +Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. Lowers to `__nv_clz` / `__nv_clzll` on CUDA, `FindMSB` on SPIR-V (with `bitwidth - 1 - FindMSB` to convert MSB index into leading-zero count), `__builtin_amdgcn_sffbh_i32` on AMDGPU. See the cross-backend caveats in the support table. + +## Examples + +### Bitset population count + +```python +@qd.kernel +def count_bits(masks: qd.types.NDArray[qd.u32, 1], total: qd.types.NDArray[qd.i32, 1]) -> None: + n = 0 + for i in range(masks.shape[0]): + n += qd.math.popcnt(masks[i]) + qd.atomic_add(total[0], n) +``` + +### Highest set bit (Morton-code depth) + +```python +@qd.func +def msb(x: qd.i32) -> qd.i32: + return 31 - qd.math.clz(x) +``` + +For `qd.u32` input on CUDA, cast first: `qd.math.clz(qd.bit_cast(x, qd.i32))`. + +## Performance and portability notes + +- `qd.math.popcnt` is fully cross-backend. +- `qd.math.clz` has the dtype caveats noted above. Tests that depend on `qd.math.clz` over u32 or u64 should `bit_cast` to the matching signed type for portability. + +## Related + +- [atomics](atomics.md) — atomic read-modify-write operations on global / shared memory; commonly paired with bit-counting in select / compact patterns. +- `qd.bit_cast` — reinterprets a value's bit pattern as another dtype, used as a workaround for the `clz` u32 / u64 caveats above. From 5e2877262760f4e457e81f1783df0acd51d5b172 Mon Sep 17 00:00:00 2001 From: Hugh Perkins Date: Thu, 7 May 2026 06:35:20 -0700 Subject: [PATCH 3/7] [Docs] Reword atomic-ops scope sentence in atomics.md Change "All atomic ops work on both global memory (fields, ndarrays) and block-shared memory (qd.simt.block.SharedArray)." to "All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (qd.simt.block.SharedArray)." The earlier wording read as "they work in both places", which is true but a little off-target; "can be called on either" matches how a user picks one memory kind per call site. --- docs/source/user_guide/atomics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md index ff13288dad..3448e84287 100644 --- a/docs/source/user_guide/atomics.md +++ b/docs/source/user_guide/atomics.md @@ -21,7 +21,7 @@ All atomic ops follow the same shape: `qd.atomic_op(x, y)` performs `x = op(x, y There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ runtime uses CmpXchg internally; surfacing it requires extending `AtomicOpType`. -All atomic ops work on both global memory (fields, ndarrays) and block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope). +All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope). ## Semantics From e3f89db3c39e2404cceb85e788584ca9fccd9917 Mon Sep 17 00:00:00 2001 From: Hugh Perkins Date: Thu, 7 May 2026 06:37:04 -0700 Subject: [PATCH 4/7] [Docs] Drop Examples section from atomics.md Removes "## Examples" and its two subsections ("Reserving a slot in an output array" and "Histogram"). No other content changes; no in-page anchors referenced the dropped section. --- docs/source/user_guide/atomics.md | 31 ------------------------------- 1 file changed, 31 deletions(-) diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md index 3448e84287..309117ffb1 100644 --- a/docs/source/user_guide/atomics.md +++ b/docs/source/user_guide/atomics.md @@ -54,37 +54,6 @@ Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type err Atomic subtract and atomic multiply. `atomic_sub` is supported natively on most backends; `atomic_mul` on integer types lowers to a CAS loop on hardware without a native multiply atomic and is intentionally not heavily optimised — prefer reducing to a different scheme on hot paths. -## Examples - -### Reserving a slot in an output array - -```python -counter = qd.field(qd.i32, shape=()) -output = qd.field(qd.f32, shape=(MAX_OUTPUTS,)) - -@qd.kernel -def emit(values: qd.types.NDArray[qd.f32, 1], threshold: qd.f32) -> None: - for i in range(values.shape[0]): - if values[i] > threshold: - slot = qd.atomic_add(counter[None], 1) - output[slot] = values[i] -``` - -Every thread that passes the predicate gets a unique `slot` from the counter. The pattern is the workhorse of select / compact and contact-pair generation. - -### Histogram - -```python -hist = qd.field(qd.i32, shape=(NBINS,)) - -@qd.kernel -def histogram(samples: qd.types.NDArray[qd.f32, 1]) -> None: - for i in range(samples.shape[0]): - b = qd.i32(samples[i] * NBINS) - if 0 <= b < NBINS: - qd.atomic_add(hist[b], 1) -``` - ## Performance and portability notes - **Atomic contention is the silent killer of throughput.** The cost of `qd.atomic_add(counter, 1)` from every thread is dominated by serialization at the location, not by the per-thread arithmetic. If many threads hit the same slot, prefer a two-stage scheme: per-warp / per-block reduction first (`qd.simt.block.reduce` if available, or `qd.simt.subgroup.reduce_add`), then a single atomic per warp / block. From 74ff985659c396af21b2a025ee57f6c1285ec775 Mon Sep 17 00:00:00 2001 From: Hugh Perkins Date: Thu, 7 May 2026 06:38:56 -0700 Subject: [PATCH 5/7] [Docs] Drop enumerative qd.math surface sentence from math.md intro Removes the second sentence of the math.md intro, which listed every category of qd.math.* helpers (vec / mat constructors, GLSL-style ops, trig, rotation builders, float-classification, constants). The remaining intro is just "qd.math is the quadrants standard library of math helpers", followed by the existing note that this page currently documents only the bit-counting helpers. --- docs/source/user_guide/math.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/user_guide/math.md b/docs/source/user_guide/math.md index ea013c8ffc..cf4eecb568 100644 --- a/docs/source/user_guide/math.md +++ b/docs/source/user_guide/math.md @@ -1,6 +1,6 @@ # Math -`qd.math` is the quadrants standard library of math helpers. Per the module docstring it "supports glsl-style vectors, matrices and functions"; in practice the surface includes vector / matrix constructors (`vec2`, `vec3`, `vec4`, `mat2`, `mat3`, `mat4`, etc.), GLSL-style scalar / vector ops (`mix`, `clamp`, `step`, `smoothstep`, `sign`, `normalize`, `length`, `distance`, `dot`, `cross`, `reflect`, `refract`, `mod`), trig (`sin`, `cos`, `tan`, `asin`, `acos`, `atan2`, `tanh`, `exp`, `log`, `log2`, `pow`, `sqrt`, `floor`, `ceil`, `round`), rotation builders (`rot_by_axis`, `rot_yaw_pitch_roll`, `rotation2d`, `rotation3d`, `translate`, `scale`, `eye`), float-classification helpers (`isinf`, `isnan`), the constants `e`, `pi`, `inf`, `nan`, and the integer bit-counting helpers documented below. +`qd.math` is the quadrants standard library of math helpers. This page currently documents only the bit-counting helpers. The broader `qd.math` surface is exported and usable today but is not yet documented here. From fef151022159d4eea1a9f37ee37c92bc77025c41 Mon Sep 17 00:00:00 2001 From: Hugh Perkins Date: Thu, 7 May 2026 10:56:22 -0700 Subject: [PATCH 6/7] [Docs] atomics/math: address PR #640 review (AMDGPU support, dtype scope, NaN semantics) --- docs/source/user_guide/atomics.md | 9 ++++++++- docs/source/user_guide/math.md | 22 ++++++++++++---------- 2 files changed, 20 insertions(+), 11 deletions(-) diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md index 309117ffb1..a3689c6372 100644 --- a/docs/source/user_guide/atomics.md +++ b/docs/source/user_guide/atomics.md @@ -23,6 +23,13 @@ There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ run All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope). +**Backend caveat for the fence-pair pattern.** Both fence helpers have current portability gaps that affect the patterns recommended on this page: + +- `qd.simt.block.mem_sync()` is supported on CUDA and SPIR-V; on AMDGPU it raises `ValueError("qd.block.mem_sync is not supported for arch ...")` at trace time. +- `qd.simt.grid.memfence()` is fully implemented only on CUDA. On AMDGPU it currently links as a silent no-op (cross-block ordering will fail without any diagnostic); on SPIR-V it fails at codegen. See [grid](grid.md) for the per-backend details. + +On AMDGPU specifically, neither fence-pair recipe works as documented yet; cross-platform code that needs an atomic plus a fence must restructure around the kernel-launch boundary or be CUDA-bound until the AMDGPU lowerings land. + ## Semantics ### `qd.atomic_add(x, y)` — and the rest of the family @@ -44,7 +51,7 @@ Properties common to every `qd.atomic_*`: ### `qd.atomic_min(x, y)` / `qd.atomic_max(x, y)` -Atomically writes back `min(x, y)` (resp. `max(x, y)`). Returns the old value of `x`. Floating-point min/max follow IEEE rules — `NaN` propagates: if either input is `NaN`, the result is `NaN`. +Atomically writes back `min(x, y)` (resp. `max(x, y)`). Returns the old value of `x`. Floating-point min/max use **`minNum` / `maxNum`-style** semantics: if exactly one input is `NaN`, the **non-`NaN`** value is written back. This matches the f16 path's use of LLVM `llvm.minnum` / `llvm.maxnum` intrinsics (`quadrants/codegen/llvm/codegen_llvm.cpp:1337-1342`) and the GPU-native paths (CUDA sm_80+ `atomicMin`/`atomicMax` for floats, SPIR-V `FMin` / `FMax`). The f32 / f64 CPU CAS-loop path (`quadrants/runtime/llvm/runtime_module/atomic.h::min_f32` / `max_f32`) uses naive `<` / `>` comparisons, which give asymmetric NaN behaviour depending on operand order — do not rely on a particular result when either input is `NaN` on the CPU backend. Behaviour when *both* inputs are `NaN` is backend-dependent across the board. ### `qd.atomic_and(x, y)` / `qd.atomic_or(x, y)` / `qd.atomic_xor(x, y)` diff --git a/docs/source/user_guide/math.md b/docs/source/user_guide/math.md index cf4eecb568..088fa9f08f 100644 --- a/docs/source/user_guide/math.md +++ b/docs/source/user_guide/math.md @@ -8,22 +8,24 @@ This page currently documents only the bit-counting helpers. The broader `qd.mat Single-thread integer-register operations. They do not access memory and do not synchronize threads — each thread independently transforms a value in its own register. -| Op | What it returns | i32 | u32 | i64 | u64 | -|---------------------|----------------------------------------------|-----|-----|-----|-----| -| `qd.math.popcnt(x)` | Number of set bits in `x` | yes | yes | yes | yes | -| `qd.math.clz(x)` | Number of leading zero bits in `x` | yes | \* | yes | \* | +| Op | CUDA | AMDGPU | SPIR-V (Vulkan / Metal) | +|---------------------|---------------------|------------------------------|----------------------------------------------------| +| `qd.math.popcnt(x)` | i32, u32, i64, u64 | unsupported (codegen FIXME) | any int (`OpBitCount`) | +| `qd.math.clz(x)` | i32, i64 only \* | unsupported (codegen FIXME) | 32-bit only (`FindMSB`); 64-bit input is silently truncated | -\* `qd.math.clz` on CUDA currently rejects unsigned 32- and 64-bit inputs (`QD_NOT_IMPLEMENTED`); cast through `qd.bit_cast(x, qd.i32)` / `qd.i64` as a workaround. On SPIR-V, `qd.math.clz` is hard-coded to 32-bit (`FindMSB`); 64-bit input is silently truncated. +\* On CUDA, `qd.math.clz` rejects unsigned 32- and 64-bit inputs (`QD_NOT_IMPLEMENTED` in `quadrants/codegen/cuda/codegen_cuda.cpp`); `bit_cast` through the matching signed type as a workaround: `qd.math.clz(qd.bit_cast(x, qd.i32))`. CUDA `popcnt` accepts u32 / u64 directly; only `clz` has the signed-only restriction. On unsupported integer widths (e.g. `i8`, `i16`, `u16`) both ops also hit `QD_NOT_IMPLEMENTED`. -The classic CUDA bit-tricks `__ffs` (find first set bit) and `__fns` (find n-th set bit in a mask) are not exposed; for a leading-zero count of a u32, the `bit_cast` workaround above is the canonical approach. +**FIXME (AMDGPU):** the AMDGPU `emit_extra_unary` override (`quadrants/codegen/amdgpu/codegen_amdgpu.cpp`) has no `popcnt` or `clz` branch; both fall through to `QD_NOT_IMPLEMENTED`. The test suite already records this (`tests/python/test_unary_ops.py::test_popcnt` and `::test_clz` both `xfail` on AMDGPU). Until lowerings are added, AMDGPU users hit a hard codegen failure. + +The classic CUDA bit-tricks `__ffs` (find first set bit) and `__fns` (find n-th set bit in a mask) are not exposed; for a leading-zero count of a u32 on CUDA, the `bit_cast` workaround above is the canonical approach. ### `qd.math.popcnt(x)` -Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes. +Counts set bits in `x` and returns an `i32`. On CUDA, lowers to `__nv_popc` for 32-bit inputs and `__nv_popcll` for 64-bit inputs (i32 / u32 / i64 / u64 only; narrower widths and AMDGPU are unsupported). On SPIR-V, lowers to `OpBitCount`. ### `qd.math.clz(x)` -Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. Lowers to `__nv_clz` / `__nv_clzll` on CUDA, `FindMSB` on SPIR-V (with `bitwidth - 1 - FindMSB` to convert MSB index into leading-zero count), `__builtin_amdgcn_sffbh_i32` on AMDGPU. See the cross-backend caveats in the support table. +Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. On CUDA, lowers to `__nv_clz` (i32 only) and `__nv_clzll` (i64 only); u32 / u64 must be `bit_cast` to the matching signed type. On SPIR-V, lowers to `FindMSB` with `bitwidth - 1 - FindMSB` to convert MSB index into a leading-zero count; the implementation is hard-coded to 32-bit, so 64-bit input silently truncates. AMDGPU is unsupported. See the cross-backend caveats in the support table. ## Examples @@ -50,8 +52,8 @@ For `qd.u32` input on CUDA, cast first: `qd.math.clz(qd.bit_cast(x, qd.i32))`. ## Performance and portability notes -- `qd.math.popcnt` is fully cross-backend. -- `qd.math.clz` has the dtype caveats noted above. Tests that depend on `qd.math.clz` over u32 or u64 should `bit_cast` to the matching signed type for portability. +- `qd.math.popcnt` is supported on CUDA (i32 / u32 / i64 / u64) and SPIR-V (any integer width). AMDGPU is unsupported (FIXME above). +- `qd.math.clz` has the dtype and backend caveats noted above. Tests that depend on `qd.math.clz` over u32 or u64 should `bit_cast` to the matching signed type for portability on CUDA, and avoid 64-bit input on SPIR-V. ## Related From 438cfcb2ad74ffea63647b4da484c095fb15042f Mon Sep 17 00:00:00 2001 From: Hugh Perkins Date: Fri, 8 May 2026 07:59:31 -0700 Subject: [PATCH 7/7] [Docs] atomics: drop link to grid.md (file lives on hp/docs-grid, not main yet) --- docs/source/user_guide/atomics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md index a3689c6372..0ca5ed71ec 100644 --- a/docs/source/user_guide/atomics.md +++ b/docs/source/user_guide/atomics.md @@ -26,7 +26,7 @@ All atomic ops can be called on either global memory (fields, ndarrays) or block **Backend caveat for the fence-pair pattern.** Both fence helpers have current portability gaps that affect the patterns recommended on this page: - `qd.simt.block.mem_sync()` is supported on CUDA and SPIR-V; on AMDGPU it raises `ValueError("qd.block.mem_sync is not supported for arch ...")` at trace time. -- `qd.simt.grid.memfence()` is fully implemented only on CUDA. On AMDGPU it currently links as a silent no-op (cross-block ordering will fail without any diagnostic); on SPIR-V it fails at codegen. See [grid](grid.md) for the per-backend details. +- `qd.simt.grid.memfence()` is fully implemented only on CUDA. On AMDGPU it currently links as a silent no-op (cross-block ordering will fail without any diagnostic); on SPIR-V it fails at codegen. On AMDGPU specifically, neither fence-pair recipe works as documented yet; cross-platform code that needs an atomic plus a fence must restructure around the kernel-launch boundary or be CUDA-bound until the AMDGPU lowerings land.