From 96eaa1904b25b8a50f2eb9fa0658229af6bf7a8a Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 00:41:39 -0700
Subject: [PATCH 1/7] [Docs] Add user-guide page for atomics and bit operations

Documents qd.atomic_* (add, sub, mul, min, max, and, or, xor) and the
bit-counting helpers qd.math.popcnt / qd.math.clz. Covers semantics
(returns old value, per-location atomicity, no fence on the rest of
memory), backend / dtype support, performance pitfalls (contention,
f64 fast path, atomic_mul CAS loop), and worked examples (slot
reservation, histogram, popcnt sum, MSB).

Adds atomics.md to the SIMT-primitives toctree in index.md.
---
 docs/source/user_guide/atomics.md | 145 ++++++++++++++++++++++++++++++
 docs/source/user_guide/index.md   |   1 +
 2 files changed, 146 insertions(+)
 create mode 100644 docs/source/user_guide/atomics.md

diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md
new file mode 100644
index 0000000000..e9257a2351
--- /dev/null
+++ b/docs/source/user_guide/atomics.md
@@ -0,0 +1,145 @@
+# Atomics and bit operations
+
+Per-thread primitives — operations a single thread executes on its own registers or on a single memory location. They do not synchronize threads; the only ordering they provide is the per-location atomicity of the read-modify-write itself. For cooperative ops across threads see the `qd.simt.block.*`, `qd.simt.subgroup.*`, and `qd.simt.grid.*` namespaces.
+
+This page covers two groups: atomic read-modify-write operations on global / shared memory (`qd.atomic_*`), and bit-counting helpers on integer registers (`qd.math.popcnt`, `qd.math.clz`).
+
+## What's available
+
+### Atomics
+
+All atomic ops follow the same shape: `qd.atomic_op(x, y)` performs `x = op(x, y)` atomically and returns the **old** value of `x`. `x` must be a writable memory target (a field element, ndarray element, or matrix slot); scalars and constant expressions are not allowed.
+
+| Op             | Semantics                              | i32 | u32 | i64 | u64 | f32 | f64 |
+|----------------|----------------------------------------|-----|-----|-----|-----|-----|-----|
+| `atomic_add`   | `x += y`                               | yes | yes | yes | yes | yes | \*  |
+| `atomic_sub`   | `x -= y`                               | yes | yes | yes | yes | yes | \*  |
+| `atomic_mul`   | `x *= y`                               | yes | yes | yes | yes | yes | \*  |
+| `atomic_min`   | `x = min(x, y)`                        | yes | yes | yes | yes | yes | \*  |
+| `atomic_max`   | `x = max(x, y)`                        | yes | yes | yes | yes | yes | \*  |
+| `atomic_and`   | `x &= y`                               | yes | yes | yes | yes | —   | —   |
+| `atomic_or`    | `x \|= y`                              | yes | yes | yes | yes | —   | —   |
+| `atomic_xor`   | `x ^= y`                               | yes | yes | yes | yes | —   | —   |
+
+\* `f64` atomic add / sub / mul / min / max is hardware-dependent: supported on CUDA sm_60+ for `add`, falls back to a CAS loop elsewhere or raises at codegen time on older targets and on backends that do not lower a CAS loop. Prefer `f32` on hot paths if portability matters.
+
+There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ runtime uses CmpXchg internally; surfacing it requires extending `AtomicOpType`.
+
+All atomic ops work on both global memory (fields, ndarrays) and block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope).
+
+### Bit-counting helpers
+
+| Op                  | What it returns                              | i32 | u32 | i64 | u64 |
+|---------------------|----------------------------------------------|-----|-----|-----|-----|
+| `qd.math.popcnt(x)` | Number of set bits in `x`                    | yes | yes | yes | yes |
+| `qd.math.clz(x)`    | Number of leading zero bits in `x`           | yes | \*  | yes | \*  |
+
+\* `qd.math.clz` on CUDA currently rejects unsigned 32- and 64-bit inputs (`QD_NOT_IMPLEMENTED`); cast through `qd.bit_cast(x, qd.i32)` / `qd.i64` as a workaround. On SPIR-V, `qd.math.clz` is hard-coded to 32-bit (`FindMSB`); 64-bit input is silently truncated.
+
+The classic CUDA bit-tricks `__ffs` (find first set bit) and `__fns` (find n-th set bit in a mask) are not exposed; for a leading-zero count of a u32, the `bit_cast` workaround above is the canonical approach.
+
+## Semantics
+
+### `qd.atomic_add(x, y)` — and the rest of the family
+
+```python
+old = qd.atomic_add(x, y)
+# Effect:
+#   tmp = load(x)
+#   store(x, op(tmp, y))
+#   old = tmp
+# all three steps execute as a single atomic transaction on x.
+```
+
+Properties common to every `qd.atomic_*`:
+
+- **Returns the old value**, not the new one. This matches CUDA's `atomicAdd` and is what enables building reservation patterns: `slot = qd.atomic_add(counter, 1)` gives every thread a unique index.
+- **Per-location atomicity, no fence on the rest of memory.** Writes you issued before an atomic on `x` are not necessarily visible to other threads after they observe the new `x`. Pair the atomic with `qd.simt.block.mem_sync()` or `qd.simt.grid.memfence()` if you need that ordering.
+- **Vector / matrix arguments fan out element-wise.** `qd.atomic_add(field_of_vec3, qd.Vector([1.0, 2.0, 3.0]))` issues three independent scalar atomic-adds, one per component. There is no all-or-nothing guarantee across the components.
+
+### `qd.atomic_min(x, y)` / `qd.atomic_max(x, y)`
+
+Atomically writes back `min(x, y)` (resp. `max(x, y)`). Returns the old value of `x`. Floating-point min/max follow IEEE rules — `NaN` propagates: if either input is `NaN`, the result is `NaN`.
+
+### `qd.atomic_and(x, y)` / `qd.atomic_or(x, y)` / `qd.atomic_xor(x, y)`
+
+Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type error at trace time.
+
+### `qd.atomic_sub(x, y)` / `qd.atomic_mul(x, y)`
+
+Atomic subtract and atomic multiply. `atomic_sub` is supported natively on most backends; `atomic_mul` on integer types lowers to a CAS loop on hardware without a native multiply atomic and is intentionally not heavily optimised — prefer reducing to a different scheme on hot paths.
+
+### `qd.math.popcnt(x)`
+
+Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes.
+
+### `qd.math.clz(x)`
+
+Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. Lowers to `__nv_clz` / `__nv_clzll` on CUDA, `FindMSB` on SPIR-V (with `bitwidth - 1 - FindMSB` to convert MSB index into leading-zero count), `__builtin_amdgcn_sffbh_i32` on AMDGPU. See the cross-backend caveats in the support table.
+
+## Examples
+
+### Reserving a slot in an output array
+
+```python
+counter = qd.field(qd.i32, shape=())
+output  = qd.field(qd.f32, shape=(MAX_OUTPUTS,))
+
+@qd.kernel
+def emit(values: qd.types.NDArray[qd.f32, 1], threshold: qd.f32) -> None:
+    for i in range(values.shape[0]):
+        if values[i] > threshold:
+            slot = qd.atomic_add(counter[None], 1)
+            output[slot] = values[i]
+```
+
+Every thread that passes the predicate gets a unique `slot` from the counter. The pattern is the workhorse of select / compact and contact-pair generation.
+
+### Histogram
+
+```python
+hist = qd.field(qd.i32, shape=(NBINS,))
+
+@qd.kernel
+def histogram(samples: qd.types.NDArray[qd.f32, 1]) -> None:
+    for i in range(samples.shape[0]):
+        b = qd.i32(samples[i] * NBINS)
+        if 0 <= b < NBINS:
+            qd.atomic_add(hist[b], 1)
+```
+
+### Bitset population count
+
+```python
+@qd.kernel
+def count_bits(masks: qd.types.NDArray[qd.u32, 1], total: qd.types.NDArray[qd.i32, 1]) -> None:
+    n = 0
+    for i in range(masks.shape[0]):
+        n += qd.math.popcnt(masks[i])
+    qd.atomic_add(total[0], n)
+```
+
+### Highest set bit (Morton-code depth)
+
+```python
+@qd.func
+def msb(x: qd.i32) -> qd.i32:
+    return 31 - qd.math.clz(x)
+```
+
+For `qd.u32` input on CUDA, cast first: `qd.math.clz(qd.bit_cast(x, qd.i32))`.
+
+## Performance and portability notes
+
+- **Atomic contention is the silent killer of throughput.** The cost of `qd.atomic_add(counter, 1)` from every thread is dominated by serialization at the location, not by the per-thread arithmetic. If many threads hit the same slot, prefer a two-stage scheme: per-warp / per-block reduction first (`qd.simt.block.reduce` if available, or `qd.simt.subgroup.reduce_add`), then a single atomic per warp / block.
+- **Pair atomics with the right fence scope.** A bare atomic only orders the location it touches. To make other writes visible to readers that observe the new atomic value, follow the atomic with a fence: block-scope (`qd.simt.block.mem_sync()`) for shared-memory publishing, or grid-scope (`qd.simt.grid.memfence()`) for cross-block coordination.
+- **`f64` atomics fall off the fast path** on most backends; if you only need monotonic accumulation, consider Kahan summation in registers and a single atomic-add at the end of the block.
+- **`atomic_mul` is generally a CAS loop** under the hood; don't put it on the hot path.
+- **Bit-trick portability is uneven.** `qd.math.popcnt` is fully cross-backend; `qd.math.clz` has the dtype caveats noted above. Tests that depend on `qd.math.clz` over u32 or u64 should `bit_cast` to the matching signed type for portability.
+
+## Related
+
+- `qd.simt.block.*` — block-scope barriers and memory fences (`qd.simt.block.mem_sync()`).
+- `qd.simt.subgroup.*` — warp-scope reductions and shuffles, the recommended pre-aggregation step before an atomic.
+- `qd.simt.grid.*` — device-scope memory fence (`qd.simt.grid.memfence()`).
+- [parallelization](parallelization.md) — thread-synchronization patterns and how atomics fit into the broader synchronization story.
diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md
index f783e84264..95467a5728 100644
--- a/docs/source/user_guide/index.md
+++ b/docs/source/user_guide/index.md
@@ -47,6 +47,7 @@ autodiff
 :maxdepth: 1
 :titlesonly:
 
+atomics
 subgroup
 tile16
 ```

From 7bc43c0f0de0fc6290f24c1ec8086b310571fd6c Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 06:31:47 -0700
Subject: [PATCH 2/7] [Docs] Split atomics.md into atomics.md + math.md

The previous single page covered two unrelated topics: the
qd.atomic_* read-modify-write ops, and the qd.math.popcnt /
qd.math.clz bit-counting helpers.  They share no concerns
(memory coordination vs single-thread register ops) and live
in different namespaces.

Split into:

- atomics.md  -- title shortened to "Atomics".  Drops the
  bit-counting table, the popcnt / clz semantics
  subsections, the bitset-popcount and Morton-MSB
  examples, and the bit-trick portability bullet.
  Adds a one-line cross-reference to math.md in the intro
  and the Related section.

- math.md  (new)  -- framed as a qd.math.* page that
  currently documents only the bit operations.  Lists the
  broader qd.math surface (vec / mat / GLSL ops / trig /
  rotations / float-classification / constants) in the
  intro per the project's own python/quadrants/math/
  __init__.py docstring, and notes explicitly that those
  are exported but not yet documented here.  Reuses the
  bit-counting table, semantics, examples, and portability
  notes verbatim from the previous atomics.md.

Toctree gains a "math" entry next to "atomics" under the
SIMT-primitives caption, preserving the existing placement
of atomics rather than re-categorising as part of this
split.
---
 docs/source/user_guide/atomics.md | 50 ++------------------------
 docs/source/user_guide/index.md   |  1 +
 docs/source/user_guide/math.md    | 59 +++++++++++++++++++++++++++++++
 3 files changed, 63 insertions(+), 47 deletions(-)
 create mode 100644 docs/source/user_guide/math.md

diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md
index e9257a2351..ff13288dad 100644
--- a/docs/source/user_guide/atomics.md
+++ b/docs/source/user_guide/atomics.md
@@ -1,13 +1,9 @@
-# Atomics and bit operations
+# Atomics
 
-Per-thread primitives — operations a single thread executes on its own registers or on a single memory location. They do not synchronize threads; the only ordering they provide is the per-location atomicity of the read-modify-write itself. For cooperative ops across threads see the `qd.simt.block.*`, `qd.simt.subgroup.*`, and `qd.simt.grid.*` namespaces.
-
-This page covers two groups: atomic read-modify-write operations on global / shared memory (`qd.atomic_*`), and bit-counting helpers on integer registers (`qd.math.popcnt`, `qd.math.clz`).
+Atomic read-modify-write operations on a single memory location. They do not synchronize threads; the only ordering they provide is the per-location atomicity of the read-modify-write itself. For cooperative ops across threads see the `qd.simt.block.*`, `qd.simt.subgroup.*`, and `qd.simt.grid.*` namespaces. Bit-counting helpers on integer registers (`qd.math.popcnt`, `qd.math.clz`) are documented in [math](math.md).
 
 ## What's available
 
-### Atomics
-
 All atomic ops follow the same shape: `qd.atomic_op(x, y)` performs `x = op(x, y)` atomically and returns the **old** value of `x`. `x` must be a writable memory target (a field element, ndarray element, or matrix slot); scalars and constant expressions are not allowed.
 
 | Op             | Semantics                              | i32 | u32 | i64 | u64 | f32 | f64 |
@@ -27,17 +23,6 @@ There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ run
 
 All atomic ops work on both global memory (fields, ndarrays) and block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope).
 
-### Bit-counting helpers
-
-| Op                  | What it returns                              | i32 | u32 | i64 | u64 |
-|---------------------|----------------------------------------------|-----|-----|-----|-----|
-| `qd.math.popcnt(x)` | Number of set bits in `x`                    | yes | yes | yes | yes |
-| `qd.math.clz(x)`    | Number of leading zero bits in `x`           | yes | \*  | yes | \*  |
-
-\* `qd.math.clz` on CUDA currently rejects unsigned 32- and 64-bit inputs (`QD_NOT_IMPLEMENTED`); cast through `qd.bit_cast(x, qd.i32)` / `qd.i64` as a workaround. On SPIR-V, `qd.math.clz` is hard-coded to 32-bit (`FindMSB`); 64-bit input is silently truncated.
-
-The classic CUDA bit-tricks `__ffs` (find first set bit) and `__fns` (find n-th set bit in a mask) are not exposed; for a leading-zero count of a u32, the `bit_cast` workaround above is the canonical approach.
-
 ## Semantics
 
 ### `qd.atomic_add(x, y)` — and the rest of the family
@@ -69,14 +54,6 @@ Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type err
 
 Atomic subtract and atomic multiply. `atomic_sub` is supported natively on most backends; `atomic_mul` on integer types lowers to a CAS loop on hardware without a native multiply atomic and is intentionally not heavily optimised — prefer reducing to a different scheme on hot paths.
 
-### `qd.math.popcnt(x)`
-
-Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes.
-
-### `qd.math.clz(x)`
-
-Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. Lowers to `__nv_clz` / `__nv_clzll` on CUDA, `FindMSB` on SPIR-V (with `bitwidth - 1 - FindMSB` to convert MSB index into leading-zero count), `__builtin_amdgcn_sffbh_i32` on AMDGPU. See the cross-backend caveats in the support table.
-
 ## Examples
 
 ### Reserving a slot in an output array
@@ -108,37 +85,16 @@ def histogram(samples: qd.types.NDArray[qd.f32, 1]) -> None:
             qd.atomic_add(hist[b], 1)
 ```
 
-### Bitset population count
-
-```python
-@qd.kernel
-def count_bits(masks: qd.types.NDArray[qd.u32, 1], total: qd.types.NDArray[qd.i32, 1]) -> None:
-    n = 0
-    for i in range(masks.shape[0]):
-        n += qd.math.popcnt(masks[i])
-    qd.atomic_add(total[0], n)
-```
-
-### Highest set bit (Morton-code depth)
-
-```python
-@qd.func
-def msb(x: qd.i32) -> qd.i32:
-    return 31 - qd.math.clz(x)
-```
-
-For `qd.u32` input on CUDA, cast first: `qd.math.clz(qd.bit_cast(x, qd.i32))`.
-
 ## Performance and portability notes
 
 - **Atomic contention is the silent killer of throughput.** The cost of `qd.atomic_add(counter, 1)` from every thread is dominated by serialization at the location, not by the per-thread arithmetic. If many threads hit the same slot, prefer a two-stage scheme: per-warp / per-block reduction first (`qd.simt.block.reduce` if available, or `qd.simt.subgroup.reduce_add`), then a single atomic per warp / block.
 - **Pair atomics with the right fence scope.** A bare atomic only orders the location it touches. To make other writes visible to readers that observe the new atomic value, follow the atomic with a fence: block-scope (`qd.simt.block.mem_sync()`) for shared-memory publishing, or grid-scope (`qd.simt.grid.memfence()`) for cross-block coordination.
 - **`f64` atomics fall off the fast path** on most backends; if you only need monotonic accumulation, consider Kahan summation in registers and a single atomic-add at the end of the block.
 - **`atomic_mul` is generally a CAS loop** under the hood; don't put it on the hot path.
-- **Bit-trick portability is uneven.** `qd.math.popcnt` is fully cross-backend; `qd.math.clz` has the dtype caveats noted above. Tests that depend on `qd.math.clz` over u32 or u64 should `bit_cast` to the matching signed type for portability.
 
 ## Related
 
+- [math](math.md) — `qd.math.*`, including the bit-counting helpers (`popcnt`, `clz`) commonly paired with atomics in select / compact patterns.
 - `qd.simt.block.*` — block-scope barriers and memory fences (`qd.simt.block.mem_sync()`).
 - `qd.simt.subgroup.*` — warp-scope reductions and shuffles, the recommended pre-aggregation step before an atomic.
 - `qd.simt.grid.*` — device-scope memory fence (`qd.simt.grid.memfence()`).
diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md
index 95467a5728..3fab71fd50 100644
--- a/docs/source/user_guide/index.md
+++ b/docs/source/user_guide/index.md
@@ -48,6 +48,7 @@ autodiff
 :titlesonly:
 
 atomics
+math
 subgroup
 tile16
 ```
diff --git a/docs/source/user_guide/math.md b/docs/source/user_guide/math.md
new file mode 100644
index 0000000000..ea013c8ffc
--- /dev/null
+++ b/docs/source/user_guide/math.md
@@ -0,0 +1,59 @@
+# Math
+
+`qd.math` is the quadrants standard library of math helpers. Per the module docstring it "supports glsl-style vectors, matrices and functions"; in practice the surface includes vector / matrix constructors (`vec2`, `vec3`, `vec4`, `mat2`, `mat3`, `mat4`, etc.), GLSL-style scalar / vector ops (`mix`, `clamp`, `step`, `smoothstep`, `sign`, `normalize`, `length`, `distance`, `dot`, `cross`, `reflect`, `refract`, `mod`), trig (`sin`, `cos`, `tan`, `asin`, `acos`, `atan2`, `tanh`, `exp`, `log`, `log2`, `pow`, `sqrt`, `floor`, `ceil`, `round`), rotation builders (`rot_by_axis`, `rot_yaw_pitch_roll`, `rotation2d`, `rotation3d`, `translate`, `scale`, `eye`), float-classification helpers (`isinf`, `isnan`), the constants `e`, `pi`, `inf`, `nan`, and the integer bit-counting helpers documented below.
+
+This page currently documents only the bit-counting helpers. The broader `qd.math` surface is exported and usable today but is not yet documented here.
+
+## Bit operations
+
+Single-thread integer-register operations. They do not access memory and do not synchronize threads — each thread independently transforms a value in its own register.
+
+| Op                  | What it returns                              | i32 | u32 | i64 | u64 |
+|---------------------|----------------------------------------------|-----|-----|-----|-----|
+| `qd.math.popcnt(x)` | Number of set bits in `x`                    | yes | yes | yes | yes |
+| `qd.math.clz(x)`    | Number of leading zero bits in `x`           | yes | \*  | yes | \*  |
+
+\* `qd.math.clz` on CUDA currently rejects unsigned 32- and 64-bit inputs (`QD_NOT_IMPLEMENTED`); cast through `qd.bit_cast(x, qd.i32)` / `qd.i64` as a workaround. On SPIR-V, `qd.math.clz` is hard-coded to 32-bit (`FindMSB`); 64-bit input is silently truncated.
+
+The classic CUDA bit-tricks `__ffs` (find first set bit) and `__fns` (find n-th set bit in a mask) are not exposed; for a leading-zero count of a u32, the `bit_cast` workaround above is the canonical approach.
+
+### `qd.math.popcnt(x)`
+
+Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes.
+
+### `qd.math.clz(x)`
+
+Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. Lowers to `__nv_clz` / `__nv_clzll` on CUDA, `FindMSB` on SPIR-V (with `bitwidth - 1 - FindMSB` to convert MSB index into leading-zero count), `__builtin_amdgcn_sffbh_i32` on AMDGPU. See the cross-backend caveats in the support table.
+
+## Examples
+
+### Bitset population count
+
+```python
+@qd.kernel
+def count_bits(masks: qd.types.NDArray[qd.u32, 1], total: qd.types.NDArray[qd.i32, 1]) -> None:
+    n = 0
+    for i in range(masks.shape[0]):
+        n += qd.math.popcnt(masks[i])
+    qd.atomic_add(total[0], n)
+```
+
+### Highest set bit (Morton-code depth)
+
+```python
+@qd.func
+def msb(x: qd.i32) -> qd.i32:
+    return 31 - qd.math.clz(x)
+```
+
+For `qd.u32` input on CUDA, cast first: `qd.math.clz(qd.bit_cast(x, qd.i32))`.
+
+## Performance and portability notes
+
+- `qd.math.popcnt` is fully cross-backend.
+- `qd.math.clz` has the dtype caveats noted above. Tests that depend on `qd.math.clz` over u32 or u64 should `bit_cast` to the matching signed type for portability.
+
+## Related
+
+- [atomics](atomics.md) — atomic read-modify-write operations on global / shared memory; commonly paired with bit-counting in select / compact patterns.
+- `qd.bit_cast` — reinterprets a value's bit pattern as another dtype, used as a workaround for the `clz` u32 / u64 caveats above.

From 5e2877262760f4e457e81f1783df0acd51d5b172 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 06:35:20 -0700
Subject: [PATCH 3/7] [Docs] Reword atomic-ops scope sentence in atomics.md

Change

  "All atomic ops work on both global memory (fields, ndarrays)
   and block-shared memory (qd.simt.block.SharedArray)."

to

  "All atomic ops can be called on either global memory (fields,
   ndarrays) or block-shared memory (qd.simt.block.SharedArray)."

The earlier wording read as "they work in both places", which
is true but a little off-target; "can be called on either"
matches how a user picks one memory kind per call site.
---
 docs/source/user_guide/atomics.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md
index ff13288dad..3448e84287 100644
--- a/docs/source/user_guide/atomics.md
+++ b/docs/source/user_guide/atomics.md
@@ -21,7 +21,7 @@ All atomic ops follow the same shape: `qd.atomic_op(x, y)` performs `x = op(x, y
 
 There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ runtime uses CmpXchg internally; surfacing it requires extending `AtomicOpType`.
 
-All atomic ops work on both global memory (fields, ndarrays) and block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope).
+All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope).
 
 ## Semantics
 

From e3f89db3c39e2404cceb85e788584ca9fccd9917 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 06:37:04 -0700
Subject: [PATCH 4/7] [Docs] Drop Examples section from atomics.md

Removes "## Examples" and its two subsections ("Reserving a
slot in an output array" and "Histogram").  No other content
changes; no in-page anchors referenced the dropped section.
---
 docs/source/user_guide/atomics.md | 31 -------------------------------
 1 file changed, 31 deletions(-)

diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md
index 3448e84287..309117ffb1 100644
--- a/docs/source/user_guide/atomics.md
+++ b/docs/source/user_guide/atomics.md
@@ -54,37 +54,6 @@ Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type err
 
 Atomic subtract and atomic multiply. `atomic_sub` is supported natively on most backends; `atomic_mul` on integer types lowers to a CAS loop on hardware without a native multiply atomic and is intentionally not heavily optimised — prefer reducing to a different scheme on hot paths.
 
-## Examples
-
-### Reserving a slot in an output array
-
-```python
-counter = qd.field(qd.i32, shape=())
-output  = qd.field(qd.f32, shape=(MAX_OUTPUTS,))
-
-@qd.kernel
-def emit(values: qd.types.NDArray[qd.f32, 1], threshold: qd.f32) -> None:
-    for i in range(values.shape[0]):
-        if values[i] > threshold:
-            slot = qd.atomic_add(counter[None], 1)
-            output[slot] = values[i]
-```
-
-Every thread that passes the predicate gets a unique `slot` from the counter. The pattern is the workhorse of select / compact and contact-pair generation.
-
-### Histogram
-
-```python
-hist = qd.field(qd.i32, shape=(NBINS,))
-
-@qd.kernel
-def histogram(samples: qd.types.NDArray[qd.f32, 1]) -> None:
-    for i in range(samples.shape[0]):
-        b = qd.i32(samples[i] * NBINS)
-        if 0 <= b < NBINS:
-            qd.atomic_add(hist[b], 1)
-```
-
 ## Performance and portability notes
 
 - **Atomic contention is the silent killer of throughput.** The cost of `qd.atomic_add(counter, 1)` from every thread is dominated by serialization at the location, not by the per-thread arithmetic. If many threads hit the same slot, prefer a two-stage scheme: per-warp / per-block reduction first (`qd.simt.block.reduce` if available, or `qd.simt.subgroup.reduce_add`), then a single atomic per warp / block.

From 74ff985659c396af21b2a025ee57f6c1285ec775 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 06:38:56 -0700
Subject: [PATCH 5/7] [Docs] Drop enumerative qd.math surface sentence from
 math.md intro

Removes the second sentence of the math.md intro, which
listed every category of qd.math.* helpers (vec / mat
constructors, GLSL-style ops, trig, rotation builders,
float-classification, constants).  The remaining intro is
just "qd.math is the quadrants standard library of math
helpers", followed by the existing note that this page
currently documents only the bit-counting helpers.
---
 docs/source/user_guide/math.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/math.md b/docs/source/user_guide/math.md
index ea013c8ffc..cf4eecb568 100644
--- a/docs/source/user_guide/math.md
+++ b/docs/source/user_guide/math.md
@@ -1,6 +1,6 @@
 # Math
 
-`qd.math` is the quadrants standard library of math helpers. Per the module docstring it "supports glsl-style vectors, matrices and functions"; in practice the surface includes vector / matrix constructors (`vec2`, `vec3`, `vec4`, `mat2`, `mat3`, `mat4`, etc.), GLSL-style scalar / vector ops (`mix`, `clamp`, `step`, `smoothstep`, `sign`, `normalize`, `length`, `distance`, `dot`, `cross`, `reflect`, `refract`, `mod`), trig (`sin`, `cos`, `tan`, `asin`, `acos`, `atan2`, `tanh`, `exp`, `log`, `log2`, `pow`, `sqrt`, `floor`, `ceil`, `round`), rotation builders (`rot_by_axis`, `rot_yaw_pitch_roll`, `rotation2d`, `rotation3d`, `translate`, `scale`, `eye`), float-classification helpers (`isinf`, `isnan`), the constants `e`, `pi`, `inf`, `nan`, and the integer bit-counting helpers documented below.
+`qd.math` is the quadrants standard library of math helpers.
 
 This page currently documents only the bit-counting helpers. The broader `qd.math` surface is exported and usable today but is not yet documented here.
 

From fef151022159d4eea1a9f37ee37c92bc77025c41 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:56:22 -0700
Subject: [PATCH 6/7] [Docs] atomics/math: address PR #640 review (AMDGPU
 support, dtype scope, NaN semantics)

---
 docs/source/user_guide/atomics.md |  9 ++++++++-
 docs/source/user_guide/math.md    | 22 ++++++++++++----------
 2 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md
index 309117ffb1..a3689c6372 100644
--- a/docs/source/user_guide/atomics.md
+++ b/docs/source/user_guide/atomics.md
@@ -23,6 +23,13 @@ There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ run
 
 All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope).
 
+**Backend caveat for the fence-pair pattern.** Both fence helpers have current portability gaps that affect the patterns recommended on this page:
+
+- `qd.simt.block.mem_sync()` is supported on CUDA and SPIR-V; on AMDGPU it raises `ValueError("qd.block.mem_sync is not supported for arch ...")` at trace time.
+- `qd.simt.grid.memfence()` is fully implemented only on CUDA. On AMDGPU it currently links as a silent no-op (cross-block ordering will fail without any diagnostic); on SPIR-V it fails at codegen. See [grid](grid.md) for the per-backend details.
+
+On AMDGPU specifically, neither fence-pair recipe works as documented yet; cross-platform code that needs an atomic plus a fence must restructure around the kernel-launch boundary or be CUDA-bound until the AMDGPU lowerings land.
+
 ## Semantics
 
 ### `qd.atomic_add(x, y)` — and the rest of the family
@@ -44,7 +51,7 @@ Properties common to every `qd.atomic_*`:
 
 ### `qd.atomic_min(x, y)` / `qd.atomic_max(x, y)`
 
-Atomically writes back `min(x, y)` (resp. `max(x, y)`). Returns the old value of `x`. Floating-point min/max follow IEEE rules — `NaN` propagates: if either input is `NaN`, the result is `NaN`.
+Atomically writes back `min(x, y)` (resp. `max(x, y)`). Returns the old value of `x`. Floating-point min/max use **`minNum` / `maxNum`-style** semantics: if exactly one input is `NaN`, the **non-`NaN`** value is written back. This matches the f16 path's use of LLVM `llvm.minnum` / `llvm.maxnum` intrinsics (`quadrants/codegen/llvm/codegen_llvm.cpp:1337-1342`) and the GPU-native paths (CUDA sm_80+ `atomicMin`/`atomicMax` for floats, SPIR-V `FMin` / `FMax`). The f32 / f64 CPU CAS-loop path (`quadrants/runtime/llvm/runtime_module/atomic.h::min_f32` / `max_f32`) uses naive `<` / `>` comparisons, which give asymmetric NaN behaviour depending on operand order — do not rely on a particular result when either input is `NaN` on the CPU backend. Behaviour when *both* inputs are `NaN` is backend-dependent across the board.
 
 ### `qd.atomic_and(x, y)` / `qd.atomic_or(x, y)` / `qd.atomic_xor(x, y)`
 
diff --git a/docs/source/user_guide/math.md b/docs/source/user_guide/math.md
index cf4eecb568..088fa9f08f 100644
--- a/docs/source/user_guide/math.md
+++ b/docs/source/user_guide/math.md
@@ -8,22 +8,24 @@ This page currently documents only the bit-counting helpers. The broader `qd.mat
 
 Single-thread integer-register operations. They do not access memory and do not synchronize threads — each thread independently transforms a value in its own register.
 
-| Op                  | What it returns                              | i32 | u32 | i64 | u64 |
-|---------------------|----------------------------------------------|-----|-----|-----|-----|
-| `qd.math.popcnt(x)` | Number of set bits in `x`                    | yes | yes | yes | yes |
-| `qd.math.clz(x)`    | Number of leading zero bits in `x`           | yes | \*  | yes | \*  |
+| Op                  | CUDA                | AMDGPU                       | SPIR-V (Vulkan / Metal)                            |
+|---------------------|---------------------|------------------------------|----------------------------------------------------|
+| `qd.math.popcnt(x)` | i32, u32, i64, u64  | unsupported (codegen FIXME)  | any int (`OpBitCount`)                             |
+| `qd.math.clz(x)`    | i32, i64 only \*    | unsupported (codegen FIXME)  | 32-bit only (`FindMSB`); 64-bit input is silently truncated |
 
-\* `qd.math.clz` on CUDA currently rejects unsigned 32- and 64-bit inputs (`QD_NOT_IMPLEMENTED`); cast through `qd.bit_cast(x, qd.i32)` / `qd.i64` as a workaround. On SPIR-V, `qd.math.clz` is hard-coded to 32-bit (`FindMSB`); 64-bit input is silently truncated.
+\* On CUDA, `qd.math.clz` rejects unsigned 32- and 64-bit inputs (`QD_NOT_IMPLEMENTED` in `quadrants/codegen/cuda/codegen_cuda.cpp`); `bit_cast` through the matching signed type as a workaround: `qd.math.clz(qd.bit_cast(x, qd.i32))`. CUDA `popcnt` accepts u32 / u64 directly; only `clz` has the signed-only restriction. On unsupported integer widths (e.g. `i8`, `i16`, `u16`) both ops also hit `QD_NOT_IMPLEMENTED`.
 
-The classic CUDA bit-tricks `__ffs` (find first set bit) and `__fns` (find n-th set bit in a mask) are not exposed; for a leading-zero count of a u32, the `bit_cast` workaround above is the canonical approach.
+**FIXME (AMDGPU):** the AMDGPU `emit_extra_unary` override (`quadrants/codegen/amdgpu/codegen_amdgpu.cpp`) has no `popcnt` or `clz` branch; both fall through to `QD_NOT_IMPLEMENTED`. The test suite already records this (`tests/python/test_unary_ops.py::test_popcnt` and `::test_clz` both `xfail` on AMDGPU). Until lowerings are added, AMDGPU users hit a hard codegen failure.
+
+The classic CUDA bit-tricks `__ffs` (find first set bit) and `__fns` (find n-th set bit in a mask) are not exposed; for a leading-zero count of a u32 on CUDA, the `bit_cast` workaround above is the canonical approach.
 
 ### `qd.math.popcnt(x)`
 
-Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes.
+Counts set bits in `x` and returns an `i32`. On CUDA, lowers to `__nv_popc` for 32-bit inputs and `__nv_popcll` for 64-bit inputs (i32 / u32 / i64 / u64 only; narrower widths and AMDGPU are unsupported). On SPIR-V, lowers to `OpBitCount`.
 
 ### `qd.math.clz(x)`
 
-Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. Lowers to `__nv_clz` / `__nv_clzll` on CUDA, `FindMSB` on SPIR-V (with `bitwidth - 1 - FindMSB` to convert MSB index into leading-zero count), `__builtin_amdgcn_sffbh_i32` on AMDGPU. See the cross-backend caveats in the support table.
+Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. On CUDA, lowers to `__nv_clz` (i32 only) and `__nv_clzll` (i64 only); u32 / u64 must be `bit_cast` to the matching signed type. On SPIR-V, lowers to `FindMSB` with `bitwidth - 1 - FindMSB` to convert MSB index into a leading-zero count; the implementation is hard-coded to 32-bit, so 64-bit input silently truncates. AMDGPU is unsupported. See the cross-backend caveats in the support table.
 
 ## Examples
 
@@ -50,8 +52,8 @@ For `qd.u32` input on CUDA, cast first: `qd.math.clz(qd.bit_cast(x, qd.i32))`.
 
 ## Performance and portability notes
 
-- `qd.math.popcnt` is fully cross-backend.
-- `qd.math.clz` has the dtype caveats noted above. Tests that depend on `qd.math.clz` over u32 or u64 should `bit_cast` to the matching signed type for portability.
+- `qd.math.popcnt` is supported on CUDA (i32 / u32 / i64 / u64) and SPIR-V (any integer width). AMDGPU is unsupported (FIXME above).
+- `qd.math.clz` has the dtype and backend caveats noted above. Tests that depend on `qd.math.clz` over u32 or u64 should `bit_cast` to the matching signed type for portability on CUDA, and avoid 64-bit input on SPIR-V.
 
 ## Related
 

From 438cfcb2ad74ffea63647b4da484c095fb15042f Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Fri, 8 May 2026 07:59:31 -0700
Subject: [PATCH 7/7] [Docs] atomics: drop link to grid.md (file lives on
 hp/docs-grid, not main yet)

---
 docs/source/user_guide/atomics.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md
index a3689c6372..0ca5ed71ec 100644
--- a/docs/source/user_guide/atomics.md
+++ b/docs/source/user_guide/atomics.md
@@ -26,7 +26,7 @@ All atomic ops can be called on either global memory (fields, ndarrays) or block
 **Backend caveat for the fence-pair pattern.** Both fence helpers have current portability gaps that affect the patterns recommended on this page:
 
 - `qd.simt.block.mem_sync()` is supported on CUDA and SPIR-V; on AMDGPU it raises `ValueError("qd.block.mem_sync is not supported for arch ...")` at trace time.
-- `qd.simt.grid.memfence()` is fully implemented only on CUDA. On AMDGPU it currently links as a silent no-op (cross-block ordering will fail without any diagnostic); on SPIR-V it fails at codegen. See [grid](grid.md) for the per-backend details.
+- `qd.simt.grid.memfence()` is fully implemented only on CUDA. On AMDGPU it currently links as a silent no-op (cross-block ordering will fail without any diagnostic); on SPIR-V it fails at codegen.
 
 On AMDGPU specifically, neither fence-pair recipe works as documented yet; cross-platform code that needs an atomic plus a fence must restructure around the kernel-launch boundary or be CUDA-bound until the AMDGPU lowerings land.