-
Notifications
You must be signed in to change notification settings - Fork 18
[Docs] Add user-guide page for atomics and bit operations #640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
96eaa19
7bc43c0
5e28772
e3f89db
74ff985
fef1510
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # Atomics | ||
|
|
||
| Atomic read-modify-write operations on a single memory location. They do not synchronize threads; the only ordering they provide is the per-location atomicity of the read-modify-write itself. For cooperative ops across threads see the `qd.simt.block.*`, `qd.simt.subgroup.*`, and `qd.simt.grid.*` namespaces. Bit-counting helpers on integer registers (`qd.math.popcnt`, `qd.math.clz`) are documented in [math](math.md). | ||
|
|
||
| ## What's available | ||
|
|
||
| All atomic ops follow the same shape: `qd.atomic_op(x, y)` performs `x = op(x, y)` atomically and returns the **old** value of `x`. `x` must be a writable memory target (a field element, ndarray element, or matrix slot); scalars and constant expressions are not allowed. | ||
|
|
||
| | Op | Semantics | i32 | u32 | i64 | u64 | f32 | f64 | | ||
| |----------------|----------------------------------------|-----|-----|-----|-----|-----|-----| | ||
| | `atomic_add` | `x += y` | yes | yes | yes | yes | yes | \* | | ||
| | `atomic_sub` | `x -= y` | yes | yes | yes | yes | yes | \* | | ||
| | `atomic_mul` | `x *= y` | yes | yes | yes | yes | yes | \* | | ||
| | `atomic_min` | `x = min(x, y)` | yes | yes | yes | yes | yes | \* | | ||
| | `atomic_max` | `x = max(x, y)` | yes | yes | yes | yes | yes | \* | | ||
| | `atomic_and` | `x &= y` | yes | yes | yes | yes | — | — | | ||
| | `atomic_or` | `x \|= y` | yes | yes | yes | yes | — | — | | ||
| | `atomic_xor` | `x ^= y` | yes | yes | yes | yes | — | — | | ||
|
|
||
| \* `f64` atomic add / sub / mul / min / max is hardware-dependent: supported on CUDA sm_60+ for `add`, falls back to a CAS loop elsewhere or raises at codegen time on older targets and on backends that do not lower a CAS loop. Prefer `f32` on hot paths if portability matters. | ||
|
|
||
| There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ runtime uses CmpXchg internally; surfacing it requires extending `AtomicOpType`. | ||
|
|
||
| All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope). | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion —
This line (and the repeated references at lines 42 and 60) recommends Consider appending:
|
||
|
|
||
| **Backend caveat for the fence-pair pattern.** Both fence helpers have current portability gaps that affect the patterns recommended on this page: | ||
|
|
||
| - `qd.simt.block.mem_sync()` is supported on CUDA and SPIR-V; on AMDGPU it raises `ValueError("qd.block.mem_sync is not supported for arch ...")` at trace time. | ||
| - `qd.simt.grid.memfence()` is fully implemented only on CUDA. On AMDGPU it currently links as a silent no-op (cross-block ordering will fail without any diagnostic); on SPIR-V it fails at codegen. See [grid](grid.md) for the per-backend details. | ||
|
|
||
| On AMDGPU specifically, neither fence-pair recipe works as documented yet; cross-platform code that needs an atomic plus a fence must restructure around the kernel-launch boundary or be CUDA-bound until the AMDGPU lowerings land. | ||
|
|
||
| ## Semantics | ||
|
|
||
| ### `qd.atomic_add(x, y)` — and the rest of the family | ||
|
|
||
| ```python | ||
| old = qd.atomic_add(x, y) | ||
| # Effect: | ||
| # tmp = load(x) | ||
| # store(x, op(tmp, y)) | ||
| # old = tmp | ||
| # all three steps execute as a single atomic transaction on x. | ||
| ``` | ||
|
|
||
| Properties common to every `qd.atomic_*`: | ||
|
|
||
| - **Returns the old value**, not the new one. This matches CUDA's `atomicAdd` and is what enables building reservation patterns: `slot = qd.atomic_add(counter, 1)` gives every thread a unique index. | ||
| - **Per-location atomicity, no fence on the rest of memory.** Writes you issued before an atomic on `x` are not necessarily visible to other threads after they observe the new `x`. Pair the atomic with `qd.simt.block.mem_sync()` or `qd.simt.grid.memfence()` if you need that ordering. | ||
| - **Vector / matrix arguments fan out element-wise.** `qd.atomic_add(field_of_vec3, qd.Vector([1.0, 2.0, 3.0]))` issues three independent scalar atomic-adds, one per component. There is no all-or-nothing guarantee across the components. | ||
|
|
||
| ### `qd.atomic_min(x, y)` / `qd.atomic_max(x, y)` | ||
|
|
||
| Atomically writes back `min(x, y)` (resp. `max(x, y)`). Returns the old value of `x`. Floating-point min/max use **`minNum` / `maxNum`-style** semantics: if exactly one input is `NaN`, the **non-`NaN`** value is written back. This matches the f16 path's use of LLVM `llvm.minnum` / `llvm.maxnum` intrinsics (`quadrants/codegen/llvm/codegen_llvm.cpp:1337-1342`) and the GPU-native paths (CUDA sm_80+ `atomicMin`/`atomicMax` for floats, SPIR-V `FMin` / `FMax`). The f32 / f64 CPU CAS-loop path (`quadrants/runtime/llvm/runtime_module/atomic.h::min_f32` / `max_f32`) uses naive `<` / `>` comparisons, which give asymmetric NaN behaviour depending on operand order — do not rely on a particular result when either input is `NaN` on the CPU backend. Behaviour when *both* inputs are `NaN` is backend-dependent across the board. | ||
|
|
||
| ### `qd.atomic_and(x, y)` / `qd.atomic_or(x, y)` / `qd.atomic_xor(x, y)` | ||
|
|
||
| Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type error at trace time. | ||
|
|
||
| ### `qd.atomic_sub(x, y)` / `qd.atomic_mul(x, y)` | ||
|
|
||
| Atomic subtract and atomic multiply. `atomic_sub` is supported natively on most backends; `atomic_mul` on integer types lowers to a CAS loop on hardware without a native multiply atomic and is intentionally not heavily optimised — prefer reducing to a different scheme on hot paths. | ||
|
|
||
| ## Performance and portability notes | ||
|
|
||
| - **Atomic contention is the silent killer of throughput.** The cost of `qd.atomic_add(counter, 1)` from every thread is dominated by serialization at the location, not by the per-thread arithmetic. If many threads hit the same slot, prefer a two-stage scheme: per-warp / per-block reduction first (`qd.simt.block.reduce` if available, or `qd.simt.subgroup.reduce_add`), then a single atomic per warp / block. | ||
| - **Pair atomics with the right fence scope.** A bare atomic only orders the location it touches. To make other writes visible to readers that observe the new atomic value, follow the atomic with a fence: block-scope (`qd.simt.block.mem_sync()`) for shared-memory publishing, or grid-scope (`qd.simt.grid.memfence()`) for cross-block coordination. | ||
| - **`f64` atomics fall off the fast path** on most backends; if you only need monotonic accumulation, consider Kahan summation in registers and a single atomic-add at the end of the block. | ||
| - **`atomic_mul` is generally a CAS loop** under the hood; don't put it on the hot path. | ||
|
|
||
| ## Related | ||
|
|
||
| - [math](math.md) — `qd.math.*`, including the bit-counting helpers (`popcnt`, `clz`) commonly paired with atomics in select / compact patterns. | ||
| - `qd.simt.block.*` — block-scope barriers and memory fences (`qd.simt.block.mem_sync()`). | ||
| - `qd.simt.subgroup.*` — warp-scope reductions and shuffles, the recommended pre-aggregation step before an atomic. | ||
| - `qd.simt.grid.*` — device-scope memory fence (`qd.simt.grid.memfence()`). | ||
| - [parallelization](parallelization.md) — thread-synchronization patterns and how atomics fit into the broader synchronization story. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -47,6 +47,8 @@ autodiff | |
| :maxdepth: 1 | ||
| :titlesonly: | ||
| atomics | ||
| math | ||
| subgroup | ||
| tile16 | ||
| ``` | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| # Math | ||
|
|
||
| `qd.math` is the quadrants standard library of math helpers. | ||
|
|
||
| This page currently documents only the bit-counting helpers. The broader `qd.math` surface is exported and usable today but is not yet documented here. | ||
|
|
||
| ## Bit operations | ||
|
|
||
| Single-thread integer-register operations. They do not access memory and do not synchronize threads — each thread independently transforms a value in its own register. | ||
|
|
||
| | Op | CUDA | AMDGPU | SPIR-V (Vulkan / Metal) | | ||
| |---------------------|---------------------|------------------------------|----------------------------------------------------| | ||
| | `qd.math.popcnt(x)` | i32, u32, i64, u64 | unsupported (codegen FIXME) | any int (`OpBitCount`) | | ||
| | `qd.math.clz(x)` | i32, i64 only \* | unsupported (codegen FIXME) | 32-bit only (`FindMSB`); 64-bit input is silently truncated | | ||
|
|
||
| \* On CUDA, `qd.math.clz` rejects unsigned 32- and 64-bit inputs (`QD_NOT_IMPLEMENTED` in `quadrants/codegen/cuda/codegen_cuda.cpp`); `bit_cast` through the matching signed type as a workaround: `qd.math.clz(qd.bit_cast(x, qd.i32))`. CUDA `popcnt` accepts u32 / u64 directly; only `clz` has the signed-only restriction. On unsupported integer widths (e.g. `i8`, `i16`, `u16`) both ops also hit `QD_NOT_IMPLEMENTED`. | ||
|
|
||
| **FIXME (AMDGPU):** the AMDGPU `emit_extra_unary` override (`quadrants/codegen/amdgpu/codegen_amdgpu.cpp`) has no `popcnt` or `clz` branch; both fall through to `QD_NOT_IMPLEMENTED`. The test suite already records this (`tests/python/test_unary_ops.py::test_popcnt` and `::test_clz` both `xfail` on AMDGPU). Until lowerings are added, AMDGPU users hit a hard codegen failure. | ||
|
|
||
| The classic CUDA bit-tricks `__ffs` (find first set bit) and `__fns` (find n-th set bit in a mask) are not exposed; for a leading-zero count of a u32 on CUDA, the `bit_cast` workaround above is the canonical approach. | ||
|
|
||
| ### `qd.math.popcnt(x)` | ||
|
|
||
| Counts set bits in `x` and returns an `i32`. On CUDA, lowers to `__nv_popc` for 32-bit inputs and `__nv_popcll` for 64-bit inputs (i32 / u32 / i64 / u64 only; narrower widths and AMDGPU are unsupported). On SPIR-V, lowers to `OpBitCount`. | ||
|
|
||
| ### `qd.math.clz(x)` | ||
|
|
||
| Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. On CUDA, lowers to `__nv_clz` (i32 only) and `__nv_clzll` (i64 only); u32 / u64 must be `bit_cast` to the matching signed type. On SPIR-V, lowers to `FindMSB` with `bitwidth - 1 - FindMSB` to convert MSB index into a leading-zero count; the implementation is hard-coded to 32-bit, so 64-bit input silently truncates. AMDGPU is unsupported. See the cross-backend caveats in the support table. | ||
|
|
||
| ## Examples | ||
|
|
||
| ### Bitset population count | ||
|
|
||
| ```python | ||
| @qd.kernel | ||
| def count_bits(masks: qd.types.NDArray[qd.u32, 1], total: qd.types.NDArray[qd.i32, 1]) -> None: | ||
| n = 0 | ||
| for i in range(masks.shape[0]): | ||
| n += qd.math.popcnt(masks[i]) | ||
| qd.atomic_add(total[0], n) | ||
| ``` | ||
|
|
||
| ### Highest set bit (Morton-code depth) | ||
|
|
||
| ```python | ||
| @qd.func | ||
| def msb(x: qd.i32) -> qd.i32: | ||
| return 31 - qd.math.clz(x) | ||
| ``` | ||
|
|
||
| For `qd.u32` input on CUDA, cast first: `qd.math.clz(qd.bit_cast(x, qd.i32))`. | ||
|
|
||
| ## Performance and portability notes | ||
|
|
||
| - `qd.math.popcnt` is supported on CUDA (i32 / u32 / i64 / u64) and SPIR-V (any integer width). AMDGPU is unsupported (FIXME above). | ||
| - `qd.math.clz` has the dtype and backend caveats noted above. Tests that depend on `qd.math.clz` over u32 or u64 should `bit_cast` to the matching signed type for portability on CUDA, and avoid 64-bit input on SPIR-V. | ||
|
|
||
| ## Related | ||
|
|
||
| - [atomics](atomics.md) — atomic read-modify-write operations on global / shared memory; commonly paired with bit-counting in select / compact patterns. | ||
| - `qd.bit_cast` — reinterprets a value's bit pattern as another dtype, used as a workaround for the `clz` u32 / u64 caveats above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This guidance points every block-scope publishing pattern at
qd.simt.block.mem_sync(), but that helper is not available on AMDGPU:python/quadrants/lang/simt/block.pyonly returns for CUDA or SPIR-V-backed arches and otherwise raisesValueError("qd.block.mem_sync is not supported..."). On AMDGPU kernels, following this new atomics page to pair a shared-memory atomic with the documented fence will fail during tracing, so the doc should either caveat AMDGPU or point AMDGPU users at a supported synchronization primitive.Useful? React with 👍 / 👎.