From 3c80d6083f28907588de6fab8080c327644275e3 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 00:44:38 -0700
Subject: [PATCH 01/13] [Docs] Add user-guide page for qd.algorithms.*
 device-wide algorithms

Documents the two ops currently in qd.algorithms: parallel_sort
(odd-even merge sort, key or key-value, all backends, not stable)
and PrefixSumExecutor (Kogge-Stone hierarchical inclusive scan,
i32 only, CUDA + Vulkan only).

Covers semantics, the i32 / CUDA + Vulkan limitation that
cross-platform code most commonly hits, the allocate-once / run-many
pattern, and worked examples (key-value sort, scan-based compact).

Adds a new 'Algorithms' caption to the toctree in index.md.
---
 docs/source/user_guide/algorithms.md | 127 +++++++++++++++++++++++++++
 docs/source/user_guide/index.md      |   8 ++
 2 files changed, 135 insertions(+)
 create mode 100644 docs/source/user_guide/algorithms.md

diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
new file mode 100644
index 0000000000..3f312e7a85
--- /dev/null
+++ b/docs/source/user_guide/algorithms.md
@@ -0,0 +1,127 @@
+# Algorithms
+
+Device-wide algorithms — primitives that consume and produce whole arrays, executed as one or more kernel launches under the hood. They sit one tier above grid-scope synchronization: they *use* block, subgroup, and grid primitives internally and expose a high-level entry point that the user calls from host (Python) code, not from inside a kernel.
+
+The current `qd.algorithms` namespace is small — two ops, both built around prefix-style internal kernels. The set is expected to grow over time (radix sort, reduce-by-key, select / compact).
+
+## What's available
+
+| Op                              | What it does                                | CUDA | AMDGPU | Vulkan | Metal |
+|---------------------------------|---------------------------------------------|------|--------|--------|-------|
+| `qd.algorithms.parallel_sort`   | Odd-even merge sort (in-place, key or key-value) | yes  | yes\*  | yes    | yes\* |
+| `qd.algorithms.PrefixSumExecutor` | Inclusive in-place prefix sum (i32 only)  | yes  | no     | yes    | no    |
+
+\* `parallel_sort` runs anywhere a Quadrants kernel runs; portability is inherited from the underlying kernel infrastructure. AMDGPU and Metal coverage is exercised less heavily than CUDA / Vulkan; report any failures.
+
+## Semantics
+
+### `qd.algorithms.parallel_sort(keys, values=None)`
+
+In-place sort. Reorders `keys` ascending; if `values` is provided, applies the same permutation to `values` (key-value sort). Both arguments must be 1-D `field` or `ndarray`.
+
+```python
+keys = qd.field(qd.i32, shape=(N,))
+qd.algorithms.parallel_sort(keys)
+```
+
+```python
+keys = qd.field(qd.i32, shape=(N,))
+vals = qd.field(qd.f32, shape=(N,))
+qd.algorithms.parallel_sort(keys, vals)
+```
+
+- **Algorithm.** Batcher's odd-even merge sort. Time complexity `O(N log² N)`, work-efficient for small / mid-sized arrays.
+- **Key dtype.** Whatever the key field's dtype is, as long as `<` is meaningful for it (integer and float types).
+- **Stability.** Odd-even merge sort is *not* a stable sort — equal keys may be reordered relative to one another. If stability matters, encode tiebreakers into the keys (e.g. pack the original index into the low bits).
+- **Memory.** Strictly in-place — no auxiliary buffers from the caller's perspective.
+- **Performance characteristic.** Beats radix-style sorts for small N (roughly N ≲ 4K), losing to them at large N. For million-element key sets prefer a radix sort (qipc / CUB-style); for thousands or tens of thousands, this is a fine choice.
+
+### `qd.algorithms.PrefixSumExecutor`
+
+Inclusive in-place prefix sum (scan) over a 1-D `i32` field. Construct once with the array length, then call `.run(field)` to scan.
+
+```python
+psum = qd.algorithms.PrefixSumExecutor(N)
+arr  = qd.field(qd.i32, shape=(N,))
+# ... fill arr ...
+psum.run(arr)
+# arr now holds the inclusive prefix sum: arr[i] = sum(arr_original[0..=i]).
+```
+
+Constructor:
+
+- `length: int` — the maximum number of elements the executor can scan. Internally allocates an auxiliary `qd.field(i32, shape=padded_length)` sized to the Kogge-Stone hierarchy (block size = 64).
+
+`run(input_arr)`:
+
+- `input_arr` must be a 1-D `qd.field(qd.i32, shape=(L,))` with `L <= length`.
+- Returns nothing; `input_arr` is overwritten with the scan result.
+
+Constraints:
+
+- **Dtype:** `qd.i32` only. Calling with any other dtype raises `RuntimeError("Only qd.i32 type is supported for prefix sum.")`.
+- **Inclusive only.** No exclusive variant exposed. To convert to exclusive, post-process: `exclusive[i] = inclusive[i] - input_original[i]`.
+- **Backend coverage.** CUDA and Vulkan only. AMDGPU and Metal raise `RuntimeError(f"{arch} is not supported for prefix sum.")` at trace time. Cross-platform code that needs a portable exclusive scan currently has to roll its own (see, for example, the qipc Onesweep / decoupled-look-back scan).
+
+The implementation is a Kogge-Stone hierarchical scan: per-block inclusive scan on shared memory, then a small recursive scan over per-block totals, then a uniform-add pass to propagate back. This means the executor reuses the underlying buffer across calls, which is why it's a class (allocate once, run many times) rather than a free function.
+
+## Examples
+
+### Sort indices by per-element key
+
+```python
+N = 1000
+keys = qd.field(qd.f32, shape=(N,))
+indices = qd.field(qd.i32, shape=(N,))
+
+@qd.kernel
+def init() -> None:
+    for i in range(N):
+        keys[i] = qd.random()
+        indices[i] = i
+
+init()
+qd.algorithms.parallel_sort(keys, indices)
+# keys is now ascending; indices[k] is the original index of the k-th smallest key.
+```
+
+### Compact-array offsets via prefix sum
+
+```python
+N = 100_000
+flags  = qd.field(qd.i32, shape=(N,))   # 0 or 1 per element
+offsets = qd.field(qd.i32, shape=(N,))
+
+@qd.kernel
+def populate(input: qd.types.NDArray[qd.f32, 1], threshold: qd.f32) -> None:
+    for i in range(N):
+        flags[i] = 1 if input[i] > threshold else 0
+
+@qd.kernel
+def copy_flags() -> None:
+    for i in range(N):
+        offsets[i] = flags[i]
+
+scan = qd.algorithms.PrefixSumExecutor(N)
+
+populate(input, 0.5)
+copy_flags()
+scan.run(offsets)
+# offsets[i] is now the 1-based output position of element i if it was selected.
+```
+
+The compact-output kernel reads `offsets[i]` (or `offsets[i] - flags[i]` for 0-based) to decide where to write surviving elements. This is the textbook scan-based select / compact pattern; the only Quadrants-specific note is the `i32`-only restriction.
+
+## Performance and portability notes
+
+- **`parallel_sort` is `O(N log² N)`**, which is fine up to a few thousand elements and noticeably slower than a radix sort beyond that. The algorithm is stable in *control flow* but not stable in element ordering — important for code that compacts after sorting.
+- **`PrefixSumExecutor` is `i32`-only and CUDA + Vulkan-only.** This is the most-often-hit limitation in cross-platform code. If you need `u32` / `i64` / `f32` / `f64` keys or AMDGPU / Metal coverage, you currently have to compose your own scan from `qd.simt.subgroup.inclusive_add` (per-block) plus an outer kernel that handles the multi-block roll-up — or use the qipc Onesweep / decoupled-look-back scan if you have a hard dependency on it.
+- **Allocate the executor once, run it many times.** The internal auxiliary buffer is sized to the constructor's `length`; constructing per call wastes allocation traffic. Each `.run()` is a sequence of kernel launches; the cost is `O(N / cache_line)` global memory bandwidth, not user-visible launch overhead.
+- **No fence required between `populate` and `scan.run`.** Each algorithm kernel launches its own kernels under the hood, and the kernel boundary serializes against prior writes from host-launched kernels.
+
+## Related
+
+- `qd.simt.block.*` — the block-scope reductions and shared-memory primitives that algorithm kernels build on.
+- `qd.simt.subgroup.*` — `inclusive_add` and friends, what the per-block scan stage of `PrefixSumExecutor` actually calls.
+- `qd.simt.grid.memfence()` — the grid-scope memory fence that decoupled-look-back scans (a more efficient alternative to Kogge-Stone) require.
+- [parallelization](parallelization.md) — broader synchronization story, including how `qd.algorithms` operations compose with hand-written kernels.
diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md
index f783e84264..64c5bcd6ca 100644
--- a/docs/source/user_guide/index.md
+++ b/docs/source/user_guide/index.md
@@ -51,6 +51,14 @@ subgroup
 tile16
 ```
 
+```{toctree}
+:caption: Algorithms
+:maxdepth: 1
+:titlesonly:
+
+algorithms
+```
+
 ```{toctree}
 :caption: Performance
 :maxdepth: 1

From c4cf5ff558f73f2a0810fe4a9430e7cf75429b74 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 11:02:20 -0700
Subject: [PATCH 02/13] [Docs] algorithms: drop namespace-size editorial
 sentence

---
 docs/source/user_guide/algorithms.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
index 3f312e7a85..570ec940bc 100644
--- a/docs/source/user_guide/algorithms.md
+++ b/docs/source/user_guide/algorithms.md
@@ -2,8 +2,6 @@
 
 Device-wide algorithms — primitives that consume and produce whole arrays, executed as one or more kernel launches under the hood. They sit one tier above grid-scope synchronization: they *use* block, subgroup, and grid primitives internally and expose a high-level entry point that the user calls from host (Python) code, not from inside a kernel.
 
-The current `qd.algorithms` namespace is small — two ops, both built around prefix-style internal kernels. The set is expected to grow over time (radix sort, reduce-by-key, select / compact).
-
 ## What's available
 
 | Op                              | What it does                                | CUDA | AMDGPU | Vulkan | Metal |

From 7f2b1e3084fa9980046c2689dd79a00e7fab4102 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 11:03:08 -0700
Subject: [PATCH 03/13] [Docs] algorithms: drop qipc and CUB references

---
 docs/source/user_guide/algorithms.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
index 570ec940bc..6c26708e19 100644
--- a/docs/source/user_guide/algorithms.md
+++ b/docs/source/user_guide/algorithms.md
@@ -32,7 +32,7 @@ qd.algorithms.parallel_sort(keys, vals)
 - **Key dtype.** Whatever the key field's dtype is, as long as `<` is meaningful for it (integer and float types).
 - **Stability.** Odd-even merge sort is *not* a stable sort — equal keys may be reordered relative to one another. If stability matters, encode tiebreakers into the keys (e.g. pack the original index into the low bits).
 - **Memory.** Strictly in-place — no auxiliary buffers from the caller's perspective.
-- **Performance characteristic.** Beats radix-style sorts for small N (roughly N ≲ 4K), losing to them at large N. For million-element key sets prefer a radix sort (qipc / CUB-style); for thousands or tens of thousands, this is a fine choice.
+- **Performance characteristic.** Beats radix-style sorts for small N (roughly N ≲ 4K), losing to them at large N. For million-element key sets prefer a radix sort; for thousands or tens of thousands, this is a fine choice.
 
 ### `qd.algorithms.PrefixSumExecutor`
 
@@ -59,7 +59,7 @@ Constraints:
 
 - **Dtype:** `qd.i32` only. Calling with any other dtype raises `RuntimeError("Only qd.i32 type is supported for prefix sum.")`.
 - **Inclusive only.** No exclusive variant exposed. To convert to exclusive, post-process: `exclusive[i] = inclusive[i] - input_original[i]`.
-- **Backend coverage.** CUDA and Vulkan only. AMDGPU and Metal raise `RuntimeError(f"{arch} is not supported for prefix sum.")` at trace time. Cross-platform code that needs a portable exclusive scan currently has to roll its own (see, for example, the qipc Onesweep / decoupled-look-back scan).
+- **Backend coverage.** CUDA and Vulkan only. AMDGPU and Metal raise `RuntimeError(f"{arch} is not supported for prefix sum.")` at trace time. Cross-platform code that needs a portable exclusive scan currently has to roll its own (e.g. an Onesweep / decoupled-look-back scan).
 
 The implementation is a Kogge-Stone hierarchical scan: per-block inclusive scan on shared memory, then a small recursive scan over per-block totals, then a uniform-add pass to propagate back. This means the executor reuses the underlying buffer across calls, which is why it's a class (allocate once, run many times) rather than a free function.
 
@@ -113,7 +113,7 @@ The compact-output kernel reads `offsets[i]` (or `offsets[i] - flags[i]` for 0-b
 ## Performance and portability notes
 
 - **`parallel_sort` is `O(N log² N)`**, which is fine up to a few thousand elements and noticeably slower than a radix sort beyond that. The algorithm is stable in *control flow* but not stable in element ordering — important for code that compacts after sorting.
-- **`PrefixSumExecutor` is `i32`-only and CUDA + Vulkan-only.** This is the most-often-hit limitation in cross-platform code. If you need `u32` / `i64` / `f32` / `f64` keys or AMDGPU / Metal coverage, you currently have to compose your own scan from `qd.simt.subgroup.inclusive_add` (per-block) plus an outer kernel that handles the multi-block roll-up — or use the qipc Onesweep / decoupled-look-back scan if you have a hard dependency on it.
+- **`PrefixSumExecutor` is `i32`-only and CUDA + Vulkan-only.** This is the most-often-hit limitation in cross-platform code. If you need `u32` / `i64` / `f32` / `f64` keys or AMDGPU / Metal coverage, you currently have to compose your own scan from `qd.simt.subgroup.inclusive_add` (per-block) plus an outer kernel that handles the multi-block roll-up.
 - **Allocate the executor once, run it many times.** The internal auxiliary buffer is sized to the constructor's `length`; constructing per call wastes allocation traffic. Each `.run()` is a sequence of kernel launches; the cost is `O(N / cache_line)` global memory bandwidth, not user-visible launch overhead.
 - **No fence required between `populate` and `scan.run`.** Each algorithm kernel launches its own kernels under the hood, and the kernel boundary serializes against prior writes from host-launched kernels.
 

From c44a3d93dfc7335aee0b9ef52d2c6751b115c476 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 11:03:33 -0700
Subject: [PATCH 04/13] [Docs] algorithms: trim performance bullet to crossover
 summary

---
 docs/source/user_guide/algorithms.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
index 6c26708e19..d7572d826a 100644
--- a/docs/source/user_guide/algorithms.md
+++ b/docs/source/user_guide/algorithms.md
@@ -32,7 +32,7 @@ qd.algorithms.parallel_sort(keys, vals)
 - **Key dtype.** Whatever the key field's dtype is, as long as `<` is meaningful for it (integer and float types).
 - **Stability.** Odd-even merge sort is *not* a stable sort — equal keys may be reordered relative to one another. If stability matters, encode tiebreakers into the keys (e.g. pack the original index into the low bits).
 - **Memory.** Strictly in-place — no auxiliary buffers from the caller's perspective.
-- **Performance characteristic.** Beats radix-style sorts for small N (roughly N ≲ 4K), losing to them at large N. For million-element key sets prefer a radix sort; for thousands or tens of thousands, this is a fine choice.
+- **Performance characteristic.** Beats radix-style sorts for small N (roughly N ≲ 4K), losing to them at large N.
 
 ### `qd.algorithms.PrefixSumExecutor`
 

From 94bc159972ee9d75169cd1f7626392fc04daf1bb Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 11:04:12 -0700
Subject: [PATCH 05/13] [Docs] algorithms: drop large-N regression aside from
 performance bullet

---
 docs/source/user_guide/algorithms.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
index d7572d826a..cf3e08ca40 100644
--- a/docs/source/user_guide/algorithms.md
+++ b/docs/source/user_guide/algorithms.md
@@ -32,7 +32,7 @@ qd.algorithms.parallel_sort(keys, vals)
 - **Key dtype.** Whatever the key field's dtype is, as long as `<` is meaningful for it (integer and float types).
 - **Stability.** Odd-even merge sort is *not* a stable sort — equal keys may be reordered relative to one another. If stability matters, encode tiebreakers into the keys (e.g. pack the original index into the low bits).
 - **Memory.** Strictly in-place — no auxiliary buffers from the caller's perspective.
-- **Performance characteristic.** Beats radix-style sorts for small N (roughly N ≲ 4K), losing to them at large N.
+- **Performance characteristic.** Beats radix-style sorts for small N (roughly N ≲ 4K).
 
 ### `qd.algorithms.PrefixSumExecutor`
 

From ea48dc1aea55e0d61e7040862ec5606f5d0b12b0 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 11:05:57 -0700
Subject: [PATCH 06/13] [Docs] algorithms: drop roll-your-own-scan aside from
 backend bullet

---
 docs/source/user_guide/algorithms.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
index cf3e08ca40..82d4057f2e 100644
--- a/docs/source/user_guide/algorithms.md
+++ b/docs/source/user_guide/algorithms.md
@@ -59,7 +59,7 @@ Constraints:
 
 - **Dtype:** `qd.i32` only. Calling with any other dtype raises `RuntimeError("Only qd.i32 type is supported for prefix sum.")`.
 - **Inclusive only.** No exclusive variant exposed. To convert to exclusive, post-process: `exclusive[i] = inclusive[i] - input_original[i]`.
-- **Backend coverage.** CUDA and Vulkan only. AMDGPU and Metal raise `RuntimeError(f"{arch} is not supported for prefix sum.")` at trace time. Cross-platform code that needs a portable exclusive scan currently has to roll its own (e.g. an Onesweep / decoupled-look-back scan).
+- **Backend coverage.** CUDA and Vulkan only. AMDGPU and Metal raise `RuntimeError(f"{arch} is not supported for prefix sum.")` at trace time.
 
 The implementation is a Kogge-Stone hierarchical scan: per-block inclusive scan on shared memory, then a small recursive scan over per-block totals, then a uniform-add pass to propagate back. This means the executor reuses the underlying buffer across calls, which is why it's a class (allocate once, run many times) rather than a free function.
 

From df7f4336afe80a12114fd875890e330898514a6e Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 11:07:04 -0700
Subject: [PATCH 07/13] [Docs] algorithms: trim parallel_sort bullet to
 complexity statement

---
 docs/source/user_guide/algorithms.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
index 82d4057f2e..c6d3648044 100644
--- a/docs/source/user_guide/algorithms.md
+++ b/docs/source/user_guide/algorithms.md
@@ -112,7 +112,7 @@ The compact-output kernel reads `offsets[i]` (or `offsets[i] - flags[i]` for 0-b
 
 ## Performance and portability notes
 
-- **`parallel_sort` is `O(N log² N)`**, which is fine up to a few thousand elements and noticeably slower than a radix sort beyond that. The algorithm is stable in *control flow* but not stable in element ordering — important for code that compacts after sorting.
+- **`parallel_sort` is `O(N log² N)`**.
 - **`PrefixSumExecutor` is `i32`-only and CUDA + Vulkan-only.** This is the most-often-hit limitation in cross-platform code. If you need `u32` / `i64` / `f32` / `f64` keys or AMDGPU / Metal coverage, you currently have to compose your own scan from `qd.simt.subgroup.inclusive_add` (per-block) plus an outer kernel that handles the multi-block roll-up.
 - **Allocate the executor once, run it many times.** The internal auxiliary buffer is sized to the constructor's `length`; constructing per call wastes allocation traffic. Each `.run()` is a sequence of kernel launches; the cost is `O(N / cache_line)` global memory bandwidth, not user-visible launch overhead.
 - **No fence required between `populate` and `scan.run`.** Each algorithm kernel launches its own kernels under the hood, and the kernel boundary serializes against prior writes from host-launched kernels.

From 440b67365fa22cbd468098621fe256cf7d87ab38 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 11:07:24 -0700
Subject: [PATCH 08/13] [Docs] algorithms: trim PrefixSumExecutor bullet to
 limitation statement

---
 docs/source/user_guide/algorithms.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
index c6d3648044..ce6005a782 100644
--- a/docs/source/user_guide/algorithms.md
+++ b/docs/source/user_guide/algorithms.md
@@ -113,7 +113,7 @@ The compact-output kernel reads `offsets[i]` (or `offsets[i] - flags[i]` for 0-b
 ## Performance and portability notes
 
 - **`parallel_sort` is `O(N log² N)`**.
-- **`PrefixSumExecutor` is `i32`-only and CUDA + Vulkan-only.** This is the most-often-hit limitation in cross-platform code. If you need `u32` / `i64` / `f32` / `f64` keys or AMDGPU / Metal coverage, you currently have to compose your own scan from `qd.simt.subgroup.inclusive_add` (per-block) plus an outer kernel that handles the multi-block roll-up.
+- **`PrefixSumExecutor` is `i32`-only and CUDA + Vulkan-only.**
 - **Allocate the executor once, run it many times.** The internal auxiliary buffer is sized to the constructor's `length`; constructing per call wastes allocation traffic. Each `.run()` is a sequence of kernel launches; the cost is `O(N / cache_line)` global memory bandwidth, not user-visible launch overhead.
 - **No fence required between `populate` and `scan.run`.** Each algorithm kernel launches its own kernels under the hood, and the kernel boundary serializes against prior writes from host-launched kernels.
 

From 1f1fa100483965673e7dcabfac6bfcd9c2f8d849 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 11:07:44 -0700
Subject: [PATCH 09/13] [Docs] algorithms: remove executor-lifecycle bullet

---
 docs/source/user_guide/algorithms.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
index ce6005a782..6a001014d1 100644
--- a/docs/source/user_guide/algorithms.md
+++ b/docs/source/user_guide/algorithms.md
@@ -114,7 +114,6 @@ The compact-output kernel reads `offsets[i]` (or `offsets[i] - flags[i]` for 0-b
 
 - **`parallel_sort` is `O(N log² N)`**.
 - **`PrefixSumExecutor` is `i32`-only and CUDA + Vulkan-only.**
-- **Allocate the executor once, run it many times.** The internal auxiliary buffer is sized to the constructor's `length`; constructing per call wastes allocation traffic. Each `.run()` is a sequence of kernel launches; the cost is `O(N / cache_line)` global memory bandwidth, not user-visible launch overhead.
 - **No fence required between `populate` and `scan.run`.** Each algorithm kernel launches its own kernels under the hood, and the kernel boundary serializes against prior writes from host-launched kernels.
 
 ## Related

From e76b31c6ba53e81442feb4597ed7f0e54167e5d1 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 11:08:03 -0700
Subject: [PATCH 10/13] [Docs] algorithms: remove PrefixSumExecutor
 backend-coverage bullet (already covered upstream)

---
 docs/source/user_guide/algorithms.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
index 6a001014d1..2da3d35b75 100644
--- a/docs/source/user_guide/algorithms.md
+++ b/docs/source/user_guide/algorithms.md
@@ -113,7 +113,6 @@ The compact-output kernel reads `offsets[i]` (or `offsets[i] - flags[i]` for 0-b
 ## Performance and portability notes
 
 - **`parallel_sort` is `O(N log² N)`**.
-- **`PrefixSumExecutor` is `i32`-only and CUDA + Vulkan-only.**
 - **No fence required between `populate` and `scan.run`.** Each algorithm kernel launches its own kernels under the hood, and the kernel boundary serializes against prior writes from host-launched kernels.
 
 ## Related

From 5e595c1a08387fe9aab328789362f0b47b3fd43b Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 11:08:38 -0700
Subject: [PATCH 11/13] [Docs] algorithms: remove parallel_sort complexity
 bullet

---
 docs/source/user_guide/algorithms.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
index 2da3d35b75..e833b143d8 100644
--- a/docs/source/user_guide/algorithms.md
+++ b/docs/source/user_guide/algorithms.md
@@ -112,7 +112,6 @@ The compact-output kernel reads `offsets[i]` (or `offsets[i] - flags[i]` for 0-b
 
 ## Performance and portability notes
 
-- **`parallel_sort` is `O(N log² N)`**.
 - **No fence required between `populate` and `scan.run`.** Each algorithm kernel launches its own kernels under the hood, and the kernel boundary serializes against prior writes from host-launched kernels.
 
 ## Related

From 0fd07ef95e30648454cde17b34709c5c5909e47c Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 11:09:45 -0700
Subject: [PATCH 12/13] [Docs] algorithms: move 'no fence required' note into
 PrefixSumExecutor section

---
 docs/source/user_guide/algorithms.md | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
index e833b143d8..6d06d6f42a 100644
--- a/docs/source/user_guide/algorithms.md
+++ b/docs/source/user_guide/algorithms.md
@@ -63,6 +63,8 @@ Constraints:
 
 The implementation is a Kogge-Stone hierarchical scan: per-block inclusive scan on shared memory, then a small recursive scan over per-block totals, then a uniform-add pass to propagate back. This means the executor reuses the underlying buffer across calls, which is why it's a class (allocate once, run many times) rather than a free function.
 
+No explicit fence is required between a kernel that writes the input and the subsequent `.run()` call. `.run()` launches its own kernels under the hood, and the kernel boundary serializes against prior writes from host-launched kernels.
+
 ## Examples
 
 ### Sort indices by per-element key
@@ -110,10 +112,6 @@ scan.run(offsets)
 
 The compact-output kernel reads `offsets[i]` (or `offsets[i] - flags[i]` for 0-based) to decide where to write surviving elements. This is the textbook scan-based select / compact pattern; the only Quadrants-specific note is the `i32`-only restriction.
 
-## Performance and portability notes
-
-- **No fence required between `populate` and `scan.run`.** Each algorithm kernel launches its own kernels under the hood, and the kernel boundary serializes against prior writes from host-launched kernels.
-
 ## Related
 
 - `qd.simt.block.*` — the block-scope reductions and shared-memory primitives that algorithm kernels build on.

From e86afb655dc139ffa33d46dd35c9bc8603e2374d Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 11:31:03 -0700
Subject: [PATCH 13/13] [Docs] algorithms: fix parallel_sort ndarray claim and
 PrefixSumExecutor length contract (PR #642 review)

---
 docs/source/user_guide/algorithms.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
index 6d06d6f42a..a99f47155d 100644
--- a/docs/source/user_guide/algorithms.md
+++ b/docs/source/user_guide/algorithms.md
@@ -15,7 +15,7 @@ Device-wide algorithms — primitives that consume and produce whole arrays, exe
 
 ### `qd.algorithms.parallel_sort(keys, values=None)`
 
-In-place sort. Reorders `keys` ascending; if `values` is provided, applies the same permutation to `values` (key-value sort). Both arguments must be 1-D `field` or `ndarray`.
+In-place sort. Reorders `keys` ascending; if `values` is provided, applies the same permutation to `values` (key-value sort). Both arguments must be 1-D `qd.field` — `parallel_sort` reaches into `snode.ptr.offset` internally, so `ndarray` is **not** supported and will fail at trace time with an `AttributeError`.
 
 ```python
 keys = qd.field(qd.i32, shape=(N,))
@@ -48,11 +48,11 @@ psum.run(arr)
 
 Constructor:
 
-- `length: int` — the maximum number of elements the executor can scan. Internally allocates an auxiliary `qd.field(i32, shape=padded_length)` sized to the Kogge-Stone hierarchy (block size = 64).
+- `length: int` — the **fixed** number of elements the executor will scan on every `.run()` call. Internally allocates an auxiliary `qd.field(i32, shape=padded_length)` sized to the Kogge-Stone hierarchy (block size = 64).
 
 `run(input_arr)`:
 
-- `input_arr` must be a 1-D `qd.field(qd.i32, shape=(L,))` with `L <= length`.
+- `input_arr` must be a 1-D `qd.field(qd.i32, shape=(length,))` — its length must match the constructor's `length` exactly. `run()` always blits `length` elements between `input_arr` and the internal buffer; passing a shorter field results in out-of-bounds reads / writes (no runtime check today).
 - Returns nothing; `input_arr` is overwritten with the scan result.
 
 Constraints: