From 5981faaea4937b917affd336f32852b3d8aa6e16 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 00:46:32 -0700
Subject: [PATCH 01/26] [Docs] Add user-guide page for matrix decompositions
 and solvers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Documents the per-thread small-matrix decomposition / solver entry
points (qd.svd, qd.sym_eig, qd.polar_decompose, qd.eig, qd.solve).
Companion to matrix_vector.md, which covers element-wise / arithmetic
operations; this page covers the numerical-algorithm layer
(Jacobi sweeps, Gauss elimination, etc.) sitting on top of those.

Covers per-op semantics, the supported size set (2 / 3 only —
2 only for general qd.eig), the optional dt argument, sign-convention
caveats for SVD / polar (det(R) handling), and what is missing
(N>3 symmetric EVD, larger inverse, atomic_cas, 3×3 general eig).

Worked examples include closest-rotation extraction (ARAP), 3×3
make_spd via sym_eig + clamp, per-thread linear solve, and 2×2 polar
shape-matching.

Adds decompositions.md to the Core concepts toctree (next to
matrix_vector).
---
 docs/source/user_guide/decompositions.md | 154 +++++++++++++++++++++++
 docs/source/user_guide/index.md          |   1 +
 2 files changed, 155 insertions(+)
 create mode 100644 docs/source/user_guide/decompositions.md

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
new file mode 100644
index 0000000000..39c166979b
--- /dev/null
+++ b/docs/source/user_guide/decompositions.md
@@ -0,0 +1,154 @@
+# Matrix decompositions and solvers
+
+Small fixed-size matrix decompositions and linear solvers — the kinds of operations a thread does on a 2×2 or 3×3 matrix held in registers. They are a different category from the element-wise / arithmetic matrix operations covered in [matrix_vector](matrix_vector.md): each entry point here implements a *numerical algorithm* (Jacobi sweeps, Gauss elimination, Givens rotations) rather than a single closed-form formula.
+
+All ops live at the top level (`qd.svd`, `qd.sym_eig`, `qd.polar_decompose`, `qd.eig`, `qd.solve`) and are intended to be called from inside a `@qd.kernel` or `@qd.func`. They run per thread — each thread independently decomposes its own matrix.
+
+## What's available
+
+| Op                              | Operates on            | Sizes  | Returns                                                |
+|---------------------------------|------------------------|--------|--------------------------------------------------------|
+| `qd.svd(A)`                     | square real matrix     | 2, 3   | `(U, S, V)` such that `A = U @ S @ V.transpose()`      |
+| `qd.sym_eig(A)`                 | symmetric real matrix  | 2, 3   | `(eigenvalues, eigenvectors)` (real)                   |
+| `qd.polar_decompose(A)`         | square real matrix     | 2, 3   | `(R, S)` such that `A = R @ S`, `R` orthogonal, `S` SPD |
+| `qd.eig(A)`                     | square real matrix     | 2      | `(eigenvalues, eigenvectors)` (complex, packed)        |
+| `qd.solve(A, b)`                | square `A` + vector `b`| 2, 3   | `x` such that `A @ x = b`                              |
+
+A few patterns to note:
+
+- **Sizes are fixed.** Calling any of these on a matrix outside the supported size set raises an exception at trace time (`"SVD only supports 2D and 3D matrices."`, etc.). Larger matrices need a different path — typically a Jacobi-style sweep applied iteratively, which Quadrants does not currently provide out of the box.
+- **All ops accept an optional `dt` argument.** When unspecified, it defaults to `impl.get_runtime().default_fp` — usually `qd.f32` unless overridden in `qd.init()`. Pass `dt=qd.f64` for the high-precision variant; passing `dt=qd.f32` explicitly is also fine if you want to be loud about it.
+- **Output dimensions match the input dimensions.** A 3×3 input yields 3×3 outputs (and a length-3 vector for `solve` / eigenvalues); a 2×2 input yields 2×2 outputs.
+- **Real matrices only.** `qd.eig` returns complex results in a packed real layout (see below); the others all assume real-valued input and return real-valued output.
+
+## Semantics
+
+### `qd.svd(A, dt=None)`
+
+Singular value decomposition — produces `(U, S, V)` such that `A = U @ S @ V.transpose()`, with:
+
+- `U` orthogonal (`U @ U.transpose() = I`).
+- `V` orthogonal.
+- `S` diagonal, with non-negative singular values.
+
+Sizes 2 and 3 use closed-form / Jacobi-style implementations specialized per dimension. Sign convention for `U` and `V` is the implementation's natural one and is **not** guaranteed to enforce `det(U) = det(V) = +1`; if you depend on a particular handedness (e.g. for an ARAP rotation `R = U @ V.transpose()`), check it explicitly and flip a column if needed.
+
+Singular values come out in implementation order — confirm via test if you depend on `S[0,0] >= S[1,1] >= S[2,2]`.
+
+### `qd.sym_eig(A, dt=None)`
+
+Symmetric eigendecomposition — for a real symmetric `A`, returns `(eigenvalues, eigenvectors)` with:
+
+- `eigenvalues`: a `Vector(n)` of real eigenvalues.
+- `eigenvectors`: a `Matrix(n, n)` whose columns are the corresponding orthonormal eigenvectors.
+
+For 3×3, eigenvalues come out sorted in ascending order (the implementation explicitly sorts at the end). The 2×2 path does not perform an explicit sort — verify with a test if you need a particular order in 2D.
+
+`A` is *assumed* symmetric; the implementation does not symmetrize first. If your matrix is only approximately symmetric (e.g. accumulated floating-point error), explicitly compute `(A + A.transpose()) * 0.5` before calling.
+
+### `qd.polar_decompose(A, dt=None)`
+
+Polar decomposition — produces `(R, S)` such that `A = R @ S`, with:
+
+- `R` orthogonal (the closest-rotation factor, modulo handedness).
+- `S` symmetric positive semi-definite (the stretch factor).
+
+Built on top of SVD: `A = U @ Σ @ Vᵀ` ⇒ `R = U @ Vᵀ`, `S = V @ Σ @ Vᵀ`. The same caveats about sign convention as `qd.svd` apply — `R` is the closest orthogonal factor to `A` but is not guaranteed to be a proper rotation (positive determinant) without a manual fix-up.
+
+### `qd.eig(A, dt=None)`
+
+General eigendecomposition for **2×2 only**. Returns:
+
+- `eigenvalues`: a `Matrix(n, 2)` where row `i` is `(real_part, imaginary_part)` of the i-th eigenvalue.
+- `eigenvectors`: a `Matrix(n*2, n)` where each column is an eigenvector with each entry expanded into two scalars (real, imaginary). For a 2×2 input, the eigenvector matrix is 4×2.
+
+Eigenvalues of a real 2×2 matrix come in complex-conjugate pairs when the discriminant is negative; the packed layout keeps the function single-return-shape. Calling with `n=3` raises — for 3×3 general (non-symmetric) eigendecomposition, there is no Quadrants entry point today; the canonical path is `qd.sym_eig` if your matrix happens to be symmetric.
+
+### `qd.solve(A, b, dt=None)`
+
+Direct solve of `A @ x = b` via Gauss elimination with partial pivoting. Returns the solution vector `x`.
+
+- Sizes 2 and 3.
+- The implementation asserts `A.n == A.m` and `A.m == b.n`.
+- Singular `A` triggers an `assert` failure (`"Matrix is singular in linear solve."`); the assert is active by default and is your only signal of singularity.
+- Use this for one-off small solves; for systems that decompose once and back-solve many times, currently you have to bake the LU yourself (or just call `qd.solve` per b — for 2×2 / 3×3 the gain from caching the LU is negligible compared to the launch).
+
+## Examples
+
+### Closest rotation to a 3×3 matrix (ARAP)
+
+```python
+@qd.func
+def closest_rotation(A: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3, qd.f64):
+    U, S, V = qd.svd(A, dt=qd.f64)
+    R = U @ V.transpose()
+    if R.determinant() < 0.0:
+        V[:, 2] *= -1.0
+        R = U @ V.transpose()
+    return R
+```
+
+The `det(R) < 0` branch fixes the handedness when SVD's sign convention produces a reflection rather than a rotation — a standard ARAP / IPC trick.
+
+### Project to symmetric positive semi-definite (`make_spd`)
+
+```python
+@qd.func
+def make_spd_3x3(H: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3, qd.f64):
+    eigvals, Q = qd.sym_eig(H, dt=qd.f64)
+    for i in qd.static(range(3)):
+        if eigvals[i] < 0.0:
+            eigvals[i] = 0.0
+    Lambda = qd.Matrix.zero(qd.f64, 3, 3)
+    for i in qd.static(range(3)):
+        Lambda[i, i] = eigvals[i]
+    return Q @ Lambda @ Q.transpose()
+```
+
+Used by IPC-style methods to project an indefinite Hessian to its closest SPD approximation per element. Note the size cap — only 3×3 today, since `qd.sym_eig` itself caps at 3×3.
+
+### Per-thread linear solve
+
+```python
+@qd.kernel
+def solve_each(A_field: qd.types.NDArray[qd.types.matrix(2, 2, qd.f32), 1],
+               b_field: qd.types.NDArray[qd.types.vector(2, qd.f32), 1],
+               x_field: qd.types.NDArray[qd.types.vector(2, qd.f32), 1]) -> None:
+    for i in range(A_field.shape[0]):
+        x_field[i] = qd.solve(A_field[i], b_field[i])
+```
+
+Each thread does an independent Gauss elimination on its own 2×2 system. For larger systems a CG / PCG iteration over the whole array is the standard Quadrants pattern; `qd.solve` is for the per-element case.
+
+### 2×2 polar decomposition for shape matching
+
+```python
+@qd.func
+def shape_match(A: qd.types.matrix(2, 2, qd.f32)) -> qd.types.matrix(2, 2, qd.f32):
+    R, _ = qd.polar_decompose(A)
+    return R
+```
+
+The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `‖R - A‖_F` — the building block of position-based dynamics shape-matching.
+
+## Sizes, performance, portability
+
+- **Size cap is the dominant constraint.** For matrices outside the supported sizes, you currently have to write your own Jacobi sweep (for symmetric EVD up to ~12×12) or LU / Cholesky (for general inverse / solve). qipc maintains a Jacobi EVD that handles sizes up to 12; it is being considered for upstreaming but is not in `quadrants` today.
+- **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine for these sizes; matrices larger than the cap would not be — register pressure plus unrolling explode quickly.
+- **Numerical conditioning.** All implementations use `f32` by default, which is fine for graphics / soft-body simulation but not always sufficient for stiff IPC-style problems. Pass `dt=qd.f64` whenever conditioning matters; the cost on modern GPUs is a constant factor, not order-of-magnitude.
+- **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split. Numerical behaviour is bit-exact across backends only for `f64`; `f32` may differ in the last bit because of fused-multiply-add ordering choices.
+
+## What's missing
+
+For reference / planning purposes, the gaps users most often hit:
+
+- **Larger SVD / EVD.** Sizes > 3×3 are unsupported. For symmetric EVD up to ~12×12, a Jacobi sweep is the standard approach (qipc has one); for general SVD, a one-sided Jacobi or QR-with-shifts is the standard approach.
+- **`Matrix.inverse` size cap.** Documented in [matrix_vector](matrix_vector.md): the closed-form cofactor inverse caps at 4×4. Larger inverses need an LU or Cholesky factorisation, neither of which is exposed today.
+- **`atomic_cas`.** Unrelated to decompositions, but the building block for spinlocks and lock-free dictionaries; not exposed in Python — `qd.atomic_*` covers add / sub / mul / min / max / and / or / xor but does not currently include compare-and-swap.
+- **3×3 `qd.eig`.** Only the 2×2 general (non-symmetric) eigendecomposition is provided. For 3×3, use `qd.sym_eig` if your matrix is symmetric.
+
+## Related
+
+- [matrix_vector](matrix_vector.md) — element-wise / arithmetic matrix operations (`@`, `inverse`, `determinant`, `transpose`, dot, cross, norm, `outer_product`). Covers the operations whose implementation is a single closed-form formula.
+- `qd.math.*` — scalar math helpers (`qd.math.dot`, `qd.math.cross`, `qd.math.length`, etc.) that operate on vectors / matrices but are not decompositions.
+- `qd.linalg.*` — sparse-matrix linear algebra (CG, sparse solvers); a different namespace and a different problem class.
diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md
index f783e84264..971c52cf67 100644
--- a/docs/source/user_guide/index.md
+++ b/docs/source/user_guide/index.md
@@ -17,6 +17,7 @@ supported_systems
 tensor_types
 scalar_tensors
 matrix_vector
+decompositions
 tensor
 compound_types
 buffer_view

From e6c6029d8084ba164b51d2c4e96954de98d68bfa Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 08:30:05 -0700
Subject: [PATCH 02/26] [Docs] decompositions: drop qipc references

---
 docs/source/user_guide/decompositions.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index 39c166979b..a319cc6dcc 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -133,7 +133,7 @@ The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `
 
 ## Sizes, performance, portability
 
-- **Size cap is the dominant constraint.** For matrices outside the supported sizes, you currently have to write your own Jacobi sweep (for symmetric EVD up to ~12×12) or LU / Cholesky (for general inverse / solve). qipc maintains a Jacobi EVD that handles sizes up to 12; it is being considered for upstreaming but is not in `quadrants` today.
+- **Size cap is the dominant constraint.** For matrices outside the supported sizes, you currently have to write your own Jacobi sweep (for symmetric EVD up to ~12×12) or LU / Cholesky (for general inverse / solve).
 - **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine for these sizes; matrices larger than the cap would not be — register pressure plus unrolling explode quickly.
 - **Numerical conditioning.** All implementations use `f32` by default, which is fine for graphics / soft-body simulation but not always sufficient for stiff IPC-style problems. Pass `dt=qd.f64` whenever conditioning matters; the cost on modern GPUs is a constant factor, not order-of-magnitude.
 - **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split. Numerical behaviour is bit-exact across backends only for `f64`; `f32` may differ in the last bit because of fused-multiply-add ordering choices.
@@ -142,7 +142,7 @@ The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `
 
 For reference / planning purposes, the gaps users most often hit:
 
-- **Larger SVD / EVD.** Sizes > 3×3 are unsupported. For symmetric EVD up to ~12×12, a Jacobi sweep is the standard approach (qipc has one); for general SVD, a one-sided Jacobi or QR-with-shifts is the standard approach.
+- **Larger SVD / EVD.** Sizes > 3×3 are unsupported. For symmetric EVD up to ~12×12, a Jacobi sweep is the standard approach; for general SVD, a one-sided Jacobi or QR-with-shifts is the standard approach.
 - **`Matrix.inverse` size cap.** Documented in [matrix_vector](matrix_vector.md): the closed-form cofactor inverse caps at 4×4. Larger inverses need an LU or Cholesky factorisation, neither of which is exposed today.
 - **`atomic_cas`.** Unrelated to decompositions, but the building block for spinlocks and lock-free dictionaries; not exposed in Python — `qd.atomic_*` covers add / sub / mul / min / max / and / or / xor but does not currently include compare-and-swap.
 - **3×3 `qd.eig`.** Only the 2×2 general (non-symmetric) eigendecomposition is provided. For 3×3, use `qd.sym_eig` if your matrix is symmetric.

From faa7fbbec22cde33d07a889efd09165940bbc1b7 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 08:33:23 -0700
Subject: [PATCH 03/26] [Docs] decompositions: replace ambiguous 'size' with
 'dimension' (the matrix dim n)

---
 docs/source/user_guide/decompositions.md | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index a319cc6dcc..2ac6ca89d6 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -6,7 +6,7 @@ All ops live at the top level (`qd.svd`, `qd.sym_eig`, `qd.polar_decompose`, `qd
 
 ## What's available
 
-| Op                              | Operates on            | Sizes  | Returns                                                |
+| Op                              | Operates on            | Dim n  | Returns                                                |
 |---------------------------------|------------------------|--------|--------------------------------------------------------|
 | `qd.svd(A)`                     | square real matrix     | 2, 3   | `(U, S, V)` such that `A = U @ S @ V.transpose()`      |
 | `qd.sym_eig(A)`                 | symmetric real matrix  | 2, 3   | `(eigenvalues, eigenvectors)` (real)                   |
@@ -14,9 +14,9 @@ All ops live at the top level (`qd.svd`, `qd.sym_eig`, `qd.polar_decompose`, `qd
 | `qd.eig(A)`                     | square real matrix     | 2      | `(eigenvalues, eigenvectors)` (complex, packed)        |
 | `qd.solve(A, b)`                | square `A` + vector `b`| 2, 3   | `x` such that `A @ x = b`                              |
 
-A few patterns to note:
+The "Dim n" column is the matrix dimension — `n = 2` means a 2×2 input, `n = 3` means 3×3. A few patterns to note:
 
-- **Sizes are fixed.** Calling any of these on a matrix outside the supported size set raises an exception at trace time (`"SVD only supports 2D and 3D matrices."`, etc.). Larger matrices need a different path — typically a Jacobi-style sweep applied iteratively, which Quadrants does not currently provide out of the box.
+- **Dimensions are fixed.** Calling any of these on a matrix outside the supported dimensions raises an exception at trace time (`"SVD only supports 2D and 3D matrices."`, etc.). Larger matrices need a different path — typically a Jacobi-style sweep applied iteratively, which Quadrants does not currently provide out of the box.
 - **All ops accept an optional `dt` argument.** When unspecified, it defaults to `impl.get_runtime().default_fp` — usually `qd.f32` unless overridden in `qd.init()`. Pass `dt=qd.f64` for the high-precision variant; passing `dt=qd.f32` explicitly is also fine if you want to be loud about it.
 - **Output dimensions match the input dimensions.** A 3×3 input yields 3×3 outputs (and a length-3 vector for `solve` / eigenvalues); a 2×2 input yields 2×2 outputs.
 - **Real matrices only.** `qd.eig` returns complex results in a packed real layout (see below); the others all assume real-valued input and return real-valued output.
@@ -31,7 +31,7 @@ Singular value decomposition — produces `(U, S, V)` such that `A = U @ S @ V.t
 - `V` orthogonal.
 - `S` diagonal, with non-negative singular values.
 
-Sizes 2 and 3 use closed-form / Jacobi-style implementations specialized per dimension. Sign convention for `U` and `V` is the implementation's natural one and is **not** guaranteed to enforce `det(U) = det(V) = +1`; if you depend on a particular handedness (e.g. for an ARAP rotation `R = U @ V.transpose()`), check it explicitly and flip a column if needed.
+Dimensions 2 and 3 use closed-form / Jacobi-style implementations specialized per dimension. Sign convention for `U` and `V` is the implementation's natural one and is **not** guaranteed to enforce `det(U) = det(V) = +1`; if you depend on a particular handedness (e.g. for an ARAP rotation `R = U @ V.transpose()`), check it explicitly and flip a column if needed.
 
 Singular values come out in implementation order — confirm via test if you depend on `S[0,0] >= S[1,1] >= S[2,2]`.
 
@@ -68,7 +68,7 @@ Eigenvalues of a real 2×2 matrix come in complex-conjugate pairs when the discr
 
 Direct solve of `A @ x = b` via Gauss elimination with partial pivoting. Returns the solution vector `x`.
 
-- Sizes 2 and 3.
+- Dimensions 2 and 3.
 - The implementation asserts `A.n == A.m` and `A.m == b.n`.
 - Singular `A` triggers an `assert` failure (`"Matrix is singular in linear solve."`); the assert is active by default and is your only signal of singularity.
 - Use this for one-off small solves; for systems that decompose once and back-solve many times, currently you have to bake the LU yourself (or just call `qd.solve` per b — for 2×2 / 3×3 the gain from caching the LU is negligible compared to the launch).
@@ -105,7 +105,7 @@ def make_spd_3x3(H: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3, qd.f
     return Q @ Lambda @ Q.transpose()
 ```
 
-Used by IPC-style methods to project an indefinite Hessian to its closest SPD approximation per element. Note the size cap — only 3×3 today, since `qd.sym_eig` itself caps at 3×3.
+Used by IPC-style methods to project an indefinite Hessian to its closest SPD approximation per element. Note the dimension cap — only 3×3 today, since `qd.sym_eig` itself caps at 3×3.
 
 ### Per-thread linear solve
 
@@ -131,10 +131,10 @@ def shape_match(A: qd.types.matrix(2, 2, qd.f32)) -> qd.types.matrix(2, 2, qd.f3
 
 The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `‖R - A‖_F` — the building block of position-based dynamics shape-matching.
 
-## Sizes, performance, portability
+## Dimensions, performance, portability
 
-- **Size cap is the dominant constraint.** For matrices outside the supported sizes, you currently have to write your own Jacobi sweep (for symmetric EVD up to ~12×12) or LU / Cholesky (for general inverse / solve).
-- **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine for these sizes; matrices larger than the cap would not be — register pressure plus unrolling explode quickly.
+- **Dimension cap is the dominant constraint.** For matrices outside the supported dimensions, you currently have to write your own Jacobi sweep (for symmetric EVD up to ~12×12) or LU / Cholesky (for general inverse / solve).
+- **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine at these dimensions; matrices larger than the cap would not be — register pressure plus unrolling explode quickly.
 - **Numerical conditioning.** All implementations use `f32` by default, which is fine for graphics / soft-body simulation but not always sufficient for stiff IPC-style problems. Pass `dt=qd.f64` whenever conditioning matters; the cost on modern GPUs is a constant factor, not order-of-magnitude.
 - **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split. Numerical behaviour is bit-exact across backends only for `f64`; `f32` may differ in the last bit because of fused-multiply-add ordering choices.
 
@@ -142,8 +142,8 @@ The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `
 
 For reference / planning purposes, the gaps users most often hit:
 
-- **Larger SVD / EVD.** Sizes > 3×3 are unsupported. For symmetric EVD up to ~12×12, a Jacobi sweep is the standard approach; for general SVD, a one-sided Jacobi or QR-with-shifts is the standard approach.
-- **`Matrix.inverse` size cap.** Documented in [matrix_vector](matrix_vector.md): the closed-form cofactor inverse caps at 4×4. Larger inverses need an LU or Cholesky factorisation, neither of which is exposed today.
+- **Larger SVD / EVD.** Dimensions above 3 are unsupported. For symmetric EVD up to ~12×12, a Jacobi sweep is the standard approach; for general SVD, a one-sided Jacobi or QR-with-shifts is the standard approach.
+- **`Matrix.inverse` dimension cap.** Documented in [matrix_vector](matrix_vector.md): the closed-form cofactor inverse caps at 4×4. Larger inverses need an LU or Cholesky factorisation, neither of which is exposed today.
 - **`atomic_cas`.** Unrelated to decompositions, but the building block for spinlocks and lock-free dictionaries; not exposed in Python — `qd.atomic_*` covers add / sub / mul / min / max / and / or / xor but does not currently include compare-and-swap.
 - **3×3 `qd.eig`.** Only the 2×2 general (non-symmetric) eigendecomposition is provided. For 3×3, use `qd.sym_eig` if your matrix is symmetric.
 

From 4a729de2188db346cd84c79f90a94415b9bf53c1 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 08:34:13 -0700
Subject: [PATCH 04/26] [Docs] decompositions: drop redundant 'fixed-size'
 qualifier in intro

---
 docs/source/user_guide/decompositions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index 2ac6ca89d6..f88560ff3e 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -1,6 +1,6 @@
 # Matrix decompositions and solvers
 
-Small fixed-size matrix decompositions and linear solvers — the kinds of operations a thread does on a 2×2 or 3×3 matrix held in registers. They are a different category from the element-wise / arithmetic matrix operations covered in [matrix_vector](matrix_vector.md): each entry point here implements a *numerical algorithm* (Jacobi sweeps, Gauss elimination, Givens rotations) rather than a single closed-form formula.
+Small matrix decompositions and linear solvers — the kinds of operations a thread does on a 2×2 or 3×3 matrix held in registers. They are a different category from the element-wise / arithmetic matrix operations covered in [matrix_vector](matrix_vector.md): each entry point here implements a *numerical algorithm* (Jacobi sweeps, Gauss elimination, Givens rotations) rather than a single closed-form formula.
 
 All ops live at the top level (`qd.svd`, `qd.sym_eig`, `qd.polar_decompose`, `qd.eig`, `qd.solve`) and are intended to be called from inside a `@qd.kernel` or `@qd.func`. They run per thread — each thread independently decomposes its own matrix.
 

From 91a183c36c2cf85d7cfb4f890caebbff5989a442 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 09:45:16 -0700
Subject: [PATCH 05/26] [Docs] decompositions: switch to unambiguous 'shape'
 (e.g. 2x2, 3x3) terminology

---
 docs/source/user_guide/decompositions.md | 36 ++++++++++++------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index f88560ff3e..ab3b02c6ac 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -6,19 +6,19 @@ All ops live at the top level (`qd.svd`, `qd.sym_eig`, `qd.polar_decompose`, `qd
 
 ## What's available
 
-| Op                              | Operates on            | Dim n  | Returns                                                |
-|---------------------------------|------------------------|--------|--------------------------------------------------------|
-| `qd.svd(A)`                     | square real matrix     | 2, 3   | `(U, S, V)` such that `A = U @ S @ V.transpose()`      |
-| `qd.sym_eig(A)`                 | symmetric real matrix  | 2, 3   | `(eigenvalues, eigenvectors)` (real)                   |
-| `qd.polar_decompose(A)`         | square real matrix     | 2, 3   | `(R, S)` such that `A = R @ S`, `R` orthogonal, `S` SPD |
-| `qd.eig(A)`                     | square real matrix     | 2      | `(eigenvalues, eigenvectors)` (complex, packed)        |
-| `qd.solve(A, b)`                | square `A` + vector `b`| 2, 3   | `x` such that `A @ x = b`                              |
+| Op                              | Operates on            | Shapes        | Returns                                                |
+|---------------------------------|------------------------|---------------|--------------------------------------------------------|
+| `qd.svd(A)`                     | square real matrix     | 2×2, 3×3      | `(U, S, V)` such that `A = U @ S @ V.transpose()`      |
+| `qd.sym_eig(A)`                 | symmetric real matrix  | 2×2, 3×3      | `(eigenvalues, eigenvectors)` (real)                   |
+| `qd.polar_decompose(A)`         | square real matrix     | 2×2, 3×3      | `(R, S)` such that `A = R @ S`, `R` orthogonal, `S` SPD |
+| `qd.eig(A)`                     | square real matrix     | 2×2           | `(eigenvalues, eigenvectors)` (complex, packed)        |
+| `qd.solve(A, b)`                | square `A` + vector `b`| 2×2, 3×3      | `x` such that `A @ x = b`                              |
 
-The "Dim n" column is the matrix dimension — `n = 2` means a 2×2 input, `n = 3` means 3×3. A few patterns to note:
+A few patterns to note:
 
-- **Dimensions are fixed.** Calling any of these on a matrix outside the supported dimensions raises an exception at trace time (`"SVD only supports 2D and 3D matrices."`, etc.). Larger matrices need a different path — typically a Jacobi-style sweep applied iteratively, which Quadrants does not currently provide out of the box.
+- **Shapes are fixed.** Calling any of these on a matrix outside the supported shapes raises an exception at trace time (`"SVD only supports 2D and 3D matrices."`, etc.). Larger matrices need a different path — typically a Jacobi-style sweep applied iteratively, which Quadrants does not currently provide out of the box.
 - **All ops accept an optional `dt` argument.** When unspecified, it defaults to `impl.get_runtime().default_fp` — usually `qd.f32` unless overridden in `qd.init()`. Pass `dt=qd.f64` for the high-precision variant; passing `dt=qd.f32` explicitly is also fine if you want to be loud about it.
-- **Output dimensions match the input dimensions.** A 3×3 input yields 3×3 outputs (and a length-3 vector for `solve` / eigenvalues); a 2×2 input yields 2×2 outputs.
+- **Output shape matches the input shape.** A 3×3 input yields 3×3 outputs (and a length-3 vector for `solve` / eigenvalues); a 2×2 input yields 2×2 outputs.
 - **Real matrices only.** `qd.eig` returns complex results in a packed real layout (see below); the others all assume real-valued input and return real-valued output.
 
 ## Semantics
@@ -31,7 +31,7 @@ Singular value decomposition — produces `(U, S, V)` such that `A = U @ S @ V.t
 - `V` orthogonal.
 - `S` diagonal, with non-negative singular values.
 
-Dimensions 2 and 3 use closed-form / Jacobi-style implementations specialized per dimension. Sign convention for `U` and `V` is the implementation's natural one and is **not** guaranteed to enforce `det(U) = det(V) = +1`; if you depend on a particular handedness (e.g. for an ARAP rotation `R = U @ V.transpose()`), check it explicitly and flip a column if needed.
+Shapes 2×2 and 3×3 use closed-form / Jacobi-style implementations specialized per shape. Sign convention for `U` and `V` is the implementation's natural one and is **not** guaranteed to enforce `det(U) = det(V) = +1`; if you depend on a particular handedness (e.g. for an ARAP rotation `R = U @ V.transpose()`), check it explicitly and flip a column if needed.
 
 Singular values come out in implementation order — confirm via test if you depend on `S[0,0] >= S[1,1] >= S[2,2]`.
 
@@ -68,7 +68,7 @@ Eigenvalues of a real 2×2 matrix come in complex-conjugate pairs when the discr
 
 Direct solve of `A @ x = b` via Gauss elimination with partial pivoting. Returns the solution vector `x`.
 
-- Dimensions 2 and 3.
+- Shapes 2×2 and 3×3.
 - The implementation asserts `A.n == A.m` and `A.m == b.n`.
 - Singular `A` triggers an `assert` failure (`"Matrix is singular in linear solve."`); the assert is active by default and is your only signal of singularity.
 - Use this for one-off small solves; for systems that decompose once and back-solve many times, currently you have to bake the LU yourself (or just call `qd.solve` per b — for 2×2 / 3×3 the gain from caching the LU is negligible compared to the launch).
@@ -105,7 +105,7 @@ def make_spd_3x3(H: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3, qd.f
     return Q @ Lambda @ Q.transpose()
 ```
 
-Used by IPC-style methods to project an indefinite Hessian to its closest SPD approximation per element. Note the dimension cap — only 3×3 today, since `qd.sym_eig` itself caps at 3×3.
+Used by IPC-style methods to project an indefinite Hessian to its closest SPD approximation per element. Note the shape cap — only 3×3 today, since `qd.sym_eig` itself caps at 3×3.
 
 ### Per-thread linear solve
 
@@ -131,10 +131,10 @@ def shape_match(A: qd.types.matrix(2, 2, qd.f32)) -> qd.types.matrix(2, 2, qd.f3
 
 The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `‖R - A‖_F` — the building block of position-based dynamics shape-matching.
 
-## Dimensions, performance, portability
+## Shapes, performance, portability
 
-- **Dimension cap is the dominant constraint.** For matrices outside the supported dimensions, you currently have to write your own Jacobi sweep (for symmetric EVD up to ~12×12) or LU / Cholesky (for general inverse / solve).
-- **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine at these dimensions; matrices larger than the cap would not be — register pressure plus unrolling explode quickly.
+- **Shape cap is the dominant constraint.** For matrices outside the supported shapes, you currently have to write your own Jacobi sweep (for symmetric EVD up to ~12×12) or LU / Cholesky (for general inverse / solve).
+- **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine at these shapes; matrices larger than the cap would not be — register pressure plus unrolling explode quickly.
 - **Numerical conditioning.** All implementations use `f32` by default, which is fine for graphics / soft-body simulation but not always sufficient for stiff IPC-style problems. Pass `dt=qd.f64` whenever conditioning matters; the cost on modern GPUs is a constant factor, not order-of-magnitude.
 - **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split. Numerical behaviour is bit-exact across backends only for `f64`; `f32` may differ in the last bit because of fused-multiply-add ordering choices.
 
@@ -142,8 +142,8 @@ The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `
 
 For reference / planning purposes, the gaps users most often hit:
 
-- **Larger SVD / EVD.** Dimensions above 3 are unsupported. For symmetric EVD up to ~12×12, a Jacobi sweep is the standard approach; for general SVD, a one-sided Jacobi or QR-with-shifts is the standard approach.
-- **`Matrix.inverse` dimension cap.** Documented in [matrix_vector](matrix_vector.md): the closed-form cofactor inverse caps at 4×4. Larger inverses need an LU or Cholesky factorisation, neither of which is exposed today.
+- **Larger SVD / EVD.** Shapes above 3×3 are unsupported. For symmetric EVD up to ~12×12, a Jacobi sweep is the standard approach; for general SVD, a one-sided Jacobi or QR-with-shifts is the standard approach.
+- **`Matrix.inverse` shape cap.** Documented in [matrix_vector](matrix_vector.md): the closed-form cofactor inverse caps at 4×4. Larger inverses need an LU or Cholesky factorisation, neither of which is exposed today.
 - **`atomic_cas`.** Unrelated to decompositions, but the building block for spinlocks and lock-free dictionaries; not exposed in Python — `qd.atomic_*` covers add / sub / mul / min / max / and / or / xor but does not currently include compare-and-swap.
 - **3×3 `qd.eig`.** Only the 2×2 general (non-symmetric) eigendecomposition is provided. For 3×3, use `qd.sym_eig` if your matrix is symmetric.
 

From aa980f1010196da5c1660a9e930e67c97e6c05b6 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 09:47:55 -0700
Subject: [PATCH 06/26] [Docs] decompositions: FIXME note on misleading '2D/3D
 matrices' wording

---
 docs/source/user_guide/decompositions.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index ab3b02c6ac..0d50123a9f 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -17,6 +17,7 @@ All ops live at the top level (`qd.svd`, `qd.sym_eig`, `qd.polar_decompose`, `qd
 A few patterns to note:
 
 - **Shapes are fixed.** Calling any of these on a matrix outside the supported shapes raises an exception at trace time (`"SVD only supports 2D and 3D matrices."`, etc.). Larger matrices need a different path — typically a Jacobi-style sweep applied iteratively, which Quadrants does not currently provide out of the box.
+  - **FIXME (message wording):** these exception strings are misleading — "2D matrix" / "3D matrix" conventionally means "rank-2 / rank-3 tensor" (any matrix is rank-2), but here the intent is "matrix of shape 2×2 / 3×3". They should be updated to e.g. `"SVD only supports 2×2 and 3×3 matrices."`. Tracked separately; this page reproduces the messages as they are emitted today.
 - **All ops accept an optional `dt` argument.** When unspecified, it defaults to `impl.get_runtime().default_fp` — usually `qd.f32` unless overridden in `qd.init()`. Pass `dt=qd.f64` for the high-precision variant; passing `dt=qd.f32` explicitly is also fine if you want to be loud about it.
 - **Output shape matches the input shape.** A 3×3 input yields 3×3 outputs (and a length-3 vector for `solve` / eigenvalues); a 2×2 input yields 2×2 outputs.
 - **Real matrices only.** `qd.eig` returns complex results in a packed real layout (see below); the others all assume real-valued input and return real-valued output.

From 0d9ea6027c5ff610253882461d86ca7ed403a070 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 09:49:10 -0700
Subject: [PATCH 07/26] [Docs] decompositions: drop 'tracked separately' from
 FIXME note

---
 docs/source/user_guide/decompositions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index 0d50123a9f..b4fc29fb23 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -17,7 +17,7 @@ All ops live at the top level (`qd.svd`, `qd.sym_eig`, `qd.polar_decompose`, `qd
 A few patterns to note:
 
 - **Shapes are fixed.** Calling any of these on a matrix outside the supported shapes raises an exception at trace time (`"SVD only supports 2D and 3D matrices."`, etc.). Larger matrices need a different path — typically a Jacobi-style sweep applied iteratively, which Quadrants does not currently provide out of the box.
-  - **FIXME (message wording):** these exception strings are misleading — "2D matrix" / "3D matrix" conventionally means "rank-2 / rank-3 tensor" (any matrix is rank-2), but here the intent is "matrix of shape 2×2 / 3×3". They should be updated to e.g. `"SVD only supports 2×2 and 3×3 matrices."`. Tracked separately; this page reproduces the messages as they are emitted today.
+  - **FIXME (message wording):** these exception strings are misleading — "2D matrix" / "3D matrix" conventionally means "rank-2 / rank-3 tensor" (any matrix is rank-2), but here the intent is "matrix of shape 2×2 / 3×3". They should be updated to e.g. `"SVD only supports 2×2 and 3×3 matrices."`. This page reproduces the messages as they are emitted today.
 - **All ops accept an optional `dt` argument.** When unspecified, it defaults to `impl.get_runtime().default_fp` — usually `qd.f32` unless overridden in `qd.init()`. Pass `dt=qd.f64` for the high-precision variant; passing `dt=qd.f32` explicitly is also fine if you want to be loud about it.
 - **Output shape matches the input shape.** A 3×3 input yields 3×3 outputs (and a length-3 vector for `solve` / eigenvalues); a 2×2 input yields 2×2 outputs.
 - **Real matrices only.** `qd.eig` returns complex results in a packed real layout (see below); the others all assume real-valued input and return real-valued output.

From d7598efff3b78d4ed5b651967937ded46625cc57 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 09:49:53 -0700
Subject: [PATCH 08/26] [Docs] decompositions: drop chatty 'be loud about it'
 aside

---
 docs/source/user_guide/decompositions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index b4fc29fb23..6fa3bf6460 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -18,7 +18,7 @@ A few patterns to note:
 
 - **Shapes are fixed.** Calling any of these on a matrix outside the supported shapes raises an exception at trace time (`"SVD only supports 2D and 3D matrices."`, etc.). Larger matrices need a different path — typically a Jacobi-style sweep applied iteratively, which Quadrants does not currently provide out of the box.
   - **FIXME (message wording):** these exception strings are misleading — "2D matrix" / "3D matrix" conventionally means "rank-2 / rank-3 tensor" (any matrix is rank-2), but here the intent is "matrix of shape 2×2 / 3×3". They should be updated to e.g. `"SVD only supports 2×2 and 3×3 matrices."`. This page reproduces the messages as they are emitted today.
-- **All ops accept an optional `dt` argument.** When unspecified, it defaults to `impl.get_runtime().default_fp` — usually `qd.f32` unless overridden in `qd.init()`. Pass `dt=qd.f64` for the high-precision variant; passing `dt=qd.f32` explicitly is also fine if you want to be loud about it.
+- **All ops accept an optional `dt` argument.** When unspecified, it defaults to `impl.get_runtime().default_fp` — usually `qd.f32` unless overridden in `qd.init()`. Pass `dt=qd.f64` for the high-precision variant.
 - **Output shape matches the input shape.** A 3×3 input yields 3×3 outputs (and a length-3 vector for `solve` / eigenvalues); a 2×2 input yields 2×2 outputs.
 - **Real matrices only.** `qd.eig` returns complex results in a packed real layout (see below); the others all assume real-valued input and return real-valued output.
 

From aab8ac84cd00dc6a541cc70ad1bddd3679984ff8 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 09:52:53 -0700
Subject: [PATCH 09/26] [Docs] decompositions: state SVD ordering precisely
 (2x2 sorted, 3x3 not)

---
 docs/source/user_guide/decompositions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index 6fa3bf6460..0ffa799bb7 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -34,7 +34,7 @@ Singular value decomposition — produces `(U, S, V)` such that `A = U @ S @ V.t
 
 Shapes 2×2 and 3×3 use closed-form / Jacobi-style implementations specialized per shape. Sign convention for `U` and `V` is the implementation's natural one and is **not** guaranteed to enforce `det(U) = det(V) = +1`; if you depend on a particular handedness (e.g. for an ARAP rotation `R = U @ V.transpose()`), check it explicitly and flip a column if needed.
 
-Singular values come out in implementation order — confirm via test if you depend on `S[0,0] >= S[1,1] >= S[2,2]`.
+The 2×2 path returns singular values sorted descending (`S[0,0] >= S[1,1]`). The 3×3 path does **not** sort — singular values come out in whatever order the Sifakis algorithm produces them. If you depend on `S[0,0] >= S[1,1] >= S[2,2]` for 3×3, sort explicitly.
 
 ### `qd.sym_eig(A, dt=None)`
 

From 0de99f305c5752efcd3ac3c6b270f71438c55c21 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 09:54:23 -0700
Subject: [PATCH 10/26] [Docs] decompositions: FIXME on SVD sort-consistency
 2x2 vs 3x3

---
 docs/source/user_guide/decompositions.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index 0ffa799bb7..03e54cb255 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -36,6 +36,8 @@ Shapes 2×2 and 3×3 use closed-form / Jacobi-style implementations specialized
 
 The 2×2 path returns singular values sorted descending (`S[0,0] >= S[1,1]`). The 3×3 path does **not** sort — singular values come out in whatever order the Sifakis algorithm produces them. If you depend on `S[0,0] >= S[1,1] >= S[2,2]` for 3×3, sort explicitly.
 
+**FIXME (sort consistency):** the 2×2 / 3×3 split is an inconsistency in the API — both shapes should either sort descending or both should leave the order to the algorithm. The 2×2 swap is essentially free; making 3×3 sort would cost a few comparisons and column swaps. Pick one and apply it across both shapes (and update this paragraph accordingly).
+
 ### `qd.sym_eig(A, dt=None)`
 
 Symmetric eigendecomposition — for a real symmetric `A`, returns `(eigenvalues, eigenvectors)` with:

From a259f734d44377a96d7ff0134c49d46570cdb465 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 09:54:47 -0700
Subject: [PATCH 11/26] [Docs] decompositions: mention boolean template option
 in sort-consistency FIXME

---
 docs/source/user_guide/decompositions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index 03e54cb255..3688cb1f04 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -36,7 +36,7 @@ Shapes 2×2 and 3×3 use closed-form / Jacobi-style implementations specialized
 
 The 2×2 path returns singular values sorted descending (`S[0,0] >= S[1,1]`). The 3×3 path does **not** sort — singular values come out in whatever order the Sifakis algorithm produces them. If you depend on `S[0,0] >= S[1,1] >= S[2,2]` for 3×3, sort explicitly.
 
-**FIXME (sort consistency):** the 2×2 / 3×3 split is an inconsistency in the API — both shapes should either sort descending or both should leave the order to the algorithm. The 2×2 swap is essentially free; making 3×3 sort would cost a few comparisons and column swaps. Pick one and apply it across both shapes (and update this paragraph accordingly).
+**FIXME (sort consistency):** the 2×2 / 3×3 split is an inconsistency in the API — both shapes should either sort descending or both should leave the order to the algorithm (or parametrize it with a boolean template parameter, e.g. `qd.svd(A, sorted=True)`). The 2×2 swap is essentially free; making 3×3 sort would cost a few comparisons and column swaps. Pick one and apply it across both shapes (and update this paragraph accordingly).
 
 ### `qd.sym_eig(A, dt=None)`
 

From 95037e2caaa5e92b0f0d06dc9faed4967278f88e Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 09:56:14 -0700
Subject: [PATCH 12/26] [Docs] decompositions: FIXME on sym_eig sort
 consistency (also vs svd direction)

---
 docs/source/user_guide/decompositions.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index 3688cb1f04..cbd91be22e 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -47,6 +47,8 @@ Symmetric eigendecomposition — for a real symmetric `A`, returns `(eigenvalues
 
 For 3×3, eigenvalues come out sorted in ascending order (the implementation explicitly sorts at the end). The 2×2 path does not perform an explicit sort — verify with a test if you need a particular order in 2D.
 
+**FIXME (sort consistency):** same kind of inconsistency as `qd.svd`, with the additional twist that the sort directions disagree across ops — `qd.sym_eig` 3×3 sorts *ascending*, while `qd.svd` 2×2 sorts *descending*. Both shapes of `qd.sym_eig` should sort the same way, and ideally `qd.sym_eig` and `qd.svd` should agree on a direction (or parametrize via a boolean template parameter, e.g. `qd.sym_eig(A, sorted=True)`). Pick one and apply it across both shapes and across both ops (and update this paragraph accordingly).
+
 `A` is *assumed* symmetric; the implementation does not symmetrize first. If your matrix is only approximately symmetric (e.g. accumulated floating-point error), explicitly compute `(A + A.transpose()) * 0.5` before calling.
 
 ### `qd.polar_decompose(A, dt=None)`

From 43a6d4541d55a94f2f03defbfae7c90dc21e75b2 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 09:57:28 -0700
Subject: [PATCH 13/26] [Docs] decompositions: qualify qd.solve singular-A
 assert as debug-only

---
 docs/source/user_guide/decompositions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index cbd91be22e..335ac598db 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -75,7 +75,7 @@ Direct solve of `A @ x = b` via Gauss elimination with partial pivoting. Returns
 
 - Shapes 2×2 and 3×3.
 - The implementation asserts `A.n == A.m` and `A.m == b.n`.
-- Singular `A` triggers an `assert` failure (`"Matrix is singular in linear solve."`); the assert is active by default and is your only signal of singularity.
+- Singular `A` is checked by a kernel `assert` (`"Matrix is singular in linear solve."`) inside the Gauss-elimination path. Kernel asserts only fire when the runtime is initialised with `qd.init(debug=True)` (see [debug](debug.md)) — under the default `debug=False` a singular input silently produces a divide-by-zero / NaN result with no diagnostic. If you need a signal in production, check singularity explicitly before calling `qd.solve` (e.g. `abs(A.determinant())` against a tolerance), or run development workloads with `debug=True` to catch the case.
 - Use this for one-off small solves; for systems that decompose once and back-solve many times, currently you have to bake the LU yourself (or just call `qd.solve` per b — for 2×2 / 3×3 the gain from caching the LU is negligible compared to the launch).
 
 ## Examples

From 12173047ece722ff97d68e77216c3e20eeb9a2a3 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:00:33 -0700
Subject: [PATCH 14/26] [Docs] decompositions: split qd.solve usage bullet into
 three clearer ones

---
 docs/source/user_guide/decompositions.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index 335ac598db..d34da9f6c5 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -76,7 +76,9 @@ Direct solve of `A @ x = b` via Gauss elimination with partial pivoting. Returns
 - Shapes 2×2 and 3×3.
 - The implementation asserts `A.n == A.m` and `A.m == b.n`.
 - Singular `A` is checked by a kernel `assert` (`"Matrix is singular in linear solve."`) inside the Gauss-elimination path. Kernel asserts only fire when the runtime is initialised with `qd.init(debug=True)` (see [debug](debug.md)) — under the default `debug=False` a singular input silently produces a divide-by-zero / NaN result with no diagnostic. If you need a signal in production, check singularity explicitly before calling `qd.solve` (e.g. `abs(A.determinant())` against a tolerance), or run development workloads with `debug=True` to catch the case.
-- Use this for one-off small solves; for systems that decompose once and back-solve many times, currently you have to bake the LU yourself (or just call `qd.solve` per b — for 2×2 / 3×3 the gain from caching the LU is negligible compared to the launch).
+- Intended for one-off small solves: each call factorises `A` from scratch and back-substitutes for the given `b`.
+- There is no separate LU-factor / reuse API. If you want to amortise the factorisation across many right-hand-sides, you have to write the factorisation yourself in user code.
+- At 2×2 / 3×3, the factorisation cost is dominated by kernel-launch overhead, so calling `qd.solve` once per `b` is usually fine in practice — the missing factor-reuse API matters more for larger shapes (which `qd.solve` does not support today anyway).
 
 ## Examples
 

From 5370a357955533d93fbf198346f92cbdde0ee58a Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:01:20 -0700
Subject: [PATCH 15/26] [Docs] decompositions: drop LU-reuse and
 launch-overhead asides for qd.solve

---
 docs/source/user_guide/decompositions.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index d34da9f6c5..78b851932a 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -77,8 +77,6 @@ Direct solve of `A @ x = b` via Gauss elimination with partial pivoting. Returns
 - The implementation asserts `A.n == A.m` and `A.m == b.n`.
 - Singular `A` is checked by a kernel `assert` (`"Matrix is singular in linear solve."`) inside the Gauss-elimination path. Kernel asserts only fire when the runtime is initialised with `qd.init(debug=True)` (see [debug](debug.md)) — under the default `debug=False` a singular input silently produces a divide-by-zero / NaN result with no diagnostic. If you need a signal in production, check singularity explicitly before calling `qd.solve` (e.g. `abs(A.determinant())` against a tolerance), or run development workloads with `debug=True` to catch the case.
 - Intended for one-off small solves: each call factorises `A` from scratch and back-substitutes for the given `b`.
-- There is no separate LU-factor / reuse API. If you want to amortise the factorisation across many right-hand-sides, you have to write the factorisation yourself in user code.
-- At 2×2 / 3×3, the factorisation cost is dominated by kernel-launch overhead, so calling `qd.solve` once per `b` is usually fine in practice — the missing factor-reuse API matters more for larger shapes (which `qd.solve` does not support today anyway).
 
 ## Examples
 

From 5c0244f8d3e2324d201d831da4ed9f6de4791a72 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:02:19 -0700
Subject: [PATCH 16/26] [Docs] decompositions: align ARAP commentary with code
 form (R.determinant() < 0.0)

---
 docs/source/user_guide/decompositions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index 78b851932a..f0d637956a 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -93,7 +93,7 @@ def closest_rotation(A: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3,
     return R
 ```
 
-The `det(R) < 0` branch fixes the handedness when SVD's sign convention produces a reflection rather than a rotation — a standard ARAP / IPC trick.
+The `R.determinant() < 0.0` branch fixes the handedness when SVD's sign convention produces a reflection rather than a rotation — a standard ARAP / IPC trick.
 
 ### Project to symmetric positive semi-definite (`make_spd`)
 

From a3722b04df00f29fcdc0d927559cddccd019771d Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:02:45 -0700
Subject: [PATCH 17/26] [Docs] decompositions: drop dangling 'one-off'
 qualifier on qd.solve

---
 docs/source/user_guide/decompositions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index f0d637956a..dd0d336b8a 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -76,7 +76,7 @@ Direct solve of `A @ x = b` via Gauss elimination with partial pivoting. Returns
 - Shapes 2×2 and 3×3.
 - The implementation asserts `A.n == A.m` and `A.m == b.n`.
 - Singular `A` is checked by a kernel `assert` (`"Matrix is singular in linear solve."`) inside the Gauss-elimination path. Kernel asserts only fire when the runtime is initialised with `qd.init(debug=True)` (see [debug](debug.md)) — under the default `debug=False` a singular input silently produces a divide-by-zero / NaN result with no diagnostic. If you need a signal in production, check singularity explicitly before calling `qd.solve` (e.g. `abs(A.determinant())` against a tolerance), or run development workloads with `debug=True` to catch the case.
-- Intended for one-off small solves: each call factorises `A` from scratch and back-substitutes for the given `b`.
+- Each call factorises `A` from scratch and back-substitutes for the given `b`.
 
 ## Examples
 

From 2871e6ac600e0ca21acb81aaa0d79a57fa0f740b Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:03:28 -0700
Subject: [PATCH 18/26] [Docs] decompositions: drop IPC references

---
 docs/source/user_guide/decompositions.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index dd0d336b8a..d89a6f65ef 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -93,7 +93,7 @@ def closest_rotation(A: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3,
     return R
 ```
 
-The `R.determinant() < 0.0` branch fixes the handedness when SVD's sign convention produces a reflection rather than a rotation — a standard ARAP / IPC trick.
+The `R.determinant() < 0.0` branch fixes the handedness when SVD's sign convention produces a reflection rather than a rotation — a standard ARAP trick.
 
 ### Project to symmetric positive semi-definite (`make_spd`)
 
@@ -110,7 +110,7 @@ def make_spd_3x3(H: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3, qd.f
     return Q @ Lambda @ Q.transpose()
 ```
 
-Used by IPC-style methods to project an indefinite Hessian to its closest SPD approximation per element. Note the shape cap — only 3×3 today, since `qd.sym_eig` itself caps at 3×3.
+Used to project an indefinite Hessian to its closest SPD approximation per element. Note the shape cap — only 3×3 today, since `qd.sym_eig` itself caps at 3×3.
 
 ### Per-thread linear solve
 
@@ -140,7 +140,7 @@ The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `
 
 - **Shape cap is the dominant constraint.** For matrices outside the supported shapes, you currently have to write your own Jacobi sweep (for symmetric EVD up to ~12×12) or LU / Cholesky (for general inverse / solve).
 - **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine at these shapes; matrices larger than the cap would not be — register pressure plus unrolling explode quickly.
-- **Numerical conditioning.** All implementations use `f32` by default, which is fine for graphics / soft-body simulation but not always sufficient for stiff IPC-style problems. Pass `dt=qd.f64` whenever conditioning matters; the cost on modern GPUs is a constant factor, not order-of-magnitude.
+- **Numerical conditioning.** All implementations use `f32` by default, which is fine for graphics / soft-body simulation but not always sufficient for stiff problems. Pass `dt=qd.f64` whenever conditioning matters; the cost on modern GPUs is a constant factor, not order-of-magnitude.
 - **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split. Numerical behaviour is bit-exact across backends only for `f64`; `f32` may differ in the last bit because of fused-multiply-add ordering choices.
 
 ## What's missing

From 6243d97c85d0c1d83c700792b8512935111898cd Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:04:19 -0700
Subject: [PATCH 19/26] [Docs] decompositions: drop graphics/soft-body aside in
 conditioning bullet

---
 docs/source/user_guide/decompositions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index d89a6f65ef..e7f21d0ef9 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -140,7 +140,7 @@ The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `
 
 - **Shape cap is the dominant constraint.** For matrices outside the supported shapes, you currently have to write your own Jacobi sweep (for symmetric EVD up to ~12×12) or LU / Cholesky (for general inverse / solve).
 - **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine at these shapes; matrices larger than the cap would not be — register pressure plus unrolling explode quickly.
-- **Numerical conditioning.** All implementations use `f32` by default, which is fine for graphics / soft-body simulation but not always sufficient for stiff problems. Pass `dt=qd.f64` whenever conditioning matters; the cost on modern GPUs is a constant factor, not order-of-magnitude.
+- **Numerical conditioning.** All implementations use `f32` by default. Pass `dt=qd.f64` whenever conditioning matters; the cost on modern GPUs is a constant factor, not order-of-magnitude.
 - **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split. Numerical behaviour is bit-exact across backends only for `f64`; `f32` may differ in the last bit because of fused-multiply-add ordering choices.
 
 ## What's missing

From 85004029cf179bde4ac6587237365af848653f9a Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:04:39 -0700
Subject: [PATCH 20/26] [Docs] decompositions: drop trailing 'standard ARAP
 trick' aside

---
 docs/source/user_guide/decompositions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index e7f21d0ef9..8f297ca318 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -93,7 +93,7 @@ def closest_rotation(A: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3,
     return R
 ```
 
-The `R.determinant() < 0.0` branch fixes the handedness when SVD's sign convention produces a reflection rather than a rotation — a standard ARAP trick.
+The `R.determinant() < 0.0` branch fixes the handedness when SVD's sign convention produces a reflection rather than a rotation.
 
 ### Project to symmetric positive semi-definite (`make_spd`)
 

From 2bda0a249326586512d8a503fba82913283aab01 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:04:57 -0700
Subject: [PATCH 21/26] [Docs] decompositions: trim conditioning bullet to
 f32-default statement

---
 docs/source/user_guide/decompositions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index 8f297ca318..41bc252e66 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -140,7 +140,7 @@ The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `
 
 - **Shape cap is the dominant constraint.** For matrices outside the supported shapes, you currently have to write your own Jacobi sweep (for symmetric EVD up to ~12×12) or LU / Cholesky (for general inverse / solve).
 - **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine at these shapes; matrices larger than the cap would not be — register pressure plus unrolling explode quickly.
-- **Numerical conditioning.** All implementations use `f32` by default. Pass `dt=qd.f64` whenever conditioning matters; the cost on modern GPUs is a constant factor, not order-of-magnitude.
+- **Numerical conditioning.** All implementations use `f32` by default.
 - **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split. Numerical behaviour is bit-exact across backends only for `f64`; `f32` may differ in the last bit because of fused-multiply-add ordering choices.
 
 ## What's missing

From 9d6098929dc43b065e12911d6faa4b94f344356f Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:06:26 -0700
Subject: [PATCH 22/26] [Docs] decompositions: remove 'Shape cap is the
 dominant constraint' bullet

---
 docs/source/user_guide/decompositions.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index 41bc252e66..efd9e60067 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -138,7 +138,6 @@ The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `
 
 ## Shapes, performance, portability
 
-- **Shape cap is the dominant constraint.** For matrices outside the supported shapes, you currently have to write your own Jacobi sweep (for symmetric EVD up to ~12×12) or LU / Cholesky (for general inverse / solve).
 - **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine at these shapes; matrices larger than the cap would not be — register pressure plus unrolling explode quickly.
 - **Numerical conditioning.** All implementations use `f32` by default.
 - **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split. Numerical behaviour is bit-exact across backends only for `f64`; `f32` may differ in the last bit because of fused-multiply-add ordering choices.

From 0e33770c4b0954a3859bdaf86d7c732b4527d191 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:06:47 -0700
Subject: [PATCH 23/26] [Docs] decompositions: remove 'Numerical conditioning'
 bullet

---
 docs/source/user_guide/decompositions.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index efd9e60067..e6db8443bc 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -139,7 +139,6 @@ The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `
 ## Shapes, performance, portability
 
 - **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine at these shapes; matrices larger than the cap would not be — register pressure plus unrolling explode quickly.
-- **Numerical conditioning.** All implementations use `f32` by default.
 - **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split. Numerical behaviour is bit-exact across backends only for `f64`; `f32` may differ in the last bit because of fused-multiply-add ordering choices.
 
 ## What's missing

From dffa6065d9eab36dd69a930aef176a53113c16c3 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:08:38 -0700
Subject: [PATCH 24/26] [Docs] decompositions: drop overconfident bit-exactness
 aside

---
 docs/source/user_guide/decompositions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index e6db8443bc..8e702115fb 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -139,7 +139,7 @@ The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `
 ## Shapes, performance, portability
 
 - **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine at these shapes; matrices larger than the cap would not be — register pressure plus unrolling explode quickly.
-- **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split. Numerical behaviour is bit-exact across backends only for `f64`; `f32` may differ in the last bit because of fused-multiply-add ordering choices.
+- **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split.
 
 ## What's missing
 

From c5a50ca3cd053c5c3d0bfdcbfa799871cbbb9cbb Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:08:59 -0700
Subject: [PATCH 25/26] [Docs] decompositions: soften 'would not be' to 'may
 not be' for compile-time claim

---
 docs/source/user_guide/decompositions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index 8e702115fb..44e40fc998 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -138,7 +138,7 @@ The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `
 
 ## Shapes, performance, portability
 
-- **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine at these shapes; matrices larger than the cap would not be — register pressure plus unrolling explode quickly.
+- **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine at these shapes; matrices larger than the cap may not be — register pressure plus unrolling explode quickly.
 - **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split.
 
 ## What's missing

From 3d0fa3ef1dd7dfb7005c8f36b0b4ce633aaf8b17 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Thu, 7 May 2026 10:09:21 -0700
Subject: [PATCH 26/26] [Docs] decompositions: remove 'What's missing' section

---
 docs/source/user_guide/decompositions.md | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/docs/source/user_guide/decompositions.md b/docs/source/user_guide/decompositions.md
index 44e40fc998..e5cd3ba44a 100644
--- a/docs/source/user_guide/decompositions.md
+++ b/docs/source/user_guide/decompositions.md
@@ -141,15 +141,6 @@ The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `
 - **Compile time.** Each call is unrolled per thread, so a kernel that calls `qd.svd` on a 3×3 matrix per element compiles a moderately large block of straight-line code per thread. Compile time is generally fine at these shapes; matrices larger than the cap may not be — register pressure plus unrolling explode quickly.
 - **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split.
 
-## What's missing
-
-For reference / planning purposes, the gaps users most often hit:
-
-- **Larger SVD / EVD.** Shapes above 3×3 are unsupported. For symmetric EVD up to ~12×12, a Jacobi sweep is the standard approach; for general SVD, a one-sided Jacobi or QR-with-shifts is the standard approach.
-- **`Matrix.inverse` shape cap.** Documented in [matrix_vector](matrix_vector.md): the closed-form cofactor inverse caps at 4×4. Larger inverses need an LU or Cholesky factorisation, neither of which is exposed today.
-- **`atomic_cas`.** Unrelated to decompositions, but the building block for spinlocks and lock-free dictionaries; not exposed in Python — `qd.atomic_*` covers add / sub / mul / min / max / and / or / xor but does not currently include compare-and-swap.
-- **3×3 `qd.eig`.** Only the 2×2 general (non-symmetric) eigendecomposition is provided. For 3×3, use `qd.sym_eig` if your matrix is symmetric.
-
 ## Related
 
 - [matrix_vector](matrix_vector.md) — element-wise / arithmetic matrix operations (`@`, `inverse`, `determinant`, `transpose`, dot, cross, norm, `outer_product`). Covers the operations whose implementation is a single closed-form formula.