Genesis-Embodied-AI · duburcqa · Apr 13, 2026 · Apr 13, 2026 · Apr 13, 2026 · Apr 13, 2026
diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md
@@ -19,6 +19,7 @@ scalar_tensors
 matrix_vector
 compound_types
 static
+precise
 sub_functions
 parallelization
 ```

diff --git a/docs/source/user_guide/precise.md b/docs/source/user_guide/precise.md
@@ -0,0 +1,129 @@
+# qd.precise
+
+`qd.precise(expr)` marks a floating-point expression as IEEE-strict. Every binary and unary FP op inside the wrapped subtree is evaluated in source order with no reassociation, no FMA contraction, and no non-IEEE-exact algebraic simplification, regardless of the module-level `fast_math` setting. Folds that are IEEE-exact for every input (e.g. `a - 0 -> a`, `a > a -> false`) are still applied. It is equivalent to the `precise` keyword in MSL / HLSL.
+
+## Why
+
+Quadrants compiles kernels with `fast_math=True` by default. Under that mode the compiler is free to:
+
+- **reassociate** FP ops (e.g. `(a + b) + c -> a + (b + c)`)
+- **contract** mul-then-add into FMA
+- **substitute approximations** for `sqrt`, `sin`, `cos`, `log`, `1/x`
+- **algebraically simplify** (e.g. `a - a -> 0`, `a / a -> 1`)
+
+This silently destroys compensated-arithmetic primitives (Dekker / Kahan 2Sum, Veltkamp split, double-single accumulators) whose entire correctness rests on the fact that `(a - aa) + (b - bb)` is non-zero under IEEE arithmetic. The traditional workaround is to flip the global `fast_math=False` switch, but that pays the perf cost everywhere, even when only a handful of lines need IEEE semantics.
+
+`qd.precise(expr)` is the per-expression opt-in: keep `fast_math=True` globally for speed, and wrap the expressions that must be IEEE-exact.
+
+## Basic usage
+
+```python
+@qd.func
+def fast_two_sum(a, b):
+    s = qd.precise(a + b)
+    e = qd.precise(b - (s - a))   # would fold to 0 under fast-math without precise
+    return s, e
+```
+
+Any expression value can be wrapped. The wrapper returns the same expression with every reachable FP op tagged as precise; at codegen time the tagged ops opt out of the optimizations above.
+
+## What gets protected
+
+`qd.precise` walks the wrapped expression tree and tags:
+
+- Every `BinaryOp` (`+`, `-`, `*`, `/`, `%`, FP comparisons)
+- Every `UnaryOp` (`neg`, `sqrt`, `sin`, `cos`, `log`, `exp`, `rsqrt`, casts, bit_cast, ...)
+
+Bitwise operations (`bit_and`, `bit_or`, `bit_xor`, `bit_shl`, `bit_sar`) are integer-domain; the walker tags them for completeness but the flag has no effect on integer IR.
+
+The walker descends through `BinaryOp`, `UnaryOp`, and `TernaryOp` (e.g. `qd.select`) nodes, so wrapping a composite expression protects the inner ops too:
+
+```python
+# All four FP ops below are tagged: the outer sqrt, the inner add, and the two inner muls.
+r = qd.precise(qd.sqrt(a * a + b * b))
+
+# Ternary is traversed through; the two branches and the condition's inner ops are tagged.
+r = qd.precise(qd.select(cond, a + b, a - b))
+```
+
+## Where the walker stops
+
+`qd.precise` does not descend into:
+
+- Loads (ndarray indexing, field access)
+- Constants
+- `qd.func` call sites
+- Atomic ops
+- Intermediate Python variable assignments (`tmp = a + b` wraps the RHS in an internal alloca, so `qd.precise(tmp)` sees the alloca, not the inner `BinaryOp`, and is a silent no-op)
+
+Semantics inside a `qd.func` body are governed by that body's own ops. If you want IEEE-strict behavior inside a called function, wrap the relevant ops inside the function's body, not at the call site. Similarly, wrap `qd.precise` directly around the expression rather than around a variable that was assigned earlier:
+
+```python
+@qd.func
+def dot_precise(a, b, c, d):
+    # Wrap inside the body, not at the caller.
+    return qd.precise(a * b + c * d)
+
+@qd.kernel
+def k(...):
+    r = dot_precise(x, y, z, w)   # inner ops are already precise
+```
+
+## Interaction with fast_math
+
+`qd.precise` is a per-op override. It takes effect whether `fast_math` is on or off:
+
+| Setting | Non-precise op | `qd.precise` op |
+|---|---|---|
+| `fast_math=True` | reassoc / contract / simplify | IEEE-strict |
+| `fast_math=False` | IEEE-strict | IEEE-strict (redundant but harmless) |
+
+The recommended workflow is to leave `fast_math=True` globally for throughput and reach for `qd.precise` only in the handful of spots that need IEEE behavior.
+
+## Backend coverage
+
+| Backend | Reassoc / contraction / algebraic folds | Approximate transcendentals (`sin` / `cos` / `log`) |
+|---|---|---|
+| CPU | LLVM FMF cleared | libc `sinf` is already correctly rounded |
+| CUDA | LLVM FMF cleared | libdevice `__nv_<fn>f` (non-fast) selected |
+| AMDGPU | LLVM FMF cleared | `__ocml_<fn>` already correctly rounded |
+| Vulkan / MoltenVK | SPIR-V `NoContraction` decoration | best-effort: driver stdlib default (spec only guarantees 2^-11 absolute error) |
+| Metal | SPIR-V `NoContraction` decoration | best-effort: driver stdlib default (spec only guarantees 2^-11 absolute error) |
+
+On SPIR-V backends, `NoContraction` is defined by the spec to apply to arithmetic instructions only; most consumers ignore it on the `OpExtInst` calls used for transcendentals. The decoration is still emitted (it is harmless and future-proofs against downstream toolchains that start honoring it), but correctness of `qd.precise(qd.sin(x))` / `qd.precise(qd.cos(x))` on Metal / Vulkan cannot be guaranteed through the tag: the Vulkan precision requirements for GLSL.std.450 `Sin`/`Cos` are stated as 2^-11 absolute error, which on inputs whose reference magnitude is smaller than 1 is thousands of ULPs, and drivers are within their rights to saturate that latitude. If you need correctly-rounded sin/cos, use the CPU / CUDA / AMDGPU backends.
+
+## Example: Dekker 2Sum
+
+A textbook compensated addition that computes `s + e = a + b` exactly in f32:
+
+```python
+@qd.func
+def two_sum(a, b):
+    s = qd.precise(a + b)
+    bb = qd.precise(s - a)
+    aa = qd.precise(s - bb)
+    e = qd.precise((a - aa) + (b - bb))
+    return s, e
+```
+
+Without the `qd.precise` wrappers, under `fast_math=True` the compiler recognizes `(a - (s - (s - a))) + (b - (s - a))` as algebraically zero and folds `e` to `0`. The wrappers prevent that fold, and `s + e` reproduces `a + b` to full precision.
+
+## Caveats
+
+- `qd.precise` is a scalar primitive. Passing a `Vector` / `Matrix` will raise. Apply it to individual components instead, or refactor your expression to use scalar ops inside.
+- `qd.precise` does not mutate its input. It returns a fresh expression subtree with every reachable FP op tagged; the original expression is unchanged. Reusing the original elsewhere is safe and never inherits the tag.
+
+## Companion: `qd.math.fma`
+
+Compensated-arithmetic blocks typically need two things: (1) IEEE-strict ordering on ordinary ops (provided by `qd.precise`) and (2) a guaranteed single-rounding fused multiply-add for error-free transforms. The second is exposed separately as `qd.math.fma(a, b, c)`:
+
+```python
+# Two-product error-free transform (TwoProd).
+# Returns p = round(a*b) and e such that a*b = p + e exactly.
+p = a * b
+e = qd.math.fma(a, b, -p)     # single rounding: exact residual of p
+```
+
+`qd.math.fma` is lowered to the native FMA on every backend: `llvm.fma` on CPU, `__nv_fma` / `__nv_fmaf` (libdevice) on CUDA, `GLSL.std.450 Fma` on Vulkan / Metal. Unlike relying on the compiler to contract `mul; add` into FMA (which requires both fast-math flags to permit contraction *and* the inputs to survive algebraic simplification), this is an explicit instruction - so the TwoProd residual, Fast2Sum, and double-single multiply patterns port over directly without needing per-backend contraction hints.
+
+Backends without hardware FMA fall back to a regular mul-then-add and lose the single-rounding guarantee; on those targets compensated algorithms should be rewritten to the Dekker / Veltkamp-split form.
diff --git a/python/quadrants/lang/ops.py b/python/quadrants/lang/ops.py
@@ -95,6 +95,59 @@ def cast(obj, dtype):
     return expr.Expr(_qd_core.value_cast(expr.Expr(obj).ptr, dtype))
 
 
+def precise(obj):
+    """Mark a floating-point expression as IEEE-strict.
+
+    Every binary and unary FP op inside ``obj`` is evaluated in source
+    order with no reassociation, no FMA contraction, no approximate
+    transcendental substitution, and no non-IEEE-exact algebraic
+    simplification, regardless of the module-level :attr:`fast_math`
+    setting. Folds that are IEEE-exact for every input (e.g.
+    ``a - 0 -> a``, ``a > a -> false``) are still applied. This is
+    equivalent to MSL's / HLSL's ``precise`` keyword and lets you keep
+    ``fast_math=True`` globally while protecting compensated-arithmetic
+    blocks (Dekker / Kahan 2Sum, Veltkamp split, etc.) from being folded
+    away.
+
+    Recursion descends through ``BinaryOp``, ``UnaryOp`` (cast, bit_cast,
+    neg, sqrt, ...), and ``TernaryOp`` (select) wrappers so that inner
+    binary ops are reached even when wrapped, e.g.
+    ``qd.precise(qd.bit_cast(a + b, qd.f32))``. It stops at loads,
+    constants, ``qd.func`` calls, ndarray accesses, etc.; semantics inside
+    a ``qd.func`` body are governed by that body's own ops - wrap calls
+    separately if needed.
+
+    Notes:
+        * ``qd.precise`` does NOT mutate the input expression. It returns
+          a fresh subtree that mirrors the input's structure, with every
+          reachable Binary / Unary / Ternary node cloned and the new
+          Binary / Unary nodes tagged as ``precise``. Non-walked nodes
+          (loads, constants, ``qd.func`` calls, ndarray accesses, ...)
+          are shared with the input by reference. The practical upshot:
+          reusing the original (pre-``precise``) expression value
+          elsewhere is safe - it will NOT pick up the tag.
+
+    Args:
+        obj: A scalar Quadrants expression (typically a chain of FP ops).
+
+    Returns:
+        A fresh expression subtree with every reachable binary and unary
+        FP op tagged as ``precise``. The original ``obj`` is unchanged.
+
+    Example::
+
+        >>> @qd.func
+        >>> def fast_two_sum(a, b):
+        >>>     # Local IEEE region, survives even with fast_math=True.
+        >>>     s = qd.precise(a + b)
+        >>>     e = qd.precise(b - (s - a))
+        >>>     return s, e
+    """
+    if is_quadrants_class(obj):
+        raise ValueError("Cannot apply precise on Quadrants classes")
+    return expr.Expr(_qd_core.precise(expr.Expr(obj).ptr))
+
+
 def bit_cast(obj, dtype):
     """Copy and cast a scalar to a specified data type with its underlying
     bits preserved. Must be called in quadrants scope.
@@ -1117,6 +1170,42 @@ def py_select(cond, x1, x2):
     return _ternary_operation(_qd_core.expr_select, py_select, cond, x1, x2)
 
 
+def fma(a, b, c):
+    """Fused multiply-add: return ``a * b + c`` computed as a single rounded
+    operation.
+
+    Unlike a plain ``a * b + c``, the intermediate product is not rounded:
+    the result is ``round(a * b + c, 1 ULP)``. This is the hardware FMA
+    available on every modern FP pipeline (x86 FMA3, ARM, Apple Silicon,
+    NVIDIA ``fma``, AMD, RISC-V Zfa). Exposed here primarily to let
+    compensated-arithmetic primitives (TwoProd, Fast2Sum + FMA,
+    double-single accumulators) get the single-rounding guarantee without
+    relying on backend-specific FMF contraction.
+
+    Classic two-product error-free transform:
+
+        p = a * b
+        e = qd.fma(a, b, -p)        # exact residual of p
+
+    Each backend maps this to its native FMA (LLVM ``llvm.fma`` intrinsic
+    on CPU, ``__nv_fma/__nv_fmaf`` on CUDA via libdevice, GLSL.std.450
+    ``Fma`` on Vulkan/Metal). Backends without hardware FMA fall back to
+    a regular mul-then-add and lose the single-rounding guarantee.
+
+    Args:
+        a, b, c: Homogeneous FP scalars (``f16``/``f32``/``f64``). Integer
+            inputs are rejected.
+
+    Returns:
+        ``round(a * b + c, 1 ULP)`` as a single rounded operation.
+    """
+
+    def py_fma(a, b, c):
+        return a * b + c
+
+    return _ternary_operation(_qd_core.expr_fma, py_fma, a, b, c)
+
+
 def ifte(cond, x1, x2):
     """Evaluate and return `x1` if `cond` is true; otherwise evaluate and return `x2`. This operator guarantees
     short-circuit semantics: exactly one of `x1` or `x2` will be evaluated.
@@ -1535,4 +1624,5 @@ def min(*args):  # pylint: disable=W0622
     "select",
     "abs",
     "pow",
+    "precise",
 ]
diff --git a/python/quadrants/math/mathimpl.py b/python/quadrants/math/mathimpl.py
@@ -823,6 +823,17 @@ def clz(x):
     return ops.clz(x)
 
 
+@func
+def fma(a, b, c):
+    """Fused multiply-add: ``a * b + c`` as a single rounded operation.
+
+    Wraps :func:`quadrants.lang.ops.fma`. Primary use case is compensated
+    FP arithmetic (TwoProd error-free transform, Fast2Sum + FMA for
+    double-single types); see the ``ops.fma`` docstring for details.
+    """
+    return ops.fma(a, b, c)
+
+
 __all__ = [
     "acos",
     "asin",
@@ -840,6 +851,7 @@ def clz(x):
     "exp",
     "eye",
     "floor",
+    "fma",
     "fract",
     "inf",
     "inverse",

diff --git a/quadrants/analysis/gen_offline_cache_key.cpp b/quadrants/analysis/gen_offline_cache_key.cpp
@@ -88,6 +88,7 @@ class ASTSerializer : public IRVisitor, public ExpressionVisitor {
   void visit(UnaryOpExpression *expr) override {
     emit(ExprOpCode::UnaryOpExpression);
     emit(expr->type);
+    emit(expr->precise);
     if (expr->is_cast()) {
       emit(expr->cast_type);
     }
@@ -97,13 +98,15 @@ class ASTSerializer : public IRVisitor, public ExpressionVisitor {
   void visit(BinaryOpExpression *expr) override {
     emit(ExprOpCode::BinaryOpExpression);
     emit(expr->type);
+    emit(expr->precise);
     emit(expr->lhs);
     emit(expr->rhs);
   }
 
   void visit(TernaryOpExpression *expr) override {
     emit(ExprOpCode::TernaryOpExpression);
     emit(expr->type);
+    emit(expr->precise);
     emit(expr->op1);
     emit(expr->op2);
     emit(expr->op3);

diff --git a/quadrants/codegen/amdgpu/codegen_amdgpu.cpp b/quadrants/codegen/amdgpu/codegen_amdgpu.cpp
@@ -389,6 +389,11 @@ class TaskCodeGenAMDGPU : public TaskCodeGenLLVM {
     if (op != BinaryOpType::atan2 && op != BinaryOpType::pow) {
       return TaskCodeGenLLVM::visit(stmt);
     }
+    // The base-class `visit(BinaryOpStmt*)` terminates with `if (stmt->precise) disable_fast_math(...)` so LLVM cannot
+    // substitute approximate variants for precise-tagged FP ops. The AMDGPU override below returns without chaining to
+    // the base, so we mirror that same guard on the __ocml_* call results. AMDGPU's `__ocml_*` transcendentals are
+    // currently correctly-rounded (no `__ocml_fast_*` variants), so this is defensive against future libocml changes
+    // rather than a bug today.
     auto lhs = llvm_val[stmt->lhs];
     auto rhs = llvm_val[stmt->rhs];
 
@@ -403,6 +408,13 @@ class TaskCodeGenAMDGPU : public TaskCodeGenLLVM {
         auto sitofp_lhs_ = builder->CreateSIToFP(lhs, llvm::Type::getDoubleTy(*llvm_context));
         auto sitofp_rhs_ = builder->CreateSIToFP(rhs, llvm::Type::getDoubleTy(*llvm_context));
         auto ret_ = call("__ocml_pow_f64", {sitofp_lhs_, sitofp_rhs_});
+        // FPToSI is not an FPMathOperator, so the post-hoc `disable_fast_math(llvm_val[stmt])` below would be a no-op
+        // on it and leave the `__ocml_pow_f64` CallInst still carrying the IRBuilder's `afn` / `reassoc` / ... Clear
+        // FMF here on the actual call before its handle is overwritten by the FPToSI. Mirrors the f16 FPTrunc guards
+        // in `codegen_llvm.cpp` and `codegen_cuda.cpp::emit_extra_unary`.
+        if (stmt->precise) {
+          disable_fast_math(ret_);
+        }
         llvm_val[stmt] = builder->CreateFPToSI(ret_, llvm::Type::getInt32Ty(*llvm_context));
       } else {
         QD_NOT_IMPLEMENTED
@@ -418,6 +430,9 @@ class TaskCodeGenAMDGPU : public TaskCodeGenLLVM {
         QD_NOT_IMPLEMENTED
       }
     }
+    if (stmt->precise) {
+      disable_fast_math(llvm_val[stmt]);
+    }
   }
 
  private: