Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
a9942ed
[Lang] Add qd.precise(...) for per-op IEEE-strict FP
duburcqa Apr 13, 2026
8f02070
[Lang] qd.precise: cover UnaryOpStmt as well
duburcqa Apr 13, 2026
5438801
[Lang] qd.precise: address self-review feedback
duburcqa Apr 13, 2026
2669dc5
[Lang] qd.precise: gate alg_simp folds, cover sqrt, DRY CUDA libdevice
duburcqa Apr 13, 2026
48ce67b
[Lang] qd.precise: scrub non-ASCII from comments
duburcqa Apr 13, 2026
44c99f8
[Lang] qd.precise: replace -- with single - in comments
duburcqa Apr 13, 2026
97d7fb6
[Doc] User guide entry for qd.precise
duburcqa Apr 13, 2026
c5fbab6
[Lang] qd.precise: factor disable_fast_math helper, add Vector/select…
duburcqa Apr 13, 2026
a44208d
[Lang] qd.precise: propagate tag in 2*a rewrite, narrow zero-fold gat…
duburcqa Apr 13, 2026
b229ca5
[Lang] qd.precise: use make_typed to avoid downcast on synthesized 2*…
duburcqa Apr 13, 2026
d3eb88f
Cleanup doc.
duburcqa Apr 13, 2026
c59c542
[Lang] qd.precise: cover walker boundaries (qd.func, bit_cast, alias,…
duburcqa Apr 13, 2026
68b7c17
[Lang] qd.precise: fix docstring to mention unary FP ops and approxim…
duburcqa Apr 13, 2026
8f71366
[Lang] qd.precise: unify precise field comments via canonical referen…
duburcqa Apr 13, 2026
3d641ac
[Lang] qd.precise: propagate tag through synthesized stmts in alg_sim…
duburcqa Apr 13, 2026
9f121bb
[Lang] qd.precise: clear LLVM FMF on intermediate and pre-FPTrunc values
duburcqa Apr 13, 2026
94020dc
[Lang] qd.precise: SPIR-V inv forwards precise, inline maybe_no_contr…
duburcqa Apr 13, 2026
31cab48
[Lang] qd.precise: drop bit-ops-on-FP from doc; align __all__ positio…
duburcqa Apr 13, 2026
40aba10
[Lang] qd.precise: clone input subtree instead of mutating in-place; …
duburcqa Apr 13, 2026
3455440
[Lang] qd.precise: parametrize unary rounding test per op for per-op …
duburcqa Apr 13, 2026
8e98e16
[Lang] qd.precise: SPIR-V visit(BinaryOpStmt) tags FP transcendental …
duburcqa Apr 13, 2026
4425393
[Lang] qd.precise: reflow PR-introduced C++ comments to 120 cols
duburcqa Apr 13, 2026
a03801f
[Lang] qd.precise: propagate tag through cast in 2*a rewrite (and ref…
duburcqa Apr 13, 2026
bfadc9b
[Lang] qd.precise: CUDA emit_extra_unary clears FMF on libdevice call…
duburcqa Apr 13, 2026
e3203cd
[Lang] qd.precise: skip sin/cos unary-rounding on SPIR-V, drop redund…
duburcqa Apr 13, 2026
fbd6e40
[Lang] qd.precise: unary-rounding test restricts to LLVM via arch dec…
duburcqa Apr 13, 2026
4d4539d
[Lang] qd.precise: type_check propagates tag through implicit operand…
duburcqa Apr 13, 2026
ab0b576
[Lang] qd.precise: document SPIR-V arithmetic/post-hoc two-layer deco…
duburcqa Apr 13, 2026
e391298
[Lang] qd.precise: scalarize propagates tag onto per-element scalar B…
duburcqa Apr 13, 2026
bfdf37f
[Lang] qd.precise: SPIR-V decorates FP ops once via post-hoc block; d…
duburcqa Apr 13, 2026
4b85bf8
[Lang] qd.precise: idempotency test also covers AMDGPU (also an LLVM …
duburcqa Apr 13, 2026
e3a4795
[Lang] qd.precise: AMDGPU i32 pow clears FMF on __ocml_pow_f64 call b…
duburcqa Apr 13, 2026
7998277
[Lang] qd.precise: exclude cmp_gt/cmp_lt from precise guard (IEEE-fal…
duburcqa Apr 13, 2026
1837821
[Lang] qd.precise: iterative worklist in clone_and_tag_precise (O(1) …
duburcqa Apr 13, 2026
e412219
[Lang] qd.precise: precise_fp_add requires FP operand type; integer a…
duburcqa Apr 13, 2026
e9a55c1
[Lang] qd.precise: fix same_operation comment, document IdExpression …
duburcqa Apr 14, 2026
e98c703
[Lang] qd.precise: IR printer annotates [precise] on Unary/BinaryOpSt…
duburcqa Apr 14, 2026
bf232b6
[Lang] qd.precise: fix op count in precise.md example comment (three …
duburcqa Apr 14, 2026
b66c9d0
[Lang] Add qd.math.fma(...) single-rounding fused multiply-add
duburcqa Apr 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/user_guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ scalar_tensors
matrix_vector
compound_types
static
precise
sub_functions
parallelization
```
Expand Down
129 changes: 129 additions & 0 deletions docs/source/user_guide/precise.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# qd.precise

`qd.precise(expr)` marks a floating-point expression as IEEE-strict. Every binary and unary FP op inside the wrapped subtree is evaluated in source order with no reassociation, no FMA contraction, and no non-IEEE-exact algebraic simplification, regardless of the module-level `fast_math` setting. Folds that are IEEE-exact for every input (e.g. `a - 0 -> a`, `a > a -> false`) are still applied. It is equivalent to the `precise` keyword in MSL / HLSL.

## Why

Quadrants compiles kernels with `fast_math=True` by default. Under that mode the compiler is free to:

- **reassociate** FP ops (e.g. `(a + b) + c -> a + (b + c)`)
- **contract** mul-then-add into FMA
- **substitute approximations** for `sqrt`, `sin`, `cos`, `log`, `1/x`
- **algebraically simplify** (e.g. `a - a -> 0`, `a / a -> 1`)

This silently destroys compensated-arithmetic primitives (Dekker / Kahan 2Sum, Veltkamp split, double-single accumulators) whose entire correctness rests on the fact that `(a - aa) + (b - bb)` is non-zero under IEEE arithmetic. The traditional workaround is to flip the global `fast_math=False` switch, but that pays the perf cost everywhere, even when only a handful of lines need IEEE semantics.

`qd.precise(expr)` is the per-expression opt-in: keep `fast_math=True` globally for speed, and wrap the expressions that must be IEEE-exact.

## Basic usage

```python
@qd.func
def fast_two_sum(a, b):
s = qd.precise(a + b)
e = qd.precise(b - (s - a)) # would fold to 0 under fast-math without precise
return s, e
```

Any expression value can be wrapped. The wrapper returns the same expression with every reachable FP op tagged as precise; at codegen time the tagged ops opt out of the optimizations above.

## What gets protected

`qd.precise` walks the wrapped expression tree and tags:

- Every `BinaryOp` (`+`, `-`, `*`, `/`, `%`, FP comparisons)
- Every `UnaryOp` (`neg`, `sqrt`, `sin`, `cos`, `log`, `exp`, `rsqrt`, casts, bit_cast, ...)

Bitwise operations (`bit_and`, `bit_or`, `bit_xor`, `bit_shl`, `bit_sar`) are integer-domain; the walker tags them for completeness but the flag has no effect on integer IR.

The walker descends through `BinaryOp`, `UnaryOp`, and `TernaryOp` (e.g. `qd.select`) nodes, so wrapping a composite expression protects the inner ops too:

```python
# All four FP ops below are tagged: the outer sqrt, the inner add, and the two inner muls.
r = qd.precise(qd.sqrt(a * a + b * b))

# Ternary is traversed through; the two branches and the condition's inner ops are tagged.
r = qd.precise(qd.select(cond, a + b, a - b))
```

## Where the walker stops

`qd.precise` does not descend into:

- Loads (ndarray indexing, field access)
- Constants
- `qd.func` call sites
- Atomic ops
- Intermediate Python variable assignments (`tmp = a + b` wraps the RHS in an internal alloca, so `qd.precise(tmp)` sees the alloca, not the inner `BinaryOp`, and is a silent no-op)

Semantics inside a `qd.func` body are governed by that body's own ops. If you want IEEE-strict behavior inside a called function, wrap the relevant ops inside the function's body, not at the call site. Similarly, wrap `qd.precise` directly around the expression rather than around a variable that was assigned earlier:

```python
@qd.func
def dot_precise(a, b, c, d):
# Wrap inside the body, not at the caller.
return qd.precise(a * b + c * d)

@qd.kernel
def k(...):
r = dot_precise(x, y, z, w) # inner ops are already precise
```

## Interaction with fast_math

`qd.precise` is a per-op override. It takes effect whether `fast_math` is on or off:

| Setting | Non-precise op | `qd.precise` op |
|---|---|---|
| `fast_math=True` | reassoc / contract / simplify | IEEE-strict |
| `fast_math=False` | IEEE-strict | IEEE-strict (redundant but harmless) |

The recommended workflow is to leave `fast_math=True` globally for throughput and reach for `qd.precise` only in the handful of spots that need IEEE behavior.

## Backend coverage

| Backend | Reassoc / contraction / algebraic folds | Approximate transcendentals (`sin` / `cos` / `log`) |
|---|---|---|
| CPU | LLVM FMF cleared | libc `sinf` is already correctly rounded |
| CUDA | LLVM FMF cleared | libdevice `__nv_<fn>f` (non-fast) selected |
| AMDGPU | LLVM FMF cleared | `__ocml_<fn>` already correctly rounded |
| Vulkan / MoltenVK | SPIR-V `NoContraction` decoration | best-effort: driver stdlib default (spec only guarantees 2^-11 absolute error) |
| Metal | SPIR-V `NoContraction` decoration | best-effort: driver stdlib default (spec only guarantees 2^-11 absolute error) |

On SPIR-V backends, `NoContraction` is defined by the spec to apply to arithmetic instructions only; most consumers ignore it on the `OpExtInst` calls used for transcendentals. The decoration is still emitted (it is harmless and future-proofs against downstream toolchains that start honoring it), but correctness of `qd.precise(qd.sin(x))` / `qd.precise(qd.cos(x))` on Metal / Vulkan cannot be guaranteed through the tag: the Vulkan precision requirements for GLSL.std.450 `Sin`/`Cos` are stated as 2^-11 absolute error, which on inputs whose reference magnitude is smaller than 1 is thousands of ULPs, and drivers are within their rights to saturate that latitude. If you need correctly-rounded sin/cos, use the CPU / CUDA / AMDGPU backends.

## Example: Dekker 2Sum

A textbook compensated addition that computes `s + e = a + b` exactly in f32:

```python
@qd.func
def two_sum(a, b):
s = qd.precise(a + b)
bb = qd.precise(s - a)
aa = qd.precise(s - bb)
e = qd.precise((a - aa) + (b - bb))
return s, e
```

Without the `qd.precise` wrappers, under `fast_math=True` the compiler recognizes `(a - (s - (s - a))) + (b - (s - a))` as algebraically zero and folds `e` to `0`. The wrappers prevent that fold, and `s + e` reproduces `a + b` to full precision.

## Caveats

- `qd.precise` is a scalar primitive. Passing a `Vector` / `Matrix` will raise. Apply it to individual components instead, or refactor your expression to use scalar ops inside.
- `qd.precise` does not mutate its input. It returns a fresh expression subtree with every reachable FP op tagged; the original expression is unchanged. Reusing the original elsewhere is safe and never inherits the tag.

## Companion: `qd.math.fma`

Compensated-arithmetic blocks typically need two things: (1) IEEE-strict ordering on ordinary ops (provided by `qd.precise`) and (2) a guaranteed single-rounding fused multiply-add for error-free transforms. The second is exposed separately as `qd.math.fma(a, b, c)`:

```python
# Two-product error-free transform (TwoProd).
# Returns p = round(a*b) and e such that a*b = p + e exactly.
p = a * b
e = qd.math.fma(a, b, -p) # single rounding: exact residual of p
```

`qd.math.fma` is lowered to the native FMA on every backend: `llvm.fma` on CPU, `__nv_fma` / `__nv_fmaf` (libdevice) on CUDA, `GLSL.std.450 Fma` on Vulkan / Metal. Unlike relying on the compiler to contract `mul; add` into FMA (which requires both fast-math flags to permit contraction *and* the inputs to survive algebraic simplification), this is an explicit instruction - so the TwoProd residual, Fast2Sum, and double-single multiply patterns port over directly without needing per-backend contraction hints.

Backends without hardware FMA fall back to a regular mul-then-add and lose the single-rounding guarantee; on those targets compensated algorithms should be rewritten to the Dekker / Veltkamp-split form.
90 changes: 90 additions & 0 deletions python/quadrants/lang/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,59 @@ def cast(obj, dtype):
return expr.Expr(_qd_core.value_cast(expr.Expr(obj).ptr, dtype))


def precise(obj):
"""Mark a floating-point expression as IEEE-strict.

Every binary and unary FP op inside ``obj`` is evaluated in source
order with no reassociation, no FMA contraction, no approximate
transcendental substitution, and no non-IEEE-exact algebraic
simplification, regardless of the module-level :attr:`fast_math`
setting. Folds that are IEEE-exact for every input (e.g.
``a - 0 -> a``, ``a > a -> false``) are still applied. This is
equivalent to MSL's / HLSL's ``precise`` keyword and lets you keep
``fast_math=True`` globally while protecting compensated-arithmetic
blocks (Dekker / Kahan 2Sum, Veltkamp split, etc.) from being folded
away.

Recursion descends through ``BinaryOp``, ``UnaryOp`` (cast, bit_cast,
neg, sqrt, ...), and ``TernaryOp`` (select) wrappers so that inner
binary ops are reached even when wrapped, e.g.
``qd.precise(qd.bit_cast(a + b, qd.f32))``. It stops at loads,
constants, ``qd.func`` calls, ndarray accesses, etc.; semantics inside
a ``qd.func`` body are governed by that body's own ops - wrap calls
separately if needed.

Notes:
* ``qd.precise`` does NOT mutate the input expression. It returns
a fresh subtree that mirrors the input's structure, with every
reachable Binary / Unary / Ternary node cloned and the new
Binary / Unary nodes tagged as ``precise``. Non-walked nodes
(loads, constants, ``qd.func`` calls, ndarray accesses, ...)
are shared with the input by reference. The practical upshot:
reusing the original (pre-``precise``) expression value
elsewhere is safe - it will NOT pick up the tag.

Args:
obj: A scalar Quadrants expression (typically a chain of FP ops).

Returns:
A fresh expression subtree with every reachable binary and unary
FP op tagged as ``precise``. The original ``obj`` is unchanged.

Example::

>>> @qd.func
>>> def fast_two_sum(a, b):
>>> # Local IEEE region, survives even with fast_math=True.
>>> s = qd.precise(a + b)
>>> e = qd.precise(b - (s - a))
>>> return s, e
"""
if is_quadrants_class(obj):
raise ValueError("Cannot apply precise on Quadrants classes")
return expr.Expr(_qd_core.precise(expr.Expr(obj).ptr))


def bit_cast(obj, dtype):
"""Copy and cast a scalar to a specified data type with its underlying
bits preserved. Must be called in quadrants scope.
Expand Down Expand Up @@ -1117,6 +1170,42 @@ def py_select(cond, x1, x2):
return _ternary_operation(_qd_core.expr_select, py_select, cond, x1, x2)


def fma(a, b, c):
"""Fused multiply-add: return ``a * b + c`` computed as a single rounded
operation.

Unlike a plain ``a * b + c``, the intermediate product is not rounded:
the result is ``round(a * b + c, 1 ULP)``. This is the hardware FMA
available on every modern FP pipeline (x86 FMA3, ARM, Apple Silicon,
NVIDIA ``fma``, AMD, RISC-V Zfa). Exposed here primarily to let
compensated-arithmetic primitives (TwoProd, Fast2Sum + FMA,
double-single accumulators) get the single-rounding guarantee without
relying on backend-specific FMF contraction.

Classic two-product error-free transform:

p = a * b
e = qd.fma(a, b, -p) # exact residual of p

Each backend maps this to its native FMA (LLVM ``llvm.fma`` intrinsic
on CPU, ``__nv_fma/__nv_fmaf`` on CUDA via libdevice, GLSL.std.450
``Fma`` on Vulkan/Metal). Backends without hardware FMA fall back to
a regular mul-then-add and lose the single-rounding guarantee.

Args:
a, b, c: Homogeneous FP scalars (``f16``/``f32``/``f64``). Integer
inputs are rejected.
Comment on lines +1183 to +1197
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The fma docstring in ops.py shows 'e = qd.fma(a, b, -p)' in its TwoProd example, but fma is not in ops.all and is therefore not re-exported to the top-level qd namespace. Copying this example verbatim raises AttributeError: module 'quadrants' has no attribute 'fma'. The correct public API is qd.math.fma, which is what the companion documentation in precise.md correctly uses.

Extended reasoning...

What the bug is and how it manifests

The fma function added in python/quadrants/lang/ops.py has a docstring that demonstrates the classic TwoProd error-free transform:

p = a * b
e = qd.fma(a, b, -p)        # exact residual of p

A user reading the docstring and copying this example will immediately encounter AttributeError: module 'quadrants' has no attribute 'fma'.

The specific code path that triggers it

The top-level qd namespace is populated via quadrants/init.py which does 'from quadrants.lang import *', which chains through 'from quadrants.lang.ops import *'. Only names listed in ops.all (lines 1592-1628) reach the top level. The fma function defined in ops.py is deliberately NOT added to ops.all - it is an internal function wrapped by qd.math.fma.

Why existing code does not prevent it

The docstring was written using the wrong namespace prefix. The omission of fma from ops.all is correct and intentional (preventing qd.fma from polluting the top-level namespace), but the docstring example was never updated to reflect that the public entry point is qd.math.fma rather than qd.fma.

What the impact would be

Any user reading the ops.fma docstring (via help(), an IDE, or generated API docs) and copying the example verbatim will get a runtime AttributeError. The bug is documentation-only: the runtime implementation is correct, and qd.math.fma works as advertised. The companion documentation in docs/source/user_guide/precise.md (added in the same PR) correctly uses qd.math.fma(a, b, -p), creating an inconsistency between the two sources.

How to fix it

Change the docstring example in python/quadrants/lang/ops.py (around line 1188) from:

e = qd.fma(a, b, -p)        # exact residual of p

to:

e = qd.math.fma(a, b, -p)   # exact residual of p

Step-by-step proof

  1. User reads the fma docstring in ops.py and finds the TwoProd example showing qd.fma(a, b, -p).
  2. User writes a kernel using that pattern.
  3. At runtime, Python evaluates qd.fma - attribute lookup on the quadrants module.
  4. quadrants/init.py populates the module via from quadrants.lang import * -> from quadrants.lang.ops import *.
  5. ops.all (lines 1592-1628) lists: acos, asin, atan2, atomic_*, bit_cast, bit_shr, cast, ceil, cos, exp, floor, frexp, log, random, raw_mod, raw_div, round, rsqrt, sin, sqrt, tan, tanh, max, min, select, abs, pow, precise - fma is absent.
  6. AttributeError: module 'quadrants' has no attribute 'fma' is raised.
  7. The correct call qd.math.fma(a, b, -p) works, as mathimpl.py explicitly adds fma to its all (line 854) and quadrants/init.py exports the math submodule.


Returns:
``round(a * b + c, 1 ULP)`` as a single rounded operation.
"""

def py_fma(a, b, c):
return a * b + c

return _ternary_operation(_qd_core.expr_fma, py_fma, a, b, c)


def ifte(cond, x1, x2):
"""Evaluate and return `x1` if `cond` is true; otherwise evaluate and return `x2`. This operator guarantees
short-circuit semantics: exactly one of `x1` or `x2` will be evaluated.
Expand Down Expand Up @@ -1535,4 +1624,5 @@ def min(*args): # pylint: disable=W0622
"select",
"abs",
"pow",
"precise",
]
Comment on lines 1624 to 1628
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Export fma in lang ops public symbol list

fma() is implemented in this module but omitted from __all__, so from quadrants.lang.ops import * does not re-export it. Because quadrants.lang and top-level quadrants rely on star-importing this list, users cannot call the documented qd.fma(...) alias and only qd.math.fma(...) works, which is a public API regression.

Useful? React with 👍 / 👎.

12 changes: 12 additions & 0 deletions python/quadrants/math/mathimpl.py
Original file line number Diff line number Diff line change
Expand Up @@ -823,6 +823,17 @@ def clz(x):
return ops.clz(x)


@func
def fma(a, b, c):
"""Fused multiply-add: ``a * b + c`` as a single rounded operation.

Wraps :func:`quadrants.lang.ops.fma`. Primary use case is compensated
FP arithmetic (TwoProd error-free transform, Fast2Sum + FMA for
double-single types); see the ``ops.fma`` docstring for details.
"""
return ops.fma(a, b, c)


__all__ = [
"acos",
"asin",
Expand All @@ -840,6 +851,7 @@ def clz(x):
"exp",
"eye",
"floor",
"fma",
"fract",
"inf",
"inverse",
Expand Down
3 changes: 3 additions & 0 deletions quadrants/analysis/gen_offline_cache_key.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ class ASTSerializer : public IRVisitor, public ExpressionVisitor {
void visit(UnaryOpExpression *expr) override {
emit(ExprOpCode::UnaryOpExpression);
emit(expr->type);
emit(expr->precise);
if (expr->is_cast()) {
emit(expr->cast_type);
}
Expand All @@ -97,13 +98,15 @@ class ASTSerializer : public IRVisitor, public ExpressionVisitor {
void visit(BinaryOpExpression *expr) override {
emit(ExprOpCode::BinaryOpExpression);
emit(expr->type);
emit(expr->precise);
emit(expr->lhs);
emit(expr->rhs);
}

void visit(TernaryOpExpression *expr) override {
emit(ExprOpCode::TernaryOpExpression);
emit(expr->type);
emit(expr->precise);
emit(expr->op1);
emit(expr->op2);
emit(expr->op3);
Expand Down
15 changes: 15 additions & 0 deletions quadrants/codegen/amdgpu/codegen_amdgpu.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -389,6 +389,11 @@ class TaskCodeGenAMDGPU : public TaskCodeGenLLVM {
if (op != BinaryOpType::atan2 && op != BinaryOpType::pow) {
return TaskCodeGenLLVM::visit(stmt);
}
// The base-class `visit(BinaryOpStmt*)` terminates with `if (stmt->precise) disable_fast_math(...)` so LLVM cannot
// substitute approximate variants for precise-tagged FP ops. The AMDGPU override below returns without chaining to
// the base, so we mirror that same guard on the __ocml_* call results. AMDGPU's `__ocml_*` transcendentals are
// currently correctly-rounded (no `__ocml_fast_*` variants), so this is defensive against future libocml changes
// rather than a bug today.
auto lhs = llvm_val[stmt->lhs];
auto rhs = llvm_val[stmt->rhs];

Expand All @@ -403,6 +408,13 @@ class TaskCodeGenAMDGPU : public TaskCodeGenLLVM {
auto sitofp_lhs_ = builder->CreateSIToFP(lhs, llvm::Type::getDoubleTy(*llvm_context));
auto sitofp_rhs_ = builder->CreateSIToFP(rhs, llvm::Type::getDoubleTy(*llvm_context));
auto ret_ = call("__ocml_pow_f64", {sitofp_lhs_, sitofp_rhs_});
// FPToSI is not an FPMathOperator, so the post-hoc `disable_fast_math(llvm_val[stmt])` below would be a no-op
// on it and leave the `__ocml_pow_f64` CallInst still carrying the IRBuilder's `afn` / `reassoc` / ... Clear
// FMF here on the actual call before its handle is overwritten by the FPToSI. Mirrors the f16 FPTrunc guards
// in `codegen_llvm.cpp` and `codegen_cuda.cpp::emit_extra_unary`.
if (stmt->precise) {
disable_fast_math(ret_);
}
llvm_val[stmt] = builder->CreateFPToSI(ret_, llvm::Type::getInt32Ty(*llvm_context));
} else {
QD_NOT_IMPLEMENTED
Expand All @@ -418,6 +430,9 @@ class TaskCodeGenAMDGPU : public TaskCodeGenLLVM {
QD_NOT_IMPLEMENTED
}
}
if (stmt->precise) {
disable_fast_math(llvm_val[stmt]);
}
}

private:
Expand Down
Loading
Loading