Skip to content

Fix master CI: expv zero-input NaN, JET-on-1.12 QA, GPU-in-All#229

Merged
ChrisRackauckas merged 2 commits into
SciML:masterfrom
ChrisRackauckas-Claude:fix-master-ci-1.12-nan-jet-gpu
Jun 26, 2026
Merged

Fix master CI: expv zero-input NaN, JET-on-1.12 QA, GPU-in-All#229
ChrisRackauckas merged 2 commits into
SciML:masterfrom
ChrisRackauckas-Claude:fix-master-ci-1.12-nan-jet-gpu

Conversation

@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor

Fixes three independent failures on the master grouped-tests CI.

1. Core: NaN == 0.0 at basictests.jl:307 (zero-input expv)

The real expv!(w, t::Real, Ks) method was missing the iszero(beta) guard the complex method already has. For a zero input vector firststep! skips initializing the Krylov basis V (it only fills V[:,1] when beta != 0), so the final lmul!(beta, mul!(w, @view(V[:,1:m]), expHe)) computes 0 * <uninitialized memory>, which is NaN whenever V holds garbage — explaining why the failure was flaky (heap-dependent: green on some OS/runs, NaN on others). Added the same early-return guard so expv of a zero vector is exactly zero.

Verified locally: full GROUP=Core Pkg.test passes on Julia 1.10 and 1.12 (it reliably produced NaN on 1.10 before).

2. QA: 6 JET failures on the 1 (= Julia 1.12) channel

lts (1.10) was green; only 1 (1.12) failed. On 1.12 JET traces into LinearAlgebra/Base internals — norm(::Vector)norm_recursive_checkiterate(::Nothing), and the broadcast unalias/copyto_unaliased! path over Adjoint{T, Union{}} — and reports abstract-interpretation artifacts there that this package does not control. Scoped the QA report_calls to target_modules = (ExponentialUtilities,) (the standard JET-as-package-QA configuration), which keeps full coverage of this package's own code.

That scoping surfaced two genuine may be undefined findings, which are fixed here so the scoped analysis is clean (not silenced):

  • si in exponential! (exp_baseexp.jl) — conditionally assigned inside if s > 0, used inside a separate if s > 0; now initialized to 0 unconditionally.
  • order / kest in kiops (kiops.jl) — carried across loop iterations via the orderold/kestold "reuse" flags but only conditionally assigned; now seeded with their first-iteration defaults.

Verified locally: QA passes 17/17 on Julia 1.10 and 1.12.

3. Core (windows): "CUDA driver not functional"

On Windows the Core job runs the run_tests "All" aggregate, which pulled in the GPU group, and using CUDA errored on the non-GPU runner. Marked the GPU group in_all = false so it only ever runs under an explicit GROUP=GPU on the self-hosted CUDA runner. Verified locally: GROUP=All now runs only Core/basictests.jl, never GPU/gputests.jl.

Not addressed (reported separately)

  • Core (julia pre, macos-latest): Static Arrays tolerance failure at basictests.jl:265 (expv(t,A,b) ≈ exp(t*A)*b). On linux Julia 1.13-rc1 the worst relative error is 1.25e-15; the macOS-pre failure shows ~1e-7. This is a macOS/1.13-rc-specific accuracy difference I could not reproduce or correctly fix on linux, and I will not loosen the tolerance without being able to prove the macOS deviation is benign.
  • GPU (self-hosted): requires CUDA hardware (infra), out of scope here.

Please ignore until reviewed by @ChrisRackauckas.

ChrisRackauckas and others added 2 commits June 19, 2026 05:17
Three independent master-CI failures on the grouped-tests workflow:

1. Core (NaN == 0.0 at basictests.jl:307, flaky across OS/version).
   The real `expv!(w, t::Real, Ks)` method lacked the `iszero(beta)`
   guard that the complex method already has. For a zero input vector
   `firststep!` skips initializing the Krylov basis V (it only fills it
   when beta != 0), so `lmul!(beta, mul!(w, V, expHe))` computes
   `0 * <uninitialized memory>`, which is NaN whenever V holds garbage.
   Add the same early-return guard, making expv of a zero vector exactly
   zero (matching the complex method). Verified: full Core suite now
   passes on Julia 1.10 and 1.12 (was reliably NaN on 1.10).

2. QA (6 JET failures on the Julia "1" = 1.12 channel; lts/1.10 was
   green). On 1.12 JET traces into LinearAlgebra/Base internals
   (`norm(::Vector)` -> `norm_recursive_check` -> `iterate(::Nothing)`,
   and the broadcast `unalias`/`copyto_unaliased!` path over
   `Adjoint{T, Union{}}`) and reports artifacts there that this package
   does not control. Scope the QA `report_call`s to
   `target_modules = (ExponentialUtilities,)` — the standard JET-as-QA
   configuration — which keeps full coverage of this package's own code.
   That scoping surfaced two genuine `may be undefined` findings, fixed
   here so the scoped analysis is clean: `si` in `exponential!` and
   `order`/`kest` in `kiops` are now unconditionally initialized before
   use. Verified: QA passes 17/17 on Julia 1.10 and 1.12.

3. Core (windows, all versions: "CUDA driver not functional"). On
   Windows the Core job runs the run_tests "All" aggregate, which pulled
   in the GPU group and `using CUDA` errored on the non-GPU runner. Mark
   the GPU group `in_all = false` so it only runs under an explicit
   GROUP=GPU on the self-hosted CUDA runner. Verified locally: GROUP=All
   now runs only Core/basictests.jl, never GPU/gputests.jl.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "Static Arrays" testset compared `expv(t, A, b)` against `exp(t * A) * b`
where `exp(t * A)` is StaticArrays' SMatrix matrix exponential. That reference
uses an unbalanced scaling-and-squaring Padé path which loses ~7-9 digits for
the larger non-normal N=8 cases on macOS + Julia prerelease (relerr ~1e-7..1e-5),
tripping the default-tolerance isapprox in "Core (julia pre, macos-latest)".

Verified against a 512-bit BigFloat ground truth that the macOS `expv` output is
correct to ~1e-16 on both platforms; it was the StaticArrays `exp` reference, not
`expv`, that drifted. Switching the reference to the dense LAPACK `exp`, which is
balanced and accurate on every platform, keeps this a machine-precision assertion
that still catches real `expv` regressions (no tolerance loosening).

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor Author

Resolved the last red — Core (julia pre, macos-latest) failing at test/basictests.jl:265 in the "Static Arrays" testset.

Root cause: not an expv bug. The assertion was expv(t, A, b) ≈ exp(t * A) * b, where exp(t * A) dispatches to StaticArrays.jl's own SMatrix matrix exponential. I reconstructed the two exact failing matrices (N=8, t=1.0 and N=8, t=10.0; RNG seed 0) and computed a 512-bit BigFloat ground truth:

quantity relerr vs BigFloat truth (case N=8,t=1.0 / N=8,t=10.0)
macOS expv output (under test) 3.2e-16 / 7.7e-16 (correct)
macOS exp(t*A) StaticArrays reference 4.0e-7 / 7.5e-6 (wrong)
Linux expv 3.2e-16 / 7.3e-16
Linux exp(t*A) StaticArrays reference 1.9e-16 / 7.3e-16

So expv is machine-accurate on both platforms. It was the reference exp(t*A) (StaticArrays' unbalanced scaling-and-squaring Padé path — the source even notes "omitted: matrix balancing") that drifted ~7-9 digits on macOS + Julia 1.13-rc1. The test was comparing a correct value against a platform-fragile reference that is less accurate than the thing under test.

Default tolerances for context. The SMatrix expv extension targets eps(T)/2 ≈ 1.1e-16 (default_tolerance), and the Krylov expv path's happy-breakdown tol is 1e-7; neither is the issue here. The test gate is the default isapprox (rtol ≈ 1.49e-8). The macOS error of ~1e-7..1e-5 is far above any plausible expv FP floor (I confirmed across 400 seeds and forced mo/s/break-tol perturbations that faithful expv stays ≤5e-14), which is what pointed at the reference, not expv.

Fix (no tolerance loosening): compare expv against the dense LAPACK exponential exp(t * Matrix(A)) * Vector(b), which is balanced and accurate on every platform. This keeps a machine-precision (default-tolerance) assertion that still catches real expv regressions.

Verified locally (Pkg.test GROUP=Core, full basictests.jl):

  • Julia 1.13.0-rc1 (= CI pre): 329 pass, 1 broken (pre-existing @test_broken)
  • Julia 1.12.6: 329 pass, 1 broken
  • Julia 1.10.11 (lts): 329 pass, 1 broken

The "Static Arrays" testset itself: 12 pass / 12 on rc.

Ignore until reviewed by @ChrisRackauckas.

@ChrisRackauckas ChrisRackauckas marked this pull request as ready for review June 26, 2026 11:32
@ChrisRackauckas ChrisRackauckas merged commit ee91ce6 into SciML:master Jun 26, 2026
21 of 36 checks passed
ChrisRackauckas-Claude pushed a commit to ChrisRackauckas-Claude/ExponentialUtilities.jl that referenced this pull request Jun 29, 2026
Convert the hand-rolled test/qa/qa.jl (raw Aqua.test_* + per-function
JET.report_call) to the SciMLTesting 1.6 `run_qa` form and enable the
ExplicitImports checks.

ExplicitImports findings (run vs released SciMLTesting 1.6.0):
  * no_stale_explicit_imports: removed the genuinely stale
    `ArrayInterface.allowed_getindex` import (never referenced; only
    `ismutable`/`allowed_setindex!` are used).
  * Made the `for i in 1:13 include("exp_generated/exp_$i.jl")` dynamic
    include in exp_noalloc.jl static (13 literal includes) so the module is
    analyzable — this unblocked no_implicit_imports and
    no_stale_explicit_imports (previously UnanalyzableModuleException).
    Verified Higham2005 matrix-exp still matches Base `exp` to ~6.7e-16.
  * all_qualified_accesses_via_owners / all_qualified_accesses_are_public /
    all_explicit_imports_are_public: ignore-listed other packages' non-public
    names (Base / LinearAlgebra(.BLAS/.LAPACK, incl. Stegr submodule) /
    ArrayInterface / libblastrampoline_jll); they go public as the base libs
    declare them.
  * no_implicit_imports: ~31 implicit names from `using LinearAlgebra,
    SparseArrays, Printf, PrecompileTools`. Making them explicit is a large
    refactor; marked ei_broken and tracked in SciML#231 (auto-flags when fixed).

Deps: test/qa/Project.toml SciMLTesting compat -> "1.6" (Aqua + ExplicitImports
are transitive via SciMLTesting; Aqua kept a direct dep so the ambiguities
sub-check's child process can resolve it; JET kept for the JET check). Root
Project.toml SciMLTesting compat -> "1.6".

QA group on Julia 1.10 (lts), released SciMLTesting 1.6.0:
  Quality Assurance | 17 Pass, 1 Broken, 0 Fail, 0 Error (no_implicit_imports
  broken per SciML#231). On Julia 1.12 the JET typo check reports pre-existing
  "may be undefined" findings (kiops order/kest, Higham2005 ilo/ihi/scale/bal);
  master is already red there and the source fixes live in draft PR SciML#229.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ChrisRackauckas added a commit that referenced this pull request Jul 3, 2026
* QA: run_qa v1.6 form + ExplicitImports

Convert the hand-rolled test/qa/qa.jl (raw Aqua.test_* + per-function
JET.report_call) to the SciMLTesting 1.6 `run_qa` form and enable the
ExplicitImports checks.

ExplicitImports findings (run vs released SciMLTesting 1.6.0):
  * no_stale_explicit_imports: removed the genuinely stale
    `ArrayInterface.allowed_getindex` import (never referenced; only
    `ismutable`/`allowed_setindex!` are used).
  * Made the `for i in 1:13 include("exp_generated/exp_$i.jl")` dynamic
    include in exp_noalloc.jl static (13 literal includes) so the module is
    analyzable — this unblocked no_implicit_imports and
    no_stale_explicit_imports (previously UnanalyzableModuleException).
    Verified Higham2005 matrix-exp still matches Base `exp` to ~6.7e-16.
  * all_qualified_accesses_via_owners / all_qualified_accesses_are_public /
    all_explicit_imports_are_public: ignore-listed other packages' non-public
    names (Base / LinearAlgebra(.BLAS/.LAPACK, incl. Stegr submodule) /
    ArrayInterface / libblastrampoline_jll); they go public as the base libs
    declare them.
  * no_implicit_imports: ~31 implicit names from `using LinearAlgebra,
    SparseArrays, Printf, PrecompileTools`. Making them explicit is a large
    refactor; marked ei_broken and tracked in #231 (auto-flags when fixed).

Deps: test/qa/Project.toml SciMLTesting compat -> "1.6" (Aqua + ExplicitImports
are transitive via SciMLTesting; Aqua kept a direct dep so the ambiguities
sub-check's child process can resolve it; JET kept for the JET check). Root
Project.toml SciMLTesting compat -> "1.6".

QA group on Julia 1.10 (lts), released SciMLTesting 1.6.0:
  Quality Assurance | 17 Pass, 1 Broken, 0 Fail, 0 Error (no_implicit_imports
  broken per #231). On Julia 1.12 the JET typo check reports pre-existing
  "may be undefined" findings (kiops order/kest, Higham2005 ilo/ihi/scale/bal);
  master is already red there and the source fixes live in draft PR #229.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* QA: fix latent undefined-balancing locals in exponential!(::ExpMethodHigham2005)

The run_qa v1.6 conversion runs JET in report_package typo mode, which analyzes
each method signature in isolation. The hand-rolled qa.jl this replaced used
JET.report_call(exponential!, (Matrix{Float64},)), where ExpMethodHigham2005(A)
sets do_balancing = (A isa StridedMatrix) as a constant that JET could
constant-propagate, so both `if method.do_balancing` blocks folded to true and
ilo/ihi/scale/bal were seen as always defined. In report_package the method is
analyzed with an abstract ExpMethodHigham2005, so do_balancing is a runtime
Bool, the two balancing blocks are not provably correlated, and the undo block
reads ilo/ihi/scale/bal as possibly-undefined locals (20 JET typo reports on
Julia 1.12; 1.10 abstract-interp did not reach them).

Seed ilo=1/ihi=n/scale=_scale as no-op defaults and lift the GenericSchur
row/col permutations into prow/pcol locals (nothing on the BLAS path, which
never reads them), so every local read in the symmetric undo block is
unconditionally defined. Behavior is unchanged: the seeds are only live when
do_balancing is false (where the undo block does not run), and the BLAS vs
GenericSchur branches use exactly the values they used before.

Verified Julia 1.12.6 (released SciMLTesting 1.7.0, JET 0.11.5): report_package
typo mode goes from 20 reports to 0. Verified Julia 1.10.11 numerics unchanged:
strided-BLAS balancing relerr 3.3e-16, GenericSchur (BigFloat) balancing relerr
1.1e-16, no-balancing relerr 1.6e-16 vs reference exp.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>

---------

Co-authored-by: ChrisRackauckas-Claude <accounts@chrisrackauckas.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants