Add O(p m³) Al-Mohy–Liu scaling-and-recovering algorithm for the matrix φ-functions by ChrisRackauckas-Claude · Pull Request #236 · SciML/ExponentialUtilities.jl

ChrisRackauckas-Claude · 2026-07-02T17:31:03Z

Please ignore until reviewed by @ChrisRackauckas. Draft; opened by an agent.

Closes #235.

Summary

phi(A, k) / phi! currently compute φ₀(A), …, φ_k(A) by calling phiv_dense! on each of the m basis vectors, each exponentiating an (m+k)×(m+k) block matrix — overall O(m (m+k)³), as noted in #235.

This PR implements Algorithm 5.1 of

Awad H. Al-Mohy and Xiaobo Liu, A scaling and recovering algorithm for the matrix φ-functions, arXiv:2506.01193 (2025),

which computes all of φ₀, …, φ_p simultaneously in O(p m³):

Scale A ← A/2^s.
Evaluate the [m/m] diagonal Padé approximant to φ_p via Paterson–Stockmeyer.
Recover the lower-index approximants with the recurrence R⁽ʲ⁾ = z R⁽ʲ⁺¹⁾ + 1/j!.
Undo the scaling with the double-argument formula φ_j(2A) = 2⁻ʲ(φ₀φ_j + Σ_{k=1}^{j} φ_k/(j−k)!).

The backward-error-optimal Padé degree m and scaling s are selected from the paper's θ_{m,p} table (Table 3.1) together with the α_r/t cost analysis (§3–4). Implementation follows the authors' reference MATLAB code phi_funm.

Scope / compatibility

New path is the default for Float64/ComplexF64 dense inputs (the Padé tables are tuned for IEEE double precision).
Other element types, and callers passing the legacy caches bundle, keep the existing basis-vector path.
Diagonal inputs and the public API are unchanged.
The quasi-triangular structure exploitation from the paper (§5.1, an accuracy refinement for already-triangular inputs) is not included here; the general path is used for all inputs. Could be a follow-up.

Validation (run locally)

Against a high-precision (BigFloat) block-augmented-exponential reference and the legacy algorithm, over normal / non-normal / large-norm / complex / Hessenberg / zero matrices for k = 1…10:

Worst-case relative error ~1e-14, matching or beating the old path (e.g. on a diagonally-dominant 12×12 the new path is ~1e-16 vs ~2e-14 for the old).
Speedups over the basis-vector path (k=3): 18× at n=50, 51× at n=100, 72× at n=200, growing with n as expected from the complexity change.

Tests

New testset in test/basictests.jl (Phi (Al-Mohy–Liu vs reference), 212 assertions) checking the new default against an independent exp-of-block-matrix reference and the legacy phiv_dense! path, plus the in-place entry point and φ₀ == exp(A).

The full basictests.jl passes with these changes. Note: two pre-existing failures in the unrelated Issue 42 testset (ForwardDiff derivatives of exp_generic) reproduce on unmodified master under Julia 1.12 and are not touched by this PR — being investigated separately.

🤖 Generated with Claude Code

…nctions `phi(A, k)` / `phi!` previously computed phi_0(A), ..., phi_k(A) by calling `phiv_dense!` on each of the m basis vectors, each exponentiating an (m+k) x (m+k) block matrix -- overall O(m (m+k)^3). This implements Algorithm 5.1 of Al-Mohy & Liu, "A scaling and recovering algorithm for the matrix phi-functions" (arXiv:2506.01193, 2025), which computes all phi functions simultaneously in O(k m^3): scale A by 2^-s, evaluate the [m/m] diagonal Pade approximant to phi_p via Paterson--Stockmeyer, recover the lower-index approximants with the recurrence R^{(j)} = z R^{(j+1)} + 1/j!, and undo the scaling with the double-argument formula. Backward-error-optimal Pade degree m and scaling s are selected from the paper's theta table (Table 3.1) and the alpha_r / t cost analysis. The new path is the default for Float64/ComplexF64 dense inputs (its Pade tables are tuned for double precision); other element types and callers passing the legacy `caches` bundle keep the basis-vector path. Diagonal inputs are unchanged. Validated against a high-precision block-exponential reference and the legacy algorithm: worst-case relative error ~1e-14 across normal, non-normal, large-norm, complex, Hessenberg and zero matrices for k = 1..10, matching or beating the old accuracy. Measured speedups over the basis-vector path: 18x at n=50, 51x at n=100, 72x at n=200 (k=3). Closes SciML#235. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

stevengj · 2026-07-03T01:42:12Z

+    n = size(A, 1)
+    Apow = Vector{Matrix{T}}(undef, tau + 1)
+    Apow[1] = Matrix{T}(I, n, n)
+    Apow[2] = Matrix(A)


Seems unfortunate to use Matrix here if you start with SMatrix

Addressed in f72fa0c. There are now two paths: for immutable dense inputs (e.g. SMatrix) a container-preserving _phi_almohy_generic keeps everything static — an SMatrix in yields SMatrix results, no Matrix conversion (tested static-in/static-out). The Matrix-based buffers are only used on the strided Float64/ComplexF64 path. Routing uses ismutable so sparse/other mutable non-strided matrices still take the legacy path.

stevengj · 2026-07-03T01:44:39Z

+        else
+            N .+= blkN * Atau
+            D .+= blkD * Atau
+            Atau = Atau * Apow[tau + 1]


Do in-place, maybe?

Done in f72fa0c. The strided path now runs fully in place through a reusable PhiPadeCache workspace (all powers, N/D, the LU solve via getrf!/getrs!, and the recovery temporaries), so repeated calls allocate nothing — verified 0 bytes with @allocated across sizes/p. Pass the cache as the caches keyword to phi! to reuse it.

@allocated

Address review feedback on SciML#236: * PhiPadeCache: a reusable workspace holding every scratch buffer. The strided Float64/ComplexF64 path (`_phi_almohy!`) now runs fully in place via mul!/getrf!/getrs! and broadcasting, allocating nothing once a cache exists (verified 0 bytes with @allocated across sizes). Pass it as the `caches` keyword to `phi!` to reuse across calls. (@stevengj: "do in-place, maybe?") * Container-preserving generic path (`_phi_almohy_generic`) for immutable dense matrices: an SMatrix input stays an SMatrix throughout instead of being converted to a Matrix. Routed only for `!ismutable` inputs, so sparse/other mutable non-strided matrices keep the legacy path. (@stevengj: "unfortunate to use Matrix here if you start with SMatrix") * Export and document PhiPadeCache and phi!; bump to 1.31.0 (new public API, SemVer minor). Tests: workspace reuse + zero-allocation assertion, complex-matrix workspace, and SMatrix static-in/static-out accuracy vs the block-exponential reference. Full basictests + Aqua + JET pass locally (the only failures are the pre-existing ForwardDiff-1.x Issue 42 asserts, green under CI's pinned ForwardDiff 0.10.x). Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

ChrisRackauckas-Claude · 2026-07-03T07:50:26Z

Pushed f72fa0c addressing the review (thanks @stevengj):

Reusable zero-alloc workspace PhiPadeCache. The strided Float64/ComplexF64 path now runs entirely in place (powers, N/D via Paterson–Stockmeyer, LU solve through getrf!/getrs!, and the double-argument recovery), so a preallocated cache reused across calls allocates 0 bytes (checked with @allocated across sizes and p). Reuse it via phi!(out, A, k; caches = cache).
Container-preserving static path _phi_almohy_generic for immutable dense inputs: an SMatrix stays an SMatrix throughout — no internal Matrix conversion. Gated on ismutable, so sparse and other mutable non-strided matrices keep the legacy path (fixes a sparse regression I caught in the local suite).
Exported + documented PhiPadeCache and phi!; version bump to 1.31.0 (new public API, SemVer-minor).

New tests: workspace reuse + zero-allocation assertion, complex-matrix workspace, and SMatrix static-in/static-out accuracy. Locally green: full basictests.jl, Aqua (10/10), and JET (7/7, including report_call(phi, ...)). The only failing asserts in the suite are the pre-existing ForwardDiff-1.x Issue 42 ones, which pass under CI's compat-pinned ForwardDiff 0.10.x.

Still marked draft / ignore until reviewed.

`LAPACK.getrf!(A, ipiv)` with a preallocated pivot vector only exists on newer Julia; on the LTS (1.10) it errored, breaking the workspace solve. ccall `getrf!` directly through libblastrampoline (as `StegrWork` does for `stegr!`) so the factorization stays zero-allocation on all supported versions; the 4-argument `getrs!` used for the solve is available everywhere. Verified on Julia 1.10 and 1.12: correctness ~2.5e-15 vs the BigFloat reference and 0 bytes allocated on cache reuse. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

The workspace solve used qualified `LinearAlgebra.BlasInt` and `LinearAlgebra.LAPACK.getrs!`, which SciMLTesting's ExplicitImports `all_qualified_accesses_are_public` check flags as reaching into non-public internals (they are not on its ignore list). Mirror `exp_noalloc.jl` instead: import `BlasInt` (an allow-listed explicit import) and ccall both `getrf!` and `getrs!` through `BLAS.@blasfunc`/`BLAS.libblastrampoline` (allow-listed names). No qualified access to non-public names remains in `phi_almohy.jl`. Behavior unchanged: verified correctness ~2.5e-15 and 0-byte cache reuse on Julia 1.10 and 1.12; Aqua and JET still pass. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

Re-analysis pass over the Al-Mohy--Liu implementation: * Paterson--Stockmeyer rewritten in Horner form with the delta trick. The previous block scheme (mirroring the reference MATLAB) used explicit Atau powers plus two products per block -- up to 12 multiplications for m = 12 where Fasi's pi_m(tau) = 7, so the parameter selection was optimizing a cost the evaluator did not achieve. The Horner form's count now equals pi_m(tau) exactly. Measured 10-20% end-to-end speedup on the workspace path across n = 25..200. * Numerical failure returns codes instead of throwing, so adaptive integrators can reject a step rather than abort: cache.info[] is 0 on success, the LAPACK getrf info for a singular Pade denominator, or -1 for non-finite results (LAPACK does not flag NaN pivots, so a post-solve finiteness check backs the code); outputs are NaN-filled on failure. Fixes a latent InexactError where NaN/Inf input crashed in ceil(Int, t) during parameter selection before reaching the solve, and the generic path's SingularException from `\`. Wrongly sized workspaces (programmer errors) still throw, with informative messages. * Workspace vectors sized from p at construction (was a fixed cap that BoundsError'd for p >~ 48) with an explicit capacity check; opnorm(A, 1) hoisted out of the per-degree ell loop (was recomputed 8x); dead Atau/Id buffers dropped (2n^2 less memory). * Fix a closure-capture leak in the test helpers: phi_block_reference assigned to `n`, clobbering the enclosing testset's `n` (Julia closures assign to existing outer locals); the helpers now declare their temporaries `local`. Verified: worst relative error 8.8e-15 over 33 BigFloat-referenced cases (real/complex, normal/nonnormal, p = 1..8); 0 bytes allocated on cache reuse (Float64 and ComplexF64, n = 10..100, p = 1..8, Julia 1.10 and 1.12); NaN/Inf inputs return info != 0 with NaN outputs and no throw; full basictests 574 pass on 1.10 and 1.12 (only the pre-existing ForwardDiff-1.x Issue 42 failures remain, which pass under CI's pinned ForwardDiff 0.10); ExplicitImports clean. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

ChrisRackauckas-Claude · 2026-07-03T09:58:06Z

Pushed 46cad11 — a re-analysis pass over the implementation (performance + failure semantics):

Faster. Paterson–Stockmeyer rewritten in Horner form with the δ trick. The previous block scheme (mirroring the reference MATLAB) used up to 12 multiplications for m=12 where Fasi's π_m(τ) = 7 — so the parameter selection was minimizing a cost model the evaluator didn't achieve. The multiplication count now equals π_m(τ) exactly; measured 10–20% end-to-end speedup on the workspace path (n = 25…200), on top of the 18–72× over the old basis-vector algorithm.

Return codes instead of exceptions. Numerical failure never throws, so adaptive integrators can reject a step instead of aborting: cache.info[] is 0 on success, the LAPACK getrf info for a singular Padé denominator, or -1 for non-finite results (backed by a post-solve finiteness check since LAPACK doesn't flag NaN pivots); outputs are NaN-filled on failure. This also fixes a latent InexactError where NaN/Inf input crashed in ceil(Int, t) during parameter selection — an integrator-killer that predates this PR's review round. Wrongly sized workspaces (programmer errors) still throw with informative messages.

Robustness. Workspace vectors are now sized from p at construction (the fixed cap silently BoundsError'd for p ≳ 48); opnorm(A,1) hoisted out of the per-degree loop; two dead n² buffers dropped; a closure-capture leak in the test helpers fixed (phi_block_reference was assigning to the enclosing testset's n).

Verified locally (Julia 1.10 + 1.12): worst relerr 8.8e-15 over 33 BigFloat-referenced cases (real/complex, normal/nonnormal, p = 1…8); 0 bytes allocated on cache reuse for Float64 and ComplexF64; NaN/Inf inputs → info != 0, NaN outputs, no throw; full basictests 574 pass ×2 versions; ExplicitImports clean.

Still draft / ignore until reviewed.

stevengj · 2026-07-03T12:19:56Z

+
+# First-factor coefficient of the backward-error series h_{m,p} (Eq. (3.4)).
+@inline function _phi_be_coeff(m::Integer, p::Integer)
+    return (_factf(m + p) / _factf(2m + p)) * (_factf(m) / _factf(2m + p + 1))


This seems vulnerable to spurious overflow in the individual terms? (In general, when evaluating a series, it is good practice to evaluate each term in the series as a recurrence from the previous term.)

Good catch — fixed in b71efd7. The coefficient is now accumulated as a product of sub-unit factors, ∏ 1/(m+p+j) · ∏ 1/(m+j), so the running value is monotonically decreasing and can only underflow when the true quotient does (the old form could even hit Inf/Inf = NaN for large p). Verified against BigFloat-exact factorials to 1.1e-15 over m = 1:12, p = 1:60, and added a p = 30 regression test.

@stevengj

…alls Two review-driven changes: * _phi_be_coeff evaluated (m+p)!, m!, (2m+p)!, (2m+p+1)! separately before dividing, so intermediates could spuriously overflow (Inf/Inf = NaN for p >~ 159) even where the quotient is representable (@stevengj). Rewritten as a product of sub-unit factors, prod 1/(m+p+j) * prod 1/(m+j), which is monotonically decreasing and can only underflow when the true value does. Matches the BigFloat-exact value to 1.1e-15 over m = 1:12, p = 1:60; new p = 30 regression test. * The hand-rolled getrf!/getrs! ccalls are gone: the solve now uses the public LinearAlgebra API, lu!(Dfact; check = false) + issuccess + ldiv!, which supports the matrix RHS directly and keeps the no-throw return-code semantics (issuccess false -> info > 0, NaN-filled outputs). LinearSolve.jl was considered but its LinearCache only solves vector right-hand sides through the public path, which would force a column-by-column loop or reaching into cache internals; plain lu!/ldiv! achieves ccall-free with zero new dependencies. Cost: the pivot vector is now the one per-call allocation (8n + ~100 bytes, measured 112 B at n = 7 to 928 B at n = 100; all O(n^2) buffers still reused), since no public API preallocates lu! pivots. Allocation tests assert that O(n) bound. The BlasInt import and ipiv field are removed; phi_almohy.jl is now entirely public-API. Verified on Julia 1.10 and 1.12: 254-assertion Phi testset passes (including new large-p and allocation-bound tests), full basictests 605 pass (only the pre-existing ForwardDiff-1.x Issue 42 failures remain), ExplicitImports clean, NaN/Inf return codes intact. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

ChrisRackauckas-Claude · 2026-07-03T13:18:15Z

Pushed b71efd7:

@stevengj's overflow point: the backward-error coefficient is now computed as a recurrence of sub-unit factors instead of ratios of raw factorials — no intermediate can spuriously overflow. Verified to 1.1e-15 against BigFloat-exact over m = 1:12, p = 1:60, plus a p = 30 regression test.
LAPACK ccalls removed (@ChrisRackauckas). I looked at LinearSolve.jl first, but its LinearCache only solves vector right-hand sides through the public path (init(LinearProblem(A, B::Matrix)) → solve! hits a _ldiv!(::Vector, ::LU, ::Matrix) MethodError), so using it here would mean either an n-column solve loop (BLAS-2 triangular solves instead of one BLAS-3 solve) or reaching into cache.cacheval — the same non-public-internals access the QA check just made us remove. Plain public LinearAlgebra.lu!(D; check=false) + issuccess + ldiv! does everything needed: matrix RHS, no-throw return codes (issuccess → cache.info[]), zero new dependencies. If you'd still prefer the LinearSolve dependency (e.g. for algorithm swappability), happy to switch — it just needs one of those two workarounds.

One measurable tradeoff: lu! allocates its pivot vector internally and there's no public API to preallocate it, so the workspace path now allocates 8n + ~100 bytes per call (112 B at n=7, 928 B at n=100) instead of strictly 0 — all O(n²) buffers are still reused; no GC churn at any realistic size. Allocation tests assert that O(n) bound.

Verified on Julia 1.10 + 1.12: 254-assertion Phi testset, full basictests 605 pass, ExplicitImports clean, NaN/Inf return-code semantics intact.

stevengj · 2026-07-03T15:39:07Z

+    normTs = normT / scalefac
+    delta = (p - 1) * (p - phat) / p + 1
+    K = 2m + p + 1
+    c = (_phi_be_coeff(m, p) / normTs^delta)^(1 / K)


I'm still a little worried about the possibility of spurious underflow/overflow from intermediate terms in this expression, since the 1/K power greatly reduces the exponent of the final result.

That is, either _phi_be_coeff(m, p) or normTs^delta could potentially underflow or overflow, even though the final result would be representable thanks to the 1/K power.

ChrisRackauckas-Claude · 2026-07-03T16:41:01Z

Follow-up on the LinearSolve direction: SciML/LinearSolve.jl#1072 (draft) implements batched RHS as v4.0.0 — LinearProblem(A, B::Matrix) ≡ A\B through the cache path, per SciML/LinearSolve.jl#552. Once that lands and releases, the lu!/ldiv! block in _phi_solve! here can switch to a LinearSolve cache out of the box (the change is confined to that one function plus a cache field). The downstream audit in #1072 found OrdinaryDiffEq/NonlinearSolve only need compat bumps.

ChrisRackauckas-Claude force-pushed the almohy-liu-phi-functions branch from 95f9d70 to 4b0551e Compare July 2, 2026 17:32

stevengj reviewed Jul 3, 2026

View reviewed changes

ChrisRackauckas and others added 3 commits July 3, 2026 04:01

stevengj reviewed Jul 3, 2026

View reviewed changes

ChrisRackauckas-Claude mentioned this pull request Jul 3, 2026

Batched (matrix) right-hand sides: LinearProblem(A, B) ≡ A \ B (v4.0.0) SciML/LinearSolve.jl#1072

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add O(p m³) Al-Mohy–Liu scaling-and-recovering algorithm for the matrix φ-functions#236

Add O(p m³) Al-Mohy–Liu scaling-and-recovering algorithm for the matrix φ-functions#236
ChrisRackauckas-Claude wants to merge 6 commits into
SciML:masterfrom
ChrisRackauckas-Claude:almohy-liu-phi-functions

ChrisRackauckas-Claude commented Jul 2, 2026

Uh oh!

stevengj Jul 3, 2026

Uh oh!

ChrisRackauckas-Claude Jul 3, 2026

Uh oh!

stevengj Jul 3, 2026

Uh oh!

ChrisRackauckas-Claude Jul 3, 2026

Uh oh!

ChrisRackauckas-Claude commented Jul 3, 2026

Uh oh!

ChrisRackauckas-Claude commented Jul 3, 2026

Uh oh!

stevengj Jul 3, 2026

Uh oh!

ChrisRackauckas-Claude Jul 3, 2026

Uh oh!

ChrisRackauckas-Claude commented Jul 3, 2026

Uh oh!

stevengj Jul 3, 2026

Uh oh!

ChrisRackauckas-Claude commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

ChrisRackauckas-Claude commented Jul 2, 2026

Summary

Scope / compatibility

Validation (run locally)

Tests

Uh oh!

stevengj Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisRackauckas-Claude Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

stevengj Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisRackauckas-Claude Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisRackauckas-Claude commented Jul 3, 2026

Uh oh!

ChrisRackauckas-Claude commented Jul 3, 2026

Uh oh!

stevengj Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisRackauckas-Claude Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisRackauckas-Claude commented Jul 3, 2026

Uh oh!

stevengj Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisRackauckas-Claude commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants