Add O(p m³) Al-Mohy–Liu scaling-and-recovering algorithm for the matrix φ-functions#236
Conversation
…nctions
`phi(A, k)` / `phi!` previously computed phi_0(A), ..., phi_k(A) by calling
`phiv_dense!` on each of the m basis vectors, each exponentiating an
(m+k) x (m+k) block matrix -- overall O(m (m+k)^3).
This implements Algorithm 5.1 of Al-Mohy & Liu, "A scaling and recovering
algorithm for the matrix phi-functions" (arXiv:2506.01193, 2025), which
computes all phi functions simultaneously in O(k m^3): scale A by 2^-s,
evaluate the [m/m] diagonal Pade approximant to phi_p via Paterson--Stockmeyer,
recover the lower-index approximants with the recurrence
R^{(j)} = z R^{(j+1)} + 1/j!, and undo the scaling with the double-argument
formula. Backward-error-optimal Pade degree m and scaling s are selected from
the paper's theta table (Table 3.1) and the alpha_r / t cost analysis.
The new path is the default for Float64/ComplexF64 dense inputs (its Pade
tables are tuned for double precision); other element types and callers passing
the legacy `caches` bundle keep the basis-vector path. Diagonal inputs are
unchanged.
Validated against a high-precision block-exponential reference and the legacy
algorithm: worst-case relative error ~1e-14 across normal, non-normal,
large-norm, complex, Hessenberg and zero matrices for k = 1..10, matching or
beating the old accuracy. Measured speedups over the basis-vector path: 18x at
n=50, 51x at n=100, 72x at n=200 (k=3).
Closes SciML#235.
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
95f9d70 to
4b0551e
Compare
| n = size(A, 1) | ||
| Apow = Vector{Matrix{T}}(undef, tau + 1) | ||
| Apow[1] = Matrix{T}(I, n, n) | ||
| Apow[2] = Matrix(A) |
There was a problem hiding this comment.
Seems unfortunate to use Matrix here if you start with SMatrix
There was a problem hiding this comment.
Addressed in f72fa0c. There are now two paths: for immutable dense inputs (e.g. SMatrix) a container-preserving _phi_almohy_generic keeps everything static — an SMatrix in yields SMatrix results, no Matrix conversion (tested static-in/static-out). The Matrix-based buffers are only used on the strided Float64/ComplexF64 path. Routing uses ismutable so sparse/other mutable non-strided matrices still take the legacy path.
| else | ||
| N .+= blkN * Atau | ||
| D .+= blkD * Atau | ||
| Atau = Atau * Apow[tau + 1] |
There was a problem hiding this comment.
Done in f72fa0c. The strided path now runs fully in place through a reusable PhiPadeCache workspace (all powers, N/D, the LU solve via getrf!/getrs!, and the recovery temporaries), so repeated calls allocate nothing — verified 0 bytes with @allocated across sizes/p. Pass the cache as the caches keyword to phi! to reuse it.
Address review feedback on SciML#236: * PhiPadeCache: a reusable workspace holding every scratch buffer. The strided Float64/ComplexF64 path (`_phi_almohy!`) now runs fully in place via mul!/getrf!/getrs! and broadcasting, allocating nothing once a cache exists (verified 0 bytes with @allocated across sizes). Pass it as the `caches` keyword to `phi!` to reuse across calls. (@stevengj: "do in-place, maybe?") * Container-preserving generic path (`_phi_almohy_generic`) for immutable dense matrices: an SMatrix input stays an SMatrix throughout instead of being converted to a Matrix. Routed only for `!ismutable` inputs, so sparse/other mutable non-strided matrices keep the legacy path. (@stevengj: "unfortunate to use Matrix here if you start with SMatrix") * Export and document PhiPadeCache and phi!; bump to 1.31.0 (new public API, SemVer minor). Tests: workspace reuse + zero-allocation assertion, complex-matrix workspace, and SMatrix static-in/static-out accuracy vs the block-exponential reference. Full basictests + Aqua + JET pass locally (the only failures are the pre-existing ForwardDiff-1.x Issue 42 asserts, green under CI's pinned ForwardDiff 0.10.x). Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
|
Pushed f72fa0c addressing the review (thanks @stevengj):
New tests: workspace reuse + zero-allocation assertion, complex-matrix workspace, and Still marked draft / ignore until reviewed. |
`LAPACK.getrf!(A, ipiv)` with a preallocated pivot vector only exists on newer Julia; on the LTS (1.10) it errored, breaking the workspace solve. ccall `getrf!` directly through libblastrampoline (as `StegrWork` does for `stegr!`) so the factorization stays zero-allocation on all supported versions; the 4-argument `getrs!` used for the solve is available everywhere. Verified on Julia 1.10 and 1.12: correctness ~2.5e-15 vs the BigFloat reference and 0 bytes allocated on cache reuse. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
The workspace solve used qualified `LinearAlgebra.BlasInt` and `LinearAlgebra.LAPACK.getrs!`, which SciMLTesting's ExplicitImports `all_qualified_accesses_are_public` check flags as reaching into non-public internals (they are not on its ignore list). Mirror `exp_noalloc.jl` instead: import `BlasInt` (an allow-listed explicit import) and ccall both `getrf!` and `getrs!` through `BLAS.@blasfunc`/`BLAS.libblastrampoline` (allow-listed names). No qualified access to non-public names remains in `phi_almohy.jl`. Behavior unchanged: verified correctness ~2.5e-15 and 0-byte cache reuse on Julia 1.10 and 1.12; Aqua and JET still pass. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
Re-analysis pass over the Al-Mohy--Liu implementation: * Paterson--Stockmeyer rewritten in Horner form with the delta trick. The previous block scheme (mirroring the reference MATLAB) used explicit Atau powers plus two products per block -- up to 12 multiplications for m = 12 where Fasi's pi_m(tau) = 7, so the parameter selection was optimizing a cost the evaluator did not achieve. The Horner form's count now equals pi_m(tau) exactly. Measured 10-20% end-to-end speedup on the workspace path across n = 25..200. * Numerical failure returns codes instead of throwing, so adaptive integrators can reject a step rather than abort: cache.info[] is 0 on success, the LAPACK getrf info for a singular Pade denominator, or -1 for non-finite results (LAPACK does not flag NaN pivots, so a post-solve finiteness check backs the code); outputs are NaN-filled on failure. Fixes a latent InexactError where NaN/Inf input crashed in ceil(Int, t) during parameter selection before reaching the solve, and the generic path's SingularException from `\`. Wrongly sized workspaces (programmer errors) still throw, with informative messages. * Workspace vectors sized from p at construction (was a fixed cap that BoundsError'd for p >~ 48) with an explicit capacity check; opnorm(A, 1) hoisted out of the per-degree ell loop (was recomputed 8x); dead Atau/Id buffers dropped (2n^2 less memory). * Fix a closure-capture leak in the test helpers: phi_block_reference assigned to `n`, clobbering the enclosing testset's `n` (Julia closures assign to existing outer locals); the helpers now declare their temporaries `local`. Verified: worst relative error 8.8e-15 over 33 BigFloat-referenced cases (real/complex, normal/nonnormal, p = 1..8); 0 bytes allocated on cache reuse (Float64 and ComplexF64, n = 10..100, p = 1..8, Julia 1.10 and 1.12); NaN/Inf inputs return info != 0 with NaN outputs and no throw; full basictests 574 pass on 1.10 and 1.12 (only the pre-existing ForwardDiff-1.x Issue 42 failures remain, which pass under CI's pinned ForwardDiff 0.10); ExplicitImports clean. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
|
Pushed 46cad11 — a re-analysis pass over the implementation (performance + failure semantics): Faster. Paterson–Stockmeyer rewritten in Horner form with the δ trick. The previous block scheme (mirroring the reference MATLAB) used up to 12 multiplications for m=12 where Fasi's π_m(τ) = 7 — so the parameter selection was minimizing a cost model the evaluator didn't achieve. The multiplication count now equals π_m(τ) exactly; measured 10–20% end-to-end speedup on the workspace path (n = 25…200), on top of the 18–72× over the old basis-vector algorithm. Return codes instead of exceptions. Numerical failure never throws, so adaptive integrators can reject a step instead of aborting: Robustness. Workspace vectors are now sized from Verified locally (Julia 1.10 + 1.12): worst relerr 8.8e-15 over 33 BigFloat-referenced cases (real/complex, normal/nonnormal, p = 1…8); 0 bytes allocated on cache reuse for Float64 and ComplexF64; NaN/Inf inputs → Still draft / ignore until reviewed. |
|
|
||
| # First-factor coefficient of the backward-error series h_{m,p} (Eq. (3.4)). | ||
| @inline function _phi_be_coeff(m::Integer, p::Integer) | ||
| return (_factf(m + p) / _factf(2m + p)) * (_factf(m) / _factf(2m + p + 1)) |
There was a problem hiding this comment.
This seems vulnerable to spurious overflow in the individual terms? (In general, when evaluating a series, it is good practice to evaluate each term in the series as a recurrence from the previous term.)
There was a problem hiding this comment.
Good catch — fixed in b71efd7. The coefficient is now accumulated as a product of sub-unit factors, ∏ 1/(m+p+j) · ∏ 1/(m+j), so the running value is monotonically decreasing and can only underflow when the true quotient does (the old form could even hit Inf/Inf = NaN for large p). Verified against BigFloat-exact factorials to 1.1e-15 over m = 1:12, p = 1:60, and added a p = 30 regression test.
…alls Two review-driven changes: * _phi_be_coeff evaluated (m+p)!, m!, (2m+p)!, (2m+p+1)! separately before dividing, so intermediates could spuriously overflow (Inf/Inf = NaN for p >~ 159) even where the quotient is representable (@stevengj). Rewritten as a product of sub-unit factors, prod 1/(m+p+j) * prod 1/(m+j), which is monotonically decreasing and can only underflow when the true value does. Matches the BigFloat-exact value to 1.1e-15 over m = 1:12, p = 1:60; new p = 30 regression test. * The hand-rolled getrf!/getrs! ccalls are gone: the solve now uses the public LinearAlgebra API, lu!(Dfact; check = false) + issuccess + ldiv!, which supports the matrix RHS directly and keeps the no-throw return-code semantics (issuccess false -> info > 0, NaN-filled outputs). LinearSolve.jl was considered but its LinearCache only solves vector right-hand sides through the public path, which would force a column-by-column loop or reaching into cache internals; plain lu!/ldiv! achieves ccall-free with zero new dependencies. Cost: the pivot vector is now the one per-call allocation (8n + ~100 bytes, measured 112 B at n = 7 to 928 B at n = 100; all O(n^2) buffers still reused), since no public API preallocates lu! pivots. Allocation tests assert that O(n) bound. The BlasInt import and ipiv field are removed; phi_almohy.jl is now entirely public-API. Verified on Julia 1.10 and 1.12: 254-assertion Phi testset passes (including new large-p and allocation-bound tests), full basictests 605 pass (only the pre-existing ForwardDiff-1.x Issue 42 failures remain), ExplicitImports clean, NaN/Inf return codes intact. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
|
Pushed b71efd7:
Verified on Julia 1.10 + 1.12: 254-assertion Phi testset, full basictests 605 pass, ExplicitImports clean, NaN/Inf return-code semantics intact. |
| normTs = normT / scalefac | ||
| delta = (p - 1) * (p - phat) / p + 1 | ||
| K = 2m + p + 1 | ||
| c = (_phi_be_coeff(m, p) / normTs^delta)^(1 / K) |
There was a problem hiding this comment.
I'm still a little worried about the possibility of spurious underflow/overflow from intermediate terms in this expression, since the 1/K power greatly reduces the exponent of the final result.
That is, either _phi_be_coeff(m, p) or normTs^delta could potentially underflow or overflow, even though the final result would be representable thanks to the 1/K power.
|
Follow-up on the LinearSolve direction: SciML/LinearSolve.jl#1072 (draft) implements batched RHS as v4.0.0 — |
Closes #235.
Summary
phi(A, k)/phi!currently computeφ₀(A), …, φ_k(A)by callingphiv_dense!on each of thembasis vectors, each exponentiating an(m+k)×(m+k)block matrix — overall O(m (m+k)³), as noted in #235.This PR implements Algorithm 5.1 of
which computes all of
φ₀, …, φ_psimultaneously in O(p m³):A ← A/2^s.[m/m]diagonal Padé approximant toφ_pvia Paterson–Stockmeyer.R⁽ʲ⁾ = z R⁽ʲ⁺¹⁾ + 1/j!.φ_j(2A) = 2⁻ʲ(φ₀φ_j + Σ_{k=1}^{j} φ_k/(j−k)!).The backward-error-optimal Padé degree
mand scalingsare selected from the paper'sθ_{m,p}table (Table 3.1) together with theα_r/tcost analysis (§3–4). Implementation follows the authors' reference MATLAB codephi_funm.Scope / compatibility
Float64/ComplexF64dense inputs (the Padé tables are tuned for IEEE double precision).cachesbundle, keep the existing basis-vector path.Diagonalinputs and the public API are unchanged.Validation (run locally)
Against a high-precision (
BigFloat) block-augmented-exponential reference and the legacy algorithm, over normal / non-normal / large-norm / complex / Hessenberg / zero matrices fork = 1…10:nas expected from the complexity change.Tests
New testset in
test/basictests.jl(Phi (Al-Mohy–Liu vs reference), 212 assertions) checking the new default against an independentexp-of-block-matrix reference and the legacyphiv_dense!path, plus the in-place entry point andφ₀ == exp(A).The full
basictests.jlpasses with these changes. Note: two pre-existing failures in the unrelatedIssue 42testset (ForwardDiff derivatives ofexp_generic) reproduce on unmodifiedmasterunder Julia 1.12 and are not touched by this PR — being investigated separately.🤖 Generated with Claude Code