Skip to content

Add O(p m³) Al-Mohy–Liu scaling-and-recovering algorithm for the matrix φ-functions#236

Draft
ChrisRackauckas-Claude wants to merge 6 commits into
SciML:masterfrom
ChrisRackauckas-Claude:almohy-liu-phi-functions
Draft

Add O(p m³) Al-Mohy–Liu scaling-and-recovering algorithm for the matrix φ-functions#236
ChrisRackauckas-Claude wants to merge 6 commits into
SciML:masterfrom
ChrisRackauckas-Claude:almohy-liu-phi-functions

Conversation

@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor

Please ignore until reviewed by @ChrisRackauckas. Draft; opened by an agent.

Closes #235.

Summary

phi(A, k) / phi! currently compute φ₀(A), …, φ_k(A) by calling phiv_dense! on each of the m basis vectors, each exponentiating an (m+k)×(m+k) block matrix — overall O(m (m+k)³), as noted in #235.

This PR implements Algorithm 5.1 of

Awad H. Al-Mohy and Xiaobo Liu, A scaling and recovering algorithm for the matrix φ-functions, arXiv:2506.01193 (2025),

which computes all of φ₀, …, φ_p simultaneously in O(p m³):

  1. Scale A ← A/2^s.
  2. Evaluate the [m/m] diagonal Padé approximant to φ_p via Paterson–Stockmeyer.
  3. Recover the lower-index approximants with the recurrence R⁽ʲ⁾ = z R⁽ʲ⁺¹⁾ + 1/j!.
  4. Undo the scaling with the double-argument formula φ_j(2A) = 2⁻ʲ(φ₀φ_j + Σ_{k=1}^{j} φ_k/(j−k)!).

The backward-error-optimal Padé degree m and scaling s are selected from the paper's θ_{m,p} table (Table 3.1) together with the α_r/t cost analysis (§3–4). Implementation follows the authors' reference MATLAB code phi_funm.

Scope / compatibility

  • New path is the default for Float64/ComplexF64 dense inputs (the Padé tables are tuned for IEEE double precision).
  • Other element types, and callers passing the legacy caches bundle, keep the existing basis-vector path.
  • Diagonal inputs and the public API are unchanged.
  • The quasi-triangular structure exploitation from the paper (§5.1, an accuracy refinement for already-triangular inputs) is not included here; the general path is used for all inputs. Could be a follow-up.

Validation (run locally)

Against a high-precision (BigFloat) block-augmented-exponential reference and the legacy algorithm, over normal / non-normal / large-norm / complex / Hessenberg / zero matrices for k = 1…10:

  • Worst-case relative error ~1e-14, matching or beating the old path (e.g. on a diagonally-dominant 12×12 the new path is ~1e-16 vs ~2e-14 for the old).
  • Speedups over the basis-vector path (k=3): 18× at n=50, 51× at n=100, 72× at n=200, growing with n as expected from the complexity change.

Tests

New testset in test/basictests.jl (Phi (Al-Mohy–Liu vs reference), 212 assertions) checking the new default against an independent exp-of-block-matrix reference and the legacy phiv_dense! path, plus the in-place entry point and φ₀ == exp(A).

The full basictests.jl passes with these changes. Note: two pre-existing failures in the unrelated Issue 42 testset (ForwardDiff derivatives of exp_generic) reproduce on unmodified master under Julia 1.12 and are not touched by this PR — being investigated separately.

🤖 Generated with Claude Code

…nctions

`phi(A, k)` / `phi!` previously computed phi_0(A), ..., phi_k(A) by calling
`phiv_dense!` on each of the m basis vectors, each exponentiating an
(m+k) x (m+k) block matrix -- overall O(m (m+k)^3).

This implements Algorithm 5.1 of Al-Mohy & Liu, "A scaling and recovering
algorithm for the matrix phi-functions" (arXiv:2506.01193, 2025), which
computes all phi functions simultaneously in O(k m^3): scale A by 2^-s,
evaluate the [m/m] diagonal Pade approximant to phi_p via Paterson--Stockmeyer,
recover the lower-index approximants with the recurrence
R^{(j)} = z R^{(j+1)} + 1/j!, and undo the scaling with the double-argument
formula. Backward-error-optimal Pade degree m and scaling s are selected from
the paper's theta table (Table 3.1) and the alpha_r / t cost analysis.

The new path is the default for Float64/ComplexF64 dense inputs (its Pade
tables are tuned for double precision); other element types and callers passing
the legacy `caches` bundle keep the basis-vector path. Diagonal inputs are
unchanged.

Validated against a high-precision block-exponential reference and the legacy
algorithm: worst-case relative error ~1e-14 across normal, non-normal,
large-norm, complex, Hessenberg and zero matrices for k = 1..10, matching or
beating the old accuracy. Measured speedups over the basis-vector path: 18x at
n=50, 51x at n=100, 72x at n=200 (k=3).

Closes SciML#235.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
Comment thread src/phi_almohy.jl Outdated
n = size(A, 1)
Apow = Vector{Matrix{T}}(undef, tau + 1)
Apow[1] = Matrix{T}(I, n, n)
Apow[2] = Matrix(A)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unfortunate to use Matrix here if you start with SMatrix

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f72fa0c. There are now two paths: for immutable dense inputs (e.g. SMatrix) a container-preserving _phi_almohy_generic keeps everything static — an SMatrix in yields SMatrix results, no Matrix conversion (tested static-in/static-out). The Matrix-based buffers are only used on the strided Float64/ComplexF64 path. Routing uses ismutable so sparse/other mutable non-strided matrices still take the legacy path.

Comment thread src/phi_almohy.jl Outdated
else
N .+= blkN * Atau
D .+= blkD * Atau
Atau = Atau * Apow[tau + 1]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do in-place, maybe?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f72fa0c. The strided path now runs fully in place through a reusable PhiPadeCache workspace (all powers, N/D, the LU solve via getrf!/getrs!, and the recovery temporaries), so repeated calls allocate nothing — verified 0 bytes with @allocated across sizes/p. Pass the cache as the caches keyword to phi! to reuse it.

Address review feedback on SciML#236:

* PhiPadeCache: a reusable workspace holding every scratch buffer. The
  strided Float64/ComplexF64 path (`_phi_almohy!`) now runs fully in place
  via mul!/getrf!/getrs! and broadcasting, allocating nothing once a cache
  exists (verified 0 bytes with @allocated across sizes). Pass it as the
  `caches` keyword to `phi!` to reuse across calls. (@stevengj: "do
  in-place, maybe?")

* Container-preserving generic path (`_phi_almohy_generic`) for immutable
  dense matrices: an SMatrix input stays an SMatrix throughout instead of
  being converted to a Matrix. Routed only for `!ismutable` inputs, so
  sparse/other mutable non-strided matrices keep the legacy path.
  (@stevengj: "unfortunate to use Matrix here if you start with SMatrix")

* Export and document PhiPadeCache and phi!; bump to 1.31.0 (new public
  API, SemVer minor).

Tests: workspace reuse + zero-allocation assertion, complex-matrix
workspace, and SMatrix static-in/static-out accuracy vs the block-exponential
reference. Full basictests + Aqua + JET pass locally (the only failures are
the pre-existing ForwardDiff-1.x Issue 42 asserts, green under CI's pinned
ForwardDiff 0.10.x).

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor Author

Pushed f72fa0c addressing the review (thanks @stevengj):

  • Reusable zero-alloc workspace PhiPadeCache. The strided Float64/ComplexF64 path now runs entirely in place (powers, N/D via Paterson–Stockmeyer, LU solve through getrf!/getrs!, and the double-argument recovery), so a preallocated cache reused across calls allocates 0 bytes (checked with @allocated across sizes and p). Reuse it via phi!(out, A, k; caches = cache).
  • Container-preserving static path _phi_almohy_generic for immutable dense inputs: an SMatrix stays an SMatrix throughout — no internal Matrix conversion. Gated on ismutable, so sparse and other mutable non-strided matrices keep the legacy path (fixes a sparse regression I caught in the local suite).
  • Exported + documented PhiPadeCache and phi!; version bump to 1.31.0 (new public API, SemVer-minor).

New tests: workspace reuse + zero-allocation assertion, complex-matrix workspace, and SMatrix static-in/static-out accuracy. Locally green: full basictests.jl, Aqua (10/10), and JET (7/7, including report_call(phi, ...)). The only failing asserts in the suite are the pre-existing ForwardDiff-1.x Issue 42 ones, which pass under CI's compat-pinned ForwardDiff 0.10.x.

Still marked draft / ignore until reviewed.

ChrisRackauckas and others added 3 commits July 3, 2026 04:01
`LAPACK.getrf!(A, ipiv)` with a preallocated pivot vector only exists on
newer Julia; on the LTS (1.10) it errored, breaking the workspace solve.
ccall `getrf!` directly through libblastrampoline (as `StegrWork` does for
`stegr!`) so the factorization stays zero-allocation on all supported
versions; the 4-argument `getrs!` used for the solve is available
everywhere.

Verified on Julia 1.10 and 1.12: correctness ~2.5e-15 vs the BigFloat
reference and 0 bytes allocated on cache reuse.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
The workspace solve used qualified `LinearAlgebra.BlasInt` and
`LinearAlgebra.LAPACK.getrs!`, which SciMLTesting's ExplicitImports
`all_qualified_accesses_are_public` check flags as reaching into non-public
internals (they are not on its ignore list). Mirror `exp_noalloc.jl`
instead: import `BlasInt` (an allow-listed explicit import) and ccall both
`getrf!` and `getrs!` through `BLAS.@blasfunc`/`BLAS.libblastrampoline`
(allow-listed names). No qualified access to non-public names remains in
`phi_almohy.jl`.

Behavior unchanged: verified correctness ~2.5e-15 and 0-byte cache reuse on
Julia 1.10 and 1.12; Aqua and JET still pass.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
Re-analysis pass over the Al-Mohy--Liu implementation:

* Paterson--Stockmeyer rewritten in Horner form with the delta trick. The
  previous block scheme (mirroring the reference MATLAB) used explicit
  Atau powers plus two products per block -- up to 12 multiplications for
  m = 12 where Fasi's pi_m(tau) = 7, so the parameter selection was
  optimizing a cost the evaluator did not achieve. The Horner form's count
  now equals pi_m(tau) exactly. Measured 10-20% end-to-end speedup on the
  workspace path across n = 25..200.

* Numerical failure returns codes instead of throwing, so adaptive
  integrators can reject a step rather than abort: cache.info[] is 0 on
  success, the LAPACK getrf info for a singular Pade denominator, or -1
  for non-finite results (LAPACK does not flag NaN pivots, so a post-solve
  finiteness check backs the code); outputs are NaN-filled on failure.
  Fixes a latent InexactError where NaN/Inf input crashed in
  ceil(Int, t) during parameter selection before reaching the solve, and
  the generic path's SingularException from `\`. Wrongly sized workspaces
  (programmer errors) still throw, with informative messages.

* Workspace vectors sized from p at construction (was a fixed cap that
  BoundsError'd for p >~ 48) with an explicit capacity check; opnorm(A, 1)
  hoisted out of the per-degree ell loop (was recomputed 8x); dead Atau/Id
  buffers dropped (2n^2 less memory).

* Fix a closure-capture leak in the test helpers: phi_block_reference
  assigned to `n`, clobbering the enclosing testset's `n` (Julia closures
  assign to existing outer locals); the helpers now declare their
  temporaries `local`.

Verified: worst relative error 8.8e-15 over 33 BigFloat-referenced cases
(real/complex, normal/nonnormal, p = 1..8); 0 bytes allocated on cache
reuse (Float64 and ComplexF64, n = 10..100, p = 1..8, Julia 1.10 and
1.12); NaN/Inf inputs return info != 0 with NaN outputs and no throw;
full basictests 574 pass on 1.10 and 1.12 (only the pre-existing
ForwardDiff-1.x Issue 42 failures remain, which pass under CI's pinned
ForwardDiff 0.10); ExplicitImports clean.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor Author

Pushed 46cad11 — a re-analysis pass over the implementation (performance + failure semantics):

Faster. Paterson–Stockmeyer rewritten in Horner form with the δ trick. The previous block scheme (mirroring the reference MATLAB) used up to 12 multiplications for m=12 where Fasi's π_m(τ) = 7 — so the parameter selection was minimizing a cost model the evaluator didn't achieve. The multiplication count now equals π_m(τ) exactly; measured 10–20% end-to-end speedup on the workspace path (n = 25…200), on top of the 18–72× over the old basis-vector algorithm.

Return codes instead of exceptions. Numerical failure never throws, so adaptive integrators can reject a step instead of aborting: cache.info[] is 0 on success, the LAPACK getrf info for a singular Padé denominator, or -1 for non-finite results (backed by a post-solve finiteness check since LAPACK doesn't flag NaN pivots); outputs are NaN-filled on failure. This also fixes a latent InexactError where NaN/Inf input crashed in ceil(Int, t) during parameter selection — an integrator-killer that predates this PR's review round. Wrongly sized workspaces (programmer errors) still throw with informative messages.

Robustness. Workspace vectors are now sized from p at construction (the fixed cap silently BoundsError'd for p ≳ 48); opnorm(A,1) hoisted out of the per-degree loop; two dead n² buffers dropped; a closure-capture leak in the test helpers fixed (phi_block_reference was assigning to the enclosing testset's n).

Verified locally (Julia 1.10 + 1.12): worst relerr 8.8e-15 over 33 BigFloat-referenced cases (real/complex, normal/nonnormal, p = 1…8); 0 bytes allocated on cache reuse for Float64 and ComplexF64; NaN/Inf inputs → info != 0, NaN outputs, no throw; full basictests 574 pass ×2 versions; ExplicitImports clean.

Still draft / ignore until reviewed.

Comment thread src/phi_almohy.jl Outdated

# First-factor coefficient of the backward-error series h_{m,p} (Eq. (3.4)).
@inline function _phi_be_coeff(m::Integer, p::Integer)
return (_factf(m + p) / _factf(2m + p)) * (_factf(m) / _factf(2m + p + 1))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems vulnerable to spurious overflow in the individual terms? (In general, when evaluating a series, it is good practice to evaluate each term in the series as a recurrence from the previous term.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in b71efd7. The coefficient is now accumulated as a product of sub-unit factors, ∏ 1/(m+p+j) · ∏ 1/(m+j), so the running value is monotonically decreasing and can only underflow when the true quotient does (the old form could even hit Inf/Inf = NaN for large p). Verified against BigFloat-exact factorials to 1.1e-15 over m = 1:12, p = 1:60, and added a p = 30 regression test.

…alls

Two review-driven changes:

* _phi_be_coeff evaluated (m+p)!, m!, (2m+p)!, (2m+p+1)! separately before
  dividing, so intermediates could spuriously overflow (Inf/Inf = NaN for
  p >~ 159) even where the quotient is representable (@stevengj). Rewritten
  as a product of sub-unit factors,
  prod 1/(m+p+j) * prod 1/(m+j), which is monotonically decreasing and can
  only underflow when the true value does. Matches the BigFloat-exact value
  to 1.1e-15 over m = 1:12, p = 1:60; new p = 30 regression test.

* The hand-rolled getrf!/getrs! ccalls are gone: the solve now uses the
  public LinearAlgebra API, lu!(Dfact; check = false) + issuccess + ldiv!,
  which supports the matrix RHS directly and keeps the no-throw return-code
  semantics (issuccess false -> info > 0, NaN-filled outputs).
  LinearSolve.jl was considered but its LinearCache only solves vector
  right-hand sides through the public path, which would force a
  column-by-column loop or reaching into cache internals; plain lu!/ldiv!
  achieves ccall-free with zero new dependencies. Cost: the pivot vector is
  now the one per-call allocation (8n + ~100 bytes, measured 112 B at
  n = 7 to 928 B at n = 100; all O(n^2) buffers still reused), since no
  public API preallocates lu! pivots. Allocation tests assert that O(n)
  bound. The BlasInt import and ipiv field are removed; phi_almohy.jl is
  now entirely public-API.

Verified on Julia 1.10 and 1.12: 254-assertion Phi testset passes
(including new large-p and allocation-bound tests), full basictests 605
pass (only the pre-existing ForwardDiff-1.x Issue 42 failures remain),
ExplicitImports clean, NaN/Inf return codes intact.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor Author

Pushed b71efd7:

  • @stevengj's overflow point: the backward-error coefficient is now computed as a recurrence of sub-unit factors instead of ratios of raw factorials — no intermediate can spuriously overflow. Verified to 1.1e-15 against BigFloat-exact over m = 1:12, p = 1:60, plus a p = 30 regression test.

  • LAPACK ccalls removed (@ChrisRackauckas). I looked at LinearSolve.jl first, but its LinearCache only solves vector right-hand sides through the public path (init(LinearProblem(A, B::Matrix))solve! hits a _ldiv!(::Vector, ::LU, ::Matrix) MethodError), so using it here would mean either an n-column solve loop (BLAS-2 triangular solves instead of one BLAS-3 solve) or reaching into cache.cacheval — the same non-public-internals access the QA check just made us remove. Plain public LinearAlgebra.lu!(D; check=false) + issuccess + ldiv! does everything needed: matrix RHS, no-throw return codes (issuccesscache.info[]), zero new dependencies. If you'd still prefer the LinearSolve dependency (e.g. for algorithm swappability), happy to switch — it just needs one of those two workarounds.

    One measurable tradeoff: lu! allocates its pivot vector internally and there's no public API to preallocate it, so the workspace path now allocates 8n + ~100 bytes per call (112 B at n=7, 928 B at n=100) instead of strictly 0 — all O(n²) buffers are still reused; no GC churn at any realistic size. Allocation tests assert that O(n) bound.

Verified on Julia 1.10 + 1.12: 254-assertion Phi testset, full basictests 605 pass, ExplicitImports clean, NaN/Inf return-code semantics intact.

Comment thread src/phi_almohy.jl
normTs = normT / scalefac
delta = (p - 1) * (p - phat) / p + 1
K = 2m + p + 1
c = (_phi_be_coeff(m, p) / normTs^delta)^(1 / K)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little worried about the possibility of spurious underflow/overflow from intermediate terms in this expression, since the 1/K power greatly reduces the exponent of the final result.

That is, either _phi_be_coeff(m, p) or normTs^delta could potentially underflow or overflow, even though the final result would be representable thanks to the 1/K power.

@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor Author

Follow-up on the LinearSolve direction: SciML/LinearSolve.jl#1072 (draft) implements batched RHS as v4.0.0 — LinearProblem(A, B::Matrix)A\B through the cache path, per SciML/LinearSolve.jl#552. Once that lands and releases, the lu!/ldiv! block in _phi_solve! here can switch to a LinearSolve cache out of the box (the change is confined to that one function plus a cache field). The downstream audit in #1072 found OrdinaryDiffEq/NonlinearSolve only need compat bumps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

faster Al-Mohy (2025) algorithm for matrix phi function

3 participants