Skip to content

Keep PureKLUFactorization off the direct dual AD path (fixes #1064)#1066

Closed
ChrisRackauckas-Claude wants to merge 1 commit into
SciML:mainfrom
ChrisRackauckas-Claude:pureklu-off-direct-dual-path-1064
Closed

Keep PureKLUFactorization off the direct dual AD path (fixes #1064)#1066
ChrisRackauckas-Claude wants to merge 1 commit into
SciML:mainfrom
ChrisRackauckas-Claude:pureklu-off-direct-dual-path-1064

Conversation

@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor

Fixes #1064.

Problem

PR #1041 added PureKLUFactorization to the direct dual solve path (_use_direct_dual_solve), taken whenever A carries duals. For ForwardDiff over an implicit ODE solve (e.g. Rodas5P), the Rosenbrock W matrix carries duals, so every solve factorized W in dual arithmetic via PureKLU's pure-Julia KLU.

This is the same problem #1052/#1057 fixed for RFLUFactorization, which was opted out at the same time. PureKLU was explicitly left on the direct path ("PureKLU are unaffected"), but #1064 shows it is affected identically.

Fix

Drop PureKLUFactorization from _use_direct_dual_solve so it stays on the split primal/partials path, which factorizes the primal W once in fast Float64 arithmetic and reuses that factorization across the partial back-solves — instead of re-running the whole numeric factorization in (2N+1)-wide scalar dual arithmetic. Adds a regression test asserting PureKLU is not routed to the direct path (mirrors the #1052 RFLU test).

Local verification (Julia 1.12, Linux)

Exact issue reproducer (tridiagonal heat equation, N=20, Rodas5P, single derivative):

time alloc
PureKLU direct (before) 3.79 ms 7.93 MiB
PureKLU split (this PR) 1.56 ms 227 KiB
C-backed KLUFactorization (always split) 1.49 ms 222 KiB

Identical derivatives; the split path matches the trusted C-KLU path. test/Core/forwarddiff_overloads.jl passes end-to-end (including the existing PureKLU correctness tests for duals in A / in b / both, and the new #1064 regression test). Runic --check clean.

Scope note (the tradeoff is real, not universal)

The split path is the better default for the sparse-Jacobian / AD-through-implicit-ODE workload PureKLU+ForwardDiff is actually used in. I swept direct-vs-split across sparsity patterns and dual chunk sizes:

  • High fill (general sparse / PDE stencils / random sparse): split wins, by a margin that grows with fill and chunk size (e.g. rand-1%, n=500, chunk=8: 27.9 ms → 7.7 ms).
  • Very low fill (tridiagonal / narrow band) with a large dual chunk: the direct path can edge ahead in a single shot (~2×), because the factorization is so cheap the split path's per-solve overhead dominates. This is not the regime PureKLU+AD hits.

There is also a separate latent inefficiency I found while digging: _solve_direct_dual! calls __init on a fresh cache every solve!, re-doing KLU symbolic analysis and re-allocating the whole cache each time — that is the real source of the 7.93 MiB in the issue. Fixing that is out of scope here; I'll file it separately.


Please ignore until reviewed by @ChrisRackauckas.

🤖 Generated with Claude Code

PR SciML#1041 added PureKLUFactorization to the direct dual solve path, taken
whenever A carries duals. For ForwardDiff over an implicit ODE solve (e.g.
Rodas5P), the Rosenbrock W matrix carries duals, so every solve factorized
W in dual arithmetic via PureKLU's pure-Julia KLU.

This is the same problem that SciML#1052/SciML#1057 fixed for RFLUFactorization, which
was opted out at the same time; PureKLU was left on the direct path ("PureKLU
are unaffected"), but SciML#1064 shows it is affected identically.

The split primal/partials path factorizes the primal W once in fast Float64
arithmetic and reuses that factorization across the partial back-solves,
instead of re-running the whole numeric factorization in (2N+1)-wide scalar
dual arithmetic. On the issue's reproducer (tridiagonal heat equation, N=20,
Rodas5P, single derivative) the direct path was ~2.4x slower (3.79ms ->
1.56ms) with ~35x the allocation (7.93 MiB -> 227 KiB), and the split path
matches the trusted C-backed KLUFactorization (1.49ms / 222 KiB), with
identical derivatives.

Drop PureKLU from _use_direct_dual_solve so it stays on the split path, and
add a regression test asserting it is not routed to the direct path.

Note: the split path is the better default for the sparse-Jacobian /
AD-through-implicit-ODE workload PureKLU+ForwardDiff is used in. For a single
solve of a very-low-fill matrix with a large dual chunk the direct path can
edge ahead, but that is not the regime this hits; a separate inefficiency in
_solve_direct_dual! (it re-inits a fresh cache on every solve, defeating KLU
symbolic reuse) is the real source of the large allocation and is left for a
follow-up.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
@hersle

hersle commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Is this the right fix to make? With this, PureKLU would effectively just do the same split path as KLU, so I don't think PureKLU would be used to its full potential. Also there is no performance gain.

With the different changes in #1065, I am able to make PureKLU continue using the direct path, but fix the unnecessary factorizations. This makes PureKLU faster than KLU, so it makes sense to have both.

@ChrisRackauckas

Copy link
Copy Markdown
Member

This isn't, no. But I'm also a bit skeptical of your PR. I think it needs to branch at a higher level and deal with non-promoting ldiv! in PureKLU itself.

@hersle

hersle commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Yeah I understand that. Please improve it when you find a better way :) At least it makes PureKLU perform (close) to its full potential through ForwardDiff, so any future improvements can check that performance against the reference.

@ChrisRackauckas

Copy link
Copy Markdown
Member

Yeah your PR had nothing wrong really, but it's the fact that it exposed it wasn't caching was like "oh, wait, if we need to work around that... that means we should just make the opt out path be one dispatch higher... and that just means we just need to not promote..." etc. etc., so a bit of a yak shave but it would solve it at its root cause and should be much better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ForwardDiff autodiff with PureKLUFactorization is very slow through implicit ODE solver

3 participants