Keep PureKLUFactorization off the direct dual AD path (fixes #1064)#1066
Keep PureKLUFactorization off the direct dual AD path (fixes #1064)#1066ChrisRackauckas-Claude wants to merge 1 commit into
Conversation
PR SciML#1041 added PureKLUFactorization to the direct dual solve path, taken whenever A carries duals. For ForwardDiff over an implicit ODE solve (e.g. Rodas5P), the Rosenbrock W matrix carries duals, so every solve factorized W in dual arithmetic via PureKLU's pure-Julia KLU. This is the same problem that SciML#1052/SciML#1057 fixed for RFLUFactorization, which was opted out at the same time; PureKLU was left on the direct path ("PureKLU are unaffected"), but SciML#1064 shows it is affected identically. The split primal/partials path factorizes the primal W once in fast Float64 arithmetic and reuses that factorization across the partial back-solves, instead of re-running the whole numeric factorization in (2N+1)-wide scalar dual arithmetic. On the issue's reproducer (tridiagonal heat equation, N=20, Rodas5P, single derivative) the direct path was ~2.4x slower (3.79ms -> 1.56ms) with ~35x the allocation (7.93 MiB -> 227 KiB), and the split path matches the trusted C-backed KLUFactorization (1.49ms / 222 KiB), with identical derivatives. Drop PureKLU from _use_direct_dual_solve so it stays on the split path, and add a regression test asserting it is not routed to the direct path. Note: the split path is the better default for the sparse-Jacobian / AD-through-implicit-ODE workload PureKLU+ForwardDiff is used in. For a single solve of a very-low-fill matrix with a large dual chunk the direct path can edge ahead, but that is not the regime this hits; a separate inefficiency in _solve_direct_dual! (it re-inits a fresh cache on every solve, defeating KLU symbolic reuse) is the real source of the large allocation and is left for a follow-up. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
|
Is this the right fix to make? With this, PureKLU would effectively just do the same split path as KLU, so I don't think PureKLU would be used to its full potential. Also there is no performance gain. With the different changes in #1065, I am able to make PureKLU continue using the direct path, but fix the unnecessary factorizations. This makes PureKLU faster than KLU, so it makes sense to have both. |
|
This isn't, no. But I'm also a bit skeptical of your PR. I think it needs to branch at a higher level and deal with non-promoting ldiv! in PureKLU itself. |
|
Yeah I understand that. Please improve it when you find a better way :) At least it makes PureKLU perform (close) to its full potential through ForwardDiff, so any future improvements can check that performance against the reference. |
|
Yeah your PR had nothing wrong really, but it's the fact that it exposed it wasn't caching was like "oh, wait, if we need to work around that... that means we should just make the opt out path be one dispatch higher... and that just means we just need to not promote..." etc. etc., so a bit of a yak shave but it would solve it at its root cause and should be much better. |
Fixes #1064.
Problem
PR #1041 added
PureKLUFactorizationto the direct dual solve path (_use_direct_dual_solve), taken wheneverAcarries duals. For ForwardDiff over an implicit ODE solve (e.g.Rodas5P), the RosenbrockWmatrix carries duals, so every solve factorizedWin dual arithmetic via PureKLU's pure-Julia KLU.This is the same problem #1052/#1057 fixed for
RFLUFactorization, which was opted out at the same time. PureKLU was explicitly left on the direct path ("PureKLU are unaffected"), but #1064 shows it is affected identically.Fix
Drop
PureKLUFactorizationfrom_use_direct_dual_solveso it stays on the split primal/partials path, which factorizes the primalWonce in fastFloat64arithmetic and reuses that factorization across the partial back-solves — instead of re-running the whole numeric factorization in(2N+1)-wide scalar dual arithmetic. Adds a regression test asserting PureKLU is not routed to the direct path (mirrors the #1052 RFLU test).Local verification (Julia 1.12, Linux)
Exact issue reproducer (tridiagonal heat equation,
N=20,Rodas5P, single derivative):KLUFactorization(always split)Identical derivatives; the split path matches the trusted C-KLU path.
test/Core/forwarddiff_overloads.jlpasses end-to-end (including the existing PureKLU correctness tests for duals in A / in b / both, and the new #1064 regression test).Runic --checkclean.Scope note (the tradeoff is real, not universal)
The split path is the better default for the sparse-Jacobian / AD-through-implicit-ODE workload PureKLU+ForwardDiff is actually used in. I swept direct-vs-split across sparsity patterns and dual chunk sizes:
rand-1%,n=500,chunk=8: 27.9 ms → 7.7 ms).There is also a separate latent inefficiency I found while digging:
_solve_direct_dual!calls__initon a fresh cache everysolve!, re-doing KLU symbolic analysis and re-allocating the whole cache each time — that is the real source of the 7.93 MiB in the issue. Fixing that is out of scope here; I'll file it separately.Please ignore until reviewed by @ChrisRackauckas.
🤖 Generated with Claude Code