Skip to content

CUDA solve: persistent kernel with cuBLASDx for device-side triangular solve #5

@robtaylor

Description

@robtaylor

Context

CUDA LU solve is 3.7ms vs cuDSS 0.6ms (6x gap) on c6288 (25380x25380 circuit Jacobian). Factor is already at cuDSS parity (2.85ms vs 2.5ms).

Current solve architecture:

  • Separate kernel dispatches for sparse-elim forward L / backward U phases
  • Per-lump CPU iteration for 16 dense lumps
  • Three flush() barriers (permutation → L solve → U solve)
  • Multiple cuBLAS calls for dense GEMV/TRSV

Proposed approach

Implement a persistent kernel solve using cuBLASDx (device-side BLAS), following the cuDSS architecture from Sparse Days 2024:

  1. Single kernel launch for the entire triangular solve (forward L + backward U)
  2. Inter-CTA synchronization via atomicAdd on done[] counters — thread blocks spin-wait until dependencies are satisfied, then immediately process their supernode
  3. cuBLASDx for device-side GEMV/TRSV — no separate cuBLAS dispatch overhead
  4. Level-set parallelism — independent supernodes at the same tree level processed by different thread blocks simultaneously

Key requirements

  • cuBLASDx — compile-time template library for device-side BLAS
  • NVIDIA forward-progress guarantee for resident thread blocks
  • Pre-computed dependency graph (already available via LevelSetSchedule)
  • Shared memory sizing for per-supernode TRSV/GEMV working storage

Expected improvement

  • Solve: 3.7ms → <1ms (eliminate all kernel launch + cuBLAS dispatch overhead)
  • Total LU: 6.5ms → ~4ms (approaching cuDSS's 3.1ms)

Notes

  • This is CUDA-only; Metal lacks forward-progress guarantees and device-side BLAS
  • The existing modular solve (Solver.cpp internalSolveLRangeUnit / internalSolveURange) should remain as fallback for non-CUDA backends
  • cuBLASDx requires specifying matrix sizes at compile time via templates — may need a few size specializations or runtime dispatch

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions