Skip to content

Add SparseMatricesCSR.jl Ext#906

Merged
luraess merged 4 commits into
JuliaGPU:mainfrom
Abdelrahman912:csr-ext
May 30, 2026
Merged

Add SparseMatricesCSR.jl Ext#906
luraess merged 4 commits into
JuliaGPU:mainfrom
Abdelrahman912:csr-ext

Conversation

@Abdelrahman912
Copy link
Copy Markdown
Contributor

Add support to CPU CSR matrices, e.g., ROCSparseMatrixCSR(::SparseMatrixCSR)

@luraess
Copy link
Copy Markdown
Member

luraess commented May 18, 2026

Thanks for the contribution! The structure closely mirrors what CUDA.jl does in lib/cusparse/ext/SparseMatricesCSRExt.jl.

One thing to consider (unless I am wrong): SparseMatricesCSR.SparseMatrixCSR{Bi,Tv,Ti} has a type parameter Bi for the index base (0 or 1). The CPU→GPU constructors currently use Mat.rowptr/Mat.colval directly without checking Bi, which would silently produce incorrect index arrays for a 0-based SparseMatrixCSR{0}; ROCSPARSE expects 1-based pointers in the Julia wrappers (matching the SparseMatrixCSR{1} used in the GPU→CPU direction). Restricting the dispatch to {1} would turn this into a clear error instead:

ROCSparseMatrixCSR{T}(Mat::SparseMatrixCSR{1}) where {T} = ...

Note that CUDA.jl has the same gap, but this could be also fixed there if needbe.

@Abdelrahman912
Copy link
Copy Markdown
Contributor Author

This should be ready for merge.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMDGPU.jl Benchmarks

Details
Benchmark suite Current: cd19e25 Previous: 77983ff Ratio
amdgpu/synchronization/context/device 580 ns 590 ns 0.98
amdgpu/synchronization/stream/blocking 240 ns 250 ns 0.96
amdgpu/synchronization/stream/nonblocking 340 ns 340 ns 1
array/accumulate/Float32/1d 86522 ns 87811 ns 0.99
array/accumulate/Float32/dims=1 389157 ns 403115 ns 0.97
array/accumulate/Float32/dims=1L 136393 ns 135562 ns 1.01
array/accumulate/Float32/dims=2 129533 ns 130651 ns 0.99
array/accumulate/Float32/dims=2L 2829561 ns 2830168 ns 1.00
array/accumulate/Int64/1d 94602 ns 95162 ns 0.99
array/accumulate/Int64/dims=1 404208 ns 287554 ns 1.41
array/accumulate/Int64/dims=1L 167973 ns 163062 ns 1.03
array/accumulate/Int64/dims=2 126082 ns 122662 ns 1.03
array/accumulate/Int64/dims=2L 3006764 ns 3009581 ns 1.00
array/broadcast 82021 ns 92571 ns 0.89
array/construct 1760 ns 1650 ns 1.07
array/copy 37901 ns 40541 ns 0.93
array/copyto!/cpu_to_gpu 123523 ns 101241 ns 1.22
array/copyto!/gpu_to_cpu 183374 ns 171042 ns 1.07
array/copyto!/gpu_to_gpu 59931 ns 67061 ns 0.89
array/iteration/findall/bool 183394 ns 186323 ns 0.98
array/iteration/findall/int 196534 ns 193182 ns 1.02
array/iteration/findfirst/bool 120442 ns 116491 ns 1.03
array/iteration/findfirst/int 115812 ns 116172 ns 1.00
array/iteration/findmin/1d 170134 ns 170953 ns 1.00
array/iteration/findmin/2d 156333 ns 157112 ns 1.00
array/iteration/logical 355797 ns 357464 ns 1.00
array/iteration/scalar 289836 ns 297354 ns 0.97
array/permutedims/2d 74041 ns 76211 ns 0.97
array/permutedims/3d 74062 ns 75451 ns 0.98
array/permutedims/4d 76681 ns 77631 ns 0.99
array/random/rand/Float32 52941 ns 52101 ns 1.02
array/random/rand/Int64 57841 ns 57911 ns 1.00
array/random/rand!/Float32 72201 ns 90721 ns 0.80
array/random/rand!/Int64 113362 ns 79251 ns 1.43
array/random/randn/Float32 88392 ns 94361 ns 0.94
array/random/randn!/Float32 109012 ns 111361 ns 0.98
array/reductions/mapreduce/Float32/1d 133433 ns 134212 ns 0.99
array/reductions/mapreduce/Float32/dims=1 94822 ns 95701 ns 0.99
array/reductions/mapreduce/Float32/dims=1L 773215 ns 781590 ns 0.99
array/reductions/mapreduce/Float32/dims=2 97082 ns 97911 ns 0.99
array/reductions/mapreduce/Float32/dims=2L 307446 ns 296814 ns 1.04
array/reductions/mapreduce/Int64/1d 133202 ns 131561 ns 1.01
array/reductions/mapreduce/Int64/dims=1 95131 ns 95531 ns 1.00
array/reductions/mapreduce/Int64/dims=1L 783035 ns 785040 ns 1.00
array/reductions/mapreduce/Int64/dims=2 96311 ns 97011 ns 0.99
array/reductions/mapreduce/Int64/dims=2L 305326 ns 300435 ns 1.02
array/reductions/reduce/Float32/1d 132643 ns 130022 ns 1.02
array/reductions/reduce/Float32/dims=1 95222 ns 95462 ns 1.00
array/reductions/reduce/Float32/dims=1L 773155 ns 776581 ns 1.00
array/reductions/reduce/Float32/dims=2 97282 ns 97971 ns 0.99
array/reductions/reduce/Float32/dims=2L 307396 ns 299134 ns 1.03
array/reductions/reduce/Int64/1d 133453 ns 134991 ns 0.99
array/reductions/reduce/Int64/dims=1 94811 ns 95171 ns 1.00
array/reductions/reduce/Int64/dims=1L 781526 ns 783650 ns 1.00
array/reductions/reduce/Int64/dims=2 96212 ns 97052 ns 0.99
array/reductions/reduce/Int64/dims=2L 299866 ns 298874 ns 1.00
array/reverse/1d 44531 ns 44560 ns 1.00
array/reverse/1dL 75841 ns 76361 ns 0.99
array/reverse/1dL_inplace 110452 ns 119381 ns 0.93
array/reverse/1d_inplace 77421 ns 79391 ns 0.98
array/reverse/2d 52031 ns 52601 ns 0.99
array/reverse/2dL 101522 ns 102402 ns 0.99
array/reverse/2dL_inplace 112843 ns 104651 ns 1.08
array/reverse/2d_inplace 121763 ns 125452 ns 0.97
array/sorting/1d 341026 ns 370525 ns 0.92
integration/byval/reference 38731 ns 39170 ns 0.99
integration/byval/slices=1 39990 ns 40441 ns 0.99
integration/byval/slices=2 162353 ns 147172 ns 1.10
integration/byval/slices=3 243344 ns 240543 ns 1.01
integration/volumerhs 5037498 ns 5026059 ns 1.00
kernel/indexing 73662 ns 65060 ns 1.13
kernel/indexing_checked 72222 ns 50911 ns 1.42
kernel/launch 1340 ns 1280 ns 1.05
kernel/rand 197324 ns 197633 ns 1.00
latency/import 1473069466 ns 1473646012 ns 1.00
latency/precompile 11897056494 ns 11890085969 ns 1.00
latency/ttfp 10343919358 ns 10356361296 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@luraess
Copy link
Copy Markdown
Member

luraess commented May 29, 2026

Thanks for the update. There are failing tests now. Upon fixing those we could then merge, if you can have a look.

Comment on lines +11 to +29
for (n, bd, p) in [(100, 5, 0.02), (5, 1, 0.8), (4, 2, 0.5)]
@testset "conversions between ROCSparseMatrices (n, bd, p) = ($n, $bd, $p)" begin
_A = sprand(n, n, p)
A = SparseMatrixCSR(_A)
blockdim = bd
for ROCSparseMatrixType1 in (ROCSparseMatrixCSC, ROCSparseMatrixCSR, ROCSparseMatrixCOO, ROCSparseMatrixBSR)
dA1 = ROCSparseMatrixType1 == ROCSparseMatrixBSR ? ROCSparseMatrixType1(A, blockdim) : ROCSparseMatrixType1(A)
@testset "conversion $ROCSparseMatrixType1 --> SparseMatrixCSR" begin
@test SparseMatrixCSR(dA1) ≈ A
end
for ROCSparseMatrixType2 in (ROCSparseMatrixCSC, ROCSparseMatrixCSR, ROCSparseMatrixCOO, ROCSparseMatrixBSR)
ROCSparseMatrixType1 == ROCSparseMatrixType2 && continue
dA2 = ROCSparseMatrixType2 == ROCSparseMatrixBSR ? ROCSparseMatrixType2(dA1, blockdim) : ROCSparseMatrixType2(dA1)
@testset "conversion $ROCSparseMatrixType1 --> $ROCSparseMatrixType2" begin
@test collect(dA1) ≈ collect(dA2)
end
end
end
end
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error During Test at /var/lib/buildkite-agent/builds/amdgpu1-luraess-com/julialang/amdgpu-dot-jl/test/hip_rocsparse/sparse_matrices_csr.jl:12
  Got exception outside of a @test
  MethodError: no method matching (AMDGPU.rocSPARSE.ROCSparseMatrixBSR{Float64})(::AMDGPU.rocSPARSE.ROCSparseMatrixCSC{Float64, Int32}, ::Int64)

  Closest candidates are:
    (AMDGPU.rocSPARSE.ROCSparseMatrixBSR{Float64})(::AMDGPU.rocSPARSE.ROCSparseMatrixCSR{Float64}, ::Integer; dir, inda, indc)
     @ AMDGPU /var/lib/buildkite-agent/builds/amdgpu1-luraess-com/julialang/amdgpu-dot-jl/src/sparse/conversions.jl:224
    (AMDGPU.rocSPARSE.ROCSparseMatrixBSR{T})(::SparseArrays.SparseMatrixCSC, ::Any) where T
     @ AMDGPU /var/lib/buildkite-agent/builds/amdgpu1-luraess-com/julialang/amdgpu-dot-jl/src/sparse/array.jl:410
    (AMDGPU.rocSPARSE.ROCSparseMatrixBSR{T})(::SparseMatricesCSR.SparseMatrixCSR{1}, ::Any) where T
     @ AMDGPUSparseMatricesCSRExt /var/lib/buildkite-agent/builds/amdgpu1-luraess-com/julialang/amdgpu-dot-jl/ext/AMDGPUSparseMatricesCSRExt.jl:26
    ...

  Stacktrace:
    [1] AMDGPU.rocSPARSE.ROCSparseMatrixBSR(x::AMDGPU.rocSPARSE.ROCSparseMatrixCSC{Float64, Int32}, blockdim::Int64)
      @ AMDGPU.rocSPARSE /var/lib/buildkite-agent/builds/amdgpu1-luraess-com/julialang/amdgpu-dot-jl/src/sparse/array.jl:417

Probably this is a side tangent to this PR, but it's just so happen that it's the main culprit of the CI fail, due to this test nested loop which essentially converts every sparse matrix to every other sparse matrix. That said,
@luraess is there a reason why there is no conversion between CSC & BSR ? Like this one:
https://github.com/JuliaGPU/CUDA.jl/blob/54c758682d909f32d656fcbe8c43c20062850927/lib/cusparse/src/conversions.jl#L710-L711

If no need for them, I can just update the test, otherwise I can add those conversions, which hopefully will make CI happy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise I can add those conversions

That would be the best way if you're willing to do so. Thanks!

@luraess luraess merged commit 25220ef into JuliaGPU:main May 30, 2026
4 checks passed
@luraess
Copy link
Copy Markdown
Member

luraess commented May 30, 2026

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants