The goal of this project is to examine whether there is a matrix reordering scheme that optimizes the Sparse Matrix-Vector Multiplication (SpMV) kernel. If such a scheme exists, we aim to understand how and why this optimization occurs.
-
Does matrix reordering improve the performance of Sparse Matrix Vector Multiplication?
-
Does matrix reordering impacts load balancing of Sparse Matrix Vector Multiplication?
-
Which matrix reordering schemes provide the greatest performance benefits?
-
Why do certain reordering schemes improve SpMV performance?
-
How does the effectiveness of reordering depend on matrix characteristics?
-
What is the trade-off between reordering cost and performance gain?
-
Does running the benchmark on a different architecture machine (ARM), impacts the resuls?
The benchmark was written purely in C, python was used for plotting and generating tables. A Makefile is provided in order to build and run the benchmark and plotting efficiently.
The matrices used as input are from SuiteSparse in *.mtx format. For the SPMxV the Compressed Sparse Row (CSR) method is used for representing the matrix.
The benchmark evaluates the following matrix reordering techniques:
- Reverse Cuthill–McKee (RCM)
- Approximate Minimum Degree (AMD)
- Nested Dissection (ND)
To simulate different cache behaviors, the benchmark includes three measurement strategies:
- Repeated A * x (RAx)
- Input-Output Swap (IOs)
- Cold Measurement
- CPU: Intel Core i5-1035G1
- Microarchitecture: Ice Lake (10th Gen Intel)
- Cores / Threads: 4 cores / 8 threads
- Base / Boost Frequency: 1.00 GHz / 3.60 GHz
- Cache:
- L1d: 48 KB per core (192 KB total)
- L1i: 32 KB per core (128 KB total)
- L2: 512 KB per core (2 MB total)
- L3: 6 MB shared
- RAM: 20 GB
- Architecture: x86_64
- Compiler: GCC 15.2.1
- Compiler Flags: -O3 -march=native
- Operating System: Manjaro Linux (Kernel 6.12)
- CPU: ARM Neoverse-N1
- Microarchitecture: Neoverse N1
- Cores / Threads: 4 cores / 4 threads
- Cache:
- L1d: 64 KB per core (256 KB total)
- L1i: 64 KB per core (256 KB total)
- L2: 1 MB per core (4 MB total)
- L3: 32 MB shared
- RAM: 16 GB
- Architecture: ARM64 (AArch64)
- Compiler: GCC 12.2.0
- Compiler Flags: -O3 -mcpu=native
- Operating System: Debian Linux (Kernel 6.1, cloud ARM64)