Accelerating GPU Inference of Large Language Models with Moderately Unstructured Sparse Weight Matrices
This library is an efficient SpMM kernel for sparse LLM inference. It contains source code of the paper Accelerating GPU Inference of Large Language Models with Moderately Unstructured Sparse Weight Matrices submitted to DAC.
- NVIDIA
sm_90aGPU (h100 or h800) - CMake 3.20 or higher
- CUDA Toolkit 12.6
- cuSPARSELt 0.7.1 or higher (Optional, used to compare performance with SparTA)
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)If you want to enable SparTA:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
-DCUSPARSELT_LIBRARY=</path/to/cusparselt/library> \
-DCUSPARSELT_HEADER=</path/to/cusparselt/header>
cmake --build build -j$(nproc)Run functional tests:
./build/test/spmm_testRun benchmark:
./build/benchmark/spmm_bench_randomHere are a selection of the benchmark results tested under NVIDIA H100 SXM5 GPUs. More results can be found in the paper.
We evaluate our SpMM kernel across a range of matrix shapes, primarily from LLMs such as OPT-30B and OPT-66B, using batch sizes of 16 and 32. We measure the kernel throughput under sparsity levels of 50%, 60%, and 70%. The baselines we compare against include Sputnik, SparTA, FlashLLM, SpInfer.

