Accelerating GPU Inference of Large Language Models with Moderately Unstructured Sparse Weight Matrices

This library is an efficient SpMM kernel for sparse LLM inference. It contains source code of the paper Accelerating GPU Inference of Large Language Models with Moderately Unstructured Sparse Weight Matrices submitted to DAC.

Prerequires

NVIDIA sm_90a GPU (h100 or h800)
CMake 3.20 or higher
CUDA Toolkit 12.6
cuSPARSELt 0.7.1 or higher (Optional, used to compare performance with SparTA)

Build

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

If you want to enable SparTA:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release          \
    -DCUSPARSELT_LIBRARY=</path/to/cusparselt/library>  \
    -DCUSPARSELT_HEADER=</path/to/cusparselt/header>
cmake --build build -j$(nproc)

Run

Run functional tests:

./build/test/spmm_test

Run benchmark:

./build/benchmark/spmm_bench_random

Results

Here are a selection of the benchmark results tested under NVIDIA H100 SXM5 GPUs. More results can be found in the paper.

We evaluate our SpMM kernel across a range of matrix shapes, primarily from LLMs such as OPT-30B and OPT-66B, using batch sizes of 16 and 32. We measure the kernel throughput under sparsity levels of 50%, 60%, and 70%. The baselines we compare against include Sputnik, SparTA, FlashLLM, SpInfer.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
benchmark		benchmark
cmake		cmake
img		img
include		include
script		script
src		src
test		test
.clang-format		.clang-format
.gitignore		.gitignore
.python-version		.python-version
CMakeLists.txt		CMakeLists.txt
README.md		README.md
convert.py		convert.py
download.sh		download.sh
pyproject.toml		pyproject.toml
suitesparse.txt		suitesparse.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Accelerating GPU Inference of Large Language Models with Moderately Unstructured Sparse Weight Matrices

Prerequires

Build

Run

Results

SpMM kernel performance comparison across different matrix shapes at the 50% sparsity level

SpMM kernel performance comparison across different sparsity levels

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Accelerating GPU Inference of Large Language Models with Moderately Unstructured Sparse Weight Matrices

Prerequires

Build

Run

Results

SpMM kernel performance comparison across different matrix shapes at the 50% sparsity level

SpMM kernel performance comparison across different sparsity levels

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages