Skip to content

moui0/cudac

Repository files navigation

Accelerating GPU Inference of Large Language Models with Moderately Unstructured Sparse Weight Matrices

This library is an efficient SpMM kernel for sparse LLM inference. It contains source code of the paper Accelerating GPU Inference of Large Language Models with Moderately Unstructured Sparse Weight Matrices submitted to DAC.

Prerequires

  • NVIDIA sm_90a GPU (h100 or h800)
  • CMake 3.20 or higher
  • CUDA Toolkit 12.6
  • cuSPARSELt 0.7.1 or higher (Optional, used to compare performance with SparTA)

Build

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

If you want to enable SparTA:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release          \
    -DCUSPARSELT_LIBRARY=</path/to/cusparselt/library>  \
    -DCUSPARSELT_HEADER=</path/to/cusparselt/header>
cmake --build build -j$(nproc)

Run

Run functional tests:

./build/test/spmm_test

Run benchmark:

./build/benchmark/spmm_bench_random

Results

Here are a selection of the benchmark results tested under NVIDIA H100 SXM5 GPUs. More results can be found in the paper.

We evaluate our SpMM kernel across a range of matrix shapes, primarily from LLMs such as OPT-30B and OPT-66B, using batch sizes of 16 and 32. We measure the kernel throughput under sparsity levels of 50%, 60%, and 70%. The baselines we compare against include Sputnik, SparTA, FlashLLM, SpInfer.

SpMM kernel performance comparison across different matrix shapes at the 50% sparsity level

SpMM kernel performance comparison across different sparsity levels

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors