SGEMM Optimization

Progressive CUDA SGEMM tutorial and reference implementation. The repository contains five hand-written kernel variants, cuBLAS-backed verification, a benchmark harness, and OpenSpec-governed repository rules for keeping the project compact and trustworthy.

Why this repository exists

Show the optimization ladder clearly: naive -> tiled -> bank-conflict-free -> double-buffered -> Tensor Core WMMA
Stay readable: each optimization lives in its own kernel file and keeps a consistent launch interface
Stay verifiable: kernels are checked against cuBLAS, with separate tolerances for FP32 and Tensor Core paths
Stay maintainable: the repository uses OpenSpec to keep docs, workflow, and validation rules aligned

Kernel progression

Stage	File	Main idea
Naive	`src/kernels/naive_sgemm.cuh`	Baseline triple-loop mapping
Tiled	`src/kernels/tiled_sgemm.cuh`	Shared-memory blocking
Bank-Free	`src/kernels/bank_conflict_free_sgemm.cuh`	`[TILE_SIZE][TILE_SIZE+1]` padding
Double Buffer	`src/kernels/double_buffer_sgemm.cuh`	Tile staging overlap and latency hiding
Tensor Core	`src/kernels/tensor_core_sgemm.cuh`	WMMA path with safe FP32 fallback

Quick start

git clone https://github.com/LessUp/sgemm-optimization.git
cd sgemm-optimization

# Recommended: CMake
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark -a
ctest --test-dir build

# Quick local alternative
make GPU_ARCH=sm_86
make benchmark
make test

Validation model

Local GPU machine: runtime tests, correctness checks, and benchmarking
GitHub Actions: format/style, CUDA compile validation, OpenSpec/repository checks, and Pages deployment

Standard FP32 kernels use rtol=1e-3, atol=1e-4. The Tensor Core path uses rtol=5e-2, atol=1e-2.

Repository layout

src/
├── kernels/        # Five SGEMM kernel variants
├── utils/          # CUDA RAII, verification, benchmark helpers
└── main.cu         # Benchmark entry point
tests/
└── test_sgemm.cu   # Google Test suite
docs/               # Public learning-oriented documentation
openspec/           # Stable specs, changes, and workflow guidance

Development workflow

Non-trivial repository changes are expected to follow:

/opsx:explore
/opsx:propose "description"
/opsx:apply
/review
/opsx:archive

The stable authoritative specs live under openspec/specs/. Active implementation plans live under openspec/changes/<change>/.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.claude		.claude
.githooks		.githooks
.github		.github
.vscode		.vscode
_sass/custom		_sass/custom
assets		assets
benchmarks/data		benchmarks/data
build		build
docs		docs
openspec		openspec
scripts		scripts
src		src
tests		tests
.clang-format		.clang-format
.clangd		.clangd
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
README.zh-CN.md		README.zh-CN.md
_config.yml		_config.yml
index.md		index.md
specs.md		specs.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SGEMM Optimization

Why this repository exists

Kernel progression

Quick start

Validation model

Read next

Repository layout

Development workflow

License

About

Licenses found

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SGEMM Optimization

Why this repository exists

Kernel progression

Quick start

Validation model

Read next

Repository layout

Development workflow

License

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages