CUDA GEMM optimization tutorial and mini inference runtime
Compact C++17/CUDA codebase with a 7-stage kernel path and a conservative ~85% cuBLAS reference result on the RTX 3080 1024×1024 benchmark
简体中文 · Docs · Getting Started
Mini-Inference Engine keeps the scope narrow:
- Progressive GEMM kernels: naive, tiled, coalesced, double-buffered, register-blocked, fused, and vectorized CUDA implementations.
- Minimal runtime pieces:
Tensor,InferenceEngine,MemoryPool,StreamManager,AutoTuner, andProfiler. - Benchmarks and tests: buildable examples plus host/GPU test split.
- Bilingual documentation: practical guides for building, profiling, and understanding the code.
The stable local CUDA path uses the system GCC 12 / G++ 12 toolchain:
cmake --preset gcc-cuda
cmake --build --preset gcc-cuda
ctest --preset gcc-cuda
cmake --preset release-gcc-cuda
cmake --build --preset release-gcc-cuda
./build-release-gcc-cuda/benchmarkIf your shell is already using a clean system compiler, default and release remain available. tests_host covers utilities that do not need a GPU device. tests_gpu covers CUDA runtime and kernel behavior; they may skip without an available NVIDIA GPU, but configuration and compilation still require the CUDA Toolkit.
| Area | Purpose |
|---|---|
src/ |
CUDA kernels and runtime implementation |
include/ |
Public headers for kernels, runtime, and utilities |
benchmarks/ |
Benchmark and demo entry points |
tests/ |
Host tests and GPU-backed behavior tests |
docs/ |
GitHub Pages source and long-form documentation |
CHANGELOG.md |
Single change log for the whole project |
| Topic | English | 中文 |
|---|---|---|
| Getting started | docs/en/guides/getting-started.md | docs/zh/guides/getting-started.md |
| Architecture | docs/en/architecture.md | docs/zh/architecture.md |
| GEMM deep dive | docs/en/deep-dive/gemm-optimization.md | docs/zh/deep-dive/gemm-optimization.md |
| Performance tuning | docs/en/performance-tuning.md | docs/zh/performance-tuning.md |
| API reference | docs/en/api-reference.md | docs/zh/api-reference.md |
| Contributing | docs/en/contributing.md | docs/zh/contributing.md |
- Use
.clang-format; functions and variables usesnake_case, classes usePascalCase, and constants/template parameters useUPPER_SNAKE_CASE. - Wrap CUDA API calls with
CUDA_CHECK()and cuBLAS calls withCUBLAS_CHECK(). - Prefer
DeviceMemoryorPooledMemoryover raw GPU allocation lifetimes. - Add new source files explicitly to
CMakeLists.txt; do not rely on recursive globbing. - Keep GitHub Pages focused on documentation and keep all release history in the root
CHANGELOG.md.