Skip to content

AICL-Lab/mini-inference-engine

Repository files navigation

Mini-Inference Engine logo

Mini-Inference Engine

CUDA GEMM optimization tutorial and mini inference runtime
Compact C++17/CUDA codebase with a 7-stage kernel path and a conservative ~85% cuBLAS reference result on the RTX 3080 1024×1024 benchmark

简体中文 · Docs · Getting Started

What this repository contains

Mini-Inference Engine keeps the scope narrow:

  • Progressive GEMM kernels: naive, tiled, coalesced, double-buffered, register-blocked, fused, and vectorized CUDA implementations.
  • Minimal runtime pieces: Tensor, InferenceEngine, MemoryPool, StreamManager, AutoTuner, and Profiler.
  • Benchmarks and tests: buildable examples plus host/GPU test split.
  • Bilingual documentation: practical guides for building, profiling, and understanding the code.

Build and test

The stable local CUDA path uses the system GCC 12 / G++ 12 toolchain:

cmake --preset gcc-cuda
cmake --build --preset gcc-cuda
ctest --preset gcc-cuda

cmake --preset release-gcc-cuda
cmake --build --preset release-gcc-cuda
./build-release-gcc-cuda/benchmark

If your shell is already using a clean system compiler, default and release remain available. tests_host covers utilities that do not need a GPU device. tests_gpu covers CUDA runtime and kernel behavior; they may skip without an available NVIDIA GPU, but configuration and compilation still require the CUDA Toolkit.

Repository layout

Area Purpose
src/ CUDA kernels and runtime implementation
include/ Public headers for kernels, runtime, and utilities
benchmarks/ Benchmark and demo entry points
tests/ Host tests and GPU-backed behavior tests
docs/ GitHub Pages source and long-form documentation
CHANGELOG.md Single change log for the whole project

Documentation

Topic English 中文
Getting started docs/en/guides/getting-started.md docs/zh/guides/getting-started.md
Architecture docs/en/architecture.md docs/zh/architecture.md
GEMM deep dive docs/en/deep-dive/gemm-optimization.md docs/zh/deep-dive/gemm-optimization.md
Performance tuning docs/en/performance-tuning.md docs/zh/performance-tuning.md
API reference docs/en/api-reference.md docs/zh/api-reference.md
Contributing docs/en/contributing.md docs/zh/contributing.md

Project rules

  • Use .clang-format; functions and variables use snake_case, classes use PascalCase, and constants/template parameters use UPPER_SNAKE_CASE.
  • Wrap CUDA API calls with CUDA_CHECK() and cuBLAS calls with CUBLAS_CHECK().
  • Prefer DeviceMemory or PooledMemory over raw GPU allocation lifetimes.
  • Add new source files explicitly to CMakeLists.txt; do not rely on recursive globbing.
  • Keep GitHub Pages focused on documentation and keep all release history in the root CHANGELOG.md.