LLM Inference Engine

C++/HIP inference engine for Qwen-family LLMs. Custom GPU kernels, no PyTorch dependency in the production path.

Validated on AMD Radeon RX 6750 XT (RDNA2, ROCm 6.3).

Performance

Qwen3-8B GPTQ-INT4 on RX 6750 XT (12 GB VRAM):

Metric	Value
Throughput	3.1 tok/s
VRAM	5.79 GB
TTFT	~280 ms
TPOT	~320 ms
Weight format	GPTQ INT4 (group_size=128)

The C++ engine eliminates Python-side kernel dispatch overhead — a dequant microbenchmark shows 94% of the PyTorch baseline's per-token latency comes from dequantization kernel launches that the native engine avoids.

Architecture

┌──────────────────────────────────────────────────────────────┐
│                        main.cpp                              │
│  CLI → Config → Weights → Tokenizer → Model → Generate      │
├──────────────────────────────────────────────────────────────┤
│                    engine/src/model.hip                       │
│  Forward pass:                                               │
│    Embed → N × (RMSNorm → QKV → QK-Norm → RoPE → KVCache   │
│              → FlashDecode/Attn → MLP) → Final Norm → LM Head│
├──────────────────────────────────────────────────────────────┤
│  Custom HIP Kernels                                          │
│  rmsnorm  rope  swiglu  flash_decode  dequant_int4           │
├──────────────────────────────────────────────────────────────┤
│  hipBLAS GEMM — FP16 in, FP32 accumulate                    │
├──────────────────────────────────────────────────────────────┤
│  MemoryManager — zero runtime allocation after init          │
└──────────────────────────────────────────────────────────────┘

Two layers exist side by side:

Layer	Purpose	Location
C++ Engine	Production inference — pure C++/HIP, custom kernels	`engine/`, `kernels/`, `main.cpp`
Python Model	Reference impl + correctness validation against HuggingFace	`model/`, `main.py`

Both produce identical outputs on the same weights. The Python side also serves as the PyBind11 test harness for verifying individual kernels against PyTorch baselines.

Kernels

Kernel	What it does
`rmsnorm.hip`	Warp-reduce RMSNorm, block-level shared memory reduction
`rope.hip`	Rotary position embeddings — computes sin/cos on-the-fly via `__sincosf`, no precomputed tables
`swiglu.hip`	Fused SiLU(gate) × up
`flash_decode.hip`	Single-pass FlashDecode with online softmax for seq_len=1 decode steps
`dequant.hip`	GPTQ INT4 → FP16 dequantization with transposed output for GEMM

Design notes

Zero-alloc inference. All scratch buffers allocated once at Model::reset_caches(). The decode loop doesn't touch the allocator.

BF16 → FP16 at load time. RDNA2 has no hardware BF16 matrix units, so weights are converted during loading. For GPTQ models, INT4 weights stay packed as int32 and dequantize on the fly.

Custom safetensors parser. Direct mmap + JSON header parse. No Python or library dependency. Handles sharded models.

GPTQ W4A16 pipeline. Each linear layer: dequant kernel writes FP16 into a scratch buffer → hipBLAS GEMM. Two kernel launches instead of ~6 in Python.

Building

Prerequisites

ROCm 6.x (tested on 6.3.4)
CMake ≥ 3.21
AMD GPU with gfx1030 ISA (RDNA2) — adjust CMAKE_HIP_ARCHITECTURES for other targets

C++ Engine

mkdir build && cd build
cmake .. -DCMAKE_HIP_ARCHITECTURES=gfx1030
make -j$(nproc)
./llm-engine --model-path /path/to/Qwen3-8B-GPTQ-Int4 --prompt "Hello" --max-tokens 50

Python Reference

pip install torch transformers safetensors huggingface_hub
python main.py --model qwen3-8b --prompt "What is gravity?"

PyBind11 Kernel Bindings

cd bindings && python setup.py build_ext --inplace

Exposes rmsnorm_forward, rope_forward, swiglu_forward_inplace, flash_decode_forward for use from PyTorch. Used by the test suite to validate kernel correctness.

CUDA Portability

The hardware-specific surface is small:

HIP	CUDA
`hipMalloc`	`cudaMalloc`
`hipblasGemmEx`	`cublasGemmEx`
`hipLaunchKernelGGL`	`<<<>>>` syntax

Kernel code uses standard intrinsics (__shfl_down, __syncthreads, __sincosf) that are the same across HIP and CUDA. Porting: rename .hip → .cu, swap includes, update CMake.

Project Structure

├── main.cpp                    # C++ entry point
├── main.py                     # Python entry point
├── CMakeLists.txt
├── engine/
│   ├── include/
│   │   ├── model.h             # Forward pass
│   │   ├── weight_loader.h     # Safetensors parser + config
│   │   ├── kv_cache.h          # KV cache
│   │   ├── gemm.h              # hipBLAS wrapper
│   │   ├── memory.h            # GPU allocation tracker
│   │   ├── tensor.h            # Lightweight tensor (ptr, shape, dtype)
│   │   ├── tokenizer.h         # BPE tokenizer
│   │   └── generate.h          # Sampling + generation loop
│   └── src/
│       ├── model.hip
│       ├── kv_cache.hip
│       ├── gemm.hip
│       ├── weight_loader.cpp
│       ├── memory.cpp
│       ├── tokenizer.cpp
│       └── generate.cpp
├── kernels/
│   ├── include/
│   └── src/
│       ├── rmsnorm.hip
│       ├── rope.hip
│       ├── swiglu.hip
│       ├── flash_decode.hip
│       └── dequant.hip
├── model/                      # PyTorch reference + test harness
│   ├── attention.py
│   ├── layers.py
│   ├── transformer.py
│   ├── kv_cache.py
│   ├── generate.py
│   ├── weight_loader.py
│   └── config.py
├── bindings/
│   ├── pybind_module.cpp
│   └── setup.py
├── tests/
│   ├── test_rmsnorm.py
│   ├── test_rope.py
│   ├── test_swiglu.py
│   ├── test_layers.py
│   ├── benchmark.py
│   ├── benchmark_suite.py
│   └── benchmark_dequant.py
└── third_party/
    └── nlohmann/json.hpp

KV Cache

Pre-allocated contiguous FP16 buffers per layer:

k_cache: (num_kv_heads, max_seq_len, head_dim)
v_cache: same layout

Qwen3-8B: 4 KV heads × 1024 tokens × 128 dim × 2 bytes = 1 MB/layer
36 layers at 1024 tokens: ~36 MB total

GQA means we cache 4 KV heads instead of 32, so KV memory is 8× smaller than full MHA.

Profiling

# Kernel trace
rocprof --hip-trace --hsa-trace ./llm-engine --model-path /path/to/model --prompt "Hello"

# VRAM monitoring
rocm-smi --showmeminfo vram -l 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Inference Engine

Performance

Architecture

Kernels

Design notes

Building

Prerequisites

C++ Engine

Python Reference

PyBind11 Kernel Bindings

CUDA Portability

Project Structure

KV Cache

Profiling

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bindings		bindings
engine		engine
kernels		kernels
model		model
tests		tests
third_party/nlohmann		third_party/nlohmann
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
main.cpp		main.cpp
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLM Inference Engine

Performance

Architecture

Kernels

Design notes

Building

Prerequisites

C++ Engine

Python Reference

PyBind11 Kernel Bindings

CUDA Portability

Project Structure

KV Cache

Profiling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages