C++/HIP inference engine for Qwen-family LLMs. Custom GPU kernels, no PyTorch dependency in the production path.
Validated on AMD Radeon RX 6750 XT (RDNA2, ROCm 6.3).
Qwen3-8B GPTQ-INT4 on RX 6750 XT (12 GB VRAM):
| Metric | Value |
|---|---|
| Throughput | 3.1 tok/s |
| VRAM | 5.79 GB |
| TTFT | ~280 ms |
| TPOT | ~320 ms |
| Weight format | GPTQ INT4 (group_size=128) |
The C++ engine eliminates Python-side kernel dispatch overhead — a dequant microbenchmark shows 94% of the PyTorch baseline's per-token latency comes from dequantization kernel launches that the native engine avoids.
┌──────────────────────────────────────────────────────────────┐
│ main.cpp │
│ CLI → Config → Weights → Tokenizer → Model → Generate │
├──────────────────────────────────────────────────────────────┤
│ engine/src/model.hip │
│ Forward pass: │
│ Embed → N × (RMSNorm → QKV → QK-Norm → RoPE → KVCache │
│ → FlashDecode/Attn → MLP) → Final Norm → LM Head│
├──────────────────────────────────────────────────────────────┤
│ Custom HIP Kernels │
│ rmsnorm rope swiglu flash_decode dequant_int4 │
├──────────────────────────────────────────────────────────────┤
│ hipBLAS GEMM — FP16 in, FP32 accumulate │
├──────────────────────────────────────────────────────────────┤
│ MemoryManager — zero runtime allocation after init │
└──────────────────────────────────────────────────────────────┘
Two layers exist side by side:
| Layer | Purpose | Location |
|---|---|---|
| C++ Engine | Production inference — pure C++/HIP, custom kernels | engine/, kernels/, main.cpp |
| Python Model | Reference impl + correctness validation against HuggingFace | model/, main.py |
Both produce identical outputs on the same weights. The Python side also serves as the PyBind11 test harness for verifying individual kernels against PyTorch baselines.
| Kernel | What it does |
|---|---|
rmsnorm.hip |
Warp-reduce RMSNorm, block-level shared memory reduction |
rope.hip |
Rotary position embeddings — computes sin/cos on-the-fly via __sincosf, no precomputed tables |
swiglu.hip |
Fused SiLU(gate) × up |
flash_decode.hip |
Single-pass FlashDecode with online softmax for seq_len=1 decode steps |
dequant.hip |
GPTQ INT4 → FP16 dequantization with transposed output for GEMM |
Zero-alloc inference. All scratch buffers allocated once at Model::reset_caches(). The decode loop doesn't touch the allocator.
BF16 → FP16 at load time. RDNA2 has no hardware BF16 matrix units, so weights are converted during loading. For GPTQ models, INT4 weights stay packed as int32 and dequantize on the fly.
Custom safetensors parser. Direct mmap + JSON header parse. No Python or library dependency. Handles sharded models.
GPTQ W4A16 pipeline. Each linear layer: dequant kernel writes FP16 into a scratch buffer → hipBLAS GEMM. Two kernel launches instead of ~6 in Python.
- ROCm 6.x (tested on 6.3.4)
- CMake ≥ 3.21
- AMD GPU with gfx1030 ISA (RDNA2) — adjust
CMAKE_HIP_ARCHITECTURESfor other targets
mkdir build && cd build
cmake .. -DCMAKE_HIP_ARCHITECTURES=gfx1030
make -j$(nproc)
./llm-engine --model-path /path/to/Qwen3-8B-GPTQ-Int4 --prompt "Hello" --max-tokens 50pip install torch transformers safetensors huggingface_hub
python main.py --model qwen3-8b --prompt "What is gravity?"cd bindings && python setup.py build_ext --inplaceExposes rmsnorm_forward, rope_forward, swiglu_forward_inplace, flash_decode_forward for use from PyTorch. Used by the test suite to validate kernel correctness.
The hardware-specific surface is small:
| HIP | CUDA |
|---|---|
hipMalloc |
cudaMalloc |
hipblasGemmEx |
cublasGemmEx |
hipLaunchKernelGGL |
<<<>>> syntax |
Kernel code uses standard intrinsics (__shfl_down, __syncthreads, __sincosf) that are the same across HIP and CUDA. Porting: rename .hip → .cu, swap includes, update CMake.
├── main.cpp # C++ entry point
├── main.py # Python entry point
├── CMakeLists.txt
├── engine/
│ ├── include/
│ │ ├── model.h # Forward pass
│ │ ├── weight_loader.h # Safetensors parser + config
│ │ ├── kv_cache.h # KV cache
│ │ ├── gemm.h # hipBLAS wrapper
│ │ ├── memory.h # GPU allocation tracker
│ │ ├── tensor.h # Lightweight tensor (ptr, shape, dtype)
│ │ ├── tokenizer.h # BPE tokenizer
│ │ └── generate.h # Sampling + generation loop
│ └── src/
│ ├── model.hip
│ ├── kv_cache.hip
│ ├── gemm.hip
│ ├── weight_loader.cpp
│ ├── memory.cpp
│ ├── tokenizer.cpp
│ └── generate.cpp
├── kernels/
│ ├── include/
│ └── src/
│ ├── rmsnorm.hip
│ ├── rope.hip
│ ├── swiglu.hip
│ ├── flash_decode.hip
│ └── dequant.hip
├── model/ # PyTorch reference + test harness
│ ├── attention.py
│ ├── layers.py
│ ├── transformer.py
│ ├── kv_cache.py
│ ├── generate.py
│ ├── weight_loader.py
│ └── config.py
├── bindings/
│ ├── pybind_module.cpp
│ └── setup.py
├── tests/
│ ├── test_rmsnorm.py
│ ├── test_rope.py
│ ├── test_swiglu.py
│ ├── test_layers.py
│ ├── benchmark.py
│ ├── benchmark_suite.py
│ └── benchmark_dequant.py
└── third_party/
└── nlohmann/json.hpp
Pre-allocated contiguous FP16 buffers per layer:
k_cache: (num_kv_heads, max_seq_len, head_dim)
v_cache: same layout
Qwen3-8B: 4 KV heads × 1024 tokens × 128 dim × 2 bytes = 1 MB/layer
36 layers at 1024 tokens: ~36 MB total
GQA means we cache 4 KV heads instead of 32, so KV memory is 8× smaller than full MHA.
# Kernel trace
rocprof --hip-trace --hsa-trace ./llm-engine --model-path /path/to/model --prompt "Hello"
# VRAM monitoring
rocm-smi --showmeminfo vram -l 1