Skip to content

RutanshS/llm-inference-engine

Repository files navigation

LLM Inference Engine

C++/HIP inference engine for Qwen-family LLMs. Custom GPU kernels, no PyTorch dependency in the production path.

Validated on AMD Radeon RX 6750 XT (RDNA2, ROCm 6.3).

Performance

Qwen3-8B GPTQ-INT4 on RX 6750 XT (12 GB VRAM):

Metric Value
Throughput 3.1 tok/s
VRAM 5.79 GB
TTFT ~280 ms
TPOT ~320 ms
Weight format GPTQ INT4 (group_size=128)

The C++ engine eliminates Python-side kernel dispatch overhead — a dequant microbenchmark shows 94% of the PyTorch baseline's per-token latency comes from dequantization kernel launches that the native engine avoids.

Architecture

┌──────────────────────────────────────────────────────────────┐
│                        main.cpp                              │
│  CLI → Config → Weights → Tokenizer → Model → Generate      │
├──────────────────────────────────────────────────────────────┤
│                    engine/src/model.hip                       │
│  Forward pass:                                               │
│    Embed → N × (RMSNorm → QKV → QK-Norm → RoPE → KVCache   │
│              → FlashDecode/Attn → MLP) → Final Norm → LM Head│
├──────────────────────────────────────────────────────────────┤
│  Custom HIP Kernels                                          │
│  rmsnorm  rope  swiglu  flash_decode  dequant_int4           │
├──────────────────────────────────────────────────────────────┤
│  hipBLAS GEMM — FP16 in, FP32 accumulate                    │
├──────────────────────────────────────────────────────────────┤
│  MemoryManager — zero runtime allocation after init          │
└──────────────────────────────────────────────────────────────┘

Two layers exist side by side:

Layer Purpose Location
C++ Engine Production inference — pure C++/HIP, custom kernels engine/, kernels/, main.cpp
Python Model Reference impl + correctness validation against HuggingFace model/, main.py

Both produce identical outputs on the same weights. The Python side also serves as the PyBind11 test harness for verifying individual kernels against PyTorch baselines.

Kernels

Kernel What it does
rmsnorm.hip Warp-reduce RMSNorm, block-level shared memory reduction
rope.hip Rotary position embeddings — computes sin/cos on-the-fly via __sincosf, no precomputed tables
swiglu.hip Fused SiLU(gate) × up
flash_decode.hip Single-pass FlashDecode with online softmax for seq_len=1 decode steps
dequant.hip GPTQ INT4 → FP16 dequantization with transposed output for GEMM

Design notes

Zero-alloc inference. All scratch buffers allocated once at Model::reset_caches(). The decode loop doesn't touch the allocator.

BF16 → FP16 at load time. RDNA2 has no hardware BF16 matrix units, so weights are converted during loading. For GPTQ models, INT4 weights stay packed as int32 and dequantize on the fly.

Custom safetensors parser. Direct mmap + JSON header parse. No Python or library dependency. Handles sharded models.

GPTQ W4A16 pipeline. Each linear layer: dequant kernel writes FP16 into a scratch buffer → hipBLAS GEMM. Two kernel launches instead of ~6 in Python.

Building

Prerequisites

  • ROCm 6.x (tested on 6.3.4)
  • CMake ≥ 3.21
  • AMD GPU with gfx1030 ISA (RDNA2) — adjust CMAKE_HIP_ARCHITECTURES for other targets

C++ Engine

mkdir build && cd build
cmake .. -DCMAKE_HIP_ARCHITECTURES=gfx1030
make -j$(nproc)
./llm-engine --model-path /path/to/Qwen3-8B-GPTQ-Int4 --prompt "Hello" --max-tokens 50

Python Reference

pip install torch transformers safetensors huggingface_hub
python main.py --model qwen3-8b --prompt "What is gravity?"

PyBind11 Kernel Bindings

cd bindings && python setup.py build_ext --inplace

Exposes rmsnorm_forward, rope_forward, swiglu_forward_inplace, flash_decode_forward for use from PyTorch. Used by the test suite to validate kernel correctness.

CUDA Portability

The hardware-specific surface is small:

HIP CUDA
hipMalloc cudaMalloc
hipblasGemmEx cublasGemmEx
hipLaunchKernelGGL <<<>>> syntax

Kernel code uses standard intrinsics (__shfl_down, __syncthreads, __sincosf) that are the same across HIP and CUDA. Porting: rename .hip.cu, swap includes, update CMake.

Project Structure

├── main.cpp                    # C++ entry point
├── main.py                     # Python entry point
├── CMakeLists.txt
├── engine/
│   ├── include/
│   │   ├── model.h             # Forward pass
│   │   ├── weight_loader.h     # Safetensors parser + config
│   │   ├── kv_cache.h          # KV cache
│   │   ├── gemm.h              # hipBLAS wrapper
│   │   ├── memory.h            # GPU allocation tracker
│   │   ├── tensor.h            # Lightweight tensor (ptr, shape, dtype)
│   │   ├── tokenizer.h         # BPE tokenizer
│   │   └── generate.h          # Sampling + generation loop
│   └── src/
│       ├── model.hip
│       ├── kv_cache.hip
│       ├── gemm.hip
│       ├── weight_loader.cpp
│       ├── memory.cpp
│       ├── tokenizer.cpp
│       └── generate.cpp
├── kernels/
│   ├── include/
│   └── src/
│       ├── rmsnorm.hip
│       ├── rope.hip
│       ├── swiglu.hip
│       ├── flash_decode.hip
│       └── dequant.hip
├── model/                      # PyTorch reference + test harness
│   ├── attention.py
│   ├── layers.py
│   ├── transformer.py
│   ├── kv_cache.py
│   ├── generate.py
│   ├── weight_loader.py
│   └── config.py
├── bindings/
│   ├── pybind_module.cpp
│   └── setup.py
├── tests/
│   ├── test_rmsnorm.py
│   ├── test_rope.py
│   ├── test_swiglu.py
│   ├── test_layers.py
│   ├── benchmark.py
│   ├── benchmark_suite.py
│   └── benchmark_dequant.py
└── third_party/
    └── nlohmann/json.hpp

KV Cache

Pre-allocated contiguous FP16 buffers per layer:

k_cache: (num_kv_heads, max_seq_len, head_dim)
v_cache: same layout

Qwen3-8B: 4 KV heads × 1024 tokens × 128 dim × 2 bytes = 1 MB/layer
36 layers at 1024 tokens: ~36 MB total

GQA means we cache 4 KV heads instead of 32, so KV memory is 8× smaller than full MHA.

Profiling

# Kernel trace
rocprof --hip-trace --hsa-trace ./llm-engine --model-path /path/to/model --prompt "Hello"

# VRAM monitoring
rocm-smi --showmeminfo vram -l 1

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors