English | 简体中文
A minimal C++ framework for learning and understanding the LLM inference pipeline. The project keeps the full inference path readable and editable: config loading, weight loading, tokenizer, prefill/decode, sampling, and decoding. The default setup targets Qwen2.5-0.5B, so you can run end-to-end inference quickly on a single machine.
This project focuses on correctness and architecture clarity, not peak serving performance. Third-party dependencies are intentionally minimal (spdlog and nlohmann/json, both vendored); the rest is implemented in C++ and kept friendly for coursework, research prototypes, and self-study.
releasenow includes a continuous batching service mode (--serve) in addition to one-shot CLI inference.- Regression/invariant tests were expanded, with CTest labels and a dedicated target
easy_llm_regression_gates. - Model loading now has stronger architecture/key/shape validation (including
LayerKeyPrefixdispatch and parameter checks). - CUDA path has grown (matmul/MLP/self-attn kernels), while the project remains CPU-first by default.
- A C++17 compiler
- CMake (>= 3.10)
- Optional: OpenMP (
EASY_LLM_ENABLE_OPENMP=ONby default) - Optional: CUDA toolkit (only when building CUDA path)
CPU build:
cmake -S . -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_COMPILER=g++
cmake --build build --target easy_llm -j8CUDA build (requires local CUDA env):
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DEASY_LLM_ENABLE_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=<your_arch> \
-DCMAKE_CXX_COMPILER=g++
cmake --build build --target easy_llm -j8build.sh is kept for local experiments and may contain machine-specific flags. Prefer the CMake commands above for portable usage.
The default target is Qwen2.5-0.5B. Put required files under (data/ is git-ignored):
data/model/
├─ config.json
├─ model.safetensors
├─ tokenizer.json
└─ tokenizer_config.json
To customize paths, edit defaults in include/config.hpp.
./build/easy_llm --help
./build/easy_llm --max-steps 128 --temperature 0.7 --top-p 0.9 --top-k 40 "Hello"
./build/easy_llm -f test/data/test_batch.txt --max-steps 256 --temperature 0.1
./build/easy_llm --serveKey arguments:
-f/--prompt-file: read multiple prompts from a file-m/--max-steps: generation length limit per request--temperature/--top-p/--top-k: sampling controls--seed: RNG seed--greedy: greedy decoding--serve: run long-lived continuous batching service--serve-max-active: max concurrent active requests in service mode--serve-prefill-batch: max requests admitted per prefill round--serve-idle-ms: idle sleep interval for the service loop--serve-stats-ms: periodic stats log interval (0 disables periodic logs)
When running ./build/easy_llm --serve:
- Submit one prompt per input line in stdin.
- Type
/quit(or:quit) to stop accepting new input and drain existing requests. - Accepted requests print
[accepted <id>]; completed requests print[request <id>] <decoded_text>.
include/ # Public headers
include/models/ # GPT component interfaces
include/continuous_batch_server.hpp
src/ # Core implementations
src/models/ # Model component implementations
src/continuous_batch_server.cpp
src/cuda/ # CUDA runtime and CUDA operators
test/ # Unit/invariant tests and test data
scripts/run_regression_gates.sh
data/ # Model assets (git-ignored)
src/main.cpp orchestrates two runtime modes:
- Parse CLI arguments (prompt/sampling/service options)
- Load config + model weights + tokenizer
- Apply chat template to user prompts
- One-shot mode:
GptEngine::runwithDataManager - Service mode:
ContinuousBatchServer::runwith prefill/decode rounds over active requests
DataManagertokenizes and applies left padding, trackingseq_lenandpad_len.GptModel::forward(or continuous sampling APIs) uses prefill -> decode staging with per-layer KV caches.- Prefill runs full prompt forward once; decode feeds one token per step with KV cache append/reuse.
- EOS filtering and active-sample bookkeeping keep sample IDs, generated tokens, and position lengths aligned.
- End-to-end inference path from config/weights/tokenizer to sampled output
- Clear GPT decomposition: Embedding -> Blocks(Self-Attn + MLP) x N -> Norm -> output projection
- Greedy / Top-K / Top-P sampling with explicit CLI controls
- Continuous batching service mode for multi-request decode scheduling
- Regression/invariant test gates for critical generation/cache/data-manager behaviors
- CPU-first baseline with optional CUDA acceleration path
- Model/tokenizer paths:
include/config.hpp(defaults underdata/model/) - Precision macro: default
USE_BF16; can be adjusted at compile time - OpenMP: controlled by
EASY_LLM_ENABLE_OPENMP - Model adaptation:
create_layer_key_prefixdispatches byarchitecture/model_type(currently Qwen2 family) - Model param validation: pre-load key/shape checks reduce silent mismatch risk
Build and run regression gates:
cmake --build build --target easy_llm_regression_gates -j8
ctest --test-dir build --output-on-failure -L "^invariant_gate$"CUDA-specific invariant tests (when CUDA build is enabled):
ctest --test-dir build --output-on-failure -L "^invariant_gate_cuda$"Helper script:
bash scripts/run_regression_gates.sh
bash scripts/run_regression_gates.sh --with-cudaQ: Where is the output saved when using -f/--prompt-file?
A: Output is written to a sibling file named *_output* with the same extension (for example, test_batch.txt -> test_batch_output.txt).
Q: Why isn't this heavily optimized?
A: It is a learning-oriented implementation prioritizing readable inference logic and debuggability.
spdlog: logging (vendored ininclude/third_party/spdlogandsrc/third_party/spdlog)nlohmann/json: JSON parsing (vendored ininclude/third_party/json.hpp)
Everything else is implemented in C++.