C++ inference engine for large language models, built on MLX.
Run LLMs locally on Apple M-series, AMD GPUs (Linux/Windows), and CPU -- no Python required.
- 50+ LLM architectures -- Llama, Qwen, Gemma, Phi, DeepSeek, Mistral, Granite, GLM, Falcon, and more
- 12 VLM architectures -- Qwen-VL, PaliGemma, Pixtral, Gemma3, SmolVLM, and more
- Embedders -- BERT, Nomic-BERT, Qwen3-Embed
- Quantized inference -- 4-bit/8-bit via
quantized_matmul - HuggingFace integration -- auto-downloads models, tokenizers, and chat templates
- OpenAI-compatible API server -- drop-in replacement for local inference
- Streaming generation -- async token pipeline with KV caching
- Multi-model management -- LRU eviction, explicit load/unload
- Chat templates -- Jinja2-compatible (minja), auto-loaded from model config
- CMake 3.20+
- C++17 compiler
- libcurl
- Rust toolchain (for tokenizers-cpp)
- ROCm (for AMD GPU builds)
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -jFor AMD GPU (ROCm):
cmake .. -DCMAKE_BUILD_TYPE=Release -DMLX_BUILD_ROCM=ON
make -j./chat <model_id_or_path> [options]
Models are specified as HuggingFace repo IDs (auto-downloaded on first use) or local directory paths.
# Basic usage -- downloads the model on first run
./chat mlx-community/Qwen3-1.7B-4bit
# With system prompt and tuned sampling
./chat mlx-community/Qwen3-4B-4bit \
--system-prompt "You are a helpful coding assistant" \
--temperature 0.5 --max-tokens 4096
# Use a local model directory
./chat /path/to/my-model
# Disable thinking/reasoning for Qwen3 models
./chat mlx-community/Qwen3-8B-4bit --no-think
# KV cache quantization to save memory
./chat mlx-community/Qwen3-8B-4bit --kv-bits 4
# Raw mode (skip chat template)
./chat mlx-community/starcoder2-3b-4bit --rawType your message at the > prompt. Type quit or exit to leave.
| Flag | Default | Description |
|---|---|---|
--system-prompt "..." |
(none) | System instructions for the session |
--max-tokens N |
2048 | Maximum tokens to generate per response |
--temperature T |
0.7 | Sampling temperature (lower = more deterministic) |
--top-p P |
0.9 | Nucleus sampling threshold |
--repetition-penalty F |
0.0 (off) | Penalize token repetition |
--memory-limit MB |
0 (unlimited) | GPU wired memory limit in MB |
--no-think |
false | Disable thinking/reasoning (Qwen3 models) |
--raw |
false | Skip chat template, use raw token encoding |
--kv-bits N |
0 (off) | KV cache quantization bits (4 or 8) |
--kv-group-size N |
64 | KV cache quantization group size |
--ctx-size N |
0 (auto) | Pre-allocate KV cache for N tokens |
./server [model_id_or_path] [options]
The server exposes an OpenAI-compatible HTTP API. It works in two modes:
- Pre-load mode -- load a specific model at startup
- Auto-load mode -- start empty, load models on demand from API requests
# Pre-load a model
./server mlx-community/Qwen3-4B-4bit
# Auto-load mode (no model pre-loaded)
./server
# Custom host/port with multiple model slots
./server --host 0.0.0.0 --port 9090 --max-loaded 3
# Offline mode (no HuggingFace downloads)
./server --no-download| Flag | Default | Description |
|---|---|---|
--host HOST |
127.0.0.1 | Bind address |
--port PORT |
8080 | Listen port |
--max-tokens N |
2048 | Default max tokens for generation |
--temperature T |
0.7 | Default sampling temperature |
--top-p P |
0.9 | Default nucleus sampling threshold |
--repetition-penalty F |
0.0 (off) | Default repetition penalty |
--memory-limit MB |
0 (unlimited) | GPU wired memory limit |
--no-think |
false | Disable thinking/reasoning globally |
--no-download |
false | Don't auto-download models from HuggingFace |
--max-loaded N |
1 | Max models in memory (LRU eviction) |
--kv-bits N |
0 (off) | KV cache quantization bits (4 or 8) |
--kv-group-size N |
64 | KV cache quantization group size |
--ctx-size N |
0 (auto) | Pre-allocate KV cache for N tokens |
Health check.
curl http://localhost:8080/health
# {"status":"ok"}List available models. Returns all MLX models found in the HuggingFace cache, with loaded status.
curl http://localhost:8080/v1/models{
"object": "list",
"data": [
{
"id": "mlx-community/Qwen3-4B-4bit",
"object": "model",
"created": 1234567890,
"owned_by": "local (loaded)"
}
]
}Chat completion (OpenAI-compatible). The model is auto-loaded if not already in memory.
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-4bit",
"messages": [
{"role": "user", "content": "What is 2+2?"}
]
}'Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
model |
string | required | Model ID (auto-loaded if needed) |
messages |
array | required | Chat history (role + content) |
temperature |
float | 0.7 | Sampling temperature |
top_p |
float | 0.9 | Nucleus sampling |
max_tokens |
int | 2048 | Max tokens to generate |
repetition_penalty |
float | 0.0 | Repetition penalty |
stream |
bool | false | Stream response via SSE |
stop |
array | [] | Stop sequences |
Streaming example:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-4bit",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Text completion (non-chat). Same fields as chat completions but uses prompt instead of messages.
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-4bit",
"prompt": "The answer to life is",
"max_tokens": 100
}'Explicitly load a model into memory.
curl -X POST http://localhost:8080/load \
-H "Content-Type: application/json" \
-d '{"model": "mlx-community/Qwen3-8B-4bit"}'Unload a model from memory.
curl -X POST http://localhost:8080/unload \
-H "Content-Type: application/json" \
-d '{"model": "mlx-community/Qwen3-8B-4bit"}'The server works with any client that speaks the OpenAI API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
response = client.chat.completions.create(
model="mlx-community/Qwen3-4B-4bit",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)Models are resolved in this order:
- Local directory -- if the path contains
config.json, use it directly - HuggingFace cache -- check for a previously downloaded snapshot
- HuggingFace Hub -- download from the hub (unless
--no-download)
| Variable | Description |
|---|---|
HF_HUB_CACHE |
Override HuggingFace cache directory (highest priority) |
HF_HOME |
HuggingFace home directory (uses $HF_HOME/hub as cache) |
HF_TOKEN |
API token for accessing private/gated models |
Use the /v1/models endpoint to see all MLX-compatible models in your HuggingFace cache. A model needs config.json and .safetensors files to be detected.
When running the server in auto-load mode, cached models are also printed to stderr at startup.
Pipeline diagnostic tool that validates each inference stage for numerical issues.
./diagnose mlx-community/Qwen3-4B-4bitTests GPU operations, quantized matmul, RMS normalization, RoPE, forward pass, dequantization, and end-to-end token generation. Reports NaN/Inf counts and numerical statistics for each stage.
Llama, Mistral, Qwen2, Qwen3, Qwen3-MoE, Qwen3-Next, Qwen3.5-MoE, Gemma, Gemma2, Gemma3, Gemma3n, Phi, Phi3, PhiMoE, DeepSeek-V3, MiMo, Cohere/Command-R, Starcoder2, Mistral3, Granite, GraniteMoE-Hybrid, GLM4, GLM4-MoE, GLM4-MoE-Lite, Ernie4.5, SmolLM3, MiniCPM, OLMo2, OLMo3, OLMoE, NanoChat, Lille-130m, InternLM2, Exaone4, Apertus, OpenELM, BailingMoE, AFMoE, GPT-OSS, LFM2, LFM2-MoE, Baichuan-M1, Falcon-H1, Nemotron-H, Jamba, AceReason
Qwen2-VL, Qwen2.5-VL, Qwen3-VL, PaliGemma, Gemma3, Pixtral, Mistral3, Idefics3, SmolVLM, FastVLM, LLaVA-Qwen2, LFM2-VL
BERT, Nomic-BERT, Qwen3-Embed
| Library | Description |
|---|---|
mlx-lm-core |
MLX module wrappers |
mlx-lm-common |
Tokenizer, generation, KV cache, hub API |
mlx-lm-llm |
LLM model implementations |
mlx-lm-vlm |
Vision-language model implementations |
mlx-lm-embedders |
Embedding model implementations |
MIT