Lemon MLX Engine

C++ inference engine for large language models, built on MLX.

Run LLMs locally on Apple M-series, AMD GPUs (Linux/Windows), and CPU -- no Python required.

Features

50+ LLM architectures -- Llama, Qwen, Gemma, Phi, DeepSeek, Mistral, Granite, GLM, Falcon, and more
12 VLM architectures -- Qwen-VL, PaliGemma, Pixtral, Gemma3, SmolVLM, and more
Embedders -- BERT, Nomic-BERT, Qwen3-Embed
Quantized inference -- 4-bit/8-bit via quantized_matmul
HuggingFace integration -- auto-downloads models, tokenizers, and chat templates
OpenAI-compatible API server -- drop-in replacement for local inference
Streaming generation -- async token pipeline with KV caching
Multi-model management -- LRU eviction, explicit load/unload
Chat templates -- Jinja2-compatible (minja), auto-loaded from model config

Requirements

CMake 3.20+
C++17 compiler
libcurl
Rust toolchain (for tokenizers-cpp)
ROCm (for AMD GPU builds)

Build

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j

For AMD GPU (ROCm):

cmake .. -DCMAKE_BUILD_TYPE=Release -DMLX_BUILD_ROCM=ON
make -j

Interactive Chat

./chat <model_id_or_path> [options]

Models are specified as HuggingFace repo IDs (auto-downloaded on first use) or local directory paths.

# Basic usage -- downloads the model on first run
./chat mlx-community/Qwen3-1.7B-4bit

# With system prompt and tuned sampling
./chat mlx-community/Qwen3-4B-4bit \
  --system-prompt "You are a helpful coding assistant" \
  --temperature 0.5 --max-tokens 4096

# Use a local model directory
./chat /path/to/my-model

# Disable thinking/reasoning for Qwen3 models
./chat mlx-community/Qwen3-8B-4bit --no-think

# KV cache quantization to save memory
./chat mlx-community/Qwen3-8B-4bit --kv-bits 4

# Raw mode (skip chat template)
./chat mlx-community/starcoder2-3b-4bit --raw

Type your message at the > prompt. Type quit or exit to leave.

Chat Options

Flag	Default	Description
`--system-prompt "..."`	(none)	System instructions for the session
`--max-tokens N`	2048	Maximum tokens to generate per response
`--temperature T`	0.7	Sampling temperature (lower = more deterministic)
`--top-p P`	0.9	Nucleus sampling threshold
`--repetition-penalty F`	0.0 (off)	Penalize token repetition
`--memory-limit MB`	0 (unlimited)	GPU wired memory limit in MB
`--no-think`	false	Disable thinking/reasoning (Qwen3 models)
`--raw`	false	Skip chat template, use raw token encoding
`--kv-bits N`	0 (off)	KV cache quantization bits (4 or 8)
`--kv-group-size N`	64	KV cache quantization group size
`--ctx-size N`	0 (auto)	Pre-allocate KV cache for N tokens

API Server

./server [model_id_or_path] [options]

The server exposes an OpenAI-compatible HTTP API. It works in two modes:

Pre-load mode -- load a specific model at startup
Auto-load mode -- start empty, load models on demand from API requests

# Pre-load a model
./server mlx-community/Qwen3-4B-4bit

# Auto-load mode (no model pre-loaded)
./server

# Custom host/port with multiple model slots
./server --host 0.0.0.0 --port 9090 --max-loaded 3

# Offline mode (no HuggingFace downloads)
./server --no-download

Server Options

Flag	Default	Description
`--host HOST`	127.0.0.1	Bind address
`--port PORT`	8080	Listen port
`--max-tokens N`	2048	Default max tokens for generation
`--temperature T`	0.7	Default sampling temperature
`--top-p P`	0.9	Default nucleus sampling threshold
`--repetition-penalty F`	0.0 (off)	Default repetition penalty
`--memory-limit MB`	0 (unlimited)	GPU wired memory limit
`--no-think`	false	Disable thinking/reasoning globally
`--no-download`	false	Don't auto-download models from HuggingFace
`--max-loaded N`	1	Max models in memory (LRU eviction)
`--kv-bits N`	0 (off)	KV cache quantization bits (4 or 8)
`--kv-group-size N`	64	KV cache quantization group size
`--ctx-size N`	0 (auto)	Pre-allocate KV cache for N tokens

API Endpoints

`GET /health`

Health check.

curl http://localhost:8080/health
# {"status":"ok"}

`GET /v1/models`

List available models. Returns all MLX models found in the HuggingFace cache, with loaded status.

curl http://localhost:8080/v1/models

{
  "object": "list",
  "data": [
    {
      "id": "mlx-community/Qwen3-4B-4bit",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local (loaded)"
    }
  ]
}

`POST /v1/chat/completions`

Chat completion (OpenAI-compatible). The model is auto-loaded if not already in memory.

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-4bit",
    "messages": [
      {"role": "user", "content": "What is 2+2?"}
    ]
  }'

Request fields:

Field	Type	Default	Description
`model`	string	required	Model ID (auto-loaded if needed)
`messages`	array	required	Chat history (`role` + `content`)
`temperature`	float	0.7	Sampling temperature
`top_p`	float	0.9	Nucleus sampling
`max_tokens`	int	2048	Max tokens to generate
`repetition_penalty`	float	0.0	Repetition penalty
`stream`	bool	false	Stream response via SSE
`stop`	array	[]	Stop sequences

Streaming example:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

`POST /v1/completions`

Text completion (non-chat). Same fields as chat completions but uses prompt instead of messages.

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-4B-4bit",
    "prompt": "The answer to life is",
    "max_tokens": 100
  }'

`POST /load`

Explicitly load a model into memory.

curl -X POST http://localhost:8080/load \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-community/Qwen3-8B-4bit"}'

`POST /unload`

Unload a model from memory.

curl -X POST http://localhost:8080/unload \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-community/Qwen3-8B-4bit"}'

Using with OpenAI-compatible clients

The server works with any client that speaks the OpenAI API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

response = client.chat.completions.create(
    model="mlx-community/Qwen3-4B-4bit",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Finding and Loading Models

Models are resolved in this order:

Local directory -- if the path contains config.json, use it directly
HuggingFace cache -- check for a previously downloaded snapshot
HuggingFace Hub -- download from the hub (unless --no-download)

Environment Variables

Variable	Description
`HF_HUB_CACHE`	Override HuggingFace cache directory (highest priority)
`HF_HOME`	HuggingFace home directory (uses `$HF_HOME/hub` as cache)
`HF_TOKEN`	API token for accessing private/gated models

Listing cached models

Use the /v1/models endpoint to see all MLX-compatible models in your HuggingFace cache. A model needs config.json and .safetensors files to be detected.

When running the server in auto-load mode, cached models are also printed to stderr at startup.

Diagnose Tool

Pipeline diagnostic tool that validates each inference stage for numerical issues.

./diagnose mlx-community/Qwen3-4B-4bit

Tests GPU operations, quantized matmul, RMS normalization, RoPE, forward pass, dequantization, and end-to-end token generation. Reports NaN/Inf counts and numerical statistics for each stage.

Supported Architectures

LLM (Text)

Llama, Mistral, Qwen2, Qwen3, Qwen3-MoE, Qwen3-Next, Qwen3.5-MoE, Gemma, Gemma2, Gemma3, Gemma3n, Phi, Phi3, PhiMoE, DeepSeek-V3, MiMo, Cohere/Command-R, Starcoder2, Mistral3, Granite, GraniteMoE-Hybrid, GLM4, GLM4-MoE, GLM4-MoE-Lite, Ernie4.5, SmolLM3, MiniCPM, OLMo2, OLMo3, OLMoE, NanoChat, Lille-130m, InternLM2, Exaone4, Apertus, OpenELM, BailingMoE, AFMoE, GPT-OSS, LFM2, LFM2-MoE, Baichuan-M1, Falcon-H1, Nemotron-H, Jamba, AceReason

VLM (Vision + Language)

Qwen2-VL, Qwen2.5-VL, Qwen3-VL, PaliGemma, Gemma3, Pixtral, Mistral3, Idefics3, SmolVLM, FastVLM, LLaVA-Qwen2, LFM2-VL

Embedders

BERT, Nomic-BERT, Qwen3-Embed

Libraries

Library	Description
`mlx-lm-core`	MLX module wrappers
`mlx-lm-common`	Tokenizer, generation, KV cache, hub API
`mlx-lm-llm`	LLM model implementations
`mlx-lm-vlm`	Vision-language model implementations
`mlx-lm-embedders`	Embedding model implementations

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github		.github
examples		examples
include/mlx-lm		include/mlx-lm
patches		patches
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
SESSION_SUMMARY.md		SESSION_SUMMARY.md
debug_layer0_reference.py		debug_layer0_reference.py
debug_numerical_trace.py		debug_numerical_trace.py
qwen3_next_reference.py		qwen3_next_reference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lemon MLX Engine

Features

Requirements

Build

Interactive Chat

Chat Options

API Server

Server Options

API Endpoints

`GET /health`

`GET /v1/models`

`POST /v1/chat/completions`

`POST /v1/completions`

`POST /load`

`POST /unload`

Using with OpenAI-compatible clients

Finding and Loading Models

Environment Variables

Listing cached models

Diagnose Tool

Supported Architectures

LLM (Text)

VLM (Vision + Language)

Embedders

Libraries

License

About

Uh oh!

Releases 11

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Lemon MLX Engine

Features

Requirements

Build

Interactive Chat

Chat Options

API Server

Server Options

API Endpoints

GET /health

GET /v1/models

POST /v1/chat/completions

POST /v1/completions

POST /load

POST /unload

Using with OpenAI-compatible clients

Finding and Loading Models

Environment Variables

Listing cached models

Diagnose Tool

Supported Architectures

LLM (Text)

VLM (Vision + Language)

Embedders

Libraries

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Contributors 1

Languages

`GET /health`

`GET /v1/models`

`POST /v1/chat/completions`

`POST /v1/completions`

`POST /load`

`POST /unload`

Packages