AX Engine is a Mac-first LLM inference runtime, local server, SDK layer, and benchmark toolkit for Apple Silicon. It runs direct-support MLX model families natively, and routes other MLX text models or non-MLX models through explicit mlx-lm and llama.cpp compatibility routes.
AX Engine is for developers who want a local OpenAI-compatible model server on Apple Silicon without hiding which runtime path is doing the work.
- OpenAI-compatible local text endpoints for common chat and completion flows, with SDKs for Python, TypeScript/JavaScript, Go, Ruby, and Mojo.
- Repo-owned MLX runtime paths for direct-support Gemma and Qwen families, with delegated routes kept explicit.
- Announcement-ready benchmark claims where evidence is complete: Gemma 4 12B assistant-MTP is 2.34-2.73x faster than same-artifact direct decode, and Qwen3.6 35B-A3B AX MTP is +59.8% faster than the retained MTPLX reference on the public sidecar-fair matrix.
- Dedicated Qwen3-Coder-Next direct-support path for local coding agents, called out separately from Qwen3.6 because it has no MTP sidecar but carries its own coding-first architecture and benchmark boundary.
- Workload-contract benchmark tooling records route identity, artifacts, prompt suite, sampler, cooldowns, accept rate, and dirty-state provenance.
- Release Highlights
- Quick Start
- Installation
- Getting a Model
- Typical Hardware
- What AX Engine Does
- Public Claim Boundaries
- Supported Models
- Performance
- SDKs
- Server Usage
- Workspace
- Development
- Benchmark Reference Projects
- Limitations
- Contributing
- Community
- License
Install (macOS 26 Tahoe or later, Apple Silicon only — see Typical Hardware):
python3 -m pip install --upgrade pip # pip 23+ is required to find the wheel
python3 -m pip install -U "ax-engine[download]<7" # keep the quotes — zsh treats [ ] as a globDownload a small model and start the server:
MODEL_DIR="$(ax-engine download mlx-community/Qwen3-4B-4bit --json | python3 -c 'import json,sys; print(json.load(sys.stdin)["dest"])')"
ax-engine serve "$MODEL_DIR" --port 8080High-memory model shortcuts:
# Choose one:
ax-engine serve qwen36-35b --download --port 8080
ax-engine serve gemma4-12b --download --port 8080Call it from any OpenAI client:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="local")
model = client.models.list().data[0].id
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "What is AGI?"}],
max_tokens=128,
)
print(resp.choices[0].message.content)Or use the Python SDK directly:
from ax_engine import download_model, Session
path = download_model("mlx-community/Qwen3-4B-4bit")
with Session(mlx=True, mlx_model_artifacts_dir=str(path)) as s:
print(s.generate([1, 2, 3], max_output_tokens=8).output_tokens)Quick Start requires macOS 26 (Tahoe) or later on Apple Silicon M2 Max or newer with 32 GB unified memory or more. Earlier macOS releases are not supported — there is no wheel or binary for them. Larger models such as Qwen3.6 35B-A3B and Gemma 4 12B need the memory tiers listed in Typical Hardware.
The published wheel and Homebrew formula are macOS-arm64-only native builds. Before installing, confirm your machine matches:
- macOS 26 (Tahoe) or later. Earlier macOS versions are not supported — there is no wheel or formula for them.
- Apple Silicon (M2 Max or newer), arm64. Intel Macs are not supported.
- Python 3.10 or later for the pip install.
- pip 23 or later. Older pip cannot read the wheel's platform tag and will
report
No matching distribution found. Always run the upgrade step first.
# Check before installing — should print a version >= 26 and "arm64":
python3 -c "import platform; print(platform.mac_ver()[0], platform.machine())"python3 -m pip install --upgrade pip
python3 -m pip install -U "ax-engine[download]<7"
ax-engine doctorKeep the quotes around the spec — zsh otherwise treats [download] as a glob.
The wheel bundles the ax-engine orchestration CLI plus the ax-engine-server
and ax-engine-bench binaries, so all three are on your PATH after install.
There is no source distribution and no wheel for other platforms; if pip reports
No matching distribution found, see Troubleshooting.
Optional extras:
python3 -m pip install -U "ax-engine[openai]<7" # FastAPI OpenAI shim
python3 -m pip install -U "ax-engine[multimodal]<7" # image/audio helpersHomebrew is the native binary channel for tagged macOS arm64 releases. The
one-liner auto-taps defai-digital/homebrew-ax-engine:
brew install defai-digital/ax-engine/ax-engine
ax-engine doctorax-engine-server and ax-engine-bench are installed alongside the CLI. If
doctor fails with Library not loaded: libmlxc.dylib, the mlx-c dependency
is missing or stale — reinstall it:
brew install mlx-c && brew reinstall defai-digital/ax-engine/ax-engineNo matching distribution found for ax-engine— your machine is not macOS 26+ Apple Silicon, or your pip is too old. Runpython3 -m pip install --upgrade pip, then re-check with the Requirements command above. There is no wheel for Intel, Linux, Windows, or macOS earlier than 26.zsh: no matches found: ax-engine[download]— quote the spec:pip install "ax-engine[download]<7".- An old version installs — make sure you used
-U, then confirm the channel is current withpython3 -m pip index versions ax-engineorbrew info defai-digital/ax-engine/ax-engine. - Anything still off — build from Source, which works on any supported macOS and rebuilds the native binaries locally.
brew install mlx mlx-c protobuf
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip maturin
cargo build --release -p ax-engine-server -p ax-engine-bench
maturin develop --release
export PATH="$PWD/target/release:$PATH"
ax-engine doctorAX Engine requires pre-sanitized MLX weights. The recommended source is mlx-community — models there are already converted and validated.
ax-engine download, download_model(), and scripts/download_model.py download weights and auto-generate the required model-manifest.json in one step:
# List supported download targets
ax-engine download --list
# Download by alias
ax-engine download qwen36-35b --json
ax-engine download qwen36-27b --json
ax-engine download gemma4-e2b --json
ax-engine download gemma4-12b --json
ax-engine download gemma4-31b --json
# Download and serve in one command
ax-engine serve qwen36-35b --download --port 8080
# Raw mlx-community repo IDs are also accepted
ax-engine download mlx-community/Qwen3.6-35B-A3B-4bit --json
ax-engine download mlx-community/Qwen3-Coder-Next-4bit --json
ax-engine download mlx-community/gemma-4-e2b-it-4bit --json
# Optional: copy snapshot to an explicit directory
ax-engine download qwen36-35b --dest /Volumes/Models/qwen36-35b
# Python SDK
from ax_engine import download_model
path = download_model("mlx-community/Qwen3.6-35B-A3B-4bit")Built-in download aliases:
| Alias | Repo |
|---|---|
qwen36-35b |
mlx-community/Qwen3.6-35B-A3B-4bit |
qwen36-27b, qwen36-27b-5bit, qwen36-27b-6bit, qwen36-27b-8bit |
mlx-community/Qwen3.6-27B-{4,5,6,8}bit |
gemma4-e2b, gemma4-e2b-5bit, gemma4-e2b-6bit, gemma4-e2b-8bit |
mlx-community/gemma-4-e2b-it-{4,5,6,8}bit |
gemma4-12b, gemma4-12b-6bit |
mlx-community/gemma-4-12B-it-{4,6}bit |
gemma4-26b |
mlx-community/gemma-4-26b-a4b-it-4bit |
gemma4-31b |
mlx-community/gemma-4-31b-it-4bit |
Leave downloads in the Hugging Face Hub cache by default — it's shared with mlx_lm and other HF-aware tools, avoiding duplicate copies of large weights. Use --dest only when you want an explicit copy outside the shared cache.
If you already have mlx_lm installed, its downloads land in the same cache and AX Engine can auto-discover them:
python -m mlx_lm.generate --model mlx-community/Qwen3-4B-4bit --prompt "x" --max-tokens 1
ax-engine-bench generate-manifest ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/<hash>
ax-engine serve ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/<hash> --port 8080Raw checkpoints need sanitization before AX Engine can load them:
pip install mlx-lm
mlx_lm.convert --hf-path <org/model> --mlx-path /path/to/dest -q --q-bits 4
ax-engine-bench generate-manifest /path/to/dest
ax-engine serve /path/to/dest --port 8080Both paths above require model-manifest.json. Download helpers generate it automatically. To run it directly:
ax-engine-bench generate-manifest /path/to/model # pip, Homebrew, or built binary
cargo run -p ax-engine-core --bin generate-manifest -- /path/to/model # sourceFor local agent and chatbot workloads, AX Engine is a better fit for a small model portfolio than for one model serving every workflow. See the FAQ model-stack guidance for the full recommendation.
| Hardware | Recommended memory | Best fit |
|---|---|---|
| Mac mini M4 Pro | 64 GB RAM | Compact always-on local chatbot and agent server |
| MacBook Pro M5 Max | 128 GB RAM | Portable high-throughput chatbot, agent, and coding stack |
| Mac Studio M3 Ultra | 256 GB RAM | Larger local model portfolio, longer contexts, and heavier parallel workloads |
| Role | Recommended model | Setup | App | Why |
|---|---|---|---|---|
| Default chatbot | Gemma 4 26B-A4B / 31B | 4-bit or 6-bit, 16K-32K | ax-studio | General assistant path for reasoning, chat, JSON/function calling, and on-device agent workflows |
| General agentic model | Qwen3.6-35B-A3B / Qwen3.6-27B | 35B A3B 4-bit; 27B 4/5/6/8-bit, 16K-32K | AX server / SDK | Strong general agent and coding balance; sparse MoE keeps active compute low |
| Coding specialist | Qwen3-Coder-Next | 6-bit + 16K default; 4-bit/5-bit + 32K when needed | ax-code | Dedicated local coding-agent path for repo editing, tool use, and long coding sessions |
AX Engine gives local inference work a stable runtime contract:
- Repo-owned MLX execution tracks direct-support model families separately from delegated routes — delegated results are not AX-owned throughput claims.
- Dual-family speculative decoding supports both Qwen3.6's fused MTP sidecar and Gemma 4's separate assistant-drafter contract in the same repo-owned runtime and benchmark tooling.
- N-gram acceleration reaches up to 3.1× mlx_lm decode throughput on high-hit benchmark rows with no second draft model.
- Long-session prefix reuse restores physical MLX KV snapshots on validated cache layouts, so long-running chat and agent loops avoid repeatedly pre-filling accumulated context. See
docs/LONG-CONTEXT.md. - Workload-contract tooling (
ax-engine-bench) validates correctness, determinism, route identity, and regression across checked-in manifests. - Delegated routes (
mlx_lm_delegated,llama_cpp) cover explicit compatibility cases without polluting AX-owned performance claims.
mlx_lm is the canonical MLX reference. AX Engine compares against mlx_lm.benchmark and keeps mlx_lm.server as the explicit delegated compatibility route when AX does not yet have a repo-owned graph. See the FAQ for the boundary between MLX kernels and AX-owned runtime behavior.
Design details: Scheduler · KV Cache · Long Context · Benchmark Design.
| Path | Use it for | Current scope |
|---|---|---|
| Repo-owned MLX runtime | Direct-support MLX model families and repo-owned performance claims backed by benchmark artifacts | Local Apple Silicon inference, token-based server/SDK requests, direct and n-gram acceleration modes |
mlx_lm_delegated |
MLX text models that upstream mlx-lm supports before AX has a repo-owned graph |
Blocking and SSE text generation through a user-provided mlx_lm.server; not AX-owned token/KV performance |
llama_cpp |
GGUF and non-MLX local inference | Delegated llama.cpp server/CLI compatibility; route-contract evidence, not repo-owned MLX throughput |
The runtime report exposes selected_backend, support_tier, and resolution_policy so callers and benchmark artifacts can distinguish these paths. For the exact OpenAI-shaped endpoint contract see docs/API-COMPATIBILITY.md.
AX Engine's public performance claims are scoped to benchmark artifacts that preserve route identity, model artifacts, prompt suite, sampler settings, and repository provenance.
| Area | Public claim | Status |
|---|---|---|
| Gemma 4 12B assistant-MTP | 2.34-2.73x faster than same-artifact AX direct decode on the 12B MTP prompt suites | Announcement-ready |
| Gemma 4 26B/31B assistant-MTP | 97.3%-99.2% accept rate; MTP+n-gram is workload-dependent (+5.2% for 26B, -0.7% for 31B) in the current matrix | Scoped; no public direct-speedup claim yet |
| Qwen3.6 35B-A3B MTP | AX MTP is +59.8% vs the retained MTPLX reference, and AX MTP+n-gram is +59.9% vs MTPLX on the sidecar-fair aggregate | Announcement-ready |
| Qwen3.6 27B MTP | AX MTP is +7.8% vs the retained MTPLX reference; MTP+n-gram is +8.7% vs MTPLX and +0.8% vs pure AX MTP | Opt-in / workload-dependent |
| Qwen3-Coder-Next direct | AX direct decode is +3.3%-6.6% vs mlx_lm and +17.1%-23.6% vs shape-compatible llama.cpp Metal (b9700, flash-attn) at 128/512/2048 tokens with the opt-in fused expert block enabled |
Scoped; direct-only |
| N-gram acceleration | Up to 3.1x mlx_lm decode throughput on high-hit benchmark rows without a second draft model |
Workload-dependent |
Direct support means AX has a repo-owned ax-engine-mlx graph for the model family and loads MLX safetensors through the AX manifest path. Other MLX text models can still use the explicit mlx_lm_delegated compatibility route.
| Family | Direct model IDs | Current scope | Architecture notes |
|---|---|---|---|
| Gemma 4 | gemma-4-e2b-it, gemma-4-e4b-it, gemma-4-12b-it, gemma-4-26b-a4b-it, gemma-4-31b-it |
Repo-owned MLX runtime; MLX affine 4/5/6/8-bit weights; assistant-MTP benchmark path | Dense unified 12B, per-layer embedding, and MoE variants; sliding-window + full attention, logit softcapping |
| Qwen 3 | Qwen3-4B-4bit and manifest-backed dense checkpoints |
Repo-owned MLX runtime | SwiGLU dense FFN; per-head QK norm |
| Qwen 3.5 | Qwen3.5-9B-MLX-4bit |
Repo-owned MLX runtime | Linear attention + MoE FFN; attn_output_gate per-head interleaving |
| Qwen 3.6 | Qwen3.6-35B-A3B 4-bit, Qwen3.6-27B 4/5/6/8-bit |
Repo-owned MLX runtime; fused sidecar-MTP benchmark path | qwen3_next: GatedDelta linear attention, full attention with per-head sigmoid gate, sparse top-k MoE |
| Qwen3-Coder-Next | Qwen3-Coder-Next-4bit |
Repo-owned MLX runtime; direct coding-agent path | qwen3_next coding-specialist checkpoint; hybrid linear/full attention, sparse top-10-of-512 MoE, shared expert, 8-bit router/shared-expert gates |
| GLM 4.7 Flash | glm4_moe_lite / glm4.7-flash-4bit |
Repo-owned MLX runtime; MLX affine 4-bit weights | Flash MLA attention, sigmoid-routed MoE with dense+MoE layer split, shared expert; post-attention RMS norm |
Adding a new architecture means implementing the model graph in ax-engine-mlx, not wiring up a generic loader. Architecture code alone is not a direct-support claim — a model requires a repo-owned graph, manifest, smoke coverage, and benchmark evidence before promotion here. LLaMA, Mistral, Mixtral, DeepSeek, and unlisted Gemma/Qwen variants should use the explicit delegated route.
Before promoting another architecture or checkpoint, run scripts/probe_mlx_model_support.py --model-dir <model-dir>; a model should report repo_owned_runtime_ready only when its manifest, local reference files, and runtime path are all present.
Full list: docs/SUPPORTED-MODELS.md.
Full result tables and interpretation live in docs/PERFORMANCE.md. Benchmark methodology, test setup, and reproduction details live in docs/BENCHMARKS.md.
Gemma 4 12B (model_type: gemma4_unified) is reported separately from the per-layer-embedding E2B/E4B and MoE 26B/31B checkpoints because it has a distinct graph, multimodal tensor contract, and benchmark boundary. Upstream mlx_lm 0.31.3 cannot load it (ValueError: Model type gemma4_unified not supported), so the direct peer here is llama.cpp Metal on a shape-compatible GGUF.
Note
AX Engine's repo-owned native MLX route supports Gemma 4 12B text plus inline base64 image/audio/video chat. Delegated compatibility routes remain text-first; /v1/generate accepts the processed multimodal_inputs.gemma4_unified tensor contract.
At a glance:
- Direct decode: AX native MLX reaches 61.7-66.0 tok/s on the bit-comparable 4-bit-FFN artifact versus llama.cpp Metal's 56.9-59.2 tok/s depth-matched range.
- Context depth: AX's direct margin is +11% / +11% / +8% versus llama.cpp matched-depth decode at 128 / 512 / 2,048 prompt tokens.
- Assistant-MTP: depth-2 assistant-MTP reaches 82.9-96.8 tok/s on code-like prompt suites, a 2.34-2.73x same-artifact speedup over AX direct decode.
- Why the earlier result flipped: the upstream MLX snapshot keeps FFN weights at 8-bit, so it reads about 1.65x the bytes of the re-quantized 4-bit-FFN artifact. Decode is bandwidth-bound; matching quantization closes the gap.
Direct Decode
AX direct rows use the 4-bit-FFN MLX artifact and random-token prompts. mlx_lm is absent because it has no gemma4_unified graph. The llama.cpp rows are shape-compatible external GGUF references, not prompt-hash-parity MLX rows.
| Prompt tokens | AX decode | llama.cpp decode (depth 0) | llama.cpp decode (matched depth) | AX prefill | llama.cpp prefill | AX TTFT (ms) | llama.cpp TTFT (ms) |
|---|---|---|---|---|---|---|---|
| 128 | 66.0 | 59.8 | 59.2 | 1,171 | 1,252 | 109 | 102 |
| 512 | 65.6 | 59.6 | 58.9 | 1,839 | 1,745 | 278 | 293 |
| 2048 | 61.7 | 59.7 | 56.9 | 2,004 | 1,690 | 1,022 | 1,212 |
Read the two llama.cpp decode columns carefully:
depth 0is plainllama-bench tg, decoding from an empty context and representing llama.cpp's best case.matched depthuses-d {prompt} -n 128, so decode happens after the same prompt depth AX has already prefetched.- AX wins the matched-depth comparison at every prompt size, and prefill also leads at 512 and 2,048 tokens.
The table uses the bit-comparable 4-bit-FFN AX artifact (scripts/requantize_gemma4_12b_ffn_4bit.py), about 4.5 bpw versus the Q4_K_M GGUF's about 4.8 bpw. The upstream mlx-community/gemma-4-12B-it-4bit snapshot keeps the FFN at 8-bit (~10.98 GB) and trails llama.cpp at about 46 tok/s. That is a bytes-read handicap, not an AX runtime result.
Memory bandwidth share:
Decode is memory-bandwidth-bound on Apple Silicon: each token reads the model weights once, so decode tok/s is set by bytes-read and how close the engine gets to the memory ceiling. Measured M5 Max GPU peak read bandwidth ≈ 577 GB/s (MLX reduction over a 6 GB array).
| Engine / quantization | Weights/token | Decode tok/s | Effective BW | % of 577 GB/s peak |
|---|---|---|---|---|
| AX — 8-bit FFN (upstream 4bit snapshot) | 10.98 GB | 45.0 | 494 GB/s | 86% |
| AX — 4-bit FFN (re-quantized) | 6.74 GB | 64.4 | 434 GB/s | 75% |
| llama.cpp Q4_K_M — decode @ depth 512 | 7.38 GB | 58.9 | 435 GB/s | 75% |
llama.cpp Q4_K_M — decode @ depth 0 (tg) |
7.38 GB | 59.8 | 441 GB/s | 76% |
The bandwidth view is the key explanation: AX is not under-utilizing memory. The re-quantized AX row sustains 434 GB/s, in the same band as llama.cpp's 435 GB/s at matched depth. The remaining direct-decode difference is bytes read per token: uniform 4-bit group-64 reduces AX to 6.74 GB/token, while Q4_K_M reads 7.38 GB/token. The 8-bit-FFN upstream snapshot has higher bus utilization (86%) but worse speed because it reads far more data.
Assistant-MTP speculative decode (depth 2):
The assistant-MTP path runs on the assistant bundle and adds a second speculative lever that neither mlx_lm nor llama.cpp has for this model. The published rows use depth-2 draft, first-token confidence gate 0.90, deep-token gate 0.999, and GPU-exact confidence.
Pure assistant-MTP is the default. MTP+n-gram stacking remains opt-in because it is workload-dependent and did not beat pure MTP on every suite.
| Suite | Depth | AX direct tok/s | AX MTP tok/s | AX MTP accept | AX MTP+ngram tok/s | AX MTP+ngram accept | n-gram status |
|---|---|---|---|---|---|---|---|
| flappy | 2 | 35.5 | 96.8 | 98.7% | 95.0 | 98.7% | no observed draft path |
| long_code | 2 | 35.8 | 92.3 | 99.1% | 95.2 | 99.1% | no observed draft path |
| python_modules_long | 2 | 35.4 | 82.9 | 97.5% | 82.5 | 97.5% | no observed draft path |
No runnable peer benchmark covers Gemma 4 12B assistant-MTP in this matrix: mlx_lm cannot load gemma4_unified, llama.cpp does not expose a Gemma assistant-MTP path, and available MTP peer tools target different sidecar contracts. The AX direct column is retained as a same-prompt baseline from the MTP harness prompts, artifact, and sampler. It is a same-artifact AX improvement view, not a peer-engine MTP comparison.
MTP prefill and TTFT — same run:
| Suite | AX MTP prefill | AX MTP+ngram prefill | AX MTP ttft ms | AX MTP+ngram ttft ms |
|---|---|---|---|---|
| flappy | 1,928 | 1,952 | 187 | 187 |
| long_code | 2,040 | 2,024 | 390 | 394 |
| python_modules_long | 1,831 | 1,812 | 195 | 198 |
Methodology and artifacts:
Direct rows use the 4-bit-FFN artifact, greedy-equivalent sampler, 128 generated tokens, 5 repetitions, 15 s cooldown, and random-token prompts following the mlx_lm.benchmark contract. llama.cpp decode is shown both at depth 0 (tg) and at matched context depth (-d {prompt}). MTP rows use the same 4-bit-FFN assistant-MTP artifact, depth-2 draft, temperature=0.6, top_p=0.95, top_k=20, 1,000 generated tokens, 5 repetitions, 30 s cooldown, and 10 s inter-case cooldown. Host/runtime for the latest direct llama.cpp peer rerun: Apple M5 Max · llama.cpp b9700 / ggml 0.15.2 (Metal, flash-attn) · mlx_lm 0.31.3 has no gemma4_unified support.
Full artifacts: 2026-06-20-gemma-4-12b-it-4bit-direct (AX direct rerun; chart artifact with retained llama.cpp reference rows in gemma-4-12b-it-4bit-with-llama-reference.json; llama.cpp GGUF provenance in llama_cpp_gguf_provenance.json) · 2026-06-20-gemma4-assistant-mtp-ax-mtp-only (AX-only assistant-MTP refresh).
Gemma 4 12B multimodal timing is reported separately from the text benchmark above because media inputs expand into validated Gemma4 unified soft-token spans before the MLX graph runs. The publication-grade timing artifact covers all 17 AX Engine image/audio/video cases through both the native /v1/generate/stream prefill path and the OpenAI-compatible /v1/chat/completions path. The llama.cpp Metal peer rows are cold OpenAI chat endpoint rows for the supported image/audio cases, with prompt cache, slot prompt reuse, and context checkpoints disabled and raw llama.cpp timing/cache metadata recorded.
| Coverage | AX cases measured | Expanded input | Median runner prefill TTFT | Median prefill | Median AX chat E2E | llama.cpp peer endpoint |
|---|---|---|---|---|---|---|
| Image | 5 | 275-535 tokens | 189.4-316.2 ms | 1,447.8-1,692.1 tok/s | 1,440.8-1,704.8 ms | 5 measured, 401.6-518.7 ms cold chat endpoint |
| Audio | 4 | 32-771 tokens | 75.8-419.4 ms | 422.1-1,838.4 tok/s | 1,466.5-1,819.2 ms | 3 measured, 338.0-464.5 ms cold chat endpoint; 1 skipped: llama.cpp audio cap unstable |
| Video | 4 | 92-2,355 tokens | 106.1-2,973.5 ms | 792.0-1,681.0 tok/s | 1,500.2-4,441.7 ms | 4 skipped: llama.cpp video path unsupported |
| Combined | 4 | 181-442 tokens | 133.2-256.7 ms | 1,359.1-1,721.6 tok/s | 1,532.4-1,771.6 ms | 1 measured, 507.9 ms cold chat endpoint; 3 skipped: video unsupported |
Rows use /v1/generate/stream with processed multimodal_inputs.gemma4_unified for runner-time prefill and /v1/chat/completions with inline media for client-wall E2E latency. This run used max_output_tokens=8, 1 warmup, 3 measured repetitions, --max-batch-tokens 4096, a release server binary, 128 GB unified memory, and a clean tracked worktree at 67ce2675a469cf5eecba687f348c649e663011b8.
The llama.cpp peer rows use reference llama.cpp 19bba67c1 with Metal, gemma-4-12B-it-Q4_K_M.gguf, and mmproj-gemma-4-12B-it-Q8_0.gguf. They are OpenAI chat endpoint-latency rows for supported image/audio inputs, not native prefill rows and not a throughput comparison. The fair-peer launch contract is --cache-ram 0 --no-cache-idle-slots --slot-prompt-similarity 0 --ctx-checkpoints 0 plus --llama-cache-policy prompt_cache_disabled; the artifact records raw llama.cpp timings, prompt_tokens_details.cached_tokens, server prompt token counts, and cache counts. Published peer rows require zero reported cached prompt tokens and server prompt-eval token counts at least as large as the cold request's reported prompt tokens. Video-containing peer rows are explicit skips because the local llama.cpp Gemma 4 path does not expose a like-for-like video contract, and audio_cap is skipped because this llama.cpp Gemma 4 audio path fails the warmup-plus-three-repetition contract on the largest audio fixture. The peer chart excludes one measured image case whose AX and llama.cpp output token counts differ, so chart bars compare matched-output rows only. For this Gemma 4 llama.cpp build, most peer text appears in reasoning_content rather than message.content, so the benchmark validates positive response_chars.
Full artifact: 2026-06-09-gemma4-12b-multimodal-cold-peer-matrix. Render charts with:
python3 scripts/render_gemma4_multimodal_charts.py \
--artifact benchmarks/results/gemma4-multimodal/2026-06-09-gemma4-12b-multimodal-cold-peer-matrix.json \
--assets-dir docs/assetsTo reproduce the supported-case image/audio/video timing matrix from a Gemma 4 12B AX Engine server, use the matrix runner and validate the resulting artifact before publishing charts:
python3 scripts/bench_gemma4_multimodal.py \
--url http://127.0.0.1:18080 \
--model gemma-4-12B-it \
--model-dir /path/to/gemma-4-12B-it-4bit \
--cases all \
--layers native_runtime_prefill,openai_chat_e2e \
--warmup 1 \
--repetitions 3 \
--cooldown 1 \
--max-output-tokens 8 \
--server-command "target/release/ax-engine-server --model-id gemma-4-12B-it --mlx --mlx-model-artifacts-dir /path/to/gemma-4-12B-it-4bit --max-batch-tokens 4096 --port 18080" \
--llama-url http://127.0.0.1:<peer-port> \
--llama-binary /path/to/llama-server \
--llama-gguf <path-to-gemma-4-12B-it-Q4_K_M.gguf> \
--llama-mmproj <path-to-mmproj-gemma-4-12B-it-Q8_0.gguf> \
--llama-cache-policy prompt_cache_disabled \
--output benchmarks/results/gemma4-multimodal/gemma4-12b-multimodal-cold-peer-matrix.json
python3 scripts/check_gemma4_multimodal_benchmark_artifact.py \
benchmarks/results/gemma4-multimodal/gemma4-12b-multimodal-cold-peer-matrix.json \
--min-repetitions 3 \
--require-modalities image,audio,video \
--require-build-provenance \
--readme-readyFor a fair llama.cpp peer rerun, launch llama-server with prompt cache, slot prompt reuse, and context checkpoints disabled for the peer server, for example --cache-ram 0 --no-cache-idle-slots --slot-prompt-similarity 0 --ctx-checkpoints 0, then validate with --readme-ready. Peer rows with unknown cache policy, reported cached prompt tokens, or server prompt-eval token counts that are too low for a cold prompt are rejected by the artifact checker. Without a matching Gemma 4 12B GGUF and multimodal projector, peer rows are explicit skips. Video rows remain explicit skips until the peer server exposes a like-for-like video path for Gemma 4 12B.
Prepare Gemma 4 12B assistant-MTP artifacts
Gemma 4 12B MLX target and assistant repos are already converted to MLX safetensors — they do not go through ax-engine convert-mtplx or scripts/prepare_mtp_sidecar.py. Download the target and matching assistant, then package them with the Gemma-specific helper:
hf download mlx-community/gemma-4-12B-it-4bit
hf download mlx-community/gemma-4-12B-it-assistant-4bit
python3 scripts/prepare_gemma4_assistant_mtp.py \
--target mlx-community/gemma-4-12B-it-4bit \
--assistant mlx-community/gemma-4-12B-it-assistant-4bit
hf download mlx-community/gemma-4-12B-it-6bit
hf download mlx-community/gemma-4-12B-it-assistant-6bit
python3 scripts/prepare_gemma4_assistant_mtp.py \
--target mlx-community/gemma-4-12B-it-6bit \
--assistant mlx-community/gemma-4-12B-it-assistant-6bitThe default outputs are quant-specific synthetic HF cache snapshots: models--ax-local--gemma-4-12b-it-4bit-assistant-mtp/snapshots/v1/ and models--ax-local--gemma-4-12b-it-6bit-assistant-mtp/snapshots/v1/. Each package contains the target files, an assistant/ subtree, and ax_gemma4_assistant_mtp.json. Generate or validate the AX manifest before serving:
ax-engine-bench generate-manifest \
~/.cache/huggingface/hub/models--ax-local--gemma-4-12b-it-4bit-assistant-mtp/snapshots/v1 \
--validate
ax-engine-bench generate-manifest \
~/.cache/huggingface/hub/models--ax-local--gemma-4-12b-it-6bit-assistant-mtp/snapshots/v1 \
--validateAX Engine's key Mac advantage is dual-family speculative decoding — it supports both Gemma 4's separate assistant-drafter contract and Qwen3.6's fused sidecar contract in one repo-owned runtime and benchmark surface. A single benchmark surface records route identity, sampler, prompt suite, cooldown, accept behavior, and artifact provenance so the two MTP families are comparable without pretending they use the same architecture.
Unlike Qwen's fused mtp.* sidecar, Gemma 4's multi-token prediction uses a small assistant drafter that shares the target's tokenizer and embedding table, drafts tokens from the target's last-layer hidden state, and attends to the target's own KV cache. Draft depth is configurable: 26B/31B benchmarks use depth 1 (one draft token per step); 12B uses depth 2 (two draft tokens per step, with the second conditioned on the first). AX runs it assistant-MTP-only (mtp, default) and with n-gram stacked on top (mtp-ngram, opt-in).
A draft confidence gate (AX_MLX_GEMMA4_ASSISTANT_MTP_DRAFT_MIN_CONFIDENCE, default 0.90 for the first draft token; deep draft default 0.999) only proposes a draft when the drafter's top-token probability clears the threshold, keeping accept high while remaining correctness-preserving. Lower the gate toward 0 for more speculation on predictable content; raise it for flatter sampled chat.
The gate is a speed knob, not a quality knob -- lowering it does not corrupt output (e.g. code). Every drafted token is verified by the target model before it is emitted (rejection sampling when draft log-probs exist, greedy argmax-match otherwise), so a mismatched draft is discarded and replaced by the target's own token. Relaxing the gate only lets the drafter propose more speculative tokens; it lowers the accept rate and shifts throughput, but the emitted sequence is still the verified target sequence. Output-altering approximations are separate, explicit opt-ins such as top-k target softmax, never the confidence gate.
Choosing the gate by workload. Because the output is verified either way, the gate is a throughput dial, not a safety one — pick it by how predictable your content is, and (only for temperature-sampled chat) how much reply diversity you want. Lower gate = more speculation = lower accept rate but more multi-token runs. Starting points:
| Workload | Suggested gate | Expected accept¹ | Why |
|---|---|---|---|
| Coding | ~0.90 (aggressive) |
high (~93–96% on 12B code suites) | Sharply peaked output makes the first draft token useful even with a looser gate. Deterministic, so no diversity cost -- tune purely for speed. |
| Agentic (tools / JSON / reasoning) | ~0.90–0.95 |
high (~93–96% expected on code-like templates) | Templated and low-temperature like code; output is verified, so no correctness risk. Keep n-gram stacking opt-in unless the workload is measured. |
| Chatbot | ~0.99–0.999 if sampling for variety; lower at low temperature |
drops on flat text | Natural language is flatter, so accept falls faster; at temperature > 0 a low gate makes replies follow the greedy token and feel less varied. Here a high gate protects diversity, not correctness. |
¹ Only the code-like benchmark suites below (
flappy,long_code,python_modules_long) are measured for 12B at the Phase 4 default -- they sit at 97.5-99.1% assistant accept and still deliver 2.34-2.73x same-artifact speedup over direct decode. The agentic and chatbot figures are expected ranges, and the suggested gates are starting points, not universal optima. Theassistant_mtp_gate*ablation profiles lock the exact per-workload sweet spot.
One flag instead of the env vars. Rather than hand-set the gate knobs, the server accepts --speculation-profile {auto,coding,agentic,chatbot} (short -s, alias --spec; or env AX_MLX_SPECULATION_PROFILE), which bundles the MTP and n-gram configuration into one posture. auto (default) is temperature-driven: it keeps the shipped gate at low/zero temperature and raises it for higher-temperature sampled chat to protect reply diversity. coding/agentic keep the shipped gate defaults — the 12B ablation found lowering the Gemma gate does not add code throughput, so the default already is the throughput setting — while chatbot raises the gate and prefers the n-gram utility gate. Any explicit per-knob env var (e.g. AX_MLX_GEMMA4_ASSISTANT_MTP_DRAFT_MIN_CONFIDENCE) still overrides the profile. The resolved posture is recorded in route metadata as ax_mlx_speculation_profile.
No peer engine (MTPLX, Rapid-MLX, lightning-mlx) exposes a runnable Gemma 4 assistant-MTP path, so this benchmark has no peer comparison rows.
Gemma 4 speculative decoding holds draft accept ≥97% on every cell below (97.3–99.2% across 26B / 31B × {MTP, MTP+n-gram} × {flappy, long_code, python_modules_long}).
The 26B/31B public run below is the promotion-grade assistant-MTP matrix only; unpublished retry fragments and failed direct-baseline attempts are excluded from this artifact set. Without a complete same-artifact direct row for these two models, the public verdict is scoped to MTP+n-gram versus pure assistant-MTP. In that scope n-gram is keep-opt-in: +5.2% median decode for 26B and -0.7% for 31B, with workload-specific regressions still present.
| Gemma 4 26B A4B 4-bit | Gemma 4 31B 4-bit |
| Model | Suite | Depth | AX MTP tok/s | AX MTP accept | AX MTP+ngram tok/s | AX MTP+ngram accept |
|---|---|---|---|---|---|---|
| Gemma 4 26B A4B 4-bit | flappy | 1 | 128.8 | 99.2% | 137.3 | 99.2% |
| Gemma 4 26B A4B 4-bit | long_code | 1 | 136.7 | 99.0% | 136.9 | 99.0% |
| Gemma 4 26B A4B 4-bit | python_modules_long | 1 | 130.1 | 98.7% | 125.3 | 98.7% |
| Gemma 4 31B 4-bit | flappy | 1 | 39.4 | 99.2% | 39.1 | 99.2% |
| Gemma 4 31B 4-bit | long_code | 1 | 40.0 | 99.1% | 40.4 | 99.1% |
| Gemma 4 31B 4-bit | python_modules_long | 1 | 37.4 | 97.3% | 37.1 | 97.3% |
Prefill and TTFT — same run:
| Model | Suite | AX MTP prefill | AX MTP+ngram prefill | AX MTP ttft ms | AX MTP+ngram ttft ms |
|---|---|---|---|---|---|
| Gemma 4 26B A4B 4-bit | flappy | 2,690 | 2,711 | 131 | 130 |
| Gemma 4 26B A4B 4-bit | long_code | 4,026 | 4,034 | 202 | 202 |
| Gemma 4 26B A4B 4-bit | python_modules_long | 2,923 | 2,854 | 130 | 132 |
| Gemma 4 31B 4-bit | flappy | 723 | 750 | 487 | 478 |
| Gemma 4 31B 4-bit | long_code | 807 | 809 | 987 | 980 |
| Gemma 4 31B 4-bit | python_modules_long | 741 | 743 | 472 | 472 |
The gated assistant already captures most of the speculation, so stacking n-gram on top stays opt-in. Sampler: temperature=0.6, top_p=0.95, top_k=20; 1,000 generated tokens, 5 repetitions, 30 s cooldown, 10 s inter-case cooldown. Apple M5 Max · AX Engine v6.5.2.
Full artifacts: 2026-06-20-gemma4-assistant-mtp-ax-mtp-only.
Reproduce this benchmark
python3 scripts/bench_gemma4_assistant_mtp.py \
--models 26b-a4b-4bit,31b-4bit \
--modes mtp,mtp-ngram \
--suites flappy,long_code,python_modules_long \
--max-tokens 1000 --repetitions 5
python3 scripts/render_gemma4_assistant_mtp_charts.py \
--results-dir benchmarks/results/gemma4-assistant-mtp/<run-dir>Artifacts land under benchmarks/results/gemma4-assistant-mtp/; SVGs render into docs/assets/. Tune the accept/throughput trade-off with AX_MLX_GEMMA4_ASSISTANT_MTP_DRAFT_MIN_CONFIDENCE (default 0.90; 0 disables the first-position gate) and AX_MLX_GEMMA4_ASSISTANT_MTP_DEEP_DRAFT_MIN_CONFIDENCE (default 0.999). MTP+n-gram stacking is opt-in: use --mlx-mtp-enable-ngram-stacking through the server/SDK path, or set AX_MLX_MTP_DISABLE_NGRAM_STACKING=0 for low-level benchmark runs.
Three-engine MTP comparison (MTPLX 0.3.7, AX Engine MTP, AX Engine MTP+n-gram) using standard Qwen/Qwen3.6-* sidecars plus matching mlx-community/*-4bit MLX bases. No Youssofal/*MTPLX* bundles are used. All three engines run on the same prompt suites, token caps, sampler, warmup, repetition count, and cooldown.
AX MTP runs the shipped default draft confidence gate (AX_MLX_MTP_DRAFT_MIN_CONFIDENCE, default 0.90). The accept columns below come from the same default-gate rerun as the throughput rows; use docs/MTP-DRAFT-GATE-THROUGHPUT.md when tuning the accept/throughput trade-off for a specific workload.
The aggregate improvement view below uses sample medians across all three suites. The 35B-A3B sidecar is the clear public win: AX MTP is +59.8% vs the retained MTPLX reference, while AX MTP+n-gram is +59.9% vs MTPLX and +0.1% vs pure AX MTP. The 27B row is workload-dependent but positive in this rerun: pure AX MTP is +7.8% vs MTPLX, and AX MTP+n-gram is +8.7% vs MTPLX and +0.8% vs pure AX MTP. Stacking remains opt-in because the per-suite win is not uniform.
The latency view follows the same boundary. On Qwen3.6 35B-A3B, AX wins every listed MTPLX prefill and TTFT row because the sidecar path stays inside the repo-owned MLX runner and records the target-model prefill separately from speculative verification. On Qwen3.6 27B, prefill and TTFT are intentionally called mixed: AX is close, but the 27B sidecar does not show a clean latency win on every suite. Treat the 35B-A3B rows as the public MTP latency advantage and the 27B rows as workload-dependent.
| Qwen3.6 27B 4-bit | Qwen3.6 35B-A3B 4-bit |
| Model | Suite | Depth | MTPLX tok/s | MTPLX accept | AX tok/s | AX accept | AX+ngram tok/s | AX+ngram accept |
|---|---|---|---|---|---|---|---|---|
| Qwen3.6 27B 4-bit | flappy | 3 | 56.1 | 100.0% (96.0-100.0) | 61.4 | 99.7% (97.3-100.0) | 61.6 | 99.7% (97.3-100.0) |
| Qwen3.6 27B 4-bit | long_code | 3 | 57.9 | 99.7% (98.4-100.0) | 60.5 | 99.6% (98.9-100.0) | 61.0 | 99.6% (98.9-100.0) |
| Qwen3.6 27B 4-bit | python_modules_long | 3 | 52.7 | 87.6% (81.2-95.0) | 52.0 | 97.8% (97.1-98.4) | 51.6 | 97.8% (97.1-98.4) |
| Qwen3.6 35B-A3B 4-bit | flappy | 1 | 104.3 | 49.5% (42.3-60.6) | 169.0 | 100.0% (99.4-100.0) | 168.8 | 100.0% (99.4-100.0) |
| Qwen3.6 35B-A3B 4-bit | long_code | 1 | 105.6 | 51.4% (43.1-66.7) | 164.7 | 99.9% (99.6-100.0) | 166.8 | 99.9% (99.6-100.0) |
| Qwen3.6 35B-A3B 4-bit | python_modules_long | 1 | 98.2 | 42.6% (37.0-46.1) | 166.7 | 97.9% (97.7-99.3) | 163.3 | 97.9% (97.7-99.3) |
Accept cells show median with (min–max) range across the suite's cases × 5 reps, so the run-to-run spread on the borderline python_modules_long suite is visible rather than hidden behind a single point.
Prefill throughput (tok/s) — same run:
MTPLX prefill is derived from prompt_tokens / prompt_eval_time_s (runner-level). AX prefill is measured at runner level. Both are pure GPU compute measurements.
| Model | Suite | Depth | MTPLX tok/s | AX MTP tok/s | AX MTP+ngram tok/s |
|---|---|---|---|---|---|
| Qwen3.6 27B 4-bit | flappy | 3 | 657 | 678 | 683 |
| Qwen3.6 27B 4-bit | long_code | 3 | 793 | 789 | 790 |
| Qwen3.6 27B 4-bit | python_modules_long | 3 | 680 | 692 | 693 |
| Qwen3.6 35B-A3B 4-bit | flappy | 1 | 1,520 | 1,795 | 1,803 |
| Qwen3.6 35B-A3B 4-bit | long_code | 1 | 2,431 | 2,673 | 2,706 |
| Qwen3.6 35B-A3B 4-bit | python_modules_long | 1 | 1,654 | 1,973 | 1,935 |
Time to first token (ms) — same run:
MTPLX TTFT is derived from prompt_eval_time_s. AX TTFT is a runner-time measurement. Both are pure prefill measurements.
| Model | Suite | Depth | MTPLX ms | AX MTP ms | AX MTP+ngram ms |
|---|---|---|---|---|---|
| Qwen3.6 27B 4-bit | flappy | 3 | 489 | 474 | 470 |
| Qwen3.6 27B 4-bit | long_code | 3 | 905 | 909 | 909 |
| Qwen3.6 27B 4-bit | python_modules_long | 3 | 509 | 506 | 505 |
| Qwen3.6 35B-A3B 4-bit | flappy | 1 | 213 | 179 | 178 |
| Qwen3.6 35B-A3B 4-bit | long_code | 1 | 295 | 269 | 265 |
| Qwen3.6 35B-A3B 4-bit | python_modules_long | 1 | 206 | 174 | 179 |
Sampler: temperature=0.6, top_p=0.95, top_k=20; 1,000 gen tokens, 5 repetitions, 30 s cooldown, 10 s inter-case cooldown. MTPLX 0.3.7 reference rows are retained from the full 2026-06-07 run; AX Engine rows are refreshed on v6.5.2.
Full artifacts: 2026-06-20-qwen36-ax-mtp-only (AX-only rerun) · 2026-06-20-qwen36-merged-ax-refresh (README chart artifact with retained MTPLX reference rows).
Reproduce this benchmark
ax-engine convert-mtplx mlx-community/Qwen3.6-27B-4bit \
--mtp-source Qwen/Qwen3.6-27B \
--fair-base-only
ax-engine convert-mtplx mlx-community/Qwen3.6-35B-A3B-4bit \
--mtp-source Qwen/Qwen3.6-35B-A3B \
--fair-base-only
python3 scripts/bench_qwen36_mtp_fair.py \
--engines mtplx ax \
--modes mtp mtp-ngram \
--models 27b-4bit 35b-a3b-4bit \
--suites flappy long_code python_modules_long \
--max-tokens 1000 \
--repetitions 5 \
--cooldown 30convert-mtplx wraps the generic sidecar packager, applies model-specific defaults when optional knobs are omitted (Qwen3.6 27B depth 3; 35B-A3B depth 1), and validates ax_mtp_sidecar_manifest.json before reporting success. The generated summary.md, summary.json, and decode-tok-s.svg live under benchmarks/results/mtp-fair/. Full methodology and caveats in docs/PERFORMANCE.md#mtp-mode.
DiffusionGemma is a block-diffusion Gemma4 26B checkpoint, not an ordinary autoregressive decoder. AX runs it with a native MLX graph, but the measurement boundary is different from the direct-decode families below: the first visible output comes from a committed 256-token diffusion block, not from a single next-token step.
Because of that generation shape, the rows below intentionally do not use the
plain decode tok/s or TTFT labels used for autoregressive models. In Qwen,
Gemma 4 text, and other next-token decoders, TTFT means prompt prefill plus the
first single-token decode step, and decode tok/s means the steady
token-by-token autoregressive loop. DiffusionGemma instead runs a bidirectional
denoise pass over a 256-token canvas, then performs a causal commit for that
block. The comparable boundary inside this runtime is therefore time to first
block and first-block decode. Treating these as ordinary TTFT/decode rows
would make the result look directly comparable to autoregressive throughput even
though the work per visible output boundary is different.
The charts keep the same 128 / 512 / 2,048 prompt-token layout as the autoregressive sections for readability, but the values are AX first-block telemetry. Peer bars are intentionally omitted rather than shown as zero: current llama.cpp Metal cannot load the GGUF (unknown model architecture: 'diffusion-gemma'), and mlx_lm 0.31.3 cannot load the MLX snapshot (Model type diffusion_gemma not supported.).
| Prompt tokens | AX first-block decode | Denoise steps | Committed block |
|---|---|---|---|
| 128 | 30.7 tok/s | 48 | 256 tokens |
| 512 | 58.9 tok/s | 25 | 256 tokens |
| 2048 | 32.1 tok/s | 48 | 256 tokens |
Prefill and first-block latency:
| Prompt tokens | AX direct prefill | AX time to first block | llama.cpp Metal 9650 | mlx_lm 0.31.3 |
|---|---|---|---|---|
| 128 | 1,351.8 tok/s | 8,428 ms | load blocked | load blocked |
| 512 | 3,002.1 tok/s | 4,518 ms | load blocked | load blocked |
| 2048 | 4,031.4 tok/s | 8,475 ms | load blocked | load blocked |
time to first block is prefill wall time plus the first 256-token denoise-and-commit block. first-block decode is computed as 256 / ax_mlx_diffusion_block_wall_us. Use these rows to track AX's DiffusionGemma path; do not compare them directly with ordinary autoregressive TTFT or fixed-token decode throughput.
| Runtime path | Model artifact | Benchmark status |
|---|---|---|
| AX direct MLX | mlx-community/diffusiongemma-26B-A4B-it-4bit |
Measured: 1 warmup + 5 measured repetitions, 15 s cooldown, medians reported |
| llama.cpp Metal 9650 | 4-bit GGUF | Blocked at load: unknown model architecture: 'diffusion-gemma' |
mlx_lm 0.31.3 |
4-bit MLX snapshot | Blocked at load: Model type diffusion_gemma not supported. |
Memory bandwidth share:
The bandwidth chart is an implementation-efficiency view, not a peer comparison. It estimates first-block traffic at block granularity from the measured denoise-step count plus one causal commit over the 16.54 GB MLX safetensors artifact. This rerun used 48 / 25 / 48 denoise steps at 128 / 512 / 2,048 prompt tokens, so the estimated traffic is much larger than a one-step early-exit block. The chart shows estimated bandwidth used versus the M5 Max theoretical ceiling; the table keeps the effective GB/s values.
| Prompt tokens | Estimated effective bandwidth | % of 614.4 GB/s M5 Max theoretical bandwidth |
|---|---|---|
| 128 | 97.3 GB/s | 15.8% |
| 512 | 98.9 GB/s | 16.1% |
| 2,048 | 101.8 GB/s | 16.6% |
At these prompt lengths, the first-block path uses roughly 16% of theoretical M5 Max bandwidth. The current bottleneck is therefore not raw memory bandwidth alone; the next optimization target is denoise graph reuse, dispatch overhead, and convergence behavior under stricter quality gates.
Denoise loop optimization — GPU-native sampling:
crates/ax-engine-mlx/src/diffusion.rs keeps denoise state, entropy-bound acceptance, and self-conditioning on the GPU. Convergence checks materialize only scalar counters and run every convergence_check_interval steps (default 4), reducing per-block GPU/CPU syncs from 48 to about 12. The CPU no longer round-trips 256 token positions on every denoise step; sampling and acceptance stay in lazy MLX graph nodes that can fuse with the forward evaluation.
Adaptive convergence detection:
The denoise loop can stop early when any configured convergence signal fires:
-
Strict stability: argmax is unchanged for
convergence_stepsconsecutive checks and mean entropy is belowentropy_threshold(default 0.005). -
Low update rate: the accepted-position update rate drops below
acceptance_rate_threshold(default 1%), so another denoise pass is unlikely to change the block materially. -
Entropy plateau: mean entropy stops decreasing materially after the early denoise phase, indicating diminishing returns from additional passes.
The benchmark rows above report the measured adaptive-convergence run as recorded in the artifact. This rerun did not converge after one denoise step: it used 48 / 25 / 48 denoise steps at 128 / 512 / 2,048 prompt tokens. Time to first block therefore tracks the full measured denoise work for the 128- and 2,048-token rows and a mid-run early exit for the 512-token row.
Experimental denoise optimizations (opt-in):
The default path above uses no optional optimizations. The following environment variables enable experimental fast paths for benchmarking and development. All are off by default and should be considered preview/experimental until they are validated across prompt lengths, multi-block generation, and token-equivalence against the default imperative path.
| Environment variable | What it does | Status |
|---|---|---|
AX_DIFFUSION_COMPILED_FORWARD=1 |
Compiles the bidirectional denoise forward pass into an MlxClosure per block, collapsing ~250 per-step MLX C-API calls into one dispatched graph. |
Experimental / benchmarking |
AX_DIFFUSION_FULL_PIPELINE=1 |
Compiles the entire denoise step (forward + softmax + entropy + argmax + sampling + acceptance) into a single MlxClosure. Supersedes AX_DIFFUSION_COMPILED_FORWARD when both are set. |
Experimental / benchmarking |
AX_DIFFUSION_KV_CONCAT_BUFFER=1 |
Pre-allocates per-layer KV concatenation buffers on the first denoise step and updates only the canvas slice on subsequent steps, avoiding re-copying the cached prompt prefix. Most beneficial when multiple denoise steps are needed. | Experimental / benchmarking |
AX_DIFFUSION_EMBEDDING_CACHE=1 |
Caches per-layer embedding inputs across denoise steps when token IDs are unchanged, using a GPU-side sum fingerprint to detect changes. | Experimental / benchmarking |
AX_DIFFUSION_SKIP_COMMIT_ON_CONVERGE=1 |
Skips the causal commit forward pass when the denoise loop converges at step 1 with near-perfect acceptance (≥ 99%). | Experimental / benchmarking |
Example usage for a single benchmark run:
AX_DIFFUSION_FULL_PIPELINE=1 \
AX_DIFFUSION_KV_CONCAT_BUFFER=1 \
python3 scripts/bench_diffusion_gemma_direct.py --bench-bin target/release/ax-engine-benchThese flags are read once per process. Do not enable them in production serving without first verifying output token equivalence against the default path on your target prompts.
Artifacts: AX direct rows are 2026-06-20-direct-first-block-rerun/summary.json, with the human summary in summary.md. Peer runtime blockers are recorded as load failures, so there are no llama.cpp or mlx_lm result artifacts for this model family.
Render charts with:
python3 scripts/bench_diffusion_gemma_direct.py --skip-benchmarkDecode acceleration model — no MTP:
DiffusionGemma's acceleration model is the diffusion block itself. It does not stack with MTP or n-gram acceleration because those techniques assume an autoregressive next-token loop:
| MTP (speculative decoding) | DiffusionGemma (block diffusion) | |
|---|---|---|
| Generation | Draft-then-verify, one token at a time | 256-token blocks via bidirectional denoising |
| Forward pass | Causal only | Bidirectional (denoise) + causal (commit) |
| Needs draft model / assistant head | Yes | No |
| AX Engine decode path | ngram_acceleration / mtp_head_only |
diffusion (early return, mutually exclusive) |
In the runner's decode_one, the diffusion path returns before the MTP/n-gram branches are reached. DiffusionConfig carries canvas size, denoise steps, entropy thresholds, convergence settings, and temperature schedule only; it has no MTP fields.
Supported features:
- Block-autoregressive discrete diffusion decode (canvas=256, up to 48 denoise steps)
- Entropy-bound position acceptance with argmax-based rejection
- Self-conditioning via GPU matmul (prob × cached embedding table)
- Linear temperature schedule (configurable start/end)
- Adaptive convergence detection (stable argmax, mean entropy, low update rate, and entropy plateau)
- Standard causal prefill (same Gemma4 encoder, 4,073.3 tok/s median at the 2,048-token row)
- Causal commit pass (writes KV cache for subsequent blocks)
- SSE telemetry counters for diffusion block timing, denoise steps, convergence signals, and near-miss entropy/update-rate diagnostics (
ax_mlx_diffusion_*) diffusiondecode-route classification in benchmark harness
Not applicable:
- MTP / assistant-head speculative decoding (architecturally incompatible)
- N-gram acceleration (diffusion replaces the autoregressive decode loop)
- Direct pipeline double-buffering (not autoregressive)
Benchmark contract:
The published rows use first-block telemetry instead of the standard fixed-token autoregressive benchmark contract. max_output_tokens=1 is enough to force prefill plus one diffusion block, and the block counters still report the full 256-token denoise/commit cycle even though the caller receives only the first emitted token.
Telemetry: SSE-emitted ax_mlx_diffusion_* counters cover block count, denoise steps, convergence count, per-criterion convergence signals, near-miss entropy/update-rate diagnostics, denoise wall time, commit wall time, and block wall time, plus diffusion decode-route classification in bench_mlx_inference_stack.py.
Run the full direct benchmark and regenerate the charts:
cargo build -p ax-engine-bench --bin ax-engine-bench
python3 scripts/bench_diffusion_gemma_direct.pyQwen3-Coder-Next is the coding-specialist qwen3_next checkpoint, so it is reported separately from Qwen 3.6. It uses the same repo-owned AX MLX graph family, but its benchmark boundary is different: it does not ship MTP heads or a Qwen3.6 sidecar, so the public README path is direct decode only.
The direct comparison below uses grouped bar charts at 128/512/2048 prompt tokens. Each engine's version is printed on the charts: AX native MLX (6.5.2) and mlx_lm (0.31.3) use the MLX artifact and prompt-hash parity; llama.cpp Metal (b9700, ggml 0.15.2, flash-attn on) is a shape-compatible external GGUF reference run on one consistent build across all three prompt sizes. The AX rerun uses the default-on Qwen MoE fast paths (AX_MLX_QWEN3_MOE_NARROW_SOFTMAX, AX_MLX_MOE_FUSE_SHARED_EXPERT_ADD, and AX_MLX_MOE_SWIGLU_PACKED_METAL) plus the opt-in fused expert block (AX_MLX_MOE_FUSED_EXPERT_BLOCK=1). AX direct decode is +6.6% / +3.3% / +3.4% versus mlx_lm, and +23.6% / +20.6% / +17.1% versus llama.cpp.
| Prompt tokens | llama.cpp decode | mlx_lm decode | AX direct decode | AX vs mlx_lm | AX vs llama.cpp |
|---|---|---|---|---|---|
| 128 | 85.5 | 99.2 | 105.7 | +6.6% | +23.6% |
| 512 | 86.0 | 100.4 | 103.7 | +3.3% | +20.6% |
| 2048 | 85.5 | 96.9 | 100.2 | +3.4% | +17.1% |
Prefill and TTFT peers — same run:
| Prompt tokens | llama.cpp prefill | mlx_lm prefill | AX direct prefill | llama.cpp TTFT | mlx_lm TTFT | AX direct TTFT |
|---|---|---|---|---|---|---|
| 128 | 1,248.7 | 301.8 | 758.5 | 103 ms | 426 ms | 169 ms |
| 512 | 2,148.3 | 897.2 | 1,703.2 | 238 ms | 574 ms | 301 ms |
| 2048 | 2,555.1 | 2,226.9 | 2,482.6 | 802 ms | 920 ms | 825 ms |
llama.cpp leads prefill/TTFT at every prompt size (flash-attn GGUF prompt ingestion). The v6.5.2 fused-expert-block AX rerun keeps the AX decode advantage at every size, but its prefill/TTFT rows trail the prior default-on AX run; use this artifact as an opt-in fast-path measurement, not a replacement claim that the flag is always faster.
What drives the decode gap (it is not bandwidth saturation). This is a runtime shootout at each engine's standard 4-bit, not a controlled kernel test. Qwen3-Coder-Next is MoE, so each decode token reads only the dense backbone plus the 10-of-512 active experts — and at that footprint none of the three engines is bandwidth-bound (all sit at 34–42% of the 577 GB/s M5 Max peak; see the bandwidth table below). The gap splits cleanly: AX beats llama.cpp on bytes-read — Q4_K_M reads ~1.44× the bytes/token (2.83 vs 1.96 GB) because its dense backbone (linear-attention/SSM, embeddings, output head) stays at higher precision; llama.cpp actually sustains the most bandwidth (~42%) yet is slowest. AX beats mlx_lm on kernel efficiency — identical 1.96 GB/token MLX weights, but AX extracts ~36% of peak vs mlx-lm's ~34% (the MoE gather-GEMV win). The parity-controlled claim is AX vs mlx_lm (identical weights, prompt-hash parity): +3.3%–6.6%; llama-bench consumes its own internal tokens (no prompt-hash parity), so the llama.cpp column is a shape-compatible external reference only.
Memory bandwidth utilization:
Decode speed follows one identity: tok/s = effective bandwidth ÷ bytes read per token. The chart below plots decode throughput (y) against weight bytes read per token (x), with the measured M5 Max peak (≈577 GB/s, MLX reduction probe) drawn as the ceiling curve tok/s = 577 / bytes. It reads in one view: AX and mlx-lm share the same x (identical MLX 4-bit weights), so the vertical gap between them is pure kernel efficiency (+6.6%, AX's MoE gather-GEMV); llama.cpp is pushed right because Q4_K_M reads 1.44× the bytes/token, which is why it decodes slowest even though it sustains the most raw bandwidth; and every point sits far below the ceiling, so decode is gather/dispatch-bound, not bandwidth-bound — the room up to the curve is headroom.
| Engine / quantization | Dense backbone | Active experts | Weights/token | Decode tok/s | Effective BW | % of 577 GB/s peak (used) |
|---|---|---|---|---|---|---|
| AX — MLX 4-bit + fused expert block | 1.21 GB (22%) | 0.76 GB (14%) | 1.96 GB | 105.7 | 208 GB/s | 36% |
| mlx-lm — MLX 4-bit | 1.21 GB (21%) | 0.76 GB (13%) | 1.96 GB | 99.2 | 195 GB/s | 34% |
| llama.cpp — Q4_K_M | 1.91 GB (28%) | 0.91 GB (14%) | 2.83 GB | 85.5 | 242 GB/s | 42% |
Per-segment percentages are that read's share of the 577 GB/s peak (dense + experts = used); the remainder is idle headroom. The dense backbone (read in full every token) is where Q4_K_M's higher precision shows up — 1.91 GB vs MLX's 1.21 GB.
AX and mlx-lm read the same 1.96 GB of active weights per token (identical MLX 4-bit artifact); AX is faster because it extracts more of the available bandwidth — a runtime/kernel win, not a quant difference. llama.cpp reads 1.44× more (2.83 GB) because Q4_K_M keeps the dense backbone — Qwen3-Next's linear-attention/SSM weights, token embeddings, and output head — at higher precision; that bytes-read overhead, not bandwidth starvation, is why its decode trails. Active-byte figures: MLX from the harness bandwidth_accounting (moe_active_estimate), llama.cpp computed from the GGUF tensor table (dense + routed × 10/512, the same formula). Rows are prompt=128; decode tok/s is essentially depth-independent for this model.
The same chart also shows the remaining AX headroom. If AX kept the 1.96 GB/token footprint and merely matched llama.cpp's 42% effective-bandwidth row, decode would land around 124 tok/s (+17%); on dense models on this same M5 Max hardware AX reaches 78–86% of peak, so the ~40-point gap here is specific to batch-1 MoE decode, where each token gathers only 10-of-512 experts and fixed routing, gather setup, dispatch, dequant, and expert weighted-sum overhead dominate costs that do not scale with bytes read (the bus idles while dispatch runs). The next lever is therefore kernel/dispatch engineering — fewer and larger fused MoE operations such as batched expert dispatch and deeper gather+GEMV+weighted-sum fusion — not pushing quantization lower (AX already reads the fewest bytes of the three; going lower would cost model quality). This is an upper bound, not a commitment: single-token MoE decode is latency-bound at its core.
Artifacts: AX direct rows are the v6.5.2 opt-in fused-expert-block rerun 2026-06-20-qwen3-coder-next-ax-direct/qwen3-coder-next-4bit-ax-direct.json, with default-on Qwen MoE fast paths plus AX_MLX_MOE_FUSED_EXPERT_BLOCK=1; mlx_lm reference rows are qwen3-coder-next-4bit-p128-p2048-step4096.json and qwen3-coder-next-4bit-p512-step4096.json; llama.cpp is 2026-06-19-qwen3-coder-next-9700-fa/qwen3-coder-next-4bit.json (b9700 / ggml 0.15.2 / flash-attn, one build across 128/512/2048).
Render charts with:
python3 scripts/render_qwen_coder_next_charts.py \
--artifact benchmarks/results/mlx-inference/2026-06-20-qwen3-coder-next-ax-direct/qwen3-coder-next-4bit-ax-direct.json \
--artifact benchmarks/results/mlx-inference/2026-06-19-qwen3-coder-next-ax-only/qwen3-coder-next-4bit-ax-direct.json \
--artifact benchmarks/results/mlx-inference/2026-06-14-qwen3-coder-next-29af647f-ax-direct/qwen3-coder-next-4bit-ax-direct.json \
--artifact benchmarks/results/mlx-inference/2026-06-13-qwen3-coder-next-prefill-probe/qwen3-coder-next-4bit-p128-p2048-step4096.json \
--artifact benchmarks/results/mlx-inference/2026-06-13-qwen3-coder-next-prefill-probe/qwen3-coder-next-4bit-p512-step4096.json \
--llama-artifact benchmarks/results/llama-cpp-metal/2026-06-19-qwen3-coder-next-9700-fa/qwen3-coder-next-4bit.json \
--assets-dir docs/assets
# Memory-bandwidth utilization chart (static data; see script header for provenance)
python3 scripts/render_qwen_coder_next_bandwidth_chart.py --assets-dir docs/assetsQwen3-Coder-Next uses a sparse top-10-of-512 MoE architecture, so each decode token reads only the dense backbone plus 10 active experts. The optimizations below reduce the per-layer dispatch overhead in the MoE expert forward path. Three Qwen-relevant paths are on by default (with kill-switches); the others are opt-in for benchmarking and development.
| Environment variable | What it does | Default |
|---|---|---|
AX_MLX_QWEN3_MOE_NARROW_SOFTMAX |
Routes MoE expert selection through argpartition on raw logits instead of full softmax_precise over all 512 experts. Mathematically equivalent (argpartition preserves top-k order since softmax is monotonic). |
ON |
AX_MLX_MOE_FUSE_SHARED_EXPERT_ADD |
Adds Qwen3 shared-expert output inside the weighted-sum Metal kernel on decode/short-tail chunks, removing one add dispatch per MoE layer when shapes are eligible. | ON |
AX_MLX_MOE_SWIGLU_PACKED_METAL |
Routes packed Qwen3 MoE expert SwiGLU through one Metal kernel instead of split + split + activation/multiply on decode. Long prefill keeps the split path. | ON |
AX_MLX_MOE_LAYER_COMPILE |
Wraps each MoE layer's decode forward path in a compiled MlxClosure (shapeless=true), collapsing ~10 per-layer MLX dispatches into a single compiled graph. Cached per (layer_index, thread_id). Only engages for decode (seq == 1). Falls back to the uncompiled path on failure. |
OFF |
AX_MLX_MOE_PROFILE |
Records wall-clock timing for each MoE sub-stage (router, gate-up, activation, down-projection, weighted-sum, total) without eval() barriers. Data surfaces in route metadata and batch summaries. Diagnostic tool, not a performance optimization. |
OFF |
AX_MLX_MOE_FUSED_EXPERT_BLOCK |
Routes the activation + squeeze + unsort chain through a single fused Metal kernel for decode (unsorted gather path only). Reduces dispatch count per MoE layer. Falls back to the standard dispatch when ineligible. | OFF |
AX_MLX_MOE_EXPERT_PARALLEL |
Bins expert tokens per-expert for parallel Metal dispatch during prefill. Checks load-balance before engaging (falls back to sequential gather_qmm when max_bin > 2x mean_bin). Infrastructure only — parallel kernel not yet implemented. |
OFF |
To disable a default-on optimization (e.g. for debugging or comparison):
# Disable packed SwiGLU for a single run
AX_MLX_MOE_SWIGLU_PACKED_METAL=0 ax-engine serve qwen3-coder-next --download --port 8080To enable selected experimental diagnostics/fast paths for benchmarking:
AX_MLX_MOE_LAYER_COMPILE=1 \
AX_MLX_MOE_PROFILE=1 \
AX_MLX_MOE_FUSED_EXPERT_BLOCK=1 \
ax-engine serve qwen3-coder-next --download --port 8080Note:
AX_MLX_MOE_LAYER_COMPILEwraps each MoE layer's decode forward in a compiledMlxClosure. It is opt-in because it may panic in long-running processes due to MLX thread-local stream registry invalidation. If you encounter crashes, disable it withAX_MLX_MOE_LAYER_COMPILE=0.AX_MLX_MOE_EXPERT_PARALLELis infrastructure-only (parallel kernel not yet implemented).
These flags are read once per process at startup. Do not enable the opt-in flags in production serving without first verifying output token equivalence against the default path on your target prompts.
The family tables below compare direct (non-speculative) decode across llama.cpp Metal, mlx_lm, and ax engine, covering Gemma 4 and Qwen 3.6 at 128/512/2048 prompt tokens. ax direct baseline disables n-gram acceleration, MTP, and assistant drafting to measure the repo-owned direct decode path. Bench prompts are mlx_lm.benchmark seed-0 random tokens, which keeps prompt-hash parity across MLX rows.
The prefill and TTFT advantage is the practical direct-mode story. AX is ahead of mlx_lm in every listed prefill and TTFT cell below, while decode gains are smaller and model-dependent. That means the repo-owned MLX route is especially valuable for interactive requests where prompt ingestion dominates perceived latency: AX keeps prompt prefill, first-token timing, model-specific graph paths, and route metadata in one measured runtime path. These are cold-prefix rows, not prompt-cache, continuous-batching, or speculative-decoding claims.
| Gemma 4 | Qwen 3.6 | |
| Decode rate | ||
| Prefill rate | ||
| TTFT |
llama.cpp Metal*column — Shape-compatible reference produced by Metal-enabledllama-bench.llama-benchgenerates its own internal synthetic prompt tokens and does not consume the harness prompt JSON, so these numbers are not prompt-hash parity with the other columns. No percentage delta is shown. MLX bit-widths are mapped to the nearest standard GGUF K-quant (4→Q4_K_M, 5→Q5_K_M, 6→Q6_K, 8→Q8_0). Source:benchmarks/manifests/llama_cpp_metal/inventory.json,scripts/bench_llama_cpp_metal_sweep.py.
Benchmark provenance and methodology
The mlx_lm reference rows for the 12 Gemma 4 and Qwen 3.6 rows shown below come from benchmarks/results/mlx-inference/2026-05-26-direct-mode-clean-refresh/. The AX direct-mode cells come from the full 12-model AX-only rerun in benchmarks/results/mlx-inference/2026-06-20-ax-direct-readme/ (v6.5.2). Qwen3-Coder-Next is intentionally handled as the opening direct-mode subsection because it has a direct-only benchmark boundary; its MLX/AX and llama.cpp Metal rows now cover 128/512/2048 prompt tokens. The llama.cpp Metal* column is injected from benchmarks/manifests/llama_cpp_metal/inventory.json and the 2026-05-18-llama-cpp-metal-gemma-e2b-4bit-depth-fa/ Gemma 4 E2B 4-bit recheck.
Setup: generation=128, 5 measured repetitions, 15-second cooldown, AX prefix cache disabled for cold prefill and TTFT measurement, production-build binaries, matching prompt SHA checks. Long-greedy AX prefill rows are runner-time measurements of the cache-state prefix plus final prompt-token boundary — not full-logits prompt scoring throughput. Percentages are versus mlx_lm.
The 2K llama.cpp Metal* prefill rows are long-context, GGUF-runtime-reference rows. The Gemma 4 E2B 4-bit row was produced with llama.cpp b9110 and rechecked on b9200 with Metal offload, -b/-ub 2048, and flash attention enabled. The b9200 recheck improved 2K prefill only slightly — this is our benchmark boundary, not an upstream llama.cpp official bug statement.
| Model | MLX quantization | Prompt tok | llama.cpp Metal* | mlx_lm | ax engine |
|---|---|---|---|---|---|
| Gemma 4 E2B | 4-bit | 128 | 3,481.7 | 2,338.1 | 5,720.2 (+144.6%) |
| 512 | 6,846.0 | 7,870.0 | 16,076.9 (+104.3%) | ||
| 2048 | 7,643.1 | 18,014.7 | 23,346.2 (+29.6%) | ||
| Gemma 4 E2B | 5-bit | 128 | 3,398.4 | 2,238.5 | 5,436.4 (+142.9%) |
| 512 | 6,860.3 | 7,469.9 | 15,526.9 (+107.9%) | ||
| 2048 | 7,288.1 | 16,664.1 | 22,798.4 (+36.8%) | ||
| Gemma 4 E2B | 6-bit | 128 | 3,539.7 | 1,823.5 | 5,330.0 (+192.3%) |
| 512 | 7,274.0 | 6,046.6 | 14,814.0 (+145.0%) | ||
| 2048 | 7,623.2 | 15,332.1 | 22,280.0 (+45.3%) | ||
| Gemma 4 E2B | 8-bit | 128 | 3,694.3 | 1,605.0 | 5,338.2 (+232.6%) |
| 512 | 7,481.0 | 6,332.9 | 15,259.4 (+141.0%) | ||
| 2048 | 7,990.4 | 15,536.8 | 22,924.7 (+47.6%) | ||
| Gemma 4 E4B | 4-bit | 128 | 2,194.0 | 1,513.2 | 3,460.6 (+128.7%) |
| 512 | 4,454.2 | 4,195.5 | 7,002.4 (+66.9%) | ||
| 2048 | 4,426.6 | 7,325.4 | 8,758.8 (+19.6%) | ||
| Gemma 4 26B A4B | 4-bit | 128 | 1,911.4 | 496.4 | 1,331.6 (+168.2%) |
| 512 | 3,484.5 | 1,621.0 | 3,011.0 (+85.7%) | ||
| 2048 | 3,604.8 | 3,300.1 | 4,550.1 (+37.9%) | ||
| Gemma 4 31B | 4-bit | 128 | 522.6 | 283.1 | 508.0 (+79.5%) |
| 512 | 665.3 | 619.9 | 736.0 (+18.7%) | ||
| 2048 | 560.3 | 733.9 | 750.8 (+2.3%) | ||
| Qwen 3.6 27B | 4-bit | 128 | 539.6 | 378.8 | 570.4 (+50.6%) |
| 512 | 759.7 | 705.7 | 826.6 (+17.1%) | ||
| 2048 | 664.3 | 895.2 | 922.0 (+3.0%) | ||
| Qwen 3.6 27B | 5-bit | 128 | 520.8 | 278.8 | 520.4 (+86.6%) |
| 512 | 733.4 | 599.5 | 760.4 (+26.8%) | ||
| 2048 | 667.0 | 827.5 | 848.1 (+2.5%) | ||
| Qwen 3.6 27B | 6-bit | 128 | 537.7 | 270.5 | 485.1 (+79.3%) |
| 512 | 756.1 | 577.6 | 736.0 (+27.4%) | ||
| 2048 | 689.3 | 798.2 | 841.0 (+5.4%) | ||
| Qwen 3.6 27B | 8-bit | 128 | 559.4 | 219.3 | 441.7 (+101.4%) |
| 512 | 798.2 | 520.2 | 710.1 (+36.5%) | ||
| 2048 | 741.9 | 787.4 | 847.6 (+7.6%) | ||
| Qwen 3.6 35B A3B | 4-bit | 128 | 1,706.9 | 539.4 | 1,118.8 (+107.4%) |
| 512 | 3,146.6 | 1,599.5 | 2,588.3 (+61.8%) | ||
| 2048 | 3,542.3 | 3,513.1 | 3,761.3 (+7.1%) |
| Model | MLX quantization | Prompt tok | llama.cpp Metal* | mlx_lm | ax direct baseline |
|---|---|---|---|---|---|
| Gemma 4 E2B | 4-bit | 128 | 174.6 | 214.0 | 224.1 (+4.7%) |
| 512 | 165.2 | 210.3 | 215.1 (+2.3%) | ||
| 2048 | 171.9 | 200.9 | 205.4 (+2.2%) | ||
| Gemma 4 E2B | 5-bit | 128 | 154.8 | 195.2 | 200.6 (+2.8%) |
| 512 | 154.3 | 182.0 | 194.5 (+6.8%) | ||
| 2048 | 154.3 | 181.9 | 185.7 (+2.1%) | ||
| Gemma 4 E2B | 6-bit | 128 | 152.1 | 172.2 | 178.0 (+3.4%) |
| 512 | 152.0 | 166.3 | 171.7 (+3.2%) | ||
| 2048 | 152.2 | 162.5 | 164.7 (+1.4%) | ||
| Gemma 4 E2B | 8-bit | 128 | 136.1 | 153.0 | 162.0 (+5.8%) |
| 512 | 138.3 | 148.8 | 157.8 (+6.1%) | ||
| 2048 | 138.7 | 144.2 | 153.0 (+6.1%) | ||
| Gemma 4 E4B | 4-bit | 128 | 110.7 | 137.1 | 142.9 (+4.2%) |
| 512 | 110.8 | 133.6 | 139.9 (+4.8%) | ||
| 2048 | 110.7 | 130.6 | 137.2 (+5.1%) | ||
| Gemma 4 26B A4B | 4-bit | 128 | 112.6 | 127.9 | 131.7 (+2.9%) |
| 512 | 112.9 | 125.0 | 128.7 (+2.9%) | ||
| 2048 | 112.9 | 119.3 | 123.7 (+3.7%) | ||
| Gemma 4 31B | 4-bit | 128 | 25.0 | 28.9 | 28.8 (-0.3%) |
| 512 | 25.5 | 28.3 | 28.3 (-0.2%) | ||
| 2048 | 25.3 | 27.0 | 26.1 (-3.3%) | ||
| Qwen 3.6 27B | 4-bit | 128 | 26.0 | 34.0 | 33.9 (-0.3%) |
| 512 | 26.0 | 33.9 | 33.6 (-0.8%) | ||
| 2048 | 18.8 | 33.4 | 33.3 (-0.4%) | ||
| Qwen 3.6 27B | 5-bit | 128 | 23.5 | 21.6 | 27.2 (+26.1%) |
| 512 | 23.3 | 28.1 | 26.9 (-4.2%) | ||
| 2048 | 17.8 | 27.8 | 26.1 (-6.3%) | ||
| Qwen 3.6 27B | 6-bit | 128 | 21.3 | 24.0 | 24.0 (+0.2%) |
| 512 | 21.3 | 24.8 | 24.0 (-3.0%) | ||
| 2048 | 15.4 | 24.6 | 23.7 (-3.8%) | ||
| Qwen 3.6 27B | 8-bit | 128 | 18.3 | 18.7 | 18.3 (-2.2%) |
| 512 | 18.2 | 18.6 | 18.0 (-3.2%) | ||
| 2048 | 12.7 | 18.4 | 18.1 (-1.7%) | ||
| Qwen 3.6 35B A3B | 4-bit | 128 | 108.1 | 140.1 | 153.2 (+9.4%) |
| 512 | 108.2 | 136.5 | 151.6 (+11.1%) | ||
| 2048 | 105.7 | 134.5 | 149.8 (+11.4%) |
Qwen 3.6 27B 4-bit at prompt=2,048 originally produced zero decode tokens because 4-bit quantization noise pushed an EOS token to argmax at decode step 0 on the
mlx_lm.benchmarkrandom-token contract. The benchmark harness now sendssampling.ignore_eos=truefor AX throughput runs, matching howmlx_lm.benchmarkmeasures fixedgen=Nthroughput. Production requests default toignore_eos=false. Source:benchmarks/results/mlx-inference/2026-05-20-qwen27-4to5-direct-ngram-directcpp-r2/qwen3_6-27b-4bit.json.
Lower is better. mlx_lm values are derived from reported prefill throughput. AX values are measured directly from per-step runner timing in the SSE event stream.
| Model | MLX quantization | Prompt tok | llama.cpp Metal* | mlx_lm | ax engine |
|---|---|---|---|---|---|
| Gemma 4 E2B | 4-bit | 128 | 36.8 | 54.7 | 22.4 (-59.1%) |
| 512 | 74.8 | 65.1 | 31.8 (-51.0%) | ||
| 2048 | 268.0 | 113.7 | 87.7 (-22.8%) | ||
| Gemma 4 E2B | 5-bit | 128 | 37.7 | 57.2 | 23.5 (-58.8%) |
| 512 | 74.6 | 68.5 | 33.0 (-51.9%) | ||
| 2048 | 281.0 | 122.9 | 89.8 (-26.9%) | ||
| Gemma 4 E2B | 6-bit | 128 | 36.2 | 70.2 | 24.0 (-65.8%) |
| 512 | 70.4 | 84.7 | 34.6 (-59.2%) | ||
| 2048 | 268.7 | 133.6 | 91.9 (-31.2%) | ||
| Gemma 4 E2B | 8-bit | 128 | 34.6 | 79.7 | 24.0 (-69.9%) |
| 512 | 68.4 | 80.8 | 33.6 (-58.5%) | ||
| 2048 | 256.3 | 131.8 | 89.3 (-32.2%) | ||
| Gemma 4 E4B | 4-bit | 128 | 58.3 | 84.6 | 37.0 (-56.3%) |
| 512 | 114.9 | 122.0 | 73.1 (-40.1%) | ||
| 2048 | 462.7 | 279.6 | 233.8 (-16.4%) | ||
| Gemma 4 26B A4B | 4-bit | 128 | 67.0 | 257.8 | 96.1 (-62.7%) |
| 512 | 146.9 | 315.8 | 170.0 (-46.2%) | ||
| 2048 | 568.1 | 620.6 | 450.1 (-27.5%) | ||
| Gemma 4 31B | 4-bit | 128 | 244.9 | 452.2 | 252.0 (-44.3%) |
| 512 | 769.5 | 826.0 | 695.7 (-15.8%) | ||
| 2048 | 3,655.2 | 2,790.6 | 2,727.7 (-2.3%) | ||
| Qwen 3.6 27B | 4-bit | 128 | 237.2 | 337.9 | 224.4 (-33.6%) |
| 512 | 673.9 | 725.6 | 619.4 (-14.6%) | ||
| 2048 | 3,083.1 | 2,287.7 | 2,221.3 (-2.9%) | ||
| Qwen 3.6 27B | 5-bit | 128 | 245.8 | 459.0 | 246.0 (-46.4%) |
| 512 | 698.1 | 854.1 | 673.3 (-21.2%) | ||
| 2048 | 3,070.5 | 2,474.9 | 2,414.7 (-2.4%) | ||
| Qwen 3.6 27B | 6-bit | 128 | 238.1 | 473.2 | 263.9 (-44.2%) |
| 512 | 677.2 | 886.5 | 695.6 (-21.5%) | ||
| 2048 | 2,971.2 | 2,565.6 | 2,435.2 (-5.1%) | ||
| Qwen 3.6 27B | 8-bit | 128 | 228.8 | 583.6 | 289.8 (-50.3%) |
| 512 | 641.5 | 984.2 | 721.0 (-26.7%) | ||
| 2048 | 2,760.6 | 2,601.1 | 2,416.3 (-7.1%) | ||
| Qwen 3.6 35B A3B | 4-bit | 128 | 75.0 | 237.3 | 114.4 (-51.8%) |
| 512 | 162.7 | 320.1 | 197.8 (-38.2%) | ||
| 2048 | 578.2 | 583.0 | 544.5 (-6.6%) | ||
Embedding benchmarks are kept out of this README summary; see docs/EMBEDDINGS.md. |
ax-engine-server exposes OpenAI-compatible HTTP endpoints, and several SDKs wrap those endpoints or the in-process Rust session directly.
| Language | Package / path | LangChain |
|---|---|---|
| Python | python/ax_engine |
ax_engine.langchain — AXEngineChatModel, AXEngineLLM |
| TypeScript / JS | javascript/ax-engine (@ax-engine/sdk) |
@ax-engine/sdk/langchain — ChatAXEngine, AXEngineLLM |
| Go | sdk/go/axengine |
Use langchaingo OpenAI provider — see examples/go/langchain/ |
| Ruby | sdk/ruby (ax-engine-sdk) |
ax_engine/langchain — ChatModel, LLM (requires langchain-rb) |
| Mojo | sdk/mojo/ax_engine.mojo |
Via Python — use ax_engine.langchain from Mojo's Python interop |
npm install @ax-engine/sdkimport AxEngineClient from "@ax-engine/sdk";
const client = new AxEngineClient({ baseUrl: "http://127.0.0.1:8080" });
const resp = await client.chatCompletion({
messages: [{ role: "user", content: "Hello!" }],
max_tokens: 128,
});
console.log(resp.choices[0].message.content);
// Streaming
for await (const event of client.streamChatCompletion({ messages: [...], stream: true })) {
process.stdout.write(event.data.choices[0]?.delta?.content ?? "");
}LangChain integration (requires @langchain/core):
import { ChatAXEngine } from "@ax-engine/sdk/langchain";
import { HumanMessage } from "@langchain/core/messages";
const chat = new ChatAXEngine({ maxTokens: 128 });
const response = await chat.invoke([new HumanMessage("Hello!")]);The Go SDK lives at sdk/go/axengine (module github.com/ax-engine/ax-engine-go).
client := axengine.NewClient(nil)
resp, err := client.ChatCompletion(ctx, axengine.OpenAiChatCompletionRequest{
Messages: []axengine.OpenAiChatMessage{{Role: "user", Content: "Hello!"}},
MaxTokens: axengine.Ptr(128),
})
// Streaming
ch, errCh := client.StreamChatCompletion(ctx, req)
for chunk := range ch {
fmt.Print(*chunk.Choices[0].Delta.Content)
}See examples/go/ for runnable examples. For LangChain, point langchaingo's OpenAI provider at http://127.0.0.1:8080/v1 — see examples/go/langchain/ and docs/GO.md.
The Ruby SDK lives at sdk/ruby/ (ax-engine-sdk gem). Zero dependencies — stdlib net/http only. Streaming uses a block interface.
require "ax_engine"
client = AxEngine::Client.new
# Blocking chat
resp = client.chat_completion(
messages: [{ role: "user", content: "Hello!" }],
max_tokens: 128
)
puts resp.dig("choices", 0, "message", "content")
# Streaming
client.stream_chat_completion(
messages: [{ role: "user", content: "Count from 1 to 5." }],
max_tokens: 64
) do |event|
print event.dig("data", "choices", 0, "delta", "content").to_s
endLangChain via langchain-rb:
require "ax_engine/langchain"
chat = AxEngine::Langchain::ChatModel.new(max_tokens: 256)
puts chat.chat(messages: [{ role: "user", content: "Hello!" }]).chat_completionSee examples/ruby/ and docs/RUBY.md for full details.
from ax_engine.langchain import AXEngineChatModel
from langchain_core.messages import HumanMessage
chat = AXEngineChatModel(base_url="http://127.0.0.1:8080", max_tokens=256)
response = chat.invoke([HumanMessage(content="Hello!")])
print(response.content)
# Streaming
for chunk in chat.stream([HumanMessage(content="Count from 1 to 5.")]):
print(chunk.content, end="", flush=True)Requires pip install langchain-core. See docs/PYTHON.md for full details.
The Mojo SDK (sdk/mojo/ax_engine.mojo) wraps the Python ax_engine package via Mojo's PythonObject interop. Requires the Python extension to be built first (maturin develop).
from sdk.mojo.ax_engine import Session
var session = Session(
"qwen3_dense",
mlx=True,
mlx_model_artifacts_dir="/path/to/artifacts",
)
var result = session.generate("Hello from Mojo!", max_output_tokens=64)
print(result.output_text)
session.close()The installed PyPI workflow uses ax-engine serve for the common local-serving path. ax-engine-server remains available as the backward-compatible low-level entrypoint when you need explicit runtime flags.
# Download a model and generate its manifest
MODEL_DIR="$(ax-engine download qwen36-35b --json | python3 -c 'import json,sys; print(json.load(sys.stdin)["dest"])')"
# Recommended: resolve and launch ax-engine-server
ax-engine serve "$MODEL_DIR" --port 8080
# Backward-compatible low-level path
./target/release/ax-engine-server \
--mlx \
--mlx-model-artifacts-dir "$MODEL_DIR" \
--port 8080
# Inspect the running server
curl http://127.0.0.1:8080/v1/runtime
# Smoke generation request
curl http://127.0.0.1:8080/v1/generate \
-H 'content-type: application/json' \
-d '{
"model": "qwen3_dense",
"input_tokens": [1, 2, 3, 4],
"max_output_tokens": 4,
"sampling": { "temperature": 0.0, "top_p": 1.0, "top_k": 0, "seed": 1234 }
}'Python bindings (after maturin develop):
import ax_engine
path = ax_engine.download_model("mlx-community/Qwen3-4B-4bit")
with ax_engine.Session(mlx=True, mlx_model_artifacts_dir=str(path)) as s:
result = s.generate([1, 2, 3], max_output_tokens=32)
print(result.output_tokens)Delegated route (for unsupported MLX text models that mlx-lm can serve):
mlx_lm.server --model /path/to/local/mlx-model --host 127.0.0.1 --port 8090
./target/release/ax-engine-bench generate \
--prompt "Hello from mlx-lm" \
--support-tier mlx_lm_delegated \
--mlx-lm-server-url http://127.0.0.1:8090mlx_lm_delegated is a compatibility route, not an AX-owned MLX throughput claim. AX forwards text generation to upstream mlx_lm.server and preserves temperature, top_p, top_k, repetition_penalty, and seed. Streamed chunks are delegated text deltas — not AX-owned token IDs, KV state, or model-kernel throughput evidence.
Check readiness and run benchmarks:
# Readiness check
./target/release/ax-engine-bench doctor --mlx-model-artifacts-dir "$MODEL_DIR"
bash scripts/check-server-preview.sh
bash scripts/check-python-preview.sh
# Primary benchmark: AX vs mlx_lm
python3 scripts/bench_mlx_inference_stack.py \
--model-dir /path/to/local/mlx-model \
--prompt-tokens 128,512,2048 --generation-tokens 128 \
--ax-compare-policies --repetitions 5 \
--output benchmarks/results/mlx-inference/$(date +%F)/gemma-4-e2b-it-4bit.json
# Secondary workload-contract benchmark
./target/release/ax-engine-bench scenario \
--manifest benchmarks/manifests/scenario/chat_gemma4_e2b_short.json \
--output-root benchmarks/resultscrates/ax-engine-core Engine state machine, scheduler, KV manager, sampler
crates/ax-engine-mlx MLX model graph, n-gram acceleration, KV cache, runner
crates/mlx-sys bindgen FFI over mlx-c; safe MlxArray RAII wrappers
crates/ax-engine-sdk Session API, backend resolution (MLX, mlx-lm delegated, or llama.cpp)
crates/ax-engine-server Axum HTTP/SSE adapter (OpenAI-compatible routes)
crates/ax-engine-bench Manifest-driven workload-contract CLI
crates/ax-engine-py PyO3 extension (ABI3, Python 3.10+)
javascript/ax-engine TypeScript/JS HTTP SDK + LangChain adapter
sdk/go/axengine Go HTTP SDK
sdk/ruby/ Ruby HTTP SDK (ax-engine-sdk gem)
sdk/mojo/ Mojo SDK (Python-interop)
cargo build --workspace # build all crates
cargo test --quiet # full Rust test suite
cargo clippy --all-targets --all-features -- -D warnings # lint (CI gate)
cargo fmt # format
maturin develop # rebuild Python extension
python -m unittest discover -s python/tests -v # Python tests
bash scripts/check-mlx-telemetry.sh # Gemma/AX MLX telemetry gateFor Gemma/AX MLX telemetry and decode-profile changes, prefer the targeted scripts/check-mlx-telemetry.sh gate. Use scripts/check-mlx-telemetry.sh --full-workspace when the change touches shared Rust contracts; that protected path compiles the workspace with cargo test --workspace --no-run --jobs 1 before running crate-by-crate tests.
Coverage is collected by the report-only GitHub Actions workflow in .github/workflows/coverage.yml. It publishes Rust cargo llvm-cov and Python coverage.py artifacts without enforcing a percentage threshold yet.
Public documentation is in docs/. Canonical benchmark manifests are in benchmarks/manifests/. Key design docs:
SDK / API ·
Python ·
JavaScript / TypeScript ·
Go ·
Ruby ·
Mojo ·
Scheduler ·
KV Cache ·
Benchmarking ·
Serving Benchmarks
AX Engine's benchmark design and compatibility checks are informed by local reference checkouts of related open-source projects. A row is published only when it fits the benchmark contract for the specific workload: comparable model artifacts, prompt and sampling policy, prefill/decode/TTFT definitions, repeatability, host/runtime metadata, and provenance.
| Project | Repository |
|---|---|
| ds4 | antirez/ds4 |
| lightning-mlx | samuelfaj/lightning-mlx |
| llama.cpp | ggml-org/llama.cpp |
| mistral.rs | EricLBuehler/mistral.rs |
| MLX | ml-explore/mlx |
| mlx-c | ml-explore/mlx-c |
| mlx-engine | lmstudio-ai/mlx-engine |
| mlx-lm | ml-explore/mlx-lm |
| mlx-turboquant | rachittshah/mlx-turboquant |
| MTPLX | youssofal/MTPLX |
| Rapid-MLX | raullenchai/Rapid-MLX |
| turboquant-mlx | arozanov/turboquant-mlx |
| vLLM | vllm-project/vllm |
Some reference projects are experimental, version-unstable, focused on a different serving route, or not shaped for the same Apple MLX/Metal measurement strategy, so those results remain implementation guidance or diagnostic evidence rather than public comparison rows.
- Qwen3.5 long-prompt prefill: Qwen3.5 prefill can trail upstream MLX references on longer prompts; decode and Qwen3-Next are not affected in the same way.
- Raw HuggingFace weights: use pre-sanitized MLX community weights or convert first with
mlx_lm.convert. - N-gram acceleration rows: effective-throughput measurements, not raw model-kernel speedups.
- TurboQuant KV compression: experimental and off by default.
See the FAQ limitations entry for details.
AX Engine welcomes community input through issue tickets, wishlist requests, reproducible benchmark results, and documentation feedback. We generally do not accept unsolicited code PRs, especially for runtime, model, kernel, scheduler, cache, n-gram, or performance-tuning changes.
Performance tuning is tightly coupled: a local speedup can regress correctness, TTFT, memory pressure, direct-vs-n-gram behavior, long-context behavior, serving stability, or another model family. Please open an issue first with the problem, target workload, and evidence so maintainers can choose the right validation path. See CONTRIBUTING.md for issue, wishlist, and benchmark result guidelines.
- Website: automatosx.com
- Discord: Join us
- Email: enquiry@defai.digital
Apache License, Version 2.0. See LICENSE for details.
Copyright (c) 2026 DEFAI Private Limited