AX Engine

AX Engine is a Mac-first LLM inference runtime, local server, SDK layer, and benchmark toolkit for Apple Silicon. It runs direct-support MLX model families natively, and routes other MLX text models or non-MLX models through explicit mlx-lm and llama.cpp compatibility routes.

Release Highlights

AX Engine is for developers who want a local OpenAI-compatible model server on Apple Silicon without hiding which runtime path is doing the work.

OpenAI-compatible local text endpoints for common chat and completion flows, with SDKs for Python, TypeScript/JavaScript, Go, Ruby, and Mojo.
Repo-owned MLX runtime paths for direct-support Gemma and Qwen families, with delegated routes kept explicit.
Announcement-ready benchmark claims where evidence is complete: Gemma 4 12B assistant-MTP is 2.34-2.73x faster than same-artifact direct decode, and Qwen3.6 35B-A3B AX MTP is +59.8% faster than the retained MTPLX reference on the public sidecar-fair matrix.
Dedicated Qwen3-Coder-Next direct-support path for local coding agents, called out separately from Qwen3.6 because it has no MTP sidecar but carries its own coding-first architecture and benchmark boundary.
Workload-contract benchmark tooling records route identity, artifacts, prompt suite, sampler, cooldowns, accept rate, and dirty-state provenance.

Quick Start

Install (macOS 26 Tahoe or later, Apple Silicon only — see Typical Hardware):

python3 -m pip install --upgrade pip               # pip 23+ is required to find the wheel
python3 -m pip install -U "ax-engine[download]<7"  # keep the quotes — zsh treats [ ] as a glob

Download a small model and start the server:

MODEL_DIR="$(ax-engine download mlx-community/Qwen3-4B-4bit --json | python3 -c 'import json,sys; print(json.load(sys.stdin)["dest"])')"
ax-engine serve "$MODEL_DIR" --port 8080

High-memory model shortcuts:

# Choose one:
ax-engine serve qwen36-35b --download --port 8080
ax-engine serve gemma4-12b --download --port 8080

Call it from any OpenAI client:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="local")
model = client.models.list().data[0].id

resp = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": "What is AGI?"}],
    max_tokens=128,
)
print(resp.choices[0].message.content)

Or use the Python SDK directly:

from ax_engine import download_model, Session

path = download_model("mlx-community/Qwen3-4B-4bit")
with Session(mlx=True, mlx_model_artifacts_dir=str(path)) as s:
    print(s.generate([1, 2, 3], max_output_tokens=8).output_tokens)

Quick Start requires macOS 26 (Tahoe) or later on Apple Silicon M2 Max or newer with 32 GB unified memory or more. Earlier macOS releases are not supported — there is no wheel or binary for them. Larger models such as Qwen3.6 35B-A3B and Gemma 4 12B need the memory tiers listed in Typical Hardware.

Installation

Requirements

The published wheel and Homebrew formula are macOS-arm64-only native builds. Before installing, confirm your machine matches:

macOS 26 (Tahoe) or later. Earlier macOS versions are not supported — there is no wheel or formula for them.
Apple Silicon (M2 Max or newer), arm64. Intel Macs are not supported.
Python 3.10 or later for the pip install.
pip 23 or later. Older pip cannot read the wheel's platform tag and will report No matching distribution found. Always run the upgrade step first.

# Check before installing — should print a version >= 26 and "arm64":
python3 -c "import platform; print(platform.mac_ver()[0], platform.machine())"

Python wheel

python3 -m pip install --upgrade pip
python3 -m pip install -U "ax-engine[download]<7"
ax-engine doctor

Keep the quotes around the spec — zsh otherwise treats [download] as a glob. The wheel bundles the ax-engine orchestration CLI plus the ax-engine-server and ax-engine-bench binaries, so all three are on your PATH after install. There is no source distribution and no wheel for other platforms; if pip reports No matching distribution found, see Troubleshooting.

Optional extras:

python3 -m pip install -U "ax-engine[openai]<7"      # FastAPI OpenAI shim
python3 -m pip install -U "ax-engine[multimodal]<7"  # image/audio helpers

Homebrew

Homebrew is the native binary channel for tagged macOS arm64 releases. The one-liner auto-taps defai-digital/homebrew-ax-engine:

brew install defai-digital/ax-engine/ax-engine
ax-engine doctor

ax-engine-server and ax-engine-bench are installed alongside the CLI. If doctor fails with Library not loaded: libmlxc.dylib, the mlx-c dependency is missing or stale — reinstall it:

brew install mlx-c && brew reinstall defai-digital/ax-engine/ax-engine

Troubleshooting

No matching distribution found for ax-engine — your machine is not macOS 26+ Apple Silicon, or your pip is too old. Run python3 -m pip install --upgrade pip, then re-check with the Requirements command above. There is no wheel for Intel, Linux, Windows, or macOS earlier than 26.
zsh: no matches found: ax-engine[download] — quote the spec: pip install "ax-engine[download]<7".
An old version installs — make sure you used -U, then confirm the channel is current with python3 -m pip index versions ax-engine or brew info defai-digital/ax-engine/ax-engine.
Anything still off — build from Source, which works on any supported macOS and rebuilds the native binaries locally.

Source

brew install mlx mlx-c protobuf
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip maturin
cargo build --release -p ax-engine-server -p ax-engine-bench
maturin develop --release
export PATH="$PWD/target/release:$PATH"
ax-engine doctor

Getting a Model

AX Engine requires pre-sanitized MLX weights. The recommended source is mlx-community — models there are already converted and validated.

mlx-community (recommended)

ax-engine download, download_model(), and scripts/download_model.py download weights and auto-generate the required model-manifest.json in one step:

# List supported download targets
ax-engine download --list

# Download by alias
ax-engine download qwen36-35b --json
ax-engine download qwen36-27b --json
ax-engine download gemma4-e2b --json
ax-engine download gemma4-12b --json
ax-engine download gemma4-31b --json

# Download and serve in one command
ax-engine serve qwen36-35b --download --port 8080

# Raw mlx-community repo IDs are also accepted
ax-engine download mlx-community/Qwen3.6-35B-A3B-4bit --json
ax-engine download mlx-community/Qwen3-Coder-Next-4bit --json
ax-engine download mlx-community/gemma-4-e2b-it-4bit --json

# Optional: copy snapshot to an explicit directory
ax-engine download qwen36-35b --dest /Volumes/Models/qwen36-35b

# Python SDK
from ax_engine import download_model
path = download_model("mlx-community/Qwen3.6-35B-A3B-4bit")

Built-in download aliases:

Alias	Repo
`qwen36-35b`	`mlx-community/Qwen3.6-35B-A3B-4bit`
`qwen36-27b`, `qwen36-27b-5bit`, `qwen36-27b-6bit`, `qwen36-27b-8bit`	`mlx-community/Qwen3.6-27B-{4,5,6,8}bit`
`gemma4-e2b`, `gemma4-e2b-5bit`, `gemma4-e2b-6bit`, `gemma4-e2b-8bit`	`mlx-community/gemma-4-e2b-it-{4,5,6,8}bit`
`gemma4-12b`, `gemma4-12b-6bit`	`mlx-community/gemma-4-12B-it-{4,6}bit`
`gemma4-26b`	`mlx-community/gemma-4-26b-a4b-it-4bit`
`gemma4-31b`	`mlx-community/gemma-4-31b-it-4bit`

Leave downloads in the Hugging Face Hub cache by default — it's shared with mlx_lm and other HF-aware tools, avoiding duplicate copies of large weights. Use --dest only when you want an explicit copy outside the shared cache.

If you already have mlx_lm installed, its downloads land in the same cache and AX Engine can auto-discover them:

python -m mlx_lm.generate --model mlx-community/Qwen3-4B-4bit --prompt "x" --max-tokens 1
ax-engine-bench generate-manifest ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/<hash>
ax-engine serve ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/<hash> --port 8080

Raw HuggingFace checkpoint

Raw checkpoints need sanitization before AX Engine can load them:

pip install mlx-lm
mlx_lm.convert --hf-path <org/model> --mlx-path /path/to/dest -q --q-bits 4
ax-engine-bench generate-manifest /path/to/dest
ax-engine serve /path/to/dest --port 8080

Manifest generation

Both paths above require model-manifest.json. Download helpers generate it automatically. To run it directly:

ax-engine-bench generate-manifest /path/to/model      # pip, Homebrew, or built binary
cargo run -p ax-engine-core --bin generate-manifest -- /path/to/model  # source

Typical Hardware

For local agent and chatbot workloads, AX Engine is a better fit for a small model portfolio than for one model serving every workflow. See the FAQ model-stack guidance for the full recommendation.

Hardware	Recommended memory	Best fit
Mac mini M4 Pro	64 GB RAM	Compact always-on local chatbot and agent server
MacBook Pro M5 Max	128 GB RAM	Portable high-throughput chatbot, agent, and coding stack
Mac Studio M3 Ultra	256 GB RAM	Larger local model portfolio, longer contexts, and heavier parallel workloads

Role	Recommended model	Setup	App	Why
Default chatbot	Gemma 4 26B-A4B / 31B	4-bit or 6-bit, 16K-32K	ax-studio	General assistant path for reasoning, chat, JSON/function calling, and on-device agent workflows
General agentic model	Qwen3.6-35B-A3B / Qwen3.6-27B	35B A3B 4-bit; 27B 4/5/6/8-bit, 16K-32K	AX server / SDK	Strong general agent and coding balance; sparse MoE keeps active compute low
Coding specialist	Qwen3-Coder-Next	6-bit + 16K default; 4-bit/5-bit + 32K when needed	ax-code	Dedicated local coding-agent path for repo editing, tool use, and long coding sessions

What AX Engine Does

AX Engine gives local inference work a stable runtime contract:

Repo-owned MLX execution tracks direct-support model families separately from delegated routes — delegated results are not AX-owned throughput claims.
Dual-family speculative decoding supports both Qwen3.6's fused MTP sidecar and Gemma 4's separate assistant-drafter contract in the same repo-owned runtime and benchmark tooling.
N-gram acceleration reaches up to 3.1× mlx_lm decode throughput on high-hit benchmark rows with no second draft model.
Long-session prefix reuse restores physical MLX KV snapshots on validated cache layouts, so long-running chat and agent loops avoid repeatedly pre-filling accumulated context. See docs/LONG-CONTEXT.md.
Workload-contract tooling (ax-engine-bench) validates correctness, determinism, route identity, and regression across checked-in manifests.
Delegated routes (mlx_lm_delegated, llama_cpp) cover explicit compatibility cases without polluting AX-owned performance claims.

mlx_lm is the canonical MLX reference. AX Engine compares against mlx_lm.benchmark and keeps mlx_lm.server as the explicit delegated compatibility route when AX does not yet have a repo-owned graph. See the FAQ for the boundary between MLX kernels and AX-owned runtime behavior.

Design details: Scheduler · KV Cache · Long Context · Benchmark Design.

Runtime Paths

Path	Use it for	Current scope
Repo-owned MLX runtime	Direct-support MLX model families and repo-owned performance claims backed by benchmark artifacts	Local Apple Silicon inference, token-based server/SDK requests, direct and n-gram acceleration modes
`mlx_lm_delegated`	MLX text models that upstream `mlx-lm` supports before AX has a repo-owned graph	Blocking and SSE text generation through a user-provided `mlx_lm.server`; not AX-owned token/KV performance
`llama_cpp`	GGUF and non-MLX local inference	Delegated llama.cpp server/CLI compatibility; route-contract evidence, not repo-owned MLX throughput

The runtime report exposes selected_backend, support_tier, and resolution_policy so callers and benchmark artifacts can distinguish these paths. For the exact OpenAI-shaped endpoint contract see docs/API-COMPATIBILITY.md.

Public Claim Boundaries

AX Engine's public performance claims are scoped to benchmark artifacts that preserve route identity, model artifacts, prompt suite, sampler settings, and repository provenance.

Area	Public claim	Status
Gemma 4 12B assistant-MTP	2.34-2.73x faster than same-artifact AX direct decode on the 12B MTP prompt suites	Announcement-ready
Gemma 4 26B/31B assistant-MTP	97.3%-99.2% accept rate; MTP+n-gram is workload-dependent (+5.2% for 26B, -0.7% for 31B) in the current matrix	Scoped; no public direct-speedup claim yet
Qwen3.6 35B-A3B MTP	AX MTP is +59.8% vs the retained MTPLX reference, and AX MTP+n-gram is +59.9% vs MTPLX on the sidecar-fair aggregate	Announcement-ready
Qwen3.6 27B MTP	AX MTP is +7.8% vs the retained MTPLX reference; MTP+n-gram is +8.7% vs MTPLX and +0.8% vs pure AX MTP	Opt-in / workload-dependent
Qwen3-Coder-Next direct	AX direct decode is +3.3%-6.6% vs `mlx_lm` and +17.1%-23.6% vs shape-compatible llama.cpp Metal (`b9700`, flash-attn) at 128/512/2048 tokens with the opt-in fused expert block enabled	Scoped; direct-only
N-gram acceleration	Up to 3.1x `mlx_lm` decode throughput on high-hit benchmark rows without a second draft model	Workload-dependent

Supported Models

Direct support means AX has a repo-owned ax-engine-mlx graph for the model family and loads MLX safetensors through the AX manifest path. Other MLX text models can still use the explicit mlx_lm_delegated compatibility route.

Family	Direct model IDs	Current scope	Architecture notes
Gemma 4	`gemma-4-e2b-it`, `gemma-4-e4b-it`, `gemma-4-12b-it`, `gemma-4-26b-a4b-it`, `gemma-4-31b-it`	Repo-owned MLX runtime; MLX affine 4/5/6/8-bit weights; assistant-MTP benchmark path	Dense unified 12B, per-layer embedding, and MoE variants; sliding-window + full attention, logit softcapping
Qwen 3	`Qwen3-4B-4bit` and manifest-backed dense checkpoints	Repo-owned MLX runtime	SwiGLU dense FFN; per-head QK norm
Qwen 3.5	`Qwen3.5-9B-MLX-4bit`	Repo-owned MLX runtime	Linear attention + MoE FFN; `attn_output_gate` per-head interleaving
Qwen 3.6	`Qwen3.6-35B-A3B` 4-bit, `Qwen3.6-27B` 4/5/6/8-bit	Repo-owned MLX runtime; fused sidecar-MTP benchmark path	`qwen3_next`: GatedDelta linear attention, full attention with per-head sigmoid gate, sparse top-k MoE
Qwen3-Coder-Next	`Qwen3-Coder-Next-4bit`	Repo-owned MLX runtime; direct coding-agent path	`qwen3_next` coding-specialist checkpoint; hybrid linear/full attention, sparse top-10-of-512 MoE, shared expert, 8-bit router/shared-expert gates
GLM 4.7 Flash	`glm4_moe_lite` / `glm4.7-flash-4bit`	Repo-owned MLX runtime; MLX affine 4-bit weights	Flash MLA attention, sigmoid-routed MoE with dense+MoE layer split, shared expert; post-attention RMS norm

Adding a new architecture means implementing the model graph in ax-engine-mlx, not wiring up a generic loader. Architecture code alone is not a direct-support claim — a model requires a repo-owned graph, manifest, smoke coverage, and benchmark evidence before promotion here. LLaMA, Mistral, Mixtral, DeepSeek, and unlisted Gemma/Qwen variants should use the explicit delegated route.

Before promoting another architecture or checkpoint, run scripts/probe_mlx_model_support.py --model-dir <model-dir>; a model should report repo_owned_runtime_ready only when its manifest, local reference files, and runtime path are all present.

Full list: docs/SUPPORTED-MODELS.md.

Performance

Full result tables and interpretation live in docs/PERFORMANCE.md. Benchmark methodology, test setup, and reproduction details live in docs/BENCHMARKS.md.

Gemma 4 12B

Gemma 4 12B (model_type: gemma4_unified) is reported separately from the per-layer-embedding E2B/E4B and MoE 26B/31B checkpoints because it has a distinct graph, multimodal tensor contract, and benchmark boundary. Upstream mlx_lm 0.31.3 cannot load it (ValueError: Model type gemma4_unified not supported), so the direct peer here is llama.cpp Metal on a shape-compatible GGUF.

Note

AX Engine's repo-owned native MLX route supports Gemma 4 12B text plus inline base64 image/audio/video chat. Delegated compatibility routes remain text-first; /v1/generate accepts the processed multimodal_inputs.gemma4_unified tensor contract.

At a glance:

Direct decode: AX native MLX reaches 61.7-66.0 tok/s on the bit-comparable 4-bit-FFN artifact versus llama.cpp Metal's 56.9-59.2 tok/s depth-matched range.
Context depth: AX's direct margin is +11% / +11% / +8% versus llama.cpp matched-depth decode at 128 / 512 / 2,048 prompt tokens.
Assistant-MTP: depth-2 assistant-MTP reaches 82.9-96.8 tok/s on code-like prompt suites, a 2.34-2.73x same-artifact speedup over AX direct decode.
Why the earlier result flipped: the upstream MLX snapshot keeps FFN weights at 8-bit, so it reads about 1.65x the bytes of the re-quantized 4-bit-FFN artifact. Decode is bandwidth-bound; matching quantization closes the gap.

Direct Decode

AX direct rows use the 4-bit-FFN MLX artifact and random-token prompts. mlx_lm is absent because it has no gemma4_unified graph. The llama.cpp rows are shape-compatible external GGUF references, not prompt-hash-parity MLX rows.

Grouped bar chart comparing Gemma 4 12B 4-bit median prefill throughput for AX Engine native MLX and llama.cpp Metal at 128/512/2048 prompt tokens

Grouped bar chart comparing Gemma 4 12B 4-bit median time to first token for AX Engine native MLX and llama.cpp Metal at 128/512/2048 prompt tokens

Prompt tokens	AX decode	llama.cpp decode (depth 0)	llama.cpp decode (matched depth)	AX prefill	llama.cpp prefill	AX TTFT (ms)	llama.cpp TTFT (ms)
128	66.0	59.8	59.2	1,171	1,252	109	102
512	65.6	59.6	58.9	1,839	1,745	278	293
2048	61.7	59.7	56.9	2,004	1,690	1,022	1,212

Read the two llama.cpp decode columns carefully:

depth 0 is plain llama-bench tg, decoding from an empty context and representing llama.cpp's best case.
matched depth uses -d {prompt} -n 128, so decode happens after the same prompt depth AX has already prefetched.
AX wins the matched-depth comparison at every prompt size, and prefill also leads at 512 and 2,048 tokens.

The table uses the bit-comparable 4-bit-FFN AX artifact (scripts/requantize_gemma4_12b_ffn_4bit.py), about 4.5 bpw versus the Q4_K_M GGUF's about 4.8 bpw. The upstream mlx-community/gemma-4-12B-it-4bit snapshot keeps the FFN at 8-bit (~10.98 GB) and trails llama.cpp at about 46 tok/s. That is a bytes-read handicap, not an AX runtime result.

Memory bandwidth share:

Decode is memory-bandwidth-bound on Apple Silicon: each token reads the model weights once, so decode tok/s is set by bytes-read and how close the engine gets to the memory ceiling. Measured M5 Max GPU peak read bandwidth ≈ 577 GB/s (MLX reduction over a 6 GB array).

Engine / quantization	Weights/token	Decode tok/s	Effective BW	% of 577 GB/s peak
AX — 8-bit FFN (upstream 4bit snapshot)	10.98 GB	45.0	494 GB/s	86%
AX — 4-bit FFN (re-quantized)	6.74 GB	64.4	434 GB/s	75%
llama.cpp Q4_K_M — decode @ depth 512	7.38 GB	58.9	435 GB/s	75%
llama.cpp Q4_K_M — decode @ depth 0 (`tg`)	7.38 GB	59.8	441 GB/s	76%

The bandwidth view is the key explanation: AX is not under-utilizing memory. The re-quantized AX row sustains 434 GB/s, in the same band as llama.cpp's 435 GB/s at matched depth. The remaining direct-decode difference is bytes read per token: uniform 4-bit group-64 reduces AX to 6.74 GB/token, while Q4_K_M reads 7.38 GB/token. The 8-bit-FFN upstream snapshot has higher bus utilization (86%) but worse speed because it reads far more data.

Assistant-MTP speculative decode (depth 2):

The assistant-MTP path runs on the assistant bundle and adds a second speculative lever that neither mlx_lm nor llama.cpp has for this model. The published rows use depth-2 draft, first-token confidence gate 0.90, deep-token gate 0.999, and GPU-exact confidence.

Pure assistant-MTP is the default. MTP+n-gram stacking remains opt-in because it is workload-dependent and did not beat pure MTP on every suite.

Suite	Depth	AX direct tok/s	AX MTP tok/s	AX MTP accept	AX MTP+ngram tok/s	AX MTP+ngram accept	n-gram status
flappy	2	35.5	96.8	98.7%	95.0	98.7%	no observed draft path
long_code	2	35.8	92.3	99.1%	95.2	99.1%	no observed draft path
python_modules_long	2	35.4	82.9	97.5%	82.5	97.5%	no observed draft path

No runnable peer benchmark covers Gemma 4 12B assistant-MTP in this matrix: mlx_lm cannot load gemma4_unified, llama.cpp does not expose a Gemma assistant-MTP path, and available MTP peer tools target different sidecar contracts. The AX direct column is retained as a same-prompt baseline from the MTP harness prompts, artifact, and sampler. It is a same-artifact AX improvement view, not a peer-engine MTP comparison.

MTP prefill and TTFT — same run:

Suite	AX MTP prefill	AX MTP+ngram prefill	AX MTP ttft ms	AX MTP+ngram ttft ms
flappy	1,928	1,952	187	187
long_code	2,040	2,024	390	394
python_modules_long	1,831	1,812	195	198

Methodology and artifacts:

Direct rows use the 4-bit-FFN artifact, greedy-equivalent sampler, 128 generated tokens, 5 repetitions, 15 s cooldown, and random-token prompts following the mlx_lm.benchmark contract. llama.cpp decode is shown both at depth 0 (tg) and at matched context depth (-d {prompt}). MTP rows use the same 4-bit-FFN assistant-MTP artifact, depth-2 draft, temperature=0.6, top_p=0.95, top_k=20, 1,000 generated tokens, 5 repetitions, 30 s cooldown, and 10 s inter-case cooldown. Host/runtime for the latest direct llama.cpp peer rerun: Apple M5 Max · llama.cpp b9700 / ggml 0.15.2 (Metal, flash-attn) · mlx_lm 0.31.3 has no gemma4_unified support.

Full artifacts: 2026-06-20-gemma-4-12b-it-4bit-direct (AX direct rerun; chart artifact with retained llama.cpp reference rows in gemma-4-12b-it-4bit-with-llama-reference.json; llama.cpp GGUF provenance in llama_cpp_gguf_provenance.json) · 2026-06-20-gemma4-assistant-mtp-ax-mtp-only (AX-only assistant-MTP refresh).

Gemma 4 12B Multimodal

Gemma 4 12B multimodal timing is reported separately from the text benchmark above because media inputs expand into validated Gemma4 unified soft-token spans before the MLX graph runs. The publication-grade timing artifact covers all 17 AX Engine image/audio/video cases through both the native /v1/generate/stream prefill path and the OpenAI-compatible /v1/chat/completions path. The llama.cpp Metal peer rows are cold OpenAI chat endpoint rows for the supported image/audio cases, with prompt cache, slot prompt reuse, and context checkpoints disabled and raw llama.cpp timing/cache metadata recorded.

Bar chart showing Gemma 4 12B multimodal prefill throughput for AX Engine native MLX

Grouped bar chart comparing Gemma 4 12B multimodal cold chat endpoint latency for llama.cpp Metal on the left and AX Engine on the right

Coverage	AX cases measured	Expanded input	Median runner prefill TTFT	Median prefill	Median AX chat E2E	llama.cpp peer endpoint
Image	5	275-535 tokens	189.4-316.2 ms	1,447.8-1,692.1 tok/s	1,440.8-1,704.8 ms	5 measured, 401.6-518.7 ms cold chat endpoint
Audio	4	32-771 tokens	75.8-419.4 ms	422.1-1,838.4 tok/s	1,466.5-1,819.2 ms	3 measured, 338.0-464.5 ms cold chat endpoint; 1 skipped: llama.cpp audio cap unstable
Video	4	92-2,355 tokens	106.1-2,973.5 ms	792.0-1,681.0 tok/s	1,500.2-4,441.7 ms	4 skipped: llama.cpp video path unsupported
Combined	4	181-442 tokens	133.2-256.7 ms	1,359.1-1,721.6 tok/s	1,532.4-1,771.6 ms	1 measured, 507.9 ms cold chat endpoint; 3 skipped: video unsupported

Rows use /v1/generate/stream with processed multimodal_inputs.gemma4_unified for runner-time prefill and /v1/chat/completions with inline media for client-wall E2E latency. This run used max_output_tokens=8, 1 warmup, 3 measured repetitions, --max-batch-tokens 4096, a release server binary, 128 GB unified memory, and a clean tracked worktree at 67ce2675a469cf5eecba687f348c649e663011b8.

The llama.cpp peer rows use reference llama.cpp 19bba67c1 with Metal, gemma-4-12B-it-Q4_K_M.gguf, and mmproj-gemma-4-12B-it-Q8_0.gguf. They are OpenAI chat endpoint-latency rows for supported image/audio inputs, not native prefill rows and not a throughput comparison. The fair-peer launch contract is --cache-ram 0 --no-cache-idle-slots --slot-prompt-similarity 0 --ctx-checkpoints 0 plus --llama-cache-policy prompt_cache_disabled; the artifact records raw llama.cpp timings, prompt_tokens_details.cached_tokens, server prompt token counts, and cache counts. Published peer rows require zero reported cached prompt tokens and server prompt-eval token counts at least as large as the cold request's reported prompt tokens. Video-containing peer rows are explicit skips because the local llama.cpp Gemma 4 path does not expose a like-for-like video contract, and audio_cap is skipped because this llama.cpp Gemma 4 audio path fails the warmup-plus-three-repetition contract on the largest audio fixture. The peer chart excludes one measured image case whose AX and llama.cpp output token counts differ, so chart bars compare matched-output rows only. For this Gemma 4 llama.cpp build, most peer text appears in reasoning_content rather than message.content, so the benchmark validates positive response_chars.

Full artifact: 2026-06-09-gemma4-12b-multimodal-cold-peer-matrix. Render charts with:

python3 scripts/render_gemma4_multimodal_charts.py \
  --artifact benchmarks/results/gemma4-multimodal/2026-06-09-gemma4-12b-multimodal-cold-peer-matrix.json \
  --assets-dir docs/assets

To reproduce the supported-case image/audio/video timing matrix from a Gemma 4 12B AX Engine server, use the matrix runner and validate the resulting artifact before publishing charts:

python3 scripts/bench_gemma4_multimodal.py \
  --url http://127.0.0.1:18080 \
  --model gemma-4-12B-it \
  --model-dir /path/to/gemma-4-12B-it-4bit \
  --cases all \
  --layers native_runtime_prefill,openai_chat_e2e \
  --warmup 1 \
  --repetitions 3 \
  --cooldown 1 \
  --max-output-tokens 8 \
  --server-command "target/release/ax-engine-server --model-id gemma-4-12B-it --mlx --mlx-model-artifacts-dir /path/to/gemma-4-12B-it-4bit --max-batch-tokens 4096 --port 18080" \
  --llama-url http://127.0.0.1:<peer-port> \
  --llama-binary /path/to/llama-server \
  --llama-gguf <path-to-gemma-4-12B-it-Q4_K_M.gguf> \
  --llama-mmproj <path-to-mmproj-gemma-4-12B-it-Q8_0.gguf> \
  --llama-cache-policy prompt_cache_disabled \
  --output benchmarks/results/gemma4-multimodal/gemma4-12b-multimodal-cold-peer-matrix.json

python3 scripts/check_gemma4_multimodal_benchmark_artifact.py \
  benchmarks/results/gemma4-multimodal/gemma4-12b-multimodal-cold-peer-matrix.json \
  --min-repetitions 3 \
  --require-modalities image,audio,video \
  --require-build-provenance \
  --readme-ready

For a fair llama.cpp peer rerun, launch llama-server with prompt cache, slot prompt reuse, and context checkpoints disabled for the peer server, for example --cache-ram 0 --no-cache-idle-slots --slot-prompt-similarity 0 --ctx-checkpoints 0, then validate with --readme-ready. Peer rows with unknown cache policy, reported cached prompt tokens, or server prompt-eval token counts that are too low for a cold prompt are rejected by the artifact checker. Without a matching Gemma 4 12B GGUF and multimodal projector, peer rows are explicit skips. Video rows remain explicit skips until the peer server exposes a like-for-like video path for Gemma 4 12B.

Prepare Gemma 4 12B assistant-MTP artifacts

Gemma 4 12B MLX target and assistant repos are already converted to MLX safetensors — they do not go through ax-engine convert-mtplx or scripts/prepare_mtp_sidecar.py. Download the target and matching assistant, then package them with the Gemma-specific helper:

hf download mlx-community/gemma-4-12B-it-4bit
hf download mlx-community/gemma-4-12B-it-assistant-4bit
python3 scripts/prepare_gemma4_assistant_mtp.py \
  --target mlx-community/gemma-4-12B-it-4bit \
  --assistant mlx-community/gemma-4-12B-it-assistant-4bit

hf download mlx-community/gemma-4-12B-it-6bit
hf download mlx-community/gemma-4-12B-it-assistant-6bit
python3 scripts/prepare_gemma4_assistant_mtp.py \
  --target mlx-community/gemma-4-12B-it-6bit \
  --assistant mlx-community/gemma-4-12B-it-assistant-6bit

The default outputs are quant-specific synthetic HF cache snapshots: models--ax-local--gemma-4-12b-it-4bit-assistant-mtp/snapshots/v1/ and models--ax-local--gemma-4-12b-it-6bit-assistant-mtp/snapshots/v1/. Each package contains the target files, an assistant/ subtree, and ax_gemma4_assistant_mtp.json. Generate or validate the AX manifest before serving:

ax-engine-bench generate-manifest \
  ~/.cache/huggingface/hub/models--ax-local--gemma-4-12b-it-4bit-assistant-mtp/snapshots/v1 \
  --validate
ax-engine-bench generate-manifest \
  ~/.cache/huggingface/hub/models--ax-local--gemma-4-12b-it-6bit-assistant-mtp/snapshots/v1 \
  --validate

Speculative Decoding (MTP)

AX Engine's key Mac advantage is dual-family speculative decoding — it supports both Gemma 4's separate assistant-drafter contract and Qwen3.6's fused sidecar contract in one repo-owned runtime and benchmark surface. A single benchmark surface records route identity, sampler, prompt suite, cooldown, accept behavior, and artifact provenance so the two MTP families are comparable without pretending they use the same architecture.

Gemma 4

Unlike Qwen's fused mtp.* sidecar, Gemma 4's multi-token prediction uses a small assistant drafter that shares the target's tokenizer and embedding table, drafts tokens from the target's last-layer hidden state, and attends to the target's own KV cache. Draft depth is configurable: 26B/31B benchmarks use depth 1 (one draft token per step); 12B uses depth 2 (two draft tokens per step, with the second conditioned on the first). AX runs it assistant-MTP-only (mtp, default) and with n-gram stacked on top (mtp-ngram, opt-in).

A draft confidence gate (AX_MLX_GEMMA4_ASSISTANT_MTP_DRAFT_MIN_CONFIDENCE, default 0.90 for the first draft token; deep draft default 0.999) only proposes a draft when the drafter's top-token probability clears the threshold, keeping accept high while remaining correctness-preserving. Lower the gate toward 0 for more speculation on predictable content; raise it for flatter sampled chat.

The gate is a speed knob, not a quality knob -- lowering it does not corrupt output (e.g. code). Every drafted token is verified by the target model before it is emitted (rejection sampling when draft log-probs exist, greedy argmax-match otherwise), so a mismatched draft is discarded and replaced by the target's own token. Relaxing the gate only lets the drafter propose more speculative tokens; it lowers the accept rate and shifts throughput, but the emitted sequence is still the verified target sequence. Output-altering approximations are separate, explicit opt-ins such as top-k target softmax, never the confidence gate.

Choosing the gate by workload. Because the output is verified either way, the gate is a throughput dial, not a safety one — pick it by how predictable your content is, and (only for temperature-sampled chat) how much reply diversity you want. Lower gate = more speculation = lower accept rate but more multi-token runs. Starting points:

Workload	Suggested gate	Expected accept¹	Why
Coding	`~0.90` (aggressive)	high (~93–96% on 12B code suites)	Sharply peaked output makes the first draft token useful even with a looser gate. Deterministic, so no diversity cost -- tune purely for speed.
Agentic (tools / JSON / reasoning)	`~0.90–0.95`	high (~93–96% expected on code-like templates)	Templated and low-temperature like code; output is verified, so no correctness risk. Keep n-gram stacking opt-in unless the workload is measured.
Chatbot	`~0.99–0.999` if sampling for variety; lower at low temperature	drops on flat text	Natural language is flatter, so accept falls faster; at temperature > 0 a low gate makes replies follow the greedy token and feel less varied. Here a high gate protects diversity, not correctness.

¹ Only the code-like benchmark suites below (flappy, long_code, python_modules_long) are measured for 12B at the Phase 4 default -- they sit at 97.5-99.1% assistant accept and still deliver 2.34-2.73x same-artifact speedup over direct decode. The agentic and chatbot figures are expected ranges, and the suggested gates are starting points, not universal optima. The assistant_mtp_gate* ablation profiles lock the exact per-workload sweet spot.

One flag instead of the env vars. Rather than hand-set the gate knobs, the server accepts --speculation-profile {auto,coding,agentic,chatbot} (short -s, alias --spec; or env AX_MLX_SPECULATION_PROFILE), which bundles the MTP and n-gram configuration into one posture. auto (default) is temperature-driven: it keeps the shipped gate at low/zero temperature and raises it for higher-temperature sampled chat to protect reply diversity. coding/agentic keep the shipped gate defaults — the 12B ablation found lowering the Gemma gate does not add code throughput, so the default already is the throughput setting — while chatbot raises the gate and prefers the n-gram utility gate. Any explicit per-knob env var (e.g. AX_MLX_GEMMA4_ASSISTANT_MTP_DRAFT_MIN_CONFIDENCE) still overrides the profile. The resolved posture is recorded in route metadata as ax_mlx_speculation_profile.

No peer engine (MTPLX, Rapid-MLX, lightning-mlx) exposes a runnable Gemma 4 assistant-MTP path, so this benchmark has no peer comparison rows.

Gemma 4 speculative decoding holds draft accept ≥97% on every cell below (97.3–99.2% across 26B / 31B × {MTP, MTP+n-gram} × {flappy, long_code, python_modules_long}).

The 26B/31B public run below is the promotion-grade assistant-MTP matrix only; unpublished retry fragments and failed direct-baseline attempts are excluded from this artifact set. Without a complete same-artifact direct row for these two models, the public verdict is scoped to MTP+n-gram versus pure assistant-MTP. In that scope n-gram is keep-opt-in: +5.2% median decode for 26B and -0.7% for 31B, with workload-specific regressions still present.

Gemma 4 26B A4B 4-bit	Gemma 4 31B 4-bit

Model	Suite	Depth	AX MTP tok/s	AX MTP accept	AX MTP+ngram tok/s	AX MTP+ngram accept
Gemma 4 26B A4B 4-bit	flappy	1	128.8	99.2%	137.3	99.2%
Gemma 4 26B A4B 4-bit	long_code	1	136.7	99.0%	136.9	99.0%
Gemma 4 26B A4B 4-bit	python_modules_long	1	130.1	98.7%	125.3	98.7%
Gemma 4 31B 4-bit	flappy	1	39.4	99.2%	39.1	99.2%
Gemma 4 31B 4-bit	long_code	1	40.0	99.1%	40.4	99.1%
Gemma 4 31B 4-bit	python_modules_long	1	37.4	97.3%	37.1	97.3%

Prefill and TTFT — same run:

Model	Suite	AX MTP prefill	AX MTP+ngram prefill	AX MTP ttft ms	AX MTP+ngram ttft ms
Gemma 4 26B A4B 4-bit	flappy	2,690	2,711	131	130
Gemma 4 26B A4B 4-bit	long_code	4,026	4,034	202	202
Gemma 4 26B A4B 4-bit	python_modules_long	2,923	2,854	130	132
Gemma 4 31B 4-bit	flappy	723	750	487	478
Gemma 4 31B 4-bit	long_code	807	809	987	980
Gemma 4 31B 4-bit	python_modules_long	741	743	472	472

The gated assistant already captures most of the speculation, so stacking n-gram on top stays opt-in. Sampler: temperature=0.6, top_p=0.95, top_k=20; 1,000 generated tokens, 5 repetitions, 30 s cooldown, 10 s inter-case cooldown. Apple M5 Max · AX Engine v6.5.2.

Full artifacts: 2026-06-20-gemma4-assistant-mtp-ax-mtp-only.

Reproduce this benchmark

python3 scripts/bench_gemma4_assistant_mtp.py \
  --models 26b-a4b-4bit,31b-4bit \
  --modes mtp,mtp-ngram \
  --suites flappy,long_code,python_modules_long \
  --max-tokens 1000 --repetitions 5
python3 scripts/render_gemma4_assistant_mtp_charts.py \
  --results-dir benchmarks/results/gemma4-assistant-mtp/<run-dir>

Artifacts land under benchmarks/results/gemma4-assistant-mtp/; SVGs render into docs/assets/. Tune the accept/throughput trade-off with AX_MLX_GEMMA4_ASSISTANT_MTP_DRAFT_MIN_CONFIDENCE (default 0.90; 0 disables the first-position gate) and AX_MLX_GEMMA4_ASSISTANT_MTP_DEEP_DRAFT_MIN_CONFIDENCE (default 0.999). MTP+n-gram stacking is opt-in: use --mlx-mtp-enable-ngram-stacking through the server/SDK path, or set AX_MLX_MTP_DISABLE_NGRAM_STACKING=0 for low-level benchmark runs.

Qwen 3.6

Three-engine MTP comparison (MTPLX 0.3.7, AX Engine MTP, AX Engine MTP+n-gram) using standard Qwen/Qwen3.6-* sidecars plus matching mlx-community/*-4bit MLX bases. No Youssofal/*MTPLX* bundles are used. All three engines run on the same prompt suites, token caps, sampler, warmup, repetition count, and cooldown.

AX MTP runs the shipped default draft confidence gate (AX_MLX_MTP_DRAFT_MIN_CONFIDENCE, default 0.90). The accept columns below come from the same default-gate rerun as the throughput rows; use docs/MTP-DRAFT-GATE-THROUGHPUT.md when tuning the accept/throughput trade-off for a specific workload.

The aggregate improvement view below uses sample medians across all three suites. The 35B-A3B sidecar is the clear public win: AX MTP is +59.8% vs the retained MTPLX reference, while AX MTP+n-gram is +59.9% vs MTPLX and +0.1% vs pure AX MTP. The 27B row is workload-dependent but positive in this rerun: pure AX MTP is +7.8% vs MTPLX, and AX MTP+n-gram is +8.7% vs MTPLX and +0.8% vs pure AX MTP. Stacking remains opt-in because the per-suite win is not uniform.

The latency view follows the same boundary. On Qwen3.6 35B-A3B, AX wins every listed MTPLX prefill and TTFT row because the sidecar path stays inside the repo-owned MLX runner and records the target-model prefill separately from speculative verification. On Qwen3.6 27B, prefill and TTFT are intentionally called mixed: AX is close, but the 27B sidecar does not show a clean latency win on every suite. Treat the 35B-A3B rows as the public MTP latency advantage and the 27B rows as workload-dependent.

Qwen3.6 27B 4-bit	Qwen3.6 35B-A3B 4-bit

Model	Suite	Depth	MTPLX tok/s	MTPLX accept	AX tok/s	AX accept	AX+ngram tok/s	AX+ngram accept
Qwen3.6 27B 4-bit	flappy	3	56.1	100.0% (96.0-100.0)	61.4	99.7% (97.3-100.0)	61.6	99.7% (97.3-100.0)
Qwen3.6 27B 4-bit	long_code	3	57.9	99.7% (98.4-100.0)	60.5	99.6% (98.9-100.0)	61.0	99.6% (98.9-100.0)
Qwen3.6 27B 4-bit	python_modules_long	3	52.7	87.6% (81.2-95.0)	52.0	97.8% (97.1-98.4)	51.6	97.8% (97.1-98.4)
Qwen3.6 35B-A3B 4-bit	flappy	1	104.3	49.5% (42.3-60.6)	169.0	100.0% (99.4-100.0)	168.8	100.0% (99.4-100.0)
Qwen3.6 35B-A3B 4-bit	long_code	1	105.6	51.4% (43.1-66.7)	164.7	99.9% (99.6-100.0)	166.8	99.9% (99.6-100.0)
Qwen3.6 35B-A3B 4-bit	python_modules_long	1	98.2	42.6% (37.0-46.1)	166.7	97.9% (97.7-99.3)	163.3	97.9% (97.7-99.3)

Accept cells show median with (min–max) range across the suite's cases × 5 reps, so the run-to-run spread on the borderline python_modules_long suite is visible rather than hidden behind a single point.

Prefill throughput (tok/s) — same run:

MTPLX prefill is derived from prompt_tokens / prompt_eval_time_s (runner-level). AX prefill is measured at runner level. Both are pure GPU compute measurements.

Model	Suite	Depth	MTPLX tok/s	AX MTP tok/s	AX MTP+ngram tok/s
Qwen3.6 27B 4-bit	flappy	3	657	678	683
Qwen3.6 27B 4-bit	long_code	3	793	789	790
Qwen3.6 27B 4-bit	python_modules_long	3	680	692	693
Qwen3.6 35B-A3B 4-bit	flappy	1	1,520	1,795	1,803
Qwen3.6 35B-A3B 4-bit	long_code	1	2,431	2,673	2,706
Qwen3.6 35B-A3B 4-bit	python_modules_long	1	1,654	1,973	1,935

Time to first token (ms) — same run:

MTPLX TTFT is derived from prompt_eval_time_s. AX TTFT is a runner-time measurement. Both are pure prefill measurements.

Model	Suite	Depth	MTPLX ms	AX MTP ms	AX MTP+ngram ms
Qwen3.6 27B 4-bit	flappy	3	489	474	470
Qwen3.6 27B 4-bit	long_code	3	905	909	909
Qwen3.6 27B 4-bit	python_modules_long	3	509	506	505
Qwen3.6 35B-A3B 4-bit	flappy	1	213	179	178
Qwen3.6 35B-A3B 4-bit	long_code	1	295	269	265
Qwen3.6 35B-A3B 4-bit	python_modules_long	1	206	174	179

Sampler: temperature=0.6, top_p=0.95, top_k=20; 1,000 gen tokens, 5 repetitions, 30 s cooldown, 10 s inter-case cooldown. MTPLX 0.3.7 reference rows are retained from the full 2026-06-07 run; AX Engine rows are refreshed on v6.5.2.

Full artifacts: 2026-06-20-qwen36-ax-mtp-only (AX-only rerun) · 2026-06-20-qwen36-merged-ax-refresh (README chart artifact with retained MTPLX reference rows).

Reproduce this benchmark

ax-engine convert-mtplx mlx-community/Qwen3.6-27B-4bit \
  --mtp-source Qwen/Qwen3.6-27B \
  --fair-base-only
ax-engine convert-mtplx mlx-community/Qwen3.6-35B-A3B-4bit \
  --mtp-source Qwen/Qwen3.6-35B-A3B \
  --fair-base-only
python3 scripts/bench_qwen36_mtp_fair.py \
  --engines mtplx ax \
  --modes mtp mtp-ngram \
  --models 27b-4bit 35b-a3b-4bit \
  --suites flappy long_code python_modules_long \
  --max-tokens 1000 \
  --repetitions 5 \
  --cooldown 30

convert-mtplx wraps the generic sidecar packager, applies model-specific defaults when optional knobs are omitted (Qwen3.6 27B depth 3; 35B-A3B depth 1), and validates ax_mtp_sidecar_manifest.json before reporting success. The generated summary.md, summary.json, and decode-tok-s.svg live under benchmarks/results/mtp-fair/. Full methodology and caveats in docs/PERFORMANCE.md#mtp-mode.

Direct Decode · Prefill · TTFT

DiffusionGemma

DiffusionGemma is a block-diffusion Gemma4 26B checkpoint, not an ordinary autoregressive decoder. AX runs it with a native MLX graph, but the measurement boundary is different from the direct-decode families below: the first visible output comes from a committed 256-token diffusion block, not from a single next-token step.

Because of that generation shape, the rows below intentionally do not use the plain decode tok/s or TTFT labels used for autoregressive models. In Qwen, Gemma 4 text, and other next-token decoders, TTFT means prompt prefill plus the first single-token decode step, and decode tok/s means the steady token-by-token autoregressive loop. DiffusionGemma instead runs a bidirectional denoise pass over a 256-token canvas, then performs a causal commit for that block. The comparable boundary inside this runtime is therefore time to first block and first-block decode. Treating these as ordinary TTFT/decode rows would make the result look directly comparable to autoregressive throughput even though the work per visible output boundary is different.

The charts keep the same 128 / 512 / 2,048 prompt-token layout as the autoregressive sections for readability, but the values are AX first-block telemetry. Peer bars are intentionally omitted rather than shown as zero: current llama.cpp Metal cannot load the GGUF (unknown model architecture: 'diffusion-gemma'), and mlx_lm 0.31.3 cannot load the MLX snapshot (Model type diffusion_gemma not supported.).

Bar chart showing measured AX direct DiffusionGemma first-block decode throughput at 128, 512, and 2048 prompt tokens

Bar chart showing measured AX direct DiffusionGemma prefill throughput at 128, 512, and 2048 prompt tokens

Bar chart showing measured AX direct DiffusionGemma time to first committed block at 128, 512, and 2048 prompt tokens

Prompt tokens	AX first-block decode	Denoise steps	Committed block
128	30.7 tok/s	48	256 tokens
512	58.9 tok/s	25	256 tokens
2048	32.1 tok/s	48	256 tokens

Prefill and first-block latency:

Prompt tokens	AX direct prefill	AX time to first block	llama.cpp Metal 9650	`mlx_lm` 0.31.3
128	1,351.8 tok/s	8,428 ms	load blocked	load blocked
512	3,002.1 tok/s	4,518 ms	load blocked	load blocked
2048	4,031.4 tok/s	8,475 ms	load blocked	load blocked

time to first block is prefill wall time plus the first 256-token denoise-and-commit block. first-block decode is computed as 256 / ax_mlx_diffusion_block_wall_us. Use these rows to track AX's DiffusionGemma path; do not compare them directly with ordinary autoregressive TTFT or fixed-token decode throughput.

Runtime path	Model artifact	Benchmark status
AX direct MLX	`mlx-community/diffusiongemma-26B-A4B-it-4bit`	Measured: 1 warmup + 5 measured repetitions, 15 s cooldown, medians reported
llama.cpp Metal 9650	4-bit GGUF	Blocked at load: `unknown model architecture: 'diffusion-gemma'`
`mlx_lm` 0.31.3	4-bit MLX snapshot	Blocked at load: `Model type diffusion_gemma not supported.`

Memory bandwidth share:

The bandwidth chart is an implementation-efficiency view, not a peer comparison. It estimates first-block traffic at block granularity from the measured denoise-step count plus one causal commit over the 16.54 GB MLX safetensors artifact. This rerun used 48 / 25 / 48 denoise steps at 128 / 512 / 2,048 prompt tokens, so the estimated traffic is much larger than a one-step early-exit block. The chart shows estimated bandwidth used versus the M5 Max theoretical ceiling; the table keeps the effective GB/s values.

Prompt tokens	Estimated effective bandwidth	% of 614.4 GB/s M5 Max theoretical bandwidth
128	97.3 GB/s	15.8%
512	98.9 GB/s	16.1%
2,048	101.8 GB/s	16.6%

At these prompt lengths, the first-block path uses roughly 16% of theoretical M5 Max bandwidth. The current bottleneck is therefore not raw memory bandwidth alone; the next optimization target is denoise graph reuse, dispatch overhead, and convergence behavior under stricter quality gates.

Denoise loop optimization — GPU-native sampling:

crates/ax-engine-mlx/src/diffusion.rs keeps denoise state, entropy-bound acceptance, and self-conditioning on the GPU. Convergence checks materialize only scalar counters and run every convergence_check_interval steps (default 4), reducing per-block GPU/CPU syncs from 48 to about 12. The CPU no longer round-trips 256 token positions on every denoise step; sampling and acceptance stay in lazy MLX graph nodes that can fuse with the forward evaluation.

Adaptive convergence detection:

The denoise loop can stop early when any configured convergence signal fires:

Strict stability: argmax is unchanged for convergence_steps consecutive checks and mean entropy is below entropy_threshold (default 0.005).
Low update rate: the accepted-position update rate drops below acceptance_rate_threshold (default 1%), so another denoise pass is unlikely to change the block materially.
Entropy plateau: mean entropy stops decreasing materially after the early denoise phase, indicating diminishing returns from additional passes.

The benchmark rows above report the measured adaptive-convergence run as recorded in the artifact. This rerun did not converge after one denoise step: it used 48 / 25 / 48 denoise steps at 128 / 512 / 2,048 prompt tokens. Time to first block therefore tracks the full measured denoise work for the 128- and 2,048-token rows and a mid-run early exit for the 512-token row.

Experimental denoise optimizations (opt-in):

The default path above uses no optional optimizations. The following environment variables enable experimental fast paths for benchmarking and development. All are off by default and should be considered preview/experimental until they are validated across prompt lengths, multi-block generation, and token-equivalence against the default imperative path.

Environment variable	What it does	Status
`AX_DIFFUSION_COMPILED_FORWARD=1`	Compiles the bidirectional denoise forward pass into an `MlxClosure` per block, collapsing ~250 per-step MLX C-API calls into one dispatched graph.	Experimental / benchmarking
`AX_DIFFUSION_FULL_PIPELINE=1`	Compiles the entire denoise step (forward + softmax + entropy + argmax + sampling + acceptance) into a single `MlxClosure`. Supersedes `AX_DIFFUSION_COMPILED_FORWARD` when both are set.	Experimental / benchmarking
`AX_DIFFUSION_KV_CONCAT_BUFFER=1`	Pre-allocates per-layer KV concatenation buffers on the first denoise step and updates only the canvas slice on subsequent steps, avoiding re-copying the cached prompt prefix. Most beneficial when multiple denoise steps are needed.	Experimental / benchmarking
`AX_DIFFUSION_EMBEDDING_CACHE=1`	Caches per-layer embedding inputs across denoise steps when token IDs are unchanged, using a GPU-side sum fingerprint to detect changes.	Experimental / benchmarking
`AX_DIFFUSION_SKIP_COMMIT_ON_CONVERGE=1`	Skips the causal commit forward pass when the denoise loop converges at step 1 with near-perfect acceptance (≥ 99%).	Experimental / benchmarking

Example usage for a single benchmark run:

AX_DIFFUSION_FULL_PIPELINE=1 \
AX_DIFFUSION_KV_CONCAT_BUFFER=1 \
python3 scripts/bench_diffusion_gemma_direct.py --bench-bin target/release/ax-engine-bench

These flags are read once per process. Do not enable them in production serving without first verifying output token equivalence against the default path on your target prompts.

Artifacts: AX direct rows are 2026-06-20-direct-first-block-rerun/summary.json, with the human summary in summary.md. Peer runtime blockers are recorded as load failures, so there are no llama.cpp or mlx_lm result artifacts for this model family.

Render charts with:

python3 scripts/bench_diffusion_gemma_direct.py --skip-benchmark

Decode acceleration model — no MTP:

DiffusionGemma's acceleration model is the diffusion block itself. It does not stack with MTP or n-gram acceleration because those techniques assume an autoregressive next-token loop:

	MTP (speculative decoding)	DiffusionGemma (block diffusion)
Generation	Draft-then-verify, one token at a time	256-token blocks via bidirectional denoising
Forward pass	Causal only	Bidirectional (denoise) + causal (commit)
Needs draft model / assistant head	Yes	No
AX Engine decode path	`ngram_acceleration` / `mtp_head_only`	`diffusion` (early return, mutually exclusive)

In the runner's decode_one, the diffusion path returns before the MTP/n-gram branches are reached. DiffusionConfig carries canvas size, denoise steps, entropy thresholds, convergence settings, and temperature schedule only; it has no MTP fields.

Supported features:

Block-autoregressive discrete diffusion decode (canvas=256, up to 48 denoise steps)
Entropy-bound position acceptance with argmax-based rejection
Self-conditioning via GPU matmul (prob × cached embedding table)
Linear temperature schedule (configurable start/end)
Adaptive convergence detection (stable argmax, mean entropy, low update rate, and entropy plateau)
Standard causal prefill (same Gemma4 encoder, 4,073.3 tok/s median at the 2,048-token row)
Causal commit pass (writes KV cache for subsequent blocks)
SSE telemetry counters for diffusion block timing, denoise steps, convergence signals, and near-miss entropy/update-rate diagnostics (ax_mlx_diffusion_*)
diffusion decode-route classification in benchmark harness

Not applicable:

MTP / assistant-head speculative decoding (architecturally incompatible)
N-gram acceleration (diffusion replaces the autoregressive decode loop)
Direct pipeline double-buffering (not autoregressive)

Benchmark contract:

The published rows use first-block telemetry instead of the standard fixed-token autoregressive benchmark contract. max_output_tokens=1 is enough to force prefill plus one diffusion block, and the block counters still report the full 256-token denoise/commit cycle even though the caller receives only the first emitted token.

Telemetry: SSE-emitted ax_mlx_diffusion_* counters cover block count, denoise steps, convergence count, per-criterion convergence signals, near-miss entropy/update-rate diagnostics, denoise wall time, commit wall time, and block wall time, plus diffusion decode-route classification in bench_mlx_inference_stack.py.

Run the full direct benchmark and regenerate the charts:

cargo build -p ax-engine-bench --bin ax-engine-bench
python3 scripts/bench_diffusion_gemma_direct.py

Qwen3-Coder-Next

Qwen3-Coder-Next is the coding-specialist qwen3_next checkpoint, so it is reported separately from Qwen 3.6. It uses the same repo-owned AX MLX graph family, but its benchmark boundary is different: it does not ship MTP heads or a Qwen3.6 sidecar, so the public README path is direct decode only.

The direct comparison below uses grouped bar charts at 128/512/2048 prompt tokens. Each engine's version is printed on the charts: AX native MLX (6.5.2) and mlx_lm (0.31.3) use the MLX artifact and prompt-hash parity; llama.cpp Metal (b9700, ggml 0.15.2, flash-attn on) is a shape-compatible external GGUF reference run on one consistent build across all three prompt sizes. The AX rerun uses the default-on Qwen MoE fast paths (AX_MLX_QWEN3_MOE_NARROW_SOFTMAX, AX_MLX_MOE_FUSE_SHARED_EXPERT_ADD, and AX_MLX_MOE_SWIGLU_PACKED_METAL) plus the opt-in fused expert block (AX_MLX_MOE_FUSED_EXPERT_BLOCK=1). AX direct decode is +6.6% / +3.3% / +3.4% versus mlx_lm, and +23.6% / +20.6% / +17.1% versus llama.cpp.

Grouped bar chart comparing Qwen3-Coder-Next 4-bit median direct decode throughput for llama.cpp Metal, mlx_lm, and AX Engine native MLX at 128/512/2048 prompt tokens

Grouped bar chart comparing Qwen3-Coder-Next 4-bit median prefill throughput for llama.cpp Metal, mlx_lm, and AX Engine native MLX at 128/512/2048 prompt tokens

Grouped bar chart comparing Qwen3-Coder-Next 4-bit median time to first token for llama.cpp Metal, mlx_lm, and AX Engine native MLX at 128/512/2048 prompt tokens

Prompt tokens	llama.cpp decode	mlx_lm decode	AX direct decode	AX vs mlx_lm	AX vs llama.cpp
128	85.5	99.2	105.7	+6.6%	+23.6%
512	86.0	100.4	103.7	+3.3%	+20.6%
2048	85.5	96.9	100.2	+3.4%	+17.1%

Prefill and TTFT peers — same run:

Prompt tokens	llama.cpp prefill	mlx_lm prefill	AX direct prefill	llama.cpp TTFT	mlx_lm TTFT	AX direct TTFT
128	1,248.7	301.8	758.5	103 ms	426 ms	169 ms
512	2,148.3	897.2	1,703.2	238 ms	574 ms	301 ms
2048	2,555.1	2,226.9	2,482.6	802 ms	920 ms	825 ms

llama.cpp leads prefill/TTFT at every prompt size (flash-attn GGUF prompt ingestion). The v6.5.2 fused-expert-block AX rerun keeps the AX decode advantage at every size, but its prefill/TTFT rows trail the prior default-on AX run; use this artifact as an opt-in fast-path measurement, not a replacement claim that the flag is always faster.

What drives the decode gap (it is not bandwidth saturation). This is a runtime shootout at each engine's standard 4-bit, not a controlled kernel test. Qwen3-Coder-Next is MoE, so each decode token reads only the dense backbone plus the 10-of-512 active experts — and at that footprint none of the three engines is bandwidth-bound (all sit at 34–42% of the 577 GB/s M5 Max peak; see the bandwidth table below). The gap splits cleanly: AX beats llama.cpp on bytes-read — Q4_K_M reads ~1.44× the bytes/token (2.83 vs 1.96 GB) because its dense backbone (linear-attention/SSM, embeddings, output head) stays at higher precision; llama.cpp actually sustains the most bandwidth (~42%) yet is slowest. AX beats mlx_lm on kernel efficiency — identical 1.96 GB/token MLX weights, but AX extracts ~36% of peak vs mlx-lm's ~34% (the MoE gather-GEMV win). The parity-controlled claim is AX vs mlx_lm (identical weights, prompt-hash parity): +3.3%–6.6%; llama-bench consumes its own internal tokens (no prompt-hash parity), so the llama.cpp column is a shape-compatible external reference only.

Memory bandwidth utilization:

Decode speed follows one identity: tok/s = effective bandwidth ÷ bytes read per token. The chart below plots decode throughput (y) against weight bytes read per token (x), with the measured M5 Max peak (≈577 GB/s, MLX reduction probe) drawn as the ceiling curve tok/s = 577 / bytes. It reads in one view: AX and mlx-lm share the same x (identical MLX 4-bit weights), so the vertical gap between them is pure kernel efficiency (+6.6%, AX's MoE gather-GEMV); llama.cpp is pushed right because Q4_K_M reads 1.44× the bytes/token, which is why it decodes slowest even though it sustains the most raw bandwidth; and every point sits far below the ceiling, so decode is gather/dispatch-bound, not bandwidth-bound — the room up to the curve is headroom.

Engine / quantization	Dense backbone	Active experts	Weights/token	Decode tok/s	Effective BW	% of 577 GB/s peak (used)
AX — MLX 4-bit + fused expert block	1.21 GB (22%)	0.76 GB (14%)	1.96 GB	105.7	208 GB/s	36%
mlx-lm — MLX 4-bit	1.21 GB (21%)	0.76 GB (13%)	1.96 GB	99.2	195 GB/s	34%
llama.cpp — Q4_K_M	1.91 GB (28%)	0.91 GB (14%)	2.83 GB	85.5	242 GB/s	42%

Per-segment percentages are that read's share of the 577 GB/s peak (dense + experts = used); the remainder is idle headroom. The dense backbone (read in full every token) is where Q4_K_M's higher precision shows up — 1.91 GB vs MLX's 1.21 GB.

AX and mlx-lm read the same 1.96 GB of active weights per token (identical MLX 4-bit artifact); AX is faster because it extracts more of the available bandwidth — a runtime/kernel win, not a quant difference. llama.cpp reads 1.44× more (2.83 GB) because Q4_K_M keeps the dense backbone — Qwen3-Next's linear-attention/SSM weights, token embeddings, and output head — at higher precision; that bytes-read overhead, not bandwidth starvation, is why its decode trails. Active-byte figures: MLX from the harness bandwidth_accounting (moe_active_estimate), llama.cpp computed from the GGUF tensor table (dense + routed × 10/512, the same formula). Rows are prompt=128; decode tok/s is essentially depth-independent for this model.

The same chart also shows the remaining AX headroom. If AX kept the 1.96 GB/token footprint and merely matched llama.cpp's 42% effective-bandwidth row, decode would land around 124 tok/s (+17%); on dense models on this same M5 Max hardware AX reaches 78–86% of peak, so the ~40-point gap here is specific to batch-1 MoE decode, where each token gathers only 10-of-512 experts and fixed routing, gather setup, dispatch, dequant, and expert weighted-sum overhead dominate costs that do not scale with bytes read (the bus idles while dispatch runs). The next lever is therefore kernel/dispatch engineering — fewer and larger fused MoE operations such as batched expert dispatch and deeper gather+GEMV+weighted-sum fusion — not pushing quantization lower (AX already reads the fewest bytes of the three; going lower would cost model quality). This is an upper bound, not a commitment: single-token MoE decode is latency-bound at its core.

Artifacts: AX direct rows are the v6.5.2 opt-in fused-expert-block rerun 2026-06-20-qwen3-coder-next-ax-direct/qwen3-coder-next-4bit-ax-direct.json, with default-on Qwen MoE fast paths plus AX_MLX_MOE_FUSED_EXPERT_BLOCK=1; mlx_lm reference rows are qwen3-coder-next-4bit-p128-p2048-step4096.json and qwen3-coder-next-4bit-p512-step4096.json; llama.cpp is 2026-06-19-qwen3-coder-next-9700-fa/qwen3-coder-next-4bit.json (b9700 / ggml 0.15.2 / flash-attn, one build across 128/512/2048).

Render charts with:

python3 scripts/render_qwen_coder_next_charts.py \
  --artifact benchmarks/results/mlx-inference/2026-06-20-qwen3-coder-next-ax-direct/qwen3-coder-next-4bit-ax-direct.json \
  --artifact benchmarks/results/mlx-inference/2026-06-19-qwen3-coder-next-ax-only/qwen3-coder-next-4bit-ax-direct.json \
  --artifact benchmarks/results/mlx-inference/2026-06-14-qwen3-coder-next-29af647f-ax-direct/qwen3-coder-next-4bit-ax-direct.json \
  --artifact benchmarks/results/mlx-inference/2026-06-13-qwen3-coder-next-prefill-probe/qwen3-coder-next-4bit-p128-p2048-step4096.json \
  --artifact benchmarks/results/mlx-inference/2026-06-13-qwen3-coder-next-prefill-probe/qwen3-coder-next-4bit-p512-step4096.json \
  --llama-artifact benchmarks/results/llama-cpp-metal/2026-06-19-qwen3-coder-next-9700-fa/qwen3-coder-next-4bit.json \
  --assets-dir docs/assets

# Memory-bandwidth utilization chart (static data; see script header for provenance)
python3 scripts/render_qwen_coder_next_bandwidth_chart.py --assets-dir docs/assets

MoE decode optimizations:

Qwen3-Coder-Next uses a sparse top-10-of-512 MoE architecture, so each decode token reads only the dense backbone plus 10 active experts. The optimizations below reduce the per-layer dispatch overhead in the MoE expert forward path. Three Qwen-relevant paths are on by default (with kill-switches); the others are opt-in for benchmarking and development.

Environment variable	What it does	Default
`AX_MLX_QWEN3_MOE_NARROW_SOFTMAX`	Routes MoE expert selection through `argpartition` on raw logits instead of full `softmax_precise` over all 512 experts. Mathematically equivalent (argpartition preserves top-k order since softmax is monotonic).	ON
`AX_MLX_MOE_FUSE_SHARED_EXPERT_ADD`	Adds Qwen3 shared-expert output inside the weighted-sum Metal kernel on decode/short-tail chunks, removing one add dispatch per MoE layer when shapes are eligible.	ON
`AX_MLX_MOE_SWIGLU_PACKED_METAL`	Routes packed Qwen3 MoE expert SwiGLU through one Metal kernel instead of split + split + activation/multiply on decode. Long prefill keeps the split path.	ON
`AX_MLX_MOE_LAYER_COMPILE`	Wraps each MoE layer's decode forward path in a compiled `MlxClosure` (`shapeless=true`), collapsing ~10 per-layer MLX dispatches into a single compiled graph. Cached per `(layer_index, thread_id)`. Only engages for decode (`seq == 1`). Falls back to the uncompiled path on failure.	OFF
`AX_MLX_MOE_PROFILE`	Records wall-clock timing for each MoE sub-stage (router, gate-up, activation, down-projection, weighted-sum, total) without `eval()` barriers. Data surfaces in route metadata and batch summaries. Diagnostic tool, not a performance optimization.	OFF
`AX_MLX_MOE_FUSED_EXPERT_BLOCK`	Routes the activation + squeeze + unsort chain through a single fused Metal kernel for decode (unsorted gather path only). Reduces dispatch count per MoE layer. Falls back to the standard dispatch when ineligible.	OFF
`AX_MLX_MOE_EXPERT_PARALLEL`	Bins expert tokens per-expert for parallel Metal dispatch during prefill. Checks load-balance before engaging (falls back to sequential `gather_qmm` when `max_bin > 2x mean_bin`). Infrastructure only — parallel kernel not yet implemented.	OFF

To disable a default-on optimization (e.g. for debugging or comparison):

# Disable packed SwiGLU for a single run
AX_MLX_MOE_SWIGLU_PACKED_METAL=0 ax-engine serve qwen3-coder-next --download --port 8080

To enable selected experimental diagnostics/fast paths for benchmarking:

AX_MLX_MOE_LAYER_COMPILE=1 \
AX_MLX_MOE_PROFILE=1 \
AX_MLX_MOE_FUSED_EXPERT_BLOCK=1 \
ax-engine serve qwen3-coder-next --download --port 8080

Note: AX_MLX_MOE_LAYER_COMPILE wraps each MoE layer's decode forward in a compiled MlxClosure. It is opt-in because it may panic in long-running processes due to MLX thread-local stream registry invalidation. If you encounter crashes, disable it with AX_MLX_MOE_LAYER_COMPILE=0. AX_MLX_MOE_EXPERT_PARALLEL is infrastructure-only (parallel kernel not yet implemented).

These flags are read once per process at startup. Do not enable the opt-in flags in production serving without first verifying output token equivalence against the default path on your target prompts.

The family tables below compare direct (non-speculative) decode across llama.cpp Metal, mlx_lm, and ax engine, covering Gemma 4 and Qwen 3.6 at 128/512/2048 prompt tokens. ax direct baseline disables n-gram acceleration, MTP, and assistant drafting to measure the repo-owned direct decode path. Bench prompts are mlx_lm.benchmark seed-0 random tokens, which keeps prompt-hash parity across MLX rows.

The prefill and TTFT advantage is the practical direct-mode story. AX is ahead of mlx_lm in every listed prefill and TTFT cell below, while decode gains are smaller and model-dependent. That means the repo-owned MLX route is especially valuable for interactive requests where prompt ingestion dominates perceived latency: AX keeps prompt prefill, first-token timing, model-specific graph paths, and route metadata in one measured runtime path. These are cold-prefix rows, not prompt-cache, continuous-batching, or speculative-decoding claims.

	Gemma 4	Qwen 3.6
Decode rate
Prefill rate
TTFT

llama.cpp Metal* column — Shape-compatible reference produced by Metal-enabled llama-bench. llama-bench generates its own internal synthetic prompt tokens and does not consume the harness prompt JSON, so these numbers are not prompt-hash parity with the other columns. No percentage delta is shown. MLX bit-widths are mapped to the nearest standard GGUF K-quant (4→Q4_K_M, 5→Q5_K_M, 6→Q6_K, 8→Q8_0). Source: benchmarks/manifests/llama_cpp_metal/inventory.json, scripts/bench_llama_cpp_metal_sweep.py.

Benchmark provenance and methodology

The mlx_lm reference rows for the 12 Gemma 4 and Qwen 3.6 rows shown below come from benchmarks/results/mlx-inference/2026-05-26-direct-mode-clean-refresh/. The AX direct-mode cells come from the full 12-model AX-only rerun in benchmarks/results/mlx-inference/2026-06-20-ax-direct-readme/ (v6.5.2). Qwen3-Coder-Next is intentionally handled as the opening direct-mode subsection because it has a direct-only benchmark boundary; its MLX/AX and llama.cpp Metal rows now cover 128/512/2048 prompt tokens. The llama.cpp Metal* column is injected from benchmarks/manifests/llama_cpp_metal/inventory.json and the 2026-05-18-llama-cpp-metal-gemma-e2b-4bit-depth-fa/ Gemma 4 E2B 4-bit recheck.

Setup: generation=128, 5 measured repetitions, 15-second cooldown, AX prefix cache disabled for cold prefill and TTFT measurement, production-build binaries, matching prompt SHA checks. Long-greedy AX prefill rows are runner-time measurements of the cache-state prefix plus final prompt-token boundary — not full-logits prompt scoring throughput. Percentages are versus mlx_lm.

The 2K llama.cpp Metal* prefill rows are long-context, GGUF-runtime-reference rows. The Gemma 4 E2B 4-bit row was produced with llama.cpp b9110 and rechecked on b9200 with Metal offload, -b/-ub 2048, and flash attention enabled. The b9200 recheck improved 2K prefill only slightly — this is our benchmark boundary, not an upstream llama.cpp official bug statement.

Prefill throughput (tok/s) — percentages vs mlx_lm

Model	MLX quantization	Prompt tok	llama.cpp Metal*	mlx_lm	ax engine
Gemma 4 E2B	4-bit	128	3,481.7	2,338.1	5,720.2 (+144.6%)
		512	6,846.0	7,870.0	16,076.9 (+104.3%)
		2048	7,643.1	18,014.7	23,346.2 (+29.6%)
Gemma 4 E2B	5-bit	128	3,398.4	2,238.5	5,436.4 (+142.9%)
		512	6,860.3	7,469.9	15,526.9 (+107.9%)
		2048	7,288.1	16,664.1	22,798.4 (+36.8%)
Gemma 4 E2B	6-bit	128	3,539.7	1,823.5	5,330.0 (+192.3%)
		512	7,274.0	6,046.6	14,814.0 (+145.0%)
		2048	7,623.2	15,332.1	22,280.0 (+45.3%)
Gemma 4 E2B	8-bit	128	3,694.3	1,605.0	5,338.2 (+232.6%)
		512	7,481.0	6,332.9	15,259.4 (+141.0%)
		2048	7,990.4	15,536.8	22,924.7 (+47.6%)
Gemma 4 E4B	4-bit	128	2,194.0	1,513.2	3,460.6 (+128.7%)
		512	4,454.2	4,195.5	7,002.4 (+66.9%)
		2048	4,426.6	7,325.4	8,758.8 (+19.6%)
Gemma 4 26B A4B	4-bit	128	1,911.4	496.4	1,331.6 (+168.2%)
		512	3,484.5	1,621.0	3,011.0 (+85.7%)
		2048	3,604.8	3,300.1	4,550.1 (+37.9%)
Gemma 4 31B	4-bit	128	522.6	283.1	508.0 (+79.5%)
		512	665.3	619.9	736.0 (+18.7%)
		2048	560.3	733.9	750.8 (+2.3%)
Qwen 3.6 27B	4-bit	128	539.6	378.8	570.4 (+50.6%)
		512	759.7	705.7	826.6 (+17.1%)
		2048	664.3	895.2	922.0 (+3.0%)
Qwen 3.6 27B	5-bit	128	520.8	278.8	520.4 (+86.6%)
		512	733.4	599.5	760.4 (+26.8%)
		2048	667.0	827.5	848.1 (+2.5%)
Qwen 3.6 27B	6-bit	128	537.7	270.5	485.1 (+79.3%)
		512	756.1	577.6	736.0 (+27.4%)
		2048	689.3	798.2	841.0 (+5.4%)
Qwen 3.6 27B	8-bit	128	559.4	219.3	441.7 (+101.4%)
		512	798.2	520.2	710.1 (+36.5%)
		2048	741.9	787.4	847.6 (+7.6%)
Qwen 3.6 35B A3B	4-bit	128	1,706.9	539.4	1,118.8 (+107.4%)
		512	3,146.6	1,599.5	2,588.3 (+61.8%)
		2048	3,542.3	3,513.1	3,761.3 (+7.1%)

Decode throughput (tok/s) — generation=128 tokens, temp=0

Model	MLX quantization	Prompt tok	llama.cpp Metal*	mlx_lm	ax direct baseline
Gemma 4 E2B	4-bit	128	174.6	214.0	224.1 (+4.7%)
		512	165.2	210.3	215.1 (+2.3%)
		2048	171.9	200.9	205.4 (+2.2%)
Gemma 4 E2B	5-bit	128	154.8	195.2	200.6 (+2.8%)
		512	154.3	182.0	194.5 (+6.8%)
		2048	154.3	181.9	185.7 (+2.1%)
Gemma 4 E2B	6-bit	128	152.1	172.2	178.0 (+3.4%)
		512	152.0	166.3	171.7 (+3.2%)
		2048	152.2	162.5	164.7 (+1.4%)
Gemma 4 E2B	8-bit	128	136.1	153.0	162.0 (+5.8%)
		512	138.3	148.8	157.8 (+6.1%)
		2048	138.7	144.2	153.0 (+6.1%)
Gemma 4 E4B	4-bit	128	110.7	137.1	142.9 (+4.2%)
		512	110.8	133.6	139.9 (+4.8%)
		2048	110.7	130.6	137.2 (+5.1%)
Gemma 4 26B A4B	4-bit	128	112.6	127.9	131.7 (+2.9%)
		512	112.9	125.0	128.7 (+2.9%)
		2048	112.9	119.3	123.7 (+3.7%)
Gemma 4 31B	4-bit	128	25.0	28.9	28.8 (-0.3%)
		512	25.5	28.3	28.3 (-0.2%)
		2048	25.3	27.0	26.1 (-3.3%)
Qwen 3.6 27B	4-bit	128	26.0	34.0	33.9 (-0.3%)
		512	26.0	33.9	33.6 (-0.8%)
		2048	18.8	33.4	33.3 (-0.4%)
Qwen 3.6 27B	5-bit	128	23.5	21.6	27.2 (+26.1%)
		512	23.3	28.1	26.9 (-4.2%)
		2048	17.8	27.8	26.1 (-6.3%)
Qwen 3.6 27B	6-bit	128	21.3	24.0	24.0 (+0.2%)
		512	21.3	24.8	24.0 (-3.0%)
		2048	15.4	24.6	23.7 (-3.8%)
Qwen 3.6 27B	8-bit	128	18.3	18.7	18.3 (-2.2%)
		512	18.2	18.6	18.0 (-3.2%)
		2048	12.7	18.4	18.1 (-1.7%)
Qwen 3.6 35B A3B	4-bit	128	108.1	140.1	153.2 (+9.4%)
		512	108.2	136.5	151.6 (+11.1%)
		2048	105.7	134.5	149.8 (+11.4%)

Qwen 3.6 27B 4-bit at prompt=2,048 originally produced zero decode tokens because 4-bit quantization noise pushed an EOS token to argmax at decode step 0 on the mlx_lm.benchmark random-token contract. The benchmark harness now sends sampling.ignore_eos=true for AX throughput runs, matching how mlx_lm.benchmark measures fixed gen=N throughput. Production requests default to ignore_eos=false. Source: benchmarks/results/mlx-inference/2026-05-20-qwen27-4to5-direct-ngram-directcpp-r2/qwen3_6-27b-4bit.json.

Time to first token (ms) — generation=128 tokens, temp=0

Lower is better. mlx_lm values are derived from reported prefill throughput. AX values are measured directly from per-step runner timing in the SSE event stream.

Model	MLX quantization	Prompt tok	llama.cpp Metal*	mlx_lm	ax engine
Gemma 4 E2B	4-bit	128	36.8	54.7	22.4 (-59.1%)
		512	74.8	65.1	31.8 (-51.0%)
		2048	268.0	113.7	87.7 (-22.8%)
Gemma 4 E2B	5-bit	128	37.7	57.2	23.5 (-58.8%)
		512	74.6	68.5	33.0 (-51.9%)
		2048	281.0	122.9	89.8 (-26.9%)
Gemma 4 E2B	6-bit	128	36.2	70.2	24.0 (-65.8%)
		512	70.4	84.7	34.6 (-59.2%)
		2048	268.7	133.6	91.9 (-31.2%)
Gemma 4 E2B	8-bit	128	34.6	79.7	24.0 (-69.9%)
		512	68.4	80.8	33.6 (-58.5%)
		2048	256.3	131.8	89.3 (-32.2%)
Gemma 4 E4B	4-bit	128	58.3	84.6	37.0 (-56.3%)
		512	114.9	122.0	73.1 (-40.1%)
		2048	462.7	279.6	233.8 (-16.4%)
Gemma 4 26B A4B	4-bit	128	67.0	257.8	96.1 (-62.7%)
		512	146.9	315.8	170.0 (-46.2%)
		2048	568.1	620.6	450.1 (-27.5%)
Gemma 4 31B	4-bit	128	244.9	452.2	252.0 (-44.3%)
		512	769.5	826.0	695.7 (-15.8%)
		2048	3,655.2	2,790.6	2,727.7 (-2.3%)
Qwen 3.6 27B	4-bit	128	237.2	337.9	224.4 (-33.6%)
		512	673.9	725.6	619.4 (-14.6%)
		2048	3,083.1	2,287.7	2,221.3 (-2.9%)
Qwen 3.6 27B	5-bit	128	245.8	459.0	246.0 (-46.4%)
		512	698.1	854.1	673.3 (-21.2%)
		2048	3,070.5	2,474.9	2,414.7 (-2.4%)
Qwen 3.6 27B	6-bit	128	238.1	473.2	263.9 (-44.2%)
		512	677.2	886.5	695.6 (-21.5%)
		2048	2,971.2	2,565.6	2,435.2 (-5.1%)
Qwen 3.6 27B	8-bit	128	228.8	583.6	289.8 (-50.3%)
		512	641.5	984.2	721.0 (-26.7%)
		2048	2,760.6	2,601.1	2,416.3 (-7.1%)
Qwen 3.6 35B A3B	4-bit	128	75.0	237.3	114.4 (-51.8%)
		512	162.7	320.1	197.8 (-38.2%)
		2048	578.2	583.0	544.5 (-6.6%)
Embedding benchmarks are kept out of this README summary; see `docs/EMBEDDINGS.md`.

SDKs

ax-engine-server exposes OpenAI-compatible HTTP endpoints, and several SDKs wrap those endpoints or the in-process Rust session directly.

Language	Package / path	LangChain
Python	`python/ax_engine`	`ax_engine.langchain` — `AXEngineChatModel`, `AXEngineLLM`
TypeScript / JS	`javascript/ax-engine` (`@ax-engine/sdk`)	`@ax-engine/sdk/langchain` — `ChatAXEngine`, `AXEngineLLM`
Go	`sdk/go/axengine`	Use langchaingo OpenAI provider — see `examples/go/langchain/`
Ruby	`sdk/ruby` (`ax-engine-sdk`)	`ax_engine/langchain` — `ChatModel`, `LLM` (requires langchain-rb)
Mojo	`sdk/mojo/ax_engine.mojo`	Via Python — use `ax_engine.langchain` from Mojo's Python interop

TypeScript / JavaScript

npm install @ax-engine/sdk

import AxEngineClient from "@ax-engine/sdk";

const client = new AxEngineClient({ baseUrl: "http://127.0.0.1:8080" });
const resp = await client.chatCompletion({
  messages: [{ role: "user", content: "Hello!" }],
  max_tokens: 128,
});
console.log(resp.choices[0].message.content);

// Streaming
for await (const event of client.streamChatCompletion({ messages: [...], stream: true })) {
  process.stdout.write(event.data.choices[0]?.delta?.content ?? "");
}

LangChain integration (requires @langchain/core):

import { ChatAXEngine } from "@ax-engine/sdk/langchain";
import { HumanMessage } from "@langchain/core/messages";

const chat = new ChatAXEngine({ maxTokens: 128 });
const response = await chat.invoke([new HumanMessage("Hello!")]);

Go

The Go SDK lives at sdk/go/axengine (module github.com/ax-engine/ax-engine-go).

client := axengine.NewClient(nil)

resp, err := client.ChatCompletion(ctx, axengine.OpenAiChatCompletionRequest{
    Messages:  []axengine.OpenAiChatMessage{{Role: "user", Content: "Hello!"}},
    MaxTokens: axengine.Ptr(128),
})

// Streaming
ch, errCh := client.StreamChatCompletion(ctx, req)
for chunk := range ch {
    fmt.Print(*chunk.Choices[0].Delta.Content)
}

See examples/go/ for runnable examples. For LangChain, point langchaingo's OpenAI provider at http://127.0.0.1:8080/v1 — see examples/go/langchain/ and docs/GO.md.

Ruby

The Ruby SDK lives at sdk/ruby/ (ax-engine-sdk gem). Zero dependencies — stdlib net/http only. Streaming uses a block interface.

require "ax_engine"

client = AxEngine::Client.new

# Blocking chat
resp = client.chat_completion(
  messages: [{ role: "user", content: "Hello!" }],
  max_tokens: 128
)
puts resp.dig("choices", 0, "message", "content")

# Streaming
client.stream_chat_completion(
  messages: [{ role: "user", content: "Count from 1 to 5." }],
  max_tokens: 64
) do |event|
  print event.dig("data", "choices", 0, "delta", "content").to_s
end

LangChain via langchain-rb:

require "ax_engine/langchain"

chat = AxEngine::Langchain::ChatModel.new(max_tokens: 256)
puts chat.chat(messages: [{ role: "user", content: "Hello!" }]).chat_completion

See examples/ruby/ and docs/RUBY.md for full details.

Python — LangChain

from ax_engine.langchain import AXEngineChatModel
from langchain_core.messages import HumanMessage

chat = AXEngineChatModel(base_url="http://127.0.0.1:8080", max_tokens=256)
response = chat.invoke([HumanMessage(content="Hello!")])
print(response.content)

# Streaming
for chunk in chat.stream([HumanMessage(content="Count from 1 to 5.")]):
    print(chunk.content, end="", flush=True)

Requires pip install langchain-core. See docs/PYTHON.md for full details.

Mojo

The Mojo SDK (sdk/mojo/ax_engine.mojo) wraps the Python ax_engine package via Mojo's PythonObject interop. Requires the Python extension to be built first (maturin develop).

from sdk.mojo.ax_engine import Session

var session = Session(
    "qwen3_dense",
    mlx=True,
    mlx_model_artifacts_dir="/path/to/artifacts",
)
var result = session.generate("Hello from Mojo!", max_output_tokens=64)
print(result.output_text)
session.close()

Server Usage

The installed PyPI workflow uses ax-engine serve for the common local-serving path. ax-engine-server remains available as the backward-compatible low-level entrypoint when you need explicit runtime flags.

# Download a model and generate its manifest
MODEL_DIR="$(ax-engine download qwen36-35b --json | python3 -c 'import json,sys; print(json.load(sys.stdin)["dest"])')"

# Recommended: resolve and launch ax-engine-server
ax-engine serve "$MODEL_DIR" --port 8080

# Backward-compatible low-level path
./target/release/ax-engine-server \
  --mlx \
  --mlx-model-artifacts-dir "$MODEL_DIR" \
  --port 8080

# Inspect the running server
curl http://127.0.0.1:8080/v1/runtime

# Smoke generation request
curl http://127.0.0.1:8080/v1/generate \
  -H 'content-type: application/json' \
  -d '{
    "model": "qwen3_dense",
    "input_tokens": [1, 2, 3, 4],
    "max_output_tokens": 4,
    "sampling": { "temperature": 0.0, "top_p": 1.0, "top_k": 0, "seed": 1234 }
  }'

Python bindings (after maturin develop):

import ax_engine

path = ax_engine.download_model("mlx-community/Qwen3-4B-4bit")
with ax_engine.Session(mlx=True, mlx_model_artifacts_dir=str(path)) as s:
    result = s.generate([1, 2, 3], max_output_tokens=32)
    print(result.output_tokens)

Delegated route (for unsupported MLX text models that mlx-lm can serve):

mlx_lm.server --model /path/to/local/mlx-model --host 127.0.0.1 --port 8090

./target/release/ax-engine-bench generate \
  --prompt "Hello from mlx-lm" \
  --support-tier mlx_lm_delegated \
  --mlx-lm-server-url http://127.0.0.1:8090

mlx_lm_delegated is a compatibility route, not an AX-owned MLX throughput claim. AX forwards text generation to upstream mlx_lm.server and preserves temperature, top_p, top_k, repetition_penalty, and seed. Streamed chunks are delegated text deltas — not AX-owned token IDs, KV state, or model-kernel throughput evidence.

Check readiness and run benchmarks:

# Readiness check
./target/release/ax-engine-bench doctor --mlx-model-artifacts-dir "$MODEL_DIR"
bash scripts/check-server-preview.sh
bash scripts/check-python-preview.sh

# Primary benchmark: AX vs mlx_lm
python3 scripts/bench_mlx_inference_stack.py \
  --model-dir /path/to/local/mlx-model \
  --prompt-tokens 128,512,2048 --generation-tokens 128 \
  --ax-compare-policies --repetitions 5 \
  --output benchmarks/results/mlx-inference/$(date +%F)/gemma-4-e2b-it-4bit.json

# Secondary workload-contract benchmark
./target/release/ax-engine-bench scenario \
  --manifest benchmarks/manifests/scenario/chat_gemma4_e2b_short.json \
  --output-root benchmarks/results

Workspace

crates/ax-engine-core    Engine state machine, scheduler, KV manager, sampler
crates/ax-engine-mlx     MLX model graph, n-gram acceleration, KV cache, runner
crates/mlx-sys           bindgen FFI over mlx-c; safe MlxArray RAII wrappers
crates/ax-engine-sdk     Session API, backend resolution (MLX, mlx-lm delegated, or llama.cpp)
crates/ax-engine-server  Axum HTTP/SSE adapter (OpenAI-compatible routes)
crates/ax-engine-bench   Manifest-driven workload-contract CLI
crates/ax-engine-py      PyO3 extension (ABI3, Python 3.10+)
javascript/ax-engine     TypeScript/JS HTTP SDK + LangChain adapter
sdk/go/axengine          Go HTTP SDK
sdk/ruby/                Ruby HTTP SDK (ax-engine-sdk gem)
sdk/mojo/                Mojo SDK (Python-interop)

Development

cargo build --workspace                                           # build all crates
cargo test --quiet                                                # full Rust test suite
cargo clippy --all-targets --all-features -- -D warnings         # lint (CI gate)
cargo fmt                                                         # format
maturin develop                                                   # rebuild Python extension
python -m unittest discover -s python/tests -v                   # Python tests
bash scripts/check-mlx-telemetry.sh                              # Gemma/AX MLX telemetry gate

For Gemma/AX MLX telemetry and decode-profile changes, prefer the targeted scripts/check-mlx-telemetry.sh gate. Use scripts/check-mlx-telemetry.sh --full-workspace when the change touches shared Rust contracts; that protected path compiles the workspace with cargo test --workspace --no-run --jobs 1 before running crate-by-crate tests.

Coverage is collected by the report-only GitHub Actions workflow in .github/workflows/coverage.yml. It publishes Rust cargo llvm-cov and Python coverage.py artifacts without enforcing a percentage threshold yet.

Public documentation is in docs/. Canonical benchmark manifests are in benchmarks/manifests/. Key design docs: SDK / API · Python · JavaScript / TypeScript · Go · Ruby · Mojo · Scheduler · KV Cache · Benchmarking · Serving Benchmarks

Benchmark Reference Projects

AX Engine's benchmark design and compatibility checks are informed by local reference checkouts of related open-source projects. A row is published only when it fits the benchmark contract for the specific workload: comparable model artifacts, prompt and sampling policy, prefill/decode/TTFT definitions, repeatability, host/runtime metadata, and provenance.

Project	Repository
ds4	antirez/ds4
lightning-mlx	samuelfaj/lightning-mlx
llama.cpp	ggml-org/llama.cpp
mistral.rs	EricLBuehler/mistral.rs
MLX	ml-explore/mlx
mlx-c	ml-explore/mlx-c
mlx-engine	lmstudio-ai/mlx-engine
mlx-lm	ml-explore/mlx-lm
mlx-turboquant	rachittshah/mlx-turboquant
MTPLX	youssofal/MTPLX
Rapid-MLX	raullenchai/Rapid-MLX
turboquant-mlx	arozanov/turboquant-mlx
vLLM	vllm-project/vllm

Some reference projects are experimental, version-unstable, focused on a different serving route, or not shaped for the same Apple MLX/Metal measurement strategy, so those results remain implementation guidance or diagnostic evidence rather than public comparison rows.

Limitations

Qwen3.5 long-prompt prefill: Qwen3.5 prefill can trail upstream MLX references on longer prompts; decode and Qwen3-Next are not affected in the same way.
Raw HuggingFace weights: use pre-sanitized MLX community weights or convert first with mlx_lm.convert.
N-gram acceleration rows: effective-throughput measurements, not raw model-kernel speedups.
TurboQuant KV compression: experimental and off by default.

See the FAQ limitations entry for details.

Contributing

AX Engine welcomes community input through issue tickets, wishlist requests, reproducible benchmark results, and documentation feedback. We generally do not accept unsolicited code PRs, especially for runtime, model, kernel, scheduler, cache, n-gram, or performance-tuning changes.

Performance tuning is tightly coupled: a local speedup can regress correctness, TTFT, memory pressure, direct-vs-n-gram behavior, long-context behavior, serving stability, or another model family. Please open an issue first with the problem, target workload, and evidence so maintainers can choose the right validation path. See CONTRIBUTING.md for issue, wishlist, and benchmark result guidelines.

Community

Website: automatosx.com
Discord: Join us
Email: enquiry@defai.digital

License

Apache License, Version 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1,797 Commits
.cargo		.cargo
.github/workflows		.github/workflows
.internal		.internal
benchmarks		benchmarks
build/metal		build/metal
crates		crates
docs		docs
examples		examples
javascript/ax-engine		javascript/ax-engine
metal		metal
proto/ax_engine/v1		proto/ax_engine/v1
python		python
qa		qa
scripts		scripts
sdk		sdk
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
rust-toolchain.toml		rust-toolchain.toml

Folders and files

Latest commit

History

Repository files navigation

AX Engine

Release Highlights

Table of Contents

Quick Start

Installation

Requirements

Python wheel

Homebrew

Troubleshooting

Source

Getting a Model

mlx-community (recommended)

Raw HuggingFace checkpoint

Manifest generation

Typical Hardware

What AX Engine Does

Runtime Paths

Public Claim Boundaries

Supported Models

Performance

Gemma 4 12B

Gemma 4 12B Multimodal

Speculative Decoding (MTP)

Gemma 4

Qwen 3.6

Direct Decode · Prefill · TTFT

DiffusionGemma

Qwen3-Coder-Next

Prefill throughput (tok/s) — percentages vs mlx_lm

Decode throughput (tok/s) — generation=128 tokens, temp=0

Time to first token (ms) — generation=128 tokens, temp=0

SDKs

TypeScript / JavaScript

Go

Ruby

Python — LangChain

Mojo

Server Usage

Workspace

Development

Benchmark Reference Projects

Limitations

Contributing

Community

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 106

Uh oh!

Contributors

Uh oh!

Languages