Run open-weight MLX models locally on Apple Silicon, route requests across local and remote providers, and expose everything through one endpoint.
Higgs is a single static Rust binary that serves local models, proxies to providers like OpenAI, Anthropic, and Ollama, and translates between OpenAI and Anthropic-style APIs so your existing tools and apps do not need a new integration.
Why care
- Run open-weight models locally on your Mac, including supported Qwen, Llama, Mistral, Gemma, Phi, DeepSeek, and vision-capable MLX families.
- Send requests to local models or remote providers through one endpoint.
- Plug tools into Higgs with
higgs shellenvorhiggs execinstead of reconfiguring each client separately.
Use Higgs if
- you want local open-weight model serving on Apple Silicon
- you switch between local and hosted models
- you want one API surface for apps, agents, and terminal tools
higgs serveremains the ad hoc foreground entrypoint for--model,--port,--batch, and related flags.higgs startis now config/profile-only. Usehiggs init, thenhiggs start.higgs attachis a daemon metrics dashboard. It now requires a live daemon, a passing/healthprobe, and metrics logging.- Exact local model names now beat regex routes.
/metricsis a real endpoint, andserver.max_body_sizeis enforced on API requests.higgs shellenvandhiggs execnow fail fast on bad config or an unreachable server.
Install:
brew install panbanda/brews/higgsThe Homebrew release currently targets Apple Silicon (aarch64-apple-darwin).
Or build from source (Rust 1.88.0+, Xcode CLI Tools):
cargo build --releaseRun a local open-weight model:
higgs serve --model mlx-community/Qwen3.6-35B-A3B-4bitSend a request to the local endpoint:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3.6-35B-A3B-4bit",
"messages": [{"role": "user", "content": "Write one sentence about Cape Town."}]
}'Point an existing tool at Higgs:
higgs exec -- claudeRequests can also target routed remote models through the same endpoint. For example, an OpenAI-format request can be translated and proxied to Anthropic based on your route configuration:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ANTHROPIC_API_KEY" \
-d '{
"model": "claude-sonnet-4-6",
"messages": [{"role": "user", "content": "Hello!"}]
}'- Serve MLX models from Hugging Face IDs or local paths.
- Support current model families including Qwen 3.6, Qwen 3.x, Llama, Mistral, Gemma 2, Phi-3, Starcoder2, DeepSeek-V2, and LLaVA-Qwen2.
- Expose local serving through OpenAI and Anthropic-compatible endpoints.
- Serve local MLX models and proxy remote providers from the same server.
- Keep client integrations stable while you switch between local and hosted backends.
- Route unmatched requests to a configured default target.
- Resolve requests by direct local model selection, regex pattern routing, model alias rewriting, or the optional auto-router.
- Translate OpenAI-format requests to Anthropic providers and Anthropic-format requests back to OpenAI-style clients, including streaming where supported.
- Proxy to OpenAI, Anthropic, Ollama, and other OpenAI-compatible APIs.
- Use
higgs shellenvto exportANTHROPIC_BASE_URLandOPENAI_BASE_URL. - Use
higgs exec -- <cmd>to launch a command with those variables set. - Point tools such as Claude Code, Aider, and other OpenAI/Anthropic-compatible clients at a single local endpoint.
- Run in the foreground with
higgs serveor as a background daemon with config-drivenhiggs start. - Open the daemon metrics dashboard with
higgs attachfor routing, latency, throughput, and error visibility. - Validate config and model/provider setup with
higgs doctor.
- Release artifacts bundle
mlx.metallib. - Source builds also require
mlx.metallibnext to the executable. Higgs now restores it automatically from Cargo build output when possible, then fails loudly if it still cannot be found. [local].raise_wired_limitdefaults tofalse. Enable it only when you explicitly want MLX to raise the process wired-memory limit.batch=trueis only supported for transformer families with true batched decode support.
Benchmarks below were run on M4 Max 128GB. Methodology, harness details, and benchmark-driven defaults are documented in docs/benchmarking.md.
Single request, 500 generated tokens, median of 3 runs.
| Model | higgs | mlx_lm | vllm-mlx | llama.cpp | Ollama |
|---|---|---|---|---|---|
| Llama-3.2-1B-4bit | 448 | 421 | 433 | 314 | 305 |
| Mistral-7B-v0.3-4bit | 103 | 103 | -- | 87 | 85 |
| Qwen3-1.7B-4bit | 305 | 293 | 300 | 216 | 183 |
| Qwen3-30B-A3B-8bit | 75 | 86 | 87 | 83 | 73 |
| Gemma-2-2B-4bit | 163 | 185 | 91 | -- | -- |
| Phi-3-mini-4bit | 171 | 170 | 95 | -- | -- |
| Starcoder2-3B-4bit | 107 | 176 | 165 | -- | -- |
| DeepSeek-V2-Lite-4bit | 140 | 174 | 99 | -- | -- |
MLX models use 4-bit, or 8-bit for MoE. llama.cpp and Ollama use Q4_K_M, or Q8_0 for MoE.
Measured on DeepSeek-V2-Lite-4bit with global batch sorting before gather_qmm.
| Prompt tokens | Before | After | Speedup |
|---|---|---|---|
| 59 | 472ms | 227ms | 2.1x |
| 481 | 3,734ms | 863ms | 4.3x |
| 1,831 | 14,390ms | 3,123ms | 4.6x |
| 4,532 | 37,489ms | 8,860ms | 4.2x |
| Concurrent requests | higgs tok/s | vllm-mlx tok/s |
|---|---|---|
| 1 | 280 | 250 |
| 2 | 585 | 459 |
| 4 | 698 | 510 |
| 8 | 755 | 646 |
| Model | higgs | mlx_lm | vllm-mlx |
|---|---|---|---|
| Llama-3.2-1B-4bit | 974 | 1,356 | 1,380 |
| Mistral-7B-v0.3-4bit | 3,965 | 4,384 | -- |
| Qwen3-1.7B-4bit | 1,127 | 1,609 | 1,641 |
| Qwen3-30B-A3B-8bit | 31,139 | 31,640 | 31,658 |
| Gemma-2-2B-4bit | 1,645 | 2,329 | 2,350 |
| Phi-3-mini-4bit | 2,126 | 2,548 | 2,573 |
| DeepSeek-V2-Lite-4bit | 8,528 | 8,972 | 8,998 |
| higgs | vllm-mlx | |
|---|---|---|
| Structured output (10 prompts, JSON schema) | 100% | 0% |
| Reasoning extraction (5 questions, Qwen3) | 5/5 | 4/5 |
| All architectures produce coherent output | Yes | Yes |
API endpoints
- OpenAI:
/v1/chat/completions,/v1/completions,/v1/embeddings,/v1/models - Anthropic:
/v1/messages,/v1/messages/count_tokens - Metrics:
/metrics - Health:
/health
Core commands
higgs serve: start in the foregroundhiggs start: start a background daemon from config or profilehiggs stop: stop a running daemon, or usehiggs stop --forcehiggs attach: open the daemon metrics dashboardhiggs init: create~/.config/higgs/config.tomlhiggs doctor: validate config, model paths, and providershiggs shellenv: printANTHROPIC_BASE_URLandOPENAI_BASE_URLafter verifying the server is reachablehiggs exec -- <cmd>: run a tool with those variables set after the same reachability check
- Replace old
higgs start --model ...usage withhiggs serve --model .... - If you previously treated
higgs attachas a best-effort log viewer, expect it to fail fast when the daemon is down or metrics logging is disabled. - If you relied on regex routes to override a local model with the same exact name, rename the local model or route; local exact matches now win.
- If you run local source builds, make sure
cargo buildcompletes before first serve so Higgs can restoremlx.metallibfrom Cargo output when needed.
- Run
scripts/release_smoke_cached_models.shto validate the cached MLX models already present on the machine without downloading anything. - Set
HIGGS_SMOKE_INCLUDE_OPTIONAL_MODELS=1to include optional large/private cached models likemlx-community/Qwen3.6-35B-A3B-4bit. - The harness covers single-model serve, streaming and non-streaming requests, multi-model startup, routing precedence, daemon start/attach/stop, and the batch-support guardrails.
- The current smoke matrix exercised these cached models on this machine:
mlx-community/Llama-3.2-1B-Instruct-4bit,mlx-community/Qwen2.5-3B-Instruct-4bit,mlx-community/Qwen3-1.7B-4bit,mlx-community/Qwen3-Coder-Next-4bit, andmlx-community/Qwen3.6-35B-A3B-4bit.
For full configuration reference, routing options, supported model families, and benchmark details, see:
cargo test -- --test-threads=1
cargo clippy
cargo fmt --checkContributor workflow, project structure, and doc update expectations live in CONTRIBUTING.md.
MIT