Stop configuring clients for every GPU box. Workers connect out; requests route in.
You have GPU boxes running llama-server (or Ollama, or vLLM, or anything OpenAI-compatible). Today you either expose each one directly — port forwarding, DNS, firewall rules — or you stick a load balancer in front that doesn't understand LLM streaming or cancellation.
ModelRelay flips the model: a central proxy receives standard inference requests while worker daemons on your GPU boxes connect out to it over WebSocket. The proxy handles queueing, routing, streaming pass-through, and cancellation propagation. Clients see one stable endpoint and never need to know about your hardware.
Clients (curl, Claude Code, LiteLLM, Open WebUI, ...)
│
│ POST /v1/chat/completions
│ POST /v1/messages
▼
┌──────────────────────┐
│ modelrelay-server │◄─── workers connect out (WebSocket)
│ (one stable │ no inbound ports needed on GPU boxes
│ endpoint) │
└──────────────────────┘
│ routes request to best available worker
▼
┌────────┐ ┌────────┐ ┌────────┐
│worker-1│ │worker-2│ │worker-3│
│ llama │ │ ollama │ │ vllm │ ← your GPU boxes,
│ server │ │ │ │ │ anywhere on any network
└────────┘ └────────┘ └────────┘
ModelRelay Desktop is a native tray application that wraps the worker daemon in a lightweight GUI. It stays in your system tray and manages the connection to your relay server — no terminal required.
Features:
- System tray icon showing connection status (connected / disconnected / relaying)
- Settings UI for backend URL, relay server, worker secret, model selection, and poll interval
- Auto-reconnect on connection loss with status notifications
- Auto-start on login
- Live model list that refreshes as your backend models change
Dashboard with live connection status and model list. Onboarding wizard and full settings pane shown below.
Download: Grab the latest installer for your platform from the Desktop Releases page.
| Platform | Installer |
|---|---|
| Windows | .msi or .exe |
| macOS | .dmg |
| Linux | .AppImage or .deb |
Getting started:
- Download and install the app for your platform
- Launch ModelRelay Desktop — it appears in your system tray
- Right-click the tray icon and open Settings
- Enter your backend URL (e.g.
http://127.0.0.1:8000), relay server URL, and worker secret - Click Connect — the tray icon updates to show your connection status
The desktop app uses the same modelrelay-worker library under the hood, so it supports all the same backends (llama-server, Ollama, vLLM, LM Studio, etc.).
Auto-updates: The app checks for new releases on launch and from the tray's Check for Updates… menu, then installs signed updates in place — no manual reinstall needed. See docs/auto-updates.md for how it works and how to cut a release.
- Home GPU users running local models who want a single API endpoint across multiple machines
- Teams with on-prem hardware that need to pool GPU capacity without a service mesh
- Researchers juggling models across heterogeneous boxes who are tired of updating client configs
| Alternative | What's missing |
|---|---|
| Pointing clients directly at llama-server | No HA, no queue, clients must know about every box, no cancellation |
| nginx / HAProxy | Doesn't understand LLM streaming semantics, no queueing, no worker auth, no cancellation propagation |
| LiteLLM / OpenRouter | Cloud-first routing — not designed for your own private hardware calling home |
Don't want to run the infrastructure yourself? A fully-managed hosted version is available at modelrelay.io — no server setup, no infrastructure to manage. Just get an API key, point your workers at it, and start routing requests. Same open protocol, zero ops burden.
Pre-built binaries are the fastest way to get started. Download the latest release for your platform from the Releases page:
| Platform | modelrelay-server | modelrelay-worker |
|---|---|---|
| Linux x86_64 | modelrelay-server-linux-amd64 |
modelrelay-worker-linux-amd64 |
| Linux arm64 | modelrelay-server-linux-arm64 |
modelrelay-worker-linux-arm64 |
| macOS Intel | modelrelay-server-darwin-amd64 |
modelrelay-worker-darwin-amd64 |
| macOS Apple Silicon | modelrelay-server-darwin-arm64 |
modelrelay-worker-darwin-arm64 |
| Windows x86_64 | modelrelay-server-windows-amd64.exe |
modelrelay-worker-windows-amd64.exe |
| Windows arm64 | modelrelay-server-windows-arm64.exe |
modelrelay-worker-windows-arm64.exe |
Start the proxy:
./modelrelay-server \
--listen 0.0.0.0:8080 \
--worker-secret mysecretStart a worker (on a GPU box with llama-server, Ollama, vLLM, or any OpenAI-compatible backend):
./modelrelay-worker \
--proxy-url http://<proxy-host>:8080 \
--worker-secret mysecret \
--backend-url http://127.0.0.1:8000 \
--models llama3.2:3b,llama3.2:1bPre-built images are published to GitHub Container Registry on every release and main push.
# Pull the latest images
docker pull ghcr.io/ericflo/modelrelay/modelrelay-server:latest
docker pull ghcr.io/ericflo/modelrelay/modelrelay-worker:latest
# Run the proxy
docker run -p 8080:8080 \
-e WORKER_SECRET=mysecret \
-e LISTEN_ADDR=0.0.0.0:8080 \
ghcr.io/ericflo/modelrelay/modelrelay-server:latest
# Run a worker (on a GPU box)
docker run \
-e PROXY_URL=http://<proxy-host>:8080 \
-e WORKER_SECRET=mysecret \
-e BACKEND_URL=http://host.docker.internal:8000 \
-e MODELS=llama3.2:3b \
ghcr.io/ericflo/modelrelay/modelrelay-worker:latestFor pinned versions, replace :latest with a release tag (e.g. :0.2.1).
git clone https://github.com/ericflo/modelrelay.git
cd modelrelay
# Start the proxy + one worker (assumes llama-server on host port 8081)
docker compose upThe proxy is now listening on http://localhost:8080. The worker connects to it automatically and forwards requests to your backend.
Note: The crates are not yet published to crates.io. Use pre-built binaries or Docker in the meantime. See CONTRIBUTING.md for how to configure the
CRATES_IO_TOKENsecret for publishing.
cargo install modelrelay-server modelrelay-workercargo build --release
# Binaries: target/release/modelrelay-server target/release/modelrelay-worker# Non-streaming
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
# Streaming (SSE chunks pass through from the backend)
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Once the proxy is running, point your existing tools at it — no special client needed.
curl — see Try it above.
Claude Code / Claude Desktop — set the base URL to your proxy:
export ANTHROPIC_BASE_URL=http://localhost:8080
claude # requests now route through ModelRelayLiteLLM — add a model entry in your config.yaml:
model_list:
- model_name: llama3.2:3b
litellm_params:
model: openai/llama3.2:3b
api_base: http://localhost:8080/v1Open WebUI — point the OpenAI-compatible backend at the proxy:
export OPENAI_API_BASE_URL=http://localhost:8080/v1Any tool that speaks OpenAI or Anthropic API formats works — just change the base URL.
Once your worker is running, set it up as a system service so it starts automatically on boot:
- Linux (systemd): Use the template unit in
extras/modelrelay-worker@.service— supports multiple workers per machine (modelrelay-worker@gpu0,@gpu1, etc.). See Systemd below for full instructions. - macOS (launchd): Create a Launch Daemon plist pointing at the binary and your
config.toml. The worker starts on boot and restarts on crash. - Windows (Service): Register with
sc.exe createand set env vars with[Environment]::SetEnvironmentVariable. See Windows Service below for full instructions.
The setup wizard at /setup in the web UI walks through this interactively with copy-paste commands.
The extras/modelrelay-llamafile script is a self-contained CLI for downloading, running, and relaying llamafile models through ModelRelay. No dependencies beyond bash and curl.
# See what fits your hardware
./extras/modelrelay-llamafile recommend
# Browse models by category
./extras/modelrelay-llamafile list --tag reasoning
# Save your relay config once
./extras/modelrelay-llamafile config set proxy-url https://relay.example.com
./extras/modelrelay-llamafile config set worker-secret mysecret
# Now just serve — no flags needed
./extras/modelrelay-llamafile serve qwen3.5-4b
# Verify it works end-to-end
./extras/modelrelay-llamafile test qwen3.5-4b
# Manage running models
./extras/modelrelay-llamafile status
./extras/modelrelay-llamafile logs qwen3.5-4b -f
./extras/modelrelay-llamafile stop all
# Import your own llamafiles
./extras/modelrelay-llamafile import ./my-model.llamafile --slug my-model
# Refresh catalog when Mozilla publishes new models
./extras/modelrelay-llamafile update-catalogRun ./extras/modelrelay-llamafile help for full usage, or ./extras/modelrelay-llamafile doctor to check system readiness.
- Cross-platform — pre-built binaries for Linux, macOS, and Windows (x86_64 + arm64)
- OpenAI + Anthropic compatible —
POST /v1/chat/completions,POST /v1/responses,POST /v1/messages,GET /v1/models - No inbound ports on GPU boxes — workers connect out to the proxy over WebSocket
- Request queueing — configurable depth and timeout when all workers are busy
- Streaming pass-through — SSE chunks forwarded with preserved ordering and termination
- End-to-end cancellation — client disconnect propagates through the proxy to the worker to the backend
- Automatic requeue — if a worker dies mid-request, the request is requeued to another worker
- Heartbeat and load tracking — stale workers are cleaned up; workers report current load
- Graceful drain — workers can shut down while replacement workers pick up queued work
- Model catalog refresh — workers can update their model list without reconnecting
- Auth cooldown recovery — workers recover gracefully from authentication failures
| Flag | Env var | Default | Description |
|---|---|---|---|
--listen |
LISTEN_ADDR |
127.0.0.1:8080 |
Address to listen on |
--worker-secret |
WORKER_SECRET |
(required) | Secret workers must present to authenticate |
--provider |
PROVIDER_NAME |
local |
Provider name used for worker routing and request dispatch |
--max-queue-len |
MAX_QUEUE_LEN |
100 |
Maximum number of queued requests (0 = unlimited) |
--queue-timeout |
QUEUE_TIMEOUT_SECS |
30 |
Seconds before a queued request times out (0 = no timeout) |
--request-timeout |
REQUEST_TIMEOUT_SECS |
300 |
Seconds before an in-flight HTTP request times out (0 = no timeout) |
--log-level |
LOG_LEVEL |
info |
Log level filter (e.g. info, debug, or modelrelay_server=debug). Overridden by RUST_LOG if set. |
--admin-token |
MODELRELAY_ADMIN_TOKEN |
(none) | Bearer token for /admin/* endpoints. If unset, admin endpoints return 403. |
--require-api-keys |
MODELRELAY_REQUIRE_API_KEYS |
false |
When true, client inference requests must include a valid API key as Bearer token. |
| Flag | Env var | Default | Description |
|---|---|---|---|
--proxy-url |
PROXY_URL |
http://127.0.0.1:8080 |
Base URL of the proxy server |
--worker-secret |
WORKER_SECRET |
(required) | Secret used to authenticate with the proxy |
--backend-url |
BACKEND_URL |
http://127.0.0.1:8000 |
Base URL of the local model backend |
--models |
MODELS |
default |
Comma-separated list of model names this worker supports |
--provider |
PROVIDER_NAME |
local |
Provider name to register with on the proxy |
--worker-name |
WORKER_NAME |
worker |
Human-readable name for this worker instance |
--max-concurrency |
MAX_CONCURRENCY |
1 |
Maximum number of concurrent requests this worker will handle |
--log-level |
LOG_LEVEL |
info |
Log level filter (e.g. info, debug, or modelrelay_worker=debug). Overridden by RUST_LOG if set. |
All flags can be passed as CLI arguments or set via the corresponding environment variable.
ModelRelay includes built-in admin endpoints for monitoring and an embedded web dashboard for managing your deployment.
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /health |
None | Basic health check — returns version, worker count, queue depth, and uptime |
| GET | /admin/workers |
Admin token | List connected workers with models, load, and capabilities |
| GET | /admin/stats |
Admin token | Request counts, queue depth per provider |
| GET | /admin/keys |
Admin token | List client API key metadata (no secrets) |
| POST | /admin/keys |
Admin token | Create a new client API key — returns the secret once |
| DELETE | /admin/keys/{id} |
Admin token | Revoke a client API key |
All /admin/* endpoints require a Bearer token matching MODELRELAY_ADMIN_TOKEN:
# Set the admin token when starting the server
modelrelay-server --worker-secret mysecret --admin-token my-admin-secret
# Query admin endpoints
curl -H "Authorization: Bearer my-admin-secret" http://localhost:8080/admin/workers
curl -H "Authorization: Bearer my-admin-secret" http://localhost:8080/admin/statsIf MODELRELAY_ADMIN_TOKEN is not set, all admin endpoints return 403 Forbidden.
When MODELRELAY_REQUIRE_API_KEYS is set to true, clients must include a valid API key as a Bearer token on inference requests (/v1/chat/completions, /v1/messages, etc.). Without a valid key, requests are rejected with 401 Unauthorized.
# Start the server with API key auth enabled
modelrelay-server --worker-secret mysecret --admin-token my-admin-secret --require-api-keys true
# Create a client API key (the secret is returned only once)
curl -X POST -H "Authorization: Bearer my-admin-secret" \
-H "Content-Type: application/json" \
-d '{"name": "my-app"}' \
http://localhost:8080/admin/keys
# Use the key for inference
curl -H "Authorization: Bearer mr-..." \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hello"}]}' \
http://localhost:8080/v1/chat/completions
# Revoke a key
curl -X DELETE -H "Authorization: Bearer my-admin-secret" \
http://localhost:8080/admin/keys/{key-id}When MODELRELAY_REQUIRE_API_KEYS is false (the default), inference endpoints accept requests without any authentication.
The modelrelay-web crate provides an embedded web UI served by the proxy:
- Dashboard at
/dashboard— real-time view of connected workers, request metrics, and queue depth - Setup Wizard at
/setup— step-by-step guide for connecting new workers (platform detection, backend configuration, worker binary download, and live connection verification)
The setup wizard is always accessible — not just on first run. Use it to add additional GPU boxes to your fleet at any time.
The included docker-compose.yml runs the proxy with two workers, health checks, restart policies, memory limits, and log rotation:
cp .env.example .env # edit WORKER_SECRET and backend URLs
docker compose up -dAdd more workers by duplicating a worker service block and adjusting MODELS, BACKEND_URL, and WORKER_NAME.
Service files live in extras/:
# Install binaries (from a release archive or cargo build --release)
sudo install -m 755 modelrelay-server modelrelay-worker /usr/local/bin/
# Create a service user
sudo useradd --system --no-create-home modelrelay
sudo mkdir -p /var/lib/modelrelay /etc/modelrelay
# Proxy
sudo cp extras/modelrelay-server.service /etc/systemd/system/
sudo cp extras/proxy.env.example /etc/modelrelay/proxy.env
sudo vim /etc/modelrelay/proxy.env # set WORKER_SECRET
sudo systemctl enable --now modelrelay-server
# Workers — the template unit lets you run multiple instances:
sudo cp extras/modelrelay-worker@.service /etc/systemd/system/
sudo cp extras/worker.env.example /etc/modelrelay/worker-gpu0.env
sudo vim /etc/modelrelay/worker-gpu0.env # set PROXY_URL, BACKEND_URL, MODELS
sudo systemctl enable --now modelrelay-worker@gpu0See extras/ for the full service files and annotated env examples.
ModelRelay ships Windows binaries that can run as native Windows Services using sc.exe. No third-party service wrappers required.
# Install the server as a service (run as Administrator)
sc.exe create ModelRelayServer binPath= "C:\ModelRelay\modelrelay-server.exe" start= auto
# Set environment variables for the service (system-wide, persists across reboots)
[Environment]::SetEnvironmentVariable("WORKER_SECRET", "your-secret-here", "Machine")
[Environment]::SetEnvironmentVariable("LISTEN_ADDR", "0.0.0.0:8080", "Machine")
# Start the service
Start-Service ModelRelayServer
# Install a worker service
sc.exe create ModelRelayWorker binPath= '"C:\ModelRelay\modelrelay-worker.exe" --models llama3-8b' start= auto
[Environment]::SetEnvironmentVariable("PROXY_URL", "http://your-proxy:8080", "Machine")
[Environment]::SetEnvironmentVariable("BACKEND_URL", "http://localhost:8000", "Machine")
Start-Service ModelRelayWorkerFor fully annotated install scripts with error handling and uninstall support, see extras/install-windows-service.ps1 and extras/install-windows-service-worker.ps1. The service runs as LocalSystem by default; to use a dedicated account, set the service log-on via services.msc or pass obj= and password= to sc.exe create.
The proxy and workers communicate over plain HTTP/WebSocket by default. For production, terminate TLS at a reverse proxy like nginx. An annotated configuration is provided at examples/tls-nginx.conf — it handles HTTPS for client requests and wss:// WebSocket upgrades for workers, with streaming-friendly settings (buffering disabled, long timeouts).
A ready-made load test script lives at extras/load-test.sh. It uses hey if installed, falls back to wrk, and finally to parallel curl loops:
./extras/load-test.sh -n 200 -c 20 -m llama3-8bBoth modelrelay-server and modelrelay-worker can generate shell completion scripts via the hidden --completions flag:
# Bash
modelrelay-server --completions bash > ~/.local/share/bash-completion/completions/modelrelay-server
modelrelay-worker --completions bash > ~/.local/share/bash-completion/completions/modelrelay-worker
# Zsh (add the target directory to $fpath)
modelrelay-server --completions zsh > ~/.zfunc/_modelrelay-server
modelrelay-worker --completions zsh > ~/.zfunc/_modelrelay-worker
# Fish
modelrelay-server --completions fish > ~/.config/fish/completions/modelrelay-server.fish
modelrelay-worker --completions fish > ~/.config/fish/completions/modelrelay-worker.fishSupported shells: bash, zsh, fish, powershell, elvish.
Full documentation: ericflo.github.io/modelrelay
- Behavior contract — the full specification of proxy, queue, streaming, and cancellation semantics
- Architecture sketch — how the pieces fit together internally
- Protocol walkthrough — ASCII wire traces for every message flow
- Operational runbook — health checks, draining, scaling, troubleshooting
The behavior matrix is exercised at three layers: black-box contract harnesses in modelrelay-contract-tests, live HTTP integration tests in modelrelay-server, and end-to-end live backend tests in modelrelay-worker.
cargo fmt --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --workspaceBug reports, feature requests, and PRs are welcome. See CONTRIBUTING.md for code style, test expectations, branch naming, and CI secrets.
To report a security vulnerability, follow the process in SECURITY.md.
MIT


