ModelRelay

Stop configuring clients for every GPU box. Workers connect out; requests route in.

You have GPU boxes running llama-server (or Ollama, or vLLM, or anything OpenAI-compatible). Today you either expose each one directly — port forwarding, DNS, firewall rules — or you stick a load balancer in front that doesn't understand LLM streaming or cancellation.

ModelRelay flips the model: a central proxy receives standard inference requests while worker daemons on your GPU boxes connect out to it over WebSocket. The proxy handles queueing, routing, streaming pass-through, and cancellation propagation. Clients see one stable endpoint and never need to know about your hardware.

  Clients (curl, Claude Code, LiteLLM, Open WebUI, ...)
         │
         │  POST /v1/chat/completions
         │  POST /v1/messages
         ▼
  ┌──────────────────────┐
  │   modelrelay-server  │◄─── workers connect out (WebSocket)
  │   (one stable        │     no inbound ports needed on GPU boxes
  │    endpoint)         │
  └──────────────────────┘
         │  routes request to best available worker
         ▼
  ┌────────┐  ┌────────┐  ┌────────┐
  │worker-1│  │worker-2│  │worker-3│
  │ llama  │  │ ollama │  │ vllm   │  ← your GPU boxes,
  │ server │  │        │  │        │    anywhere on any network
  └────────┘  └────────┘  └────────┘

Desktop App

ModelRelay Desktop is a native tray application that wraps the worker daemon in a lightweight GUI. It stays in your system tray and manages the connection to your relay server — no terminal required.

Features:

System tray icon showing connection status (connected / disconnected / relaying)
Settings UI for backend URL, relay server, worker secret, model selection, and poll interval
Auto-reconnect on connection loss with status notifications
Auto-start on login
Live model list that refreshes as your backend models change

Dashboard with live connection status and model list. Onboarding wizard and full settings pane shown below.

Download: Grab the latest installer for your platform from the Desktop Releases page.

Platform	Installer
Windows	`.msi` or `.exe`
macOS	`.dmg`
Linux	`.AppImage` or `.deb`

Getting started:

Download and install the app for your platform
Launch ModelRelay Desktop — it appears in your system tray
Right-click the tray icon and open Settings
Enter your backend URL (e.g. http://127.0.0.1:8000), relay server URL, and worker secret
Click Connect — the tray icon updates to show your connection status

The desktop app uses the same modelrelay-worker library under the hood, so it supports all the same backends (llama-server, Ollama, vLLM, LM Studio, etc.).

Auto-updates: The app checks for new releases on launch and from the tray's Check for Updates… menu, then installs signed updates in place — no manual reinstall needed. See docs/auto-updates.md for how it works and how to cut a release.

Who is this for?

Home GPU users running local models who want a single API endpoint across multiple machines
Teams with on-prem hardware that need to pool GPU capacity without a service mesh
Researchers juggling models across heterogeneous boxes who are tired of updating client configs

Why this instead of...

Alternative	What's missing
Pointing clients directly at llama-server	No HA, no queue, clients must know about every box, no cancellation
nginx / HAProxy	Doesn't understand LLM streaming semantics, no queueing, no worker auth, no cancellation propagation
LiteLLM / OpenRouter	Cloud-first routing — not designed for your own private hardware calling home

Hosted Version

Don't want to run the infrastructure yourself? A fully-managed hosted version is available at modelrelay.io — no server setup, no infrastructure to manage. Just get an API key, point your workers at it, and start routing requests. Same open protocol, zero ops burden.

Quickstart

Pre-built binaries (recommended)

Pre-built binaries are the fastest way to get started. Download the latest release for your platform from the Releases page:

Platform	modelrelay-server	modelrelay-worker
Linux x86_64	`modelrelay-server-linux-amd64`	`modelrelay-worker-linux-amd64`
Linux arm64	`modelrelay-server-linux-arm64`	`modelrelay-worker-linux-arm64`
macOS Intel	`modelrelay-server-darwin-amd64`	`modelrelay-worker-darwin-amd64`
macOS Apple Silicon	`modelrelay-server-darwin-arm64`	`modelrelay-worker-darwin-arm64`
Windows x86_64	`modelrelay-server-windows-amd64.exe`	`modelrelay-worker-windows-amd64.exe`
Windows arm64	`modelrelay-server-windows-arm64.exe`	`modelrelay-worker-windows-arm64.exe`

Start the proxy:

./modelrelay-server \
  --listen 0.0.0.0:8080 \
  --worker-secret mysecret

Start a worker (on a GPU box with llama-server, Ollama, vLLM, or any OpenAI-compatible backend):

./modelrelay-worker \
  --proxy-url http://<proxy-host>:8080 \
  --worker-secret mysecret \
  --backend-url http://127.0.0.1:8000 \
  --models llama3.2:3b,llama3.2:1b

Docker

Pre-built images are published to GitHub Container Registry on every release and main push.

# Pull the latest images
docker pull ghcr.io/ericflo/modelrelay/modelrelay-server:latest
docker pull ghcr.io/ericflo/modelrelay/modelrelay-worker:latest

# Run the proxy
docker run -p 8080:8080 \
  -e WORKER_SECRET=mysecret \
  -e LISTEN_ADDR=0.0.0.0:8080 \
  ghcr.io/ericflo/modelrelay/modelrelay-server:latest

# Run a worker (on a GPU box)
docker run \
  -e PROXY_URL=http://<proxy-host>:8080 \
  -e WORKER_SECRET=mysecret \
  -e BACKEND_URL=http://host.docker.internal:8000 \
  -e MODELS=llama3.2:3b \
  ghcr.io/ericflo/modelrelay/modelrelay-worker:latest

For pinned versions, replace :latest with a release tag (e.g. :0.2.1).

Docker Compose (easiest for local dev)

git clone https://github.com/ericflo/modelrelay.git
cd modelrelay

# Start the proxy + one worker (assumes llama-server on host port 8081)
docker compose up

The proxy is now listening on http://localhost:8080. The worker connects to it automatically and forwards requests to your backend.

From crates.io

Note: The crates are not yet published to crates.io. Use pre-built binaries or Docker in the meantime. See CONTRIBUTING.md for how to configure the CRATES_IO_TOKEN secret for publishing.

cargo install modelrelay-server modelrelay-worker

Build from source

cargo build --release
# Binaries: target/release/modelrelay-server  target/release/modelrelay-worker

Try it

# Non-streaming
curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

# Streaming (SSE chunks pass through from the backend)
curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Connecting your tools

Once the proxy is running, point your existing tools at it — no special client needed.

curl — see Try it above.

Claude Code / Claude Desktop — set the base URL to your proxy:

export ANTHROPIC_BASE_URL=http://localhost:8080
claude    # requests now route through ModelRelay

LiteLLM — add a model entry in your config.yaml:

model_list:
  - model_name: llama3.2:3b
    litellm_params:
      model: openai/llama3.2:3b
      api_base: http://localhost:8080/v1

Open WebUI — point the OpenAI-compatible backend at the proxy:

export OPENAI_API_BASE_URL=http://localhost:8080/v1

Any tool that speaks OpenAI or Anthropic API formats works — just change the base URL.

Make it persistent

Once your worker is running, set it up as a system service so it starts automatically on boot:

Linux (systemd): Use the template unit in extras/modelrelay-worker@.service — supports multiple workers per machine (modelrelay-worker@gpu0, @gpu1, etc.). See Systemd below for full instructions.
macOS (launchd): Create a Launch Daemon plist pointing at the binary and your config.toml. The worker starts on boot and restarts on crash.
Windows (Service): Register with sc.exe create and set env vars with [Environment]::SetEnvironmentVariable. See Windows Service below for full instructions.

The setup wizard at /setup in the web UI walks through this interactively with copy-paste commands.

llamafile Integration

The extras/modelrelay-llamafile script is a self-contained CLI for downloading, running, and relaying llamafile models through ModelRelay. No dependencies beyond bash and curl.

# See what fits your hardware
./extras/modelrelay-llamafile recommend

# Browse models by category
./extras/modelrelay-llamafile list --tag reasoning

# Save your relay config once
./extras/modelrelay-llamafile config set proxy-url https://relay.example.com
./extras/modelrelay-llamafile config set worker-secret mysecret

# Now just serve — no flags needed
./extras/modelrelay-llamafile serve qwen3.5-4b

# Verify it works end-to-end
./extras/modelrelay-llamafile test qwen3.5-4b

# Manage running models
./extras/modelrelay-llamafile status
./extras/modelrelay-llamafile logs qwen3.5-4b -f
./extras/modelrelay-llamafile stop all

# Import your own llamafiles
./extras/modelrelay-llamafile import ./my-model.llamafile --slug my-model

# Refresh catalog when Mozilla publishes new models
./extras/modelrelay-llamafile update-catalog

Run ./extras/modelrelay-llamafile help for full usage, or ./extras/modelrelay-llamafile doctor to check system readiness.

Features

Cross-platform — pre-built binaries for Linux, macOS, and Windows (x86_64 + arm64)
OpenAI + Anthropic compatible — POST /v1/chat/completions, POST /v1/responses, POST /v1/messages, GET /v1/models
No inbound ports on GPU boxes — workers connect out to the proxy over WebSocket
Request queueing — configurable depth and timeout when all workers are busy
Streaming pass-through — SSE chunks forwarded with preserved ordering and termination
End-to-end cancellation — client disconnect propagates through the proxy to the worker to the backend
Automatic requeue — if a worker dies mid-request, the request is requeued to another worker
Heartbeat and load tracking — stale workers are cleaned up; workers report current load
Graceful drain — workers can shut down while replacement workers pick up queued work
Model catalog refresh — workers can update their model list without reconnecting
Auth cooldown recovery — workers recover gracefully from authentication failures

Configuration

modelrelay-server

Flag	Env var	Default	Description
`--listen`	`LISTEN_ADDR`	`127.0.0.1:8080`	Address to listen on
`--worker-secret`	`WORKER_SECRET`	(required)	Secret workers must present to authenticate
`--provider`	`PROVIDER_NAME`	`local`	Provider name used for worker routing and request dispatch
`--max-queue-len`	`MAX_QUEUE_LEN`	`100`	Maximum number of queued requests (0 = unlimited)
`--queue-timeout`	`QUEUE_TIMEOUT_SECS`	`30`	Seconds before a queued request times out (0 = no timeout)
`--request-timeout`	`REQUEST_TIMEOUT_SECS`	`300`	Seconds before an in-flight HTTP request times out (0 = no timeout)
`--log-level`	`LOG_LEVEL`	`info`	Log level filter (e.g. `info`, `debug`, or `modelrelay_server=debug`). Overridden by `RUST_LOG` if set.
`--admin-token`	`MODELRELAY_ADMIN_TOKEN`	(none)	Bearer token for `/admin/*` endpoints. If unset, admin endpoints return 403.
`--require-api-keys`	`MODELRELAY_REQUIRE_API_KEYS`	`false`	When `true`, client inference requests must include a valid API key as Bearer token.

modelrelay-worker

Flag	Env var	Default	Description
`--proxy-url`	`PROXY_URL`	`http://127.0.0.1:8080`	Base URL of the proxy server
`--worker-secret`	`WORKER_SECRET`	(required)	Secret used to authenticate with the proxy
`--backend-url`	`BACKEND_URL`	`http://127.0.0.1:8000`	Base URL of the local model backend
`--models`	`MODELS`	`default`	Comma-separated list of model names this worker supports
`--provider`	`PROVIDER_NAME`	`local`	Provider name to register with on the proxy
`--worker-name`	`WORKER_NAME`	`worker`	Human-readable name for this worker instance
`--max-concurrency`	`MAX_CONCURRENCY`	`1`	Maximum number of concurrent requests this worker will handle
`--log-level`	`LOG_LEVEL`	`info`	Log level filter (e.g. `info`, `debug`, or `modelrelay_worker=debug`). Overridden by `RUST_LOG` if set.

All flags can be passed as CLI arguments or set via the corresponding environment variable.

Admin API & Web Dashboard

ModelRelay includes built-in admin endpoints for monitoring and an embedded web dashboard for managing your deployment.

Admin API Endpoints

Method	Path	Auth	Description
GET	`/health`	None	Basic health check — returns version, worker count, queue depth, and uptime
GET	`/admin/workers`	Admin token	List connected workers with models, load, and capabilities
GET	`/admin/stats`	Admin token	Request counts, queue depth per provider
GET	`/admin/keys`	Admin token	List client API key metadata (no secrets)
POST	`/admin/keys`	Admin token	Create a new client API key — returns the secret once
DELETE	`/admin/keys/{id}`	Admin token	Revoke a client API key

Admin Authentication

All /admin/* endpoints require a Bearer token matching MODELRELAY_ADMIN_TOKEN:

# Set the admin token when starting the server
modelrelay-server --worker-secret mysecret --admin-token my-admin-secret

# Query admin endpoints
curl -H "Authorization: Bearer my-admin-secret" http://localhost:8080/admin/workers
curl -H "Authorization: Bearer my-admin-secret" http://localhost:8080/admin/stats

If MODELRELAY_ADMIN_TOKEN is not set, all admin endpoints return 403 Forbidden.

Client API Key Authentication

When MODELRELAY_REQUIRE_API_KEYS is set to true, clients must include a valid API key as a Bearer token on inference requests (/v1/chat/completions, /v1/messages, etc.). Without a valid key, requests are rejected with 401 Unauthorized.

# Start the server with API key auth enabled
modelrelay-server --worker-secret mysecret --admin-token my-admin-secret --require-api-keys true

# Create a client API key (the secret is returned only once)
curl -X POST -H "Authorization: Bearer my-admin-secret" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-app"}' \
  http://localhost:8080/admin/keys

# Use the key for inference
curl -H "Authorization: Bearer mr-..." \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hello"}]}' \
  http://localhost:8080/v1/chat/completions

# Revoke a key
curl -X DELETE -H "Authorization: Bearer my-admin-secret" \
  http://localhost:8080/admin/keys/{key-id}

When MODELRELAY_REQUIRE_API_KEYS is false (the default), inference endpoints accept requests without any authentication.

Web Dashboard & Setup Wizard

The modelrelay-web crate provides an embedded web UI served by the proxy:

Dashboard at /dashboard — real-time view of connected workers, request metrics, and queue depth
Setup Wizard at /setup — step-by-step guide for connecting new workers (platform detection, backend configuration, worker binary download, and live connection verification)

The setup wizard is always accessible — not just on first run. Use it to add additional GPU boxes to your fleet at any time.

Production deployment

Docker Compose (multi-worker)

The included docker-compose.yml runs the proxy with two workers, health checks, restart policies, memory limits, and log rotation:

cp .env.example .env   # edit WORKER_SECRET and backend URLs
docker compose up -d

Add more workers by duplicating a worker service block and adjusting MODELS, BACKEND_URL, and WORKER_NAME.

Systemd (bare metal / VM)

Service files live in extras/:

# Install binaries (from a release archive or cargo build --release)
sudo install -m 755 modelrelay-server modelrelay-worker /usr/local/bin/

# Create a service user
sudo useradd --system --no-create-home modelrelay
sudo mkdir -p /var/lib/modelrelay /etc/modelrelay

# Proxy
sudo cp extras/modelrelay-server.service /etc/systemd/system/
sudo cp extras/proxy.env.example /etc/modelrelay/proxy.env
sudo vim /etc/modelrelay/proxy.env   # set WORKER_SECRET
sudo systemctl enable --now modelrelay-server

# Workers — the template unit lets you run multiple instances:
sudo cp extras/modelrelay-worker@.service /etc/systemd/system/
sudo cp extras/worker.env.example /etc/modelrelay/worker-gpu0.env
sudo vim /etc/modelrelay/worker-gpu0.env   # set PROXY_URL, BACKEND_URL, MODELS
sudo systemctl enable --now modelrelay-worker@gpu0

See extras/ for the full service files and annotated env examples.

Windows Service

ModelRelay ships Windows binaries that can run as native Windows Services using sc.exe. No third-party service wrappers required.

# Install the server as a service (run as Administrator)
sc.exe create ModelRelayServer binPath= "C:\ModelRelay\modelrelay-server.exe" start= auto

# Set environment variables for the service (system-wide, persists across reboots)
[Environment]::SetEnvironmentVariable("WORKER_SECRET", "your-secret-here", "Machine")
[Environment]::SetEnvironmentVariable("LISTEN_ADDR", "0.0.0.0:8080", "Machine")

# Start the service
Start-Service ModelRelayServer

# Install a worker service
sc.exe create ModelRelayWorker binPath= '"C:\ModelRelay\modelrelay-worker.exe" --models llama3-8b' start= auto
[Environment]::SetEnvironmentVariable("PROXY_URL", "http://your-proxy:8080", "Machine")
[Environment]::SetEnvironmentVariable("BACKEND_URL", "http://localhost:8000", "Machine")
Start-Service ModelRelayWorker

For fully annotated install scripts with error handling and uninstall support, see extras/install-windows-service.ps1 and extras/install-windows-service-worker.ps1. The service runs as LocalSystem by default; to use a dedicated account, set the service log-on via services.msc or pass obj= and password= to sc.exe create.

TLS

The proxy and workers communicate over plain HTTP/WebSocket by default. For production, terminate TLS at a reverse proxy like nginx. An annotated configuration is provided at examples/tls-nginx.conf — it handles HTTPS for client requests and wss:// WebSocket upgrades for workers, with streaming-friendly settings (buffering disabled, long timeouts).

Load Testing

A ready-made load test script lives at extras/load-test.sh. It uses hey if installed, falls back to wrk, and finally to parallel curl loops:

./extras/load-test.sh -n 200 -c 20 -m llama3-8b

Shell Completions

Both modelrelay-server and modelrelay-worker can generate shell completion scripts via the hidden --completions flag:

# Bash
modelrelay-server --completions bash > ~/.local/share/bash-completion/completions/modelrelay-server
modelrelay-worker --completions bash > ~/.local/share/bash-completion/completions/modelrelay-worker

# Zsh (add the target directory to $fpath)
modelrelay-server --completions zsh > ~/.zfunc/_modelrelay-server
modelrelay-worker --completions zsh > ~/.zfunc/_modelrelay-worker

# Fish
modelrelay-server --completions fish > ~/.config/fish/completions/modelrelay-server.fish
modelrelay-worker --completions fish > ~/.config/fish/completions/modelrelay-worker.fish

Supported shells: bash, zsh, fish, powershell, elvish.

Documents

Full documentation: ericflo.github.io/modelrelay

Behavior contract — the full specification of proxy, queue, streaming, and cancellation semantics
Architecture sketch — how the pieces fit together internally
Protocol walkthrough — ASCII wire traces for every message flow
Operational runbook — health checks, draining, scaling, troubleshooting

Validation

The behavior matrix is exercised at three layers: black-box contract harnesses in modelrelay-contract-tests, live HTTP integration tests in modelrelay-server, and end-to-end live backend tests in modelrelay-worker.

cargo fmt --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --workspace

Contributing

Bug reports, feature requests, and PRs are welcome. See CONTRIBUTING.md for code style, test expectations, branch naming, and CI secrets.

To report a security vulnerability, follow the process in SECURITY.md.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 333 Commits
.cargo		.cargo
.github		.github
crates		crates
deploy/modelrelay-io		deploy/modelrelay-io
docs		docs
examples		examples
extras		extras
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile.cloud		Dockerfile.cloud
Dockerfile.proxy		Dockerfile.proxy
Dockerfile.web		Dockerfile.web
Dockerfile.worker		Dockerfile.worker
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
book.toml		book.toml
codecov.yml		codecov.yml
deny.toml		deny.toml
docker-compose.yml		docker-compose.yml
rust-toolchain.toml		rust-toolchain.toml

Folders and files

Latest commit

History

Repository files navigation

ModelRelay

Desktop App

Who is this for?

Why this instead of...

Hosted Version

Quickstart

Pre-built binaries (recommended)

Docker

Docker Compose (easiest for local dev)

From crates.io

Build from source

Try it

Connecting your tools

Make it persistent

llamafile Integration

Features

Configuration

modelrelay-server

modelrelay-worker

Admin API & Web Dashboard

Admin API Endpoints

Admin Authentication

Client API Key Authentication

Web Dashboard & Setup Wizard

Production deployment

Docker Compose (multi-worker)

Systemd (bare metal / VM)

Windows Service

TLS

Load Testing

Shell Completions

Documents

Validation

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 21

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages