Skip to content

ericflo/modelrelay

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

333 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

CI Latest Release Coverage crates.io Minimum Rust Version Documentation MIT License

ModelRelay

Stop configuring clients for every GPU box. Workers connect out; requests route in.

You have GPU boxes running llama-server (or Ollama, or vLLM, or anything OpenAI-compatible). Today you either expose each one directly — port forwarding, DNS, firewall rules — or you stick a load balancer in front that doesn't understand LLM streaming or cancellation.

ModelRelay flips the model: a central proxy receives standard inference requests while worker daemons on your GPU boxes connect out to it over WebSocket. The proxy handles queueing, routing, streaming pass-through, and cancellation propagation. Clients see one stable endpoint and never need to know about your hardware.

  Clients (curl, Claude Code, LiteLLM, Open WebUI, ...)
         │
         │  POST /v1/chat/completions
         │  POST /v1/messages
         ▼
  ┌──────────────────────┐
  │   modelrelay-server  │◄─── workers connect out (WebSocket)
  │   (one stable        │     no inbound ports needed on GPU boxes
  │    endpoint)         │
  └──────────────────────┘
         │  routes request to best available worker
         ▼
  ┌────────┐  ┌────────┐  ┌────────┐
  │worker-1│  │worker-2│  │worker-3│
  │ llama  │  │ ollama │  │ vllm   │  ← your GPU boxes,
  │ server │  │        │  │        │    anywhere on any network
  └────────┘  └────────┘  └────────┘

Desktop App

ModelRelay Desktop is a native tray application that wraps the worker daemon in a lightweight GUI. It stays in your system tray and manages the connection to your relay server — no terminal required.

Features:

  • System tray icon showing connection status (connected / disconnected / relaying)
  • Settings UI for backend URL, relay server, worker secret, model selection, and poll interval
  • Auto-reconnect on connection loss with status notifications
  • Auto-start on login
  • Live model list that refreshes as your backend models change

ModelRelay Desktop dashboard showing connected status, relay server, active requests, and live model list

Dashboard with live connection status and model list. Onboarding wizard and full settings pane shown below.

ModelRelay Desktop onboarding wizard on the test-connection step showing a successful result   ModelRelay Desktop settings pane with connection, identity, performance, and behavior sections

Download: Grab the latest installer for your platform from the Desktop Releases page.

Platform Installer
Windows .msi or .exe
macOS .dmg
Linux .AppImage or .deb

Getting started:

  1. Download and install the app for your platform
  2. Launch ModelRelay Desktop — it appears in your system tray
  3. Right-click the tray icon and open Settings
  4. Enter your backend URL (e.g. http://127.0.0.1:8000), relay server URL, and worker secret
  5. Click Connect — the tray icon updates to show your connection status

The desktop app uses the same modelrelay-worker library under the hood, so it supports all the same backends (llama-server, Ollama, vLLM, LM Studio, etc.).

Auto-updates: The app checks for new releases on launch and from the tray's Check for Updates… menu, then installs signed updates in place — no manual reinstall needed. See docs/auto-updates.md for how it works and how to cut a release.

Who is this for?

  • Home GPU users running local models who want a single API endpoint across multiple machines
  • Teams with on-prem hardware that need to pool GPU capacity without a service mesh
  • Researchers juggling models across heterogeneous boxes who are tired of updating client configs

Why this instead of...

Alternative What's missing
Pointing clients directly at llama-server No HA, no queue, clients must know about every box, no cancellation
nginx / HAProxy Doesn't understand LLM streaming semantics, no queueing, no worker auth, no cancellation propagation
LiteLLM / OpenRouter Cloud-first routing — not designed for your own private hardware calling home

Hosted Version

Don't want to run the infrastructure yourself? A fully-managed hosted version is available at modelrelay.io — no server setup, no infrastructure to manage. Just get an API key, point your workers at it, and start routing requests. Same open protocol, zero ops burden.

Quickstart

Pre-built binaries (recommended)

Pre-built binaries are the fastest way to get started. Download the latest release for your platform from the Releases page:

Platform modelrelay-server modelrelay-worker
Linux x86_64 modelrelay-server-linux-amd64 modelrelay-worker-linux-amd64
Linux arm64 modelrelay-server-linux-arm64 modelrelay-worker-linux-arm64
macOS Intel modelrelay-server-darwin-amd64 modelrelay-worker-darwin-amd64
macOS Apple Silicon modelrelay-server-darwin-arm64 modelrelay-worker-darwin-arm64
Windows x86_64 modelrelay-server-windows-amd64.exe modelrelay-worker-windows-amd64.exe
Windows arm64 modelrelay-server-windows-arm64.exe modelrelay-worker-windows-arm64.exe

Start the proxy:

./modelrelay-server \
  --listen 0.0.0.0:8080 \
  --worker-secret mysecret

Start a worker (on a GPU box with llama-server, Ollama, vLLM, or any OpenAI-compatible backend):

./modelrelay-worker \
  --proxy-url http://<proxy-host>:8080 \
  --worker-secret mysecret \
  --backend-url http://127.0.0.1:8000 \
  --models llama3.2:3b,llama3.2:1b

Docker

Pre-built images are published to GitHub Container Registry on every release and main push.

# Pull the latest images
docker pull ghcr.io/ericflo/modelrelay/modelrelay-server:latest
docker pull ghcr.io/ericflo/modelrelay/modelrelay-worker:latest

# Run the proxy
docker run -p 8080:8080 \
  -e WORKER_SECRET=mysecret \
  -e LISTEN_ADDR=0.0.0.0:8080 \
  ghcr.io/ericflo/modelrelay/modelrelay-server:latest

# Run a worker (on a GPU box)
docker run \
  -e PROXY_URL=http://<proxy-host>:8080 \
  -e WORKER_SECRET=mysecret \
  -e BACKEND_URL=http://host.docker.internal:8000 \
  -e MODELS=llama3.2:3b \
  ghcr.io/ericflo/modelrelay/modelrelay-worker:latest

For pinned versions, replace :latest with a release tag (e.g. :0.2.1).

Docker Compose (easiest for local dev)

git clone https://github.com/ericflo/modelrelay.git
cd modelrelay

# Start the proxy + one worker (assumes llama-server on host port 8081)
docker compose up

The proxy is now listening on http://localhost:8080. The worker connects to it automatically and forwards requests to your backend.

From crates.io

Note: The crates are not yet published to crates.io. Use pre-built binaries or Docker in the meantime. See CONTRIBUTING.md for how to configure the CRATES_IO_TOKEN secret for publishing.

cargo install modelrelay-server modelrelay-worker

Build from source

cargo build --release
# Binaries: target/release/modelrelay-server  target/release/modelrelay-worker

Try it

# Non-streaming
curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

# Streaming (SSE chunks pass through from the backend)
curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Connecting your tools

Once the proxy is running, point your existing tools at it — no special client needed.

curl — see Try it above.

Claude Code / Claude Desktop — set the base URL to your proxy:

export ANTHROPIC_BASE_URL=http://localhost:8080
claude    # requests now route through ModelRelay

LiteLLM — add a model entry in your config.yaml:

model_list:
  - model_name: llama3.2:3b
    litellm_params:
      model: openai/llama3.2:3b
      api_base: http://localhost:8080/v1

Open WebUI — point the OpenAI-compatible backend at the proxy:

export OPENAI_API_BASE_URL=http://localhost:8080/v1

Any tool that speaks OpenAI or Anthropic API formats works — just change the base URL.

Make it persistent

Once your worker is running, set it up as a system service so it starts automatically on boot:

  • Linux (systemd): Use the template unit in extras/modelrelay-worker@.service — supports multiple workers per machine (modelrelay-worker@gpu0, @gpu1, etc.). See Systemd below for full instructions.
  • macOS (launchd): Create a Launch Daemon plist pointing at the binary and your config.toml. The worker starts on boot and restarts on crash.
  • Windows (Service): Register with sc.exe create and set env vars with [Environment]::SetEnvironmentVariable. See Windows Service below for full instructions.

The setup wizard at /setup in the web UI walks through this interactively with copy-paste commands.

llamafile Integration

The extras/modelrelay-llamafile script is a self-contained CLI for downloading, running, and relaying llamafile models through ModelRelay. No dependencies beyond bash and curl.

# See what fits your hardware
./extras/modelrelay-llamafile recommend

# Browse models by category
./extras/modelrelay-llamafile list --tag reasoning

# Save your relay config once
./extras/modelrelay-llamafile config set proxy-url https://relay.example.com
./extras/modelrelay-llamafile config set worker-secret mysecret

# Now just serve — no flags needed
./extras/modelrelay-llamafile serve qwen3.5-4b

# Verify it works end-to-end
./extras/modelrelay-llamafile test qwen3.5-4b

# Manage running models
./extras/modelrelay-llamafile status
./extras/modelrelay-llamafile logs qwen3.5-4b -f
./extras/modelrelay-llamafile stop all

# Import your own llamafiles
./extras/modelrelay-llamafile import ./my-model.llamafile --slug my-model

# Refresh catalog when Mozilla publishes new models
./extras/modelrelay-llamafile update-catalog

Run ./extras/modelrelay-llamafile help for full usage, or ./extras/modelrelay-llamafile doctor to check system readiness.

Features

  • Cross-platform — pre-built binaries for Linux, macOS, and Windows (x86_64 + arm64)
  • OpenAI + Anthropic compatiblePOST /v1/chat/completions, POST /v1/responses, POST /v1/messages, GET /v1/models
  • No inbound ports on GPU boxes — workers connect out to the proxy over WebSocket
  • Request queueing — configurable depth and timeout when all workers are busy
  • Streaming pass-through — SSE chunks forwarded with preserved ordering and termination
  • End-to-end cancellation — client disconnect propagates through the proxy to the worker to the backend
  • Automatic requeue — if a worker dies mid-request, the request is requeued to another worker
  • Heartbeat and load tracking — stale workers are cleaned up; workers report current load
  • Graceful drain — workers can shut down while replacement workers pick up queued work
  • Model catalog refresh — workers can update their model list without reconnecting
  • Auth cooldown recovery — workers recover gracefully from authentication failures

Configuration

modelrelay-server

Flag Env var Default Description
--listen LISTEN_ADDR 127.0.0.1:8080 Address to listen on
--worker-secret WORKER_SECRET (required) Secret workers must present to authenticate
--provider PROVIDER_NAME local Provider name used for worker routing and request dispatch
--max-queue-len MAX_QUEUE_LEN 100 Maximum number of queued requests (0 = unlimited)
--queue-timeout QUEUE_TIMEOUT_SECS 30 Seconds before a queued request times out (0 = no timeout)
--request-timeout REQUEST_TIMEOUT_SECS 300 Seconds before an in-flight HTTP request times out (0 = no timeout)
--log-level LOG_LEVEL info Log level filter (e.g. info, debug, or modelrelay_server=debug). Overridden by RUST_LOG if set.
--admin-token MODELRELAY_ADMIN_TOKEN (none) Bearer token for /admin/* endpoints. If unset, admin endpoints return 403.
--require-api-keys MODELRELAY_REQUIRE_API_KEYS false When true, client inference requests must include a valid API key as Bearer token.

modelrelay-worker

Flag Env var Default Description
--proxy-url PROXY_URL http://127.0.0.1:8080 Base URL of the proxy server
--worker-secret WORKER_SECRET (required) Secret used to authenticate with the proxy
--backend-url BACKEND_URL http://127.0.0.1:8000 Base URL of the local model backend
--models MODELS default Comma-separated list of model names this worker supports
--provider PROVIDER_NAME local Provider name to register with on the proxy
--worker-name WORKER_NAME worker Human-readable name for this worker instance
--max-concurrency MAX_CONCURRENCY 1 Maximum number of concurrent requests this worker will handle
--log-level LOG_LEVEL info Log level filter (e.g. info, debug, or modelrelay_worker=debug). Overridden by RUST_LOG if set.

All flags can be passed as CLI arguments or set via the corresponding environment variable.

Admin API & Web Dashboard

ModelRelay includes built-in admin endpoints for monitoring and an embedded web dashboard for managing your deployment.

Admin API Endpoints

Method Path Auth Description
GET /health None Basic health check — returns version, worker count, queue depth, and uptime
GET /admin/workers Admin token List connected workers with models, load, and capabilities
GET /admin/stats Admin token Request counts, queue depth per provider
GET /admin/keys Admin token List client API key metadata (no secrets)
POST /admin/keys Admin token Create a new client API key — returns the secret once
DELETE /admin/keys/{id} Admin token Revoke a client API key

Admin Authentication

All /admin/* endpoints require a Bearer token matching MODELRELAY_ADMIN_TOKEN:

# Set the admin token when starting the server
modelrelay-server --worker-secret mysecret --admin-token my-admin-secret

# Query admin endpoints
curl -H "Authorization: Bearer my-admin-secret" http://localhost:8080/admin/workers
curl -H "Authorization: Bearer my-admin-secret" http://localhost:8080/admin/stats

If MODELRELAY_ADMIN_TOKEN is not set, all admin endpoints return 403 Forbidden.

Client API Key Authentication

When MODELRELAY_REQUIRE_API_KEYS is set to true, clients must include a valid API key as a Bearer token on inference requests (/v1/chat/completions, /v1/messages, etc.). Without a valid key, requests are rejected with 401 Unauthorized.

# Start the server with API key auth enabled
modelrelay-server --worker-secret mysecret --admin-token my-admin-secret --require-api-keys true

# Create a client API key (the secret is returned only once)
curl -X POST -H "Authorization: Bearer my-admin-secret" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-app"}' \
  http://localhost:8080/admin/keys

# Use the key for inference
curl -H "Authorization: Bearer mr-..." \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hello"}]}' \
  http://localhost:8080/v1/chat/completions

# Revoke a key
curl -X DELETE -H "Authorization: Bearer my-admin-secret" \
  http://localhost:8080/admin/keys/{key-id}

When MODELRELAY_REQUIRE_API_KEYS is false (the default), inference endpoints accept requests without any authentication.

Web Dashboard & Setup Wizard

The modelrelay-web crate provides an embedded web UI served by the proxy:

  • Dashboard at /dashboard — real-time view of connected workers, request metrics, and queue depth
  • Setup Wizard at /setup — step-by-step guide for connecting new workers (platform detection, backend configuration, worker binary download, and live connection verification)

The setup wizard is always accessible — not just on first run. Use it to add additional GPU boxes to your fleet at any time.

Production deployment

Docker Compose (multi-worker)

The included docker-compose.yml runs the proxy with two workers, health checks, restart policies, memory limits, and log rotation:

cp .env.example .env   # edit WORKER_SECRET and backend URLs
docker compose up -d

Add more workers by duplicating a worker service block and adjusting MODELS, BACKEND_URL, and WORKER_NAME.

Systemd (bare metal / VM)

Service files live in extras/:

# Install binaries (from a release archive or cargo build --release)
sudo install -m 755 modelrelay-server modelrelay-worker /usr/local/bin/

# Create a service user
sudo useradd --system --no-create-home modelrelay
sudo mkdir -p /var/lib/modelrelay /etc/modelrelay

# Proxy
sudo cp extras/modelrelay-server.service /etc/systemd/system/
sudo cp extras/proxy.env.example /etc/modelrelay/proxy.env
sudo vim /etc/modelrelay/proxy.env   # set WORKER_SECRET
sudo systemctl enable --now modelrelay-server

# Workers — the template unit lets you run multiple instances:
sudo cp extras/modelrelay-worker@.service /etc/systemd/system/
sudo cp extras/worker.env.example /etc/modelrelay/worker-gpu0.env
sudo vim /etc/modelrelay/worker-gpu0.env   # set PROXY_URL, BACKEND_URL, MODELS
sudo systemctl enable --now modelrelay-worker@gpu0

See extras/ for the full service files and annotated env examples.

Windows Service

ModelRelay ships Windows binaries that can run as native Windows Services using sc.exe. No third-party service wrappers required.

# Install the server as a service (run as Administrator)
sc.exe create ModelRelayServer binPath= "C:\ModelRelay\modelrelay-server.exe" start= auto

# Set environment variables for the service (system-wide, persists across reboots)
[Environment]::SetEnvironmentVariable("WORKER_SECRET", "your-secret-here", "Machine")
[Environment]::SetEnvironmentVariable("LISTEN_ADDR", "0.0.0.0:8080", "Machine")

# Start the service
Start-Service ModelRelayServer

# Install a worker service
sc.exe create ModelRelayWorker binPath= '"C:\ModelRelay\modelrelay-worker.exe" --models llama3-8b' start= auto
[Environment]::SetEnvironmentVariable("PROXY_URL", "http://your-proxy:8080", "Machine")
[Environment]::SetEnvironmentVariable("BACKEND_URL", "http://localhost:8000", "Machine")
Start-Service ModelRelayWorker

For fully annotated install scripts with error handling and uninstall support, see extras/install-windows-service.ps1 and extras/install-windows-service-worker.ps1. The service runs as LocalSystem by default; to use a dedicated account, set the service log-on via services.msc or pass obj= and password= to sc.exe create.

TLS

The proxy and workers communicate over plain HTTP/WebSocket by default. For production, terminate TLS at a reverse proxy like nginx. An annotated configuration is provided at examples/tls-nginx.conf — it handles HTTPS for client requests and wss:// WebSocket upgrades for workers, with streaming-friendly settings (buffering disabled, long timeouts).

Load Testing

A ready-made load test script lives at extras/load-test.sh. It uses hey if installed, falls back to wrk, and finally to parallel curl loops:

./extras/load-test.sh -n 200 -c 20 -m llama3-8b

Shell Completions

Both modelrelay-server and modelrelay-worker can generate shell completion scripts via the hidden --completions flag:

# Bash
modelrelay-server --completions bash > ~/.local/share/bash-completion/completions/modelrelay-server
modelrelay-worker --completions bash > ~/.local/share/bash-completion/completions/modelrelay-worker

# Zsh (add the target directory to $fpath)
modelrelay-server --completions zsh > ~/.zfunc/_modelrelay-server
modelrelay-worker --completions zsh > ~/.zfunc/_modelrelay-worker

# Fish
modelrelay-server --completions fish > ~/.config/fish/completions/modelrelay-server.fish
modelrelay-worker --completions fish > ~/.config/fish/completions/modelrelay-worker.fish

Supported shells: bash, zsh, fish, powershell, elvish.

Documents

Full documentation: ericflo.github.io/modelrelay

Validation

The behavior matrix is exercised at three layers: black-box contract harnesses in modelrelay-contract-tests, live HTTP integration tests in modelrelay-server, and end-to-end live backend tests in modelrelay-worker.

cargo fmt --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --workspace

Contributing

Bug reports, feature requests, and PRs are welcome. See CONTRIBUTING.md for code style, test expectations, branch naming, and CI secrets.

To report a security vulnerability, follow the process in SECURITY.md.

License

MIT

About

Central HTTP LLM proxy that routes inference requests to authenticated remote workers over WebSocket — queueing, streaming, and cancellation included.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors