Tuned Docker images for running open LLMs on consumer GPUs. One command, maximum tok/s.
Foundry compiles llama.cpp from source for native Blackwell (sm_120a) and Ada (sm_89) GPU architectures, bundles per-GPU hardware profiles, and auto-detects your GPU at startup. No manual tuning required.
- Quick Start
- Models
- How It Works
- Configuration
- Multi-Agent Inference
- Running
- Monitoring
- Host Kernel Tuning
- Project Structure
- License
docker run --gpus all -p 8080:8080 \
-v ~/.cache/foundry:/models \
ghcr.io/infernet-org/foundry/qwen3.5-9b:latestThe first run downloads the model (~6 GB). Subsequent starts are instant.
Then use it like any OpenAI-compatible API:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-9b",
"messages": [{"role": "user", "content": "Hello!"}]
}'Works with any OpenAI-compatible client: Cursor, Continue, OpenCode, Open WebUI, CrewAI, AutoGen, etc. See AGENTS.md for detailed integration guides.
Hybrid Gated DeltaNet + Dense FFN. 9B total parameters, all active per token. Qwen3.5 generation with vision-language capability.
- 32 layers: 24 Gated DeltaNet (recurrent) + 8 full attention (GQA 16:4)
- All 9B parameters active per token (dense, compute-bound)
- Thinking mode by default (reasoning_content field)
- Quantization: UD-Q4_K_XL via Unsloth (Dynamic 2.0)
- Disk: ~5.66 GB | Min VRAM: 8 GB | Max context: 262K native (1M with YaRN)
| GPU | VRAM | Context | Decode | 4-concurrent | VRAM used |
|---|---|---|---|---|---|
| RTX 5090 | 32 GB | 262K/slot | ~177 tok/s | ~423 tok/s | 29.5 GB |
| Other NVIDIA (8 GB+) | 8+ GB | 32K/slot | varies | varies | varies |
RTX 5090 detailed benchmark
SINGLE-STREAM DECODE: ~177 tok/s (compute-bound, 94% SM utilization)
4-CONCURRENT AGGREGATE: ~423 tok/s
4-CONCURRENT PER-SLOT: ~106 tok/s each
PROMPT PROCESSING: ~1,688 tok/s
GPU UTILIZATION: 94% SM / 65% mem (single) | 100% SM / 63% mem (4-concurrent)
POWER DRAW: 312W single, 445W 4-concurrent
TEMPERATURE: 52-60C (under sustained load)
VRAM USAGE: 29.5 GB / 32.6 GB (2.6 GB headroom)
CONTEXT: 262K per slot (4 slots, 1M total)
Benchmarked 2026-03-02 with native sm_120a (Blackwell) compilation and BLACKWELL_NATIVE_FP4=1 enabled.
Why this replaces Qwen3.5-35B-A3B: Newer Qwen3.5 generation model that outperforms the 35B-A3B on every benchmark -- agent tasks (+37 TAU2-Bench), math (+20 HMMT), reasoning (+8 GPQA), instruction following (+13 IFBench). At 5.66 GB it uses 1/4 the VRAM, enabling full 262K context per slot (vs 48K for the 35B). Internal --parallel 4 batching provides 2.6x more throughput than running multiple instances (tested with eBPF telemetry: dense model is compute-bound at 94% SM utilization, not memory-bandwidth-bound).
Dense transformer. 36B total parameters, all active per token. ByteDance Seed-OSS-36B architecture.
- 64 transformer layers, standard attention (GQA 80:8)
- Quantization: Q4_K_M via NousResearch
- Disk: ~21.8 GB | Min VRAM: 24 GB | Max context: 512K native
| GPU | VRAM | Context | Decode | 4-concurrent | VRAM used |
|---|---|---|---|---|---|
| RTX 5090 | 32 GB | 32K | ~64 tok/s | ~132 tok/s | 27.8 GB |
| Other NVIDIA (24 GB+) | 24+ GB | 8K | varies | varies | varies |
Dense models activate all parameters per token, making them compute-bound rather than memory-bandwidth-bound. Expect ~3x slower decode than equivalently-sized MoE models on the same hardware.
Standard Mixture-of-Experts optimized for code generation. 30B total parameters, only 3B active per token.
- 48 transformer layers, standard attention (GQA 32:4)
- 128 experts per MoE layer, top-8 active per token
- Quantization: UD-Q4_K_XL via Unsloth (Dynamic 2.0)
- Disk: ~17.7 GB | Min VRAM: 16 GB (with expert offloading) | Max context: 262K native
- Built-in tool calling support via
--jinjachat template
| GPU | VRAM | Context | Decode | 3-concurrent | VRAM used |
|---|---|---|---|---|---|
| RTX 5090 | 32 GB | 64K/slot | ~275 tok/s | ~497 tok/s | 28.9 GB |
| Other NVIDIA (16 GB+) | 16+ GB | 16K/slot | varies | varies | varies |
RTX 5090 detailed benchmark
SINGLE-STREAM DECODE: ~275 tok/s
3-CONCURRENT AGGREGATE: ~497 tok/s (+81% via MoE expert batching)
3-CONCURRENT PER-SLOT: ~168 tok/s each
PROMPT PROCESSING: ~345-1,038 tok/s (varies with batch position)
VRAM USAGE: 28.9 GB / 32.6 GB (3.7 GB headroom)
CONTEXT: 64K per slot (3 slots, auto-fitted from 192K request)
Benchmarked 2026-03-02 with native sm_120a (Blackwell) compilation and BLACKWELL_NATIVE_FP4=1 enabled.
Why 3 slots (not 4)? With 3 slots, --fit on allocates 64K context per slot instead of 48K. Aggregate throughput is identical (497 vs 495 tok/s), but per-agent speed under load is 35% faster (168 vs 124 tok/s). The 4th slot rarely matters for a single-GPU workstation. Override with FOUNDRY_EXTRA_ARGS="--parallel 4" if needed.
vs Qwen3.5-9B: 52% faster single-stream, 18% faster aggregate. The standard MoE architecture (no DeltaNet recurrent layers) batches more efficiently on Blackwell. Trades the 262K context of Qwen3.5 for raw speed.
Why llama.cpp and not SGLang or vLLM? For consumer GPUs, llama.cpp's MoE expert offloading (--fit on) is the only engine that can run a 30B-parameter MoE model on a single 16-24 GB card at full speed. SGLang and vLLM require the entire model to fit in VRAM.
Qwen3-Coder-30B-A3B keeps attention layers on GPU while spilling inactive experts to CPU, which is why a 30B MoE runs faster than a 9B dense model on the same hardware (275 vs 177 tok/s).
On startup, Foundry:
- Detects your GPU via
nvidia-smi - Loads a tuned hardware profile with optimal settings
- Downloads the GGUF model if not already cached
- Launches
llama-serverwith the right arguments
Each profile tunes: context length, KV cache quantization, thread count, batch size, flash attention, thread priority, CPU affinity, and Prometheus metrics.
# Override auto-detection with a specific profile
docker run --gpus all -p 8080:8080 \
-v ~/.cache/foundry:/models \
-e FOUNDRY_PROFILE=rtx5090 \
ghcr.io/infernet-org/foundry/qwen3.5-9b:latestAvailable profiles (per model): rtx5090, default
The entrypoint automatically applies architecture-specific flags based on the FOUNDRY_ARCH environment variable baked into each image:
| Architecture | Flag | Reason |
|---|---|---|
MoE (moe) |
--fit on |
Spill inactive experts to CPU when VRAM is tight |
Dense (dense) |
(none) | No experts to offload |
Model-specific quirks (e.g. --swa-full for Qwen's hybrid attention, --cache-ram 0 for recurrent state) are set in profile EXTRA_ARGS, not in the architecture tier.
All settings can be overridden via environment variables:
| Variable | Default | Description |
|---|---|---|
FOUNDRY_PROFILE |
auto |
GPU profile (auto, rtx5090, default) |
FOUNDRY_PORT |
8080 |
Server port |
FOUNDRY_CTX_LENGTH |
Profile default | Context window size |
FOUNDRY_THREADS |
Profile default | CPU thread count |
FOUNDRY_EXTRA_ARGS |
(empty) | Additional llama-server arguments (highest priority) |
HF_TOKEN |
(empty) | Hugging Face token for authenticated downloads |
The RTX 5090 profiles are configured with multiple concurrent inference slots: --parallel 4 for Qwen3.5-9B and Hermes, --parallel 3 for Qwen3-Coder. This makes Foundry well-suited for multi-agent workflows where several AI agents share a single GPU.
Qwen3-Coder-30B-A3B uses a 128-expert MoE architecture with only 8 experts active per token. During single-stream decode, the GPU's tensor cores are largely idle -- the bottleneck is memory bandwidth, not compute. When multiple agents send concurrent requests, llama.cpp batches token generation across all active slots. Different tokens may route to different experts, and CUDA graphs capture the entire batched MoE operation, significantly improving GPU utilization.
Measured on RTX 5090:
| Active agents | Qwen3.5-9B (4 slots, dense) | Qwen3-Coder-30B-A3B (3 slots, MoE) |
|---|---|---|
| 1 | 177 tok/s | 275 tok/s |
| 2 | — | 405 tok/s (204 each) |
| 3 | — | 497 tok/s (168 each) |
| 4 | 423 tok/s (106 each) | — |
Single-agent speed is unaffected. Concurrent slots only activate when there are simultaneous requests.
With 2x RTX 5090, run two independent instances for double the concurrent slots and aggregate throughput:
# GPU 0
docker run --gpus '"device=0"' -p 8080:8080 -v ~/.cache/foundry:/models \
ghcr.io/infernet-org/foundry/qwen3-coder-30b-a3b:latest
# GPU 1
docker run --gpus '"device=1"' -p 8081:8080 -v ~/.cache/foundry:/models \
ghcr.io/infernet-org/foundry/qwen3-coder-30b-a3b:latestAny OpenAI-compatible agent framework works out of the box -- point it at http://localhost:8080/v1. See AGENTS.md for setup examples.
# Default: Qwen3.5-9B
docker compose up
# Choose a different model
FOUNDRY_MODEL=hermes-4.3-36b docker compose up
FOUNDRY_MODEL=qwen3-coder-30b-a3b docker compose up
# With explicit profile
FOUNDRY_PROFILE=rtx5090 docker compose up
# With monitoring stack (Prometheus + Grafana + GPU + eBPF metrics)
docker compose --profile monitoring upCreate a .env file for secrets and optional config:
HF_TOKEN=hf_your_token_here
GF_ADMIN_USER=admin
GF_ADMIN_PASSWORD=admin
make build # Build the default model image (qwen3.5-9b)
make build MODEL=hermes-4.3-36b # Build a different model
make build MODEL=qwen3-coder-30b-a3b # Build the coding-optimized model
make run # Run with auto-detected GPU
make test # Smoke test: start, wait for health, send one request
make benchmark # Run benchmark against a running server
make download # Download the GGUF model file to ~/.cache/foundrypython3 scripts/benchmark.py --url http://localhost:8080 --mode allModes: all, generation (single-stream decode), prompt (prompt processing), throughput (4-concurrent).
Foundry includes an optional observability stack activated via Docker Compose profiles.
docker compose --profile monitoring up| Component | Port | Source | Metrics |
|---|---|---|---|
| llama-server | 8080 | Built-in /metrics |
Decode tok/s, prompt tok/s, active slots, deferred requests, KV cache usage |
| Prometheus | 9090 | Scrapes all targets | Time-series storage, 30-day retention |
| Grafana | 3000 | Dashboards | Visualization (default: admin / admin) |
| nvidia-gpu-exporter | 9835 | nvidia-smi |
VRAM, GPU utilization, temperature, power, clocks, fan speed |
| node-exporter | 9100 | /proc, /sys |
CPU, RAM, disk, network, load average |
| cAdvisor | 8081 | Docker API | Per-container CPU, memory, network I/O |
| ebpf-exporter | 9435 | eBPF / kernel tracepoints | Block I/O latency histograms, scheduling latency, kernel-level metrics |
The eBPF exporter runs with privileged: true and pid: host to attach kernel tracepoints. It ships with Cloudflare's biolatency config by default, providing block I/O latency distributions useful for diagnosing model loading stalls and NVMe performance.
All dashboards are auto-provisioned on first start -- no manual import needed.
| Dashboard | Description |
|---|---|
| Foundry Inference | Custom: inference throughput gauges, slot utilization, GPU telemetry, host resources |
| Node Exporter Full | Host metrics (community dashboard #1860) |
| NVIDIA GPU | GPU monitoring (community dashboard #14574) |
| cAdvisor | Container resources (community dashboard #14282) |
┌─────────────────┐ ┌────────────────┐ ┌─────────┐
│ llama-server │────▶│ │ │ │
│ :8080/metrics │ │ │ │ │
├─────────────────┤ │ │ │ │
│ nvidia-gpu-exp │────▶│ Prometheus │────▶│ Grafana │
│ :9835 │ │ :9090 │ │ :3000 │
├─────────────────┤ │ │ │ │
│ node-exporter │────▶│ scrapes 15s │ │ │
│ :9100 │ │ 30d retention │ │ │
├─────────────────┤ │ │ │ │
│ cAdvisor │────▶│ │ │ │
│ :8081 │ │ │ │ │
├─────────────────┤ │ │ └─────────┘
│ ebpf-exporter │────▶│ │
│ :9435 │ └────────────────┘
└─────────────────┘
For maximum performance, run the host tuning script once on the Docker host:
sudo ./scripts/host-setup.shChanges are not persistent across reboots. The script prints instructions for making them permanent via /etc/sysctl.d/ and GRUB.
| Category | Parameter | Value | Purpose |
|---|---|---|---|
| Memory | vm.swappiness |
0 | Keep model weights strictly in RAM |
vm.overcommit_memory |
1 | Ensure mlock() succeeds for large models |
|
vm.nr_hugepages |
1280 | ~2.5 GB hugepages for reduced TLB misses | |
kernel.numa_balancing |
0 | Disable page migration jitter | |
| THP defrag | defer+madvise |
Prevent allocation stalls | |
vm.dirty_ratio |
80 | Reduce I/O contention during model load | |
| Network | TCP congestion | BBR | Smoother token streaming over WAN |
net.core.somaxconn |
4096 | Handle connection bursts | |
net.core.busy_read/poll |
50 us | Reduce NIC-to-CPU interrupt latency | |
| TCP fast open | Enabled | Faster connection setup | |
| Buffer sizes | 16 MB | Adequate for streaming responses | |
| I/O | NVMe scheduler | none |
NVMe handles its own queues |
| NVMe read-ahead | 4 MB | Fast sequential model loading | |
| PCIe | ASPM | Disabled | Prevent link sleep latency for MoE routing |
| CPU | Governor | performance |
Maximum clock speed, no frequency scaling |
| EPB | 0 | Maximum performance energy bias | |
| GPU | Persistence mode | Enabled | Avoid ~100-500 ms cold start latency |
For tighter tail latency, pin GPU interrupts to dedicated cores away from inference threads:
# Pin NVIDIA IRQs to cores 28-31 (adjust for your topology)
for irq in $(grep nvidia /proc/interrupts | awk '{print $1}' | tr -d ':'); do
echo 28-31 > /proc/irq/$irq/smp_affinity_list
doneThis reduced p99 latency jitter from ~5.8 tok/s spread to ~2.2 tok/s spread in our RTX 5090 testing. Average throughput is unchanged -- the benefit is consistency.
foundry/
├── models/
│ ├── qwen3.5-9b/
│ │ ├── Dockerfile # Multi-stage: compiles llama.cpp for sm_89 + sm_120a
│ │ ├── entrypoint.sh # Copied from scripts/entrypoint.sh at build time
│ │ └── profiles/
│ │ ├── rtx5090.sh # 1M ctx, 4 slots, ~423 tok/s aggregate, 262K/slot
│ │ └── default.sh # 32K ctx, 8 GB minimum
│ ├── qwen3.5-35b-a3b/ # Legacy: still available, superseded by qwen3.5-9b
│ │ ├── Dockerfile
│ │ ├── entrypoint.sh
│ │ └── profiles/
│ ├── hermes-4.3-36b/
│ │ ├── Dockerfile # Multi-stage: compiles llama.cpp for sm_89 + sm_120a
│ │ ├── entrypoint.sh # Copied from scripts/entrypoint.sh at build time
│ │ └── profiles/
│ │ ├── rtx5090.sh # 32K ctx, 4 slots, ~132 tok/s aggregate
│ │ └── default.sh # 8K ctx, 24 GB minimum
│ └── qwen3-coder-30b-a3b/
│ ├── Dockerfile # Multi-stage: compiles llama.cpp for sm_89 + sm_120a
│ ├── entrypoint.sh # Copied from scripts/entrypoint.sh at build time
│ └── profiles/
│ ├── rtx5090.sh # 192K ctx, 3 slots, ~497 tok/s aggregate
│ └── default.sh # 32K ctx, conservative settings
├── scripts/
│ ├── entrypoint.sh # Shared entrypoint (GPU detect, profile load, model download)
│ ├── benchmark.py # Generation speed, prompt processing, throughput
│ ├── optimize_5090.py # Multi-config A/B testing harness
│ ├── download-model.sh # Download GGUF outside Docker
│ └── host-setup.sh # Linux kernel tuning for inference
├── monitoring/
│ ├── prometheus/prometheus.yml # Scrape config (llama-server, GPU, node, cAdvisor, eBPF)
│ └── grafana/
│ ├── dashboards/ # 4 pre-provisioned dashboards (JSON)
│ └── provisioning/ # Datasource and dashboard auto-provisioning
├── docker-compose.yml # Inference + monitoring stack (with eBPF exporter)
├── Makefile # build, run, test, benchmark, download
├── AGENTS.md # AI agent integration guide
└── .github/workflows/
├── build.yml # CI: build and push Docker images to GHCR
└── lint.yml # CI: ruff (Python) + shellcheck (Bash)
Apache-2.0