Self-host your AI coding assistant. Like ChatGPT or Claude — but the AI runs on a server you control, your prompts never get used to train anyone's model, and it costs cents per session instead of $20/month.
New to this? Jump to 📚 If you've never used a terminal for a step-by-step walkthrough. Most readers below are developers — that section is for everyone else.
Most "self-host your AI" guides assume you're a developer. This section is for everyone else. No coding experience required. You'll spend ~30 minutes setting it up once, then never think about it again.
- A chat window in your browser where you talk to an AI coding assistant (like ChatGPT, but private)
- Your code and conversations never leave your computer or the AI provider — no training data harvest, no logs
- Pennies per session instead of $20/month — Groq charges per-use, with $0 idle
- Works on Mac, Windows, or Linux
| What | Where | Cost |
|---|---|---|
| Docker Desktop (the engine that runs everything) | https://docs.docker.com/desktop/ — pick your OS | Free |
| A Groq account (provides the AI) | https://console.groq.com/ | Free signup, pennies per use |
| About 30 minutes | — | Yours |
You will not need to:
- Know what an API, container, or proxy is
- Write or edit any code
- Run commands manually (after the one-time setup)
Go to the link in the table above, download the installer for your OS, run it. After installing, open Docker Desktop and wait until the menu-bar/tray icon shows "Docker Desktop is running." (You only do this once.)
- Go to https://console.groq.com/ — sign up with Google or email.
- Click API Keys in the left sidebar → Create API Key → copy the key (it starts with
gsk_...). Save it somewhere safe — you'll need it in a moment.
- Go to https://console.groq.com/settings/data-controls
- Toggle Zero Data Retention to ON.
- This stops Groq from keeping any record of your prompts. You can verify it stuck by reloading the page.
- At the top of this page, click the green
<> Codebutton → Download ZIP. - Unzip the file. You'll have a folder called
zdr-coder-main. Move it somewhere you'll find it again (Documents, Desktop, anywhere). - Open the folder. Find the file called
.env.example— make a copy and rename the copy to.env(yes, just.env— no.example). - Open
.envin TextEdit (Mac) or Notepad (Windows). Find the line that saysGROQ_API_KEY=and paste your Groq key from step 2 right after the=. Save and close.
- Mac: Double-click
start.commandin the project folder. (If macOS warns about an "unidentified developer," right-click → Open → Open.) - Windows: Double-click
start.batin the project folder.
A Terminal/Command Prompt window will appear and show progress. After a minute or two it will say "Stack is running" and your browser will open to http://localhost:3000 — that's the AI chat interface (OpenHands).
- In OpenHands, you'll see a chat box. Drop your project folder into the workspace pane (or just chat without a folder for general questions).
- Type what you want the AI to do — "build me a simple todo app", "explain this code", "fix the bug where X happens", etc.
- The AI will plan, edit files, and run commands inside its own sandbox. You watch and approve.
Double-click stop.command (Mac) or stop.bat (Windows). This stops the AI and the proxy. Your API key and settings are preserved for next time.
To start again later: just double-click start.command again.
| Symptom | Fix |
|---|---|
start.command says "Docker Desktop did not start" |
Open Docker Desktop manually from Applications, wait for the icon to say "running," then try start.command again. |
start.command says "Missing Groq API key" |
It will open the .env file for you. Paste your key after GROQ_API_KEY= and save. |
| Browser shows "This site can't be reached" | Wait another 30 seconds — first launch is slow. Refresh the page. If still broken, double-click stop.command then start.command again. |
| Bills look higher than expected | You may have left a self-hosted GPU pod running. Double-click stop.command to stop everything. Groq API mode by itself costs $0 when idle. |
| Anything else | Take a screenshot and ask the friend who pointed you here. |
Using the default setup (Groq API mode), your costs are roughly:
| What you're doing | Approximate cost |
|---|---|
| Stack sitting idle (you're not chatting) | $0 |
| 1 hour of active AI coding work | ~$0.20 |
| 8 hours/day, every weekday, for a month | ~$30/month |
| Compare: ChatGPT Plus | $20/month |
| Compare: Claude Pro | $20/month |
So at most workloads, this is cheaper than ChatGPT Plus or Claude Pro while giving you actual zero-data-retention.
Who it's for: anyone who wants AI coding assistance with comparable performance to Claude (Anthropic) but with real ZDR + data privacy you can verify — including teams under HIPAA, SOC 2, or IP-sensitive workloads where "the model provider promises to be nice" isn't sufficient.
The thesis: open-weights models (GPT-OSS 120B, DeepSeek V4 Flash, Kimi K2.6) are now close enough to Claude Haiku / Sonnet / Opus that for most work you don't need Claude.AI. What you do need: a way to run them privately with contractual or technical ZDR. That's this repo.
What it gives you: one local proxy on http://localhost:4000, two privacy-preserving inference paths behind it (API + self-hosted), and tier aliases (haiku-api, sonnet-api, sonnet-vast, opus-vast) so your client config doesn't change when you swap underlying models. Drop-in for Aider, Cline, Roo Code, OpenHands.
Why now: ChatGPT Plus, Claude Pro, and Gemini Advanced all train on your input by default (or by tiny opt-in toggle), and none of the three are HIPAA-BAA-eligible. Your $20/mo buys faster models, not contractual privacy. This repo gives you contractual ZDR (Groq Cloud, with a self-serve toggle) or physical ZDR (your own pod on a Tier 3-4 datacenter) for less than $1/hr active and $0 idle. Total time to first request: 5 minutes for API mode, 15 minutes for a fresh pod.
Honest about what it doesn't do: no cryptographic E2E (provider still sees plaintext during inference — that's a Level 5 problem requiring TEE attestation), no FedRAMP / HITRUST of the rental platform itself (their datacenter partners have it transitively). COMPLIANCE.md documents every gap with verbatim citations.
There are six levels of "how private is my AI." This repo gives you ⭐ levels 3 and 6 — the rest are listed so you can see where you'd otherwise land. Every claim below is sourced from the provider's own legal docs (links + verbatim quotes in COMPLIANCE.md).
| Level | What it is | Cost | Provider sees your prompts? | Trained on by default? | HIPAA BAA? | Compliance certs | Good for |
|---|---|---|---|---|---|---|---|
| 1. Lowest | Free consumer chat — chatgpt.com, claude.ai free, gemini.google.com | $0 | Yes, plaintext, sampled humans may read | ChatGPT & Gemini: yes. Claude: opt-in only. | ❌ never | ❌ free tier excluded | Throwaway questions |
| 2. Moderate | $20/mo consumer subs — ChatGPT Plus, Claude Pro, Gemini Advanced | ~$20/mo | Yes, plaintext, sampled humans may read | ChatGPT Plus & Gemini Advanced: yes. Claude Pro: opt-in (toggle in settings). | ❌ Plus / Pro / Advanced explicitly ineligible | ❌ consumer tier excluded | Personal coding, nothing sensitive |
3. High ⭐ this repo: *-api routes |
Developer APIs with ZDR option — Groq, OpenAI API, Anthropic API, DeepInfra | $0.13–$4.50/hr active, $0 idle | Yes, plaintext, no human review under contract | ❌ contractually no | ✅ on request | SOC 2 Type 2, ISO 27001 (Groq) | Most use. Sensible default. |
| 4. Very high | Enterprise cloud LLM APIs — AWS Bedrock (Anthropic Claude via Bedrock), Azure OpenAI Foundry, GCP Vertex AI | $3–$15 / million tokens (consumption-only, no contract minimum) | Yes, plaintext, cloud-vendor enforced no-access | ❌ contractually no | ✅ standard (Bedrock + Azure + GCP all have BAA in their default agreements) | SOC 2 Type 2, ISO 27001/27017/27018, FedRAMP Moderate (commercial) / High (GovCloud / Assured Workloads), HITRUST CSF | Regulated industries with audit obligations — HIPAA + FedRAMP/HITRUST required |
| 5. Maximum | TEE-attested confidential inference — Tinfoil, GCP H100 CC mode | $5–$50/hr active or per-token | ❌ cryptographically blind — hardware-attested | ❌ enforced by hardware | ✅ via Tinfoil | Tinfoil: SOC 2 only. GCP: full stack | National security, ultra-paranoid PHI |
6. Own everything ⭐ this repo: *-vast, *-serverless routes |
Self-host open-weights on rented GPU. Three sub-tiers by compliance ceiling: (a) Vast Secure Cloud / RunPod Secure Cloud — cheapest BAA-eligible; (b) AWS EC2 p5e/p5en (8× H200) — FedRAMP/HITRUST inheritance, much pricier; (c) Azure/GCP equivalents — similar to AWS | (a) $0.40–$15/hr Vast/RunPod pods, $0 idle serverless; (b) $39.80–$52.02/hr AWS p5e/p5en | Only the datacenter host operator's root user; contractually prohibited from introspecting (RunPod explicit, Vast implicit; AWS strongest) | ❌ you control the model weights | ✅ via datacenter operator BAA (all three) | (a) SOC 2 Type 2 platform, ISO 27001 via DC partners; (b) full AWS stack — SOC 2, ISO 27001, FedRAMP, HITRUST inherited | Long sessions, full audit trail, no managed-model provider in the path. AWS sub-tier when FedRAMP/HITRUST required. |
A few non-obvious things from the research:
- Paid consumer ≠ private. ChatGPT Plus and Gemini Advanced default to using your chats for training. Claude Pro defaults to opt-in (same as free Claude). None of Plus / Pro / Advanced is HIPAA-BAA-eligible. The $20 buys you faster models and higher rate limits, not contractual privacy.
- Free Claude is more private than free ChatGPT. Claude requires opt-in for training; ChatGPT and Gemini opt you in by default. None are BAA-eligible.
- AWS Nitro Enclaves can't run sonnet-class models. Nitro Enclaves have no GPU. The "confidential AI on AWS" marketing requires GovCloud Provisioned Throughput, not Enclaves.
- Vast/RunPod compliance is split: the rental platform holds SOC 2 Type 2, but ISO 27001 belongs to their datacenter partners, not the platform itself.
- Groq's ZDR is the strongest "Level 3" story because the toggle is self-serve in every account — most competitors gate ZDR behind enterprise contracts.
For each cell with verbatim provider quotes + URLs: see COMPLIANCE.md.
| If you want… | Run this | Cost shape |
|---|---|---|
| Level 3 (API + ZDR), fastest setup | ./scripts/api-up.sh |
$0.13–$0.56/hr active, $0 idle |
| Level 6 always-on pod, cheapest | ./scripts/deploy-vast.sh haiku (or sonnet / opus) |
$0.40–$15/hr while up |
| Level 6 scale-to-zero, private | ./scripts/deploy-serverless.sh haiku (or sonnet) |
$0 idle, ~$0.50–$6/hr active |
| Stop everything | ./scripts/destroy.sh all |
$0 |
All three use the same local endpoint (http://localhost:4000). Switch between them in Cline by changing the Model ID field — no restart.
# 1. Install prereqs once (Docker, jq, openssl, VSCodium, Cline extension)
./scripts/install-prereqs.sh # macOS / Ubuntu / Debian
# or .\scripts\install-prereqs.ps1 # Windows (PowerShell as admin)
# 2. Set up your API keys
cp .env.example .env
$EDITOR .env # set GROQ_API_KEY and/or VAST_API_KEY
# 3. Pick a path and run it
./scripts/api-up.sh # Level 3 — easiest, ZDR via Groq
# OR
./scripts/deploy-vast.sh haiku # Level 6 — your own GPU podLiteLLM is now serving an OpenAI-compatible endpoint on http://localhost:4000/v1. Point any agentic coding tool at it:
OpenHands (browser-based — best for non-developers, or anyone who wants a visual workspace):
./scripts/openhands-up.sh # starts at http://localhost:3000
./scripts/openhands-down.sh # stopBrowser UI with file workspace, sandboxed shell, plan/act loops. Backed by the same LiteLLM proxy. The non-developer start.command / start.bat launchers wrap this path end-to-end (Docker → LiteLLM → OpenHands → browser).
Aider (terminal, recommended for developers — no extension fragility):
./scripts/aider-up.sh # one-time install (pipx install aider-chat)
./scripts/aider.sh # launch — defaults to sonnet-apiSlash commands inside Aider: /add <file>, /run <cmd>, /commit, /undo, /help. Works flawlessly over SSH/tmux. To switch tier mid-session: exit and ZDR_MODEL=haiku-api ./scripts/aider.sh.
Cline / Roo Code (VSCode/VSCodium extension):
VSCodium → Cline (or Roo) → gear icon:
- API Provider: OpenAI Compatible
- Base URL:
http://localhost:4000/v1 - API Key: contents of
.litellm-key(auto-generated;cat .litellm-key) - Model ID:
sonnet-api(orsonnet-vast/haiku/ etc — see below)
Cline is more autonomous; Aider is more controllable. Pick whichever fits your workflow. For SSH-Remote dev specifically, Aider avoids a class of extension-host networking bugs — see Using Cline from a remote SSH host below.
Done. Start coding.
| Model ID | What runs | Where | Tier mapping | Notes |
|---|---|---|---|---|
haiku-api |
GPT-OSS 20B (OpenAI open-weight, coding-tuned) | Groq Cloud | Level 3, ~Haiku-class | $0.10/hr active, $0 idle |
sonnet-api |
GPT-OSS 120B (OpenAI open-weight reasoning) | Groq Cloud | Level 3, ~Sonnet-ish | $0.20/hr active, $0 idle |
haiku-vast |
Qwen2.5-Coder-32B-AWQ (18GB INT4) | Vast Secure Cloud pod | Level 6, ~Haiku-class | $0.40–0.67/hr while up |
sonnet-vast |
DeepSeek V4 Flash (158B FP8, 149GB weights) | Vast Secure Cloud pod | Level 6, ~Sonnet-class | $5.87/hr while up |
opus-vast |
Kimi K2.6 (1T params, 554GB weights) — Opus-class on most benchmarks | Vast Secure Cloud pod | Level 6, ~Opus-class | $7.74–$15/hr while up |
haiku-serverless |
Qwen2.5-Coder-32B-AWQ | RunPod Serverless | Level 6, scale-to-zero | $0 idle, ~$0.50/hr active |
sonnet-serverless |
DeepSeek V4 Flash | RunPod Serverless | Level 6, scale-to-zero | $0 idle, ~$6/hr active (H200, capacity-dependent) |
haiku / sonnet / opus |
Same as -vast |
RunPod always-on pod | Level 6, alt provider | Usually more expensive than Vast |
Switching is instant — just change the field in Aider (ZDR_MODEL=...) or Cline (Model ID).
The aliases (haiku-api, sonnet-api, opus-vast) are aspirational labels mapped to the best ZDR-eligible open-weights option for that compute shape. As of May 2026:
| Want | ZDR-eligible route in this repo | Real cost | Honest performance vs Claude |
|---|---|---|---|
| Haiku-class, cheapest | haiku-api (GPT-OSS 20B on Groq) |
$0.10/hr active, $0 idle | Comparable to Haiku for simple edits; coding-tuned |
| Sonnet-ish, scale-to-zero | sonnet-api (GPT-OSS 120B on Groq) |
$0.20/hr active, $0 idle | Between Sonnet 3.5 and Sonnet 4 for agentic loops; weaker on long-context multi-file work |
| Sonnet-class, self-hosted | sonnet-vast (DeepSeek V4 Flash on Vast 4× H100) |
$5.87/hr while up | Solid Sonnet-class; FP8 weights, 65K context |
| Opus-class (best open option) | opus-vast (Kimi K2.6 on Vast 8× H100 or 4× H200) |
$7.74–$11.74/hr while up | 88.7% SWE-Bench vs Opus 4.7's 87.6% — Kimi K2.6 wins. Trade Opus's edge on GPQA / long-context / tool orchestration for the strongest SWE-Bench open-weights score. |
There is no Groq-API equivalent for Opus-tier — Groq's production catalog tops out at GPT-OSS 120B, which sits between Sonnet 3.5 and Sonnet 4 on capability. Opus-class under ZDR means self-hosting (opus-vast → Kimi K2.6). We deliberately keep the API path to a single provider key (Groq) to minimize setup friction — adding a second API provider would mean another account and another key for marginal upside, since opus-vast already gets you Opus-class self-hosted.
ZDR-first framing: this repo prioritizes the contractual + technical privacy posture over hitting Claude's exact quality on every workload. The model choices are the best ZDR-eligible options, not necessarily the absolute best model. Specific caveats:
- Long-horizon tool-use consistency: Claude Opus 4.7 still leads on 20+ tool-call agentic loops. Open-weights models including Kimi K2.6 drift more in long sessions. Mitigation: shorter task scope per
aidersession, explicit/clearbetween unrelated tasks. - Aider edit-block format compliance: older Qwen models drop the diff format ~5–10% of the time on multi-file refactors. GPT-OSS 120B and Kimi K2.6 are noticeably more reliable. If a model misbehaves, try
--edit-format wholeor--edit-format udiff. - GPQA / scientific reasoning: Opus 4.7 leads (~94.2%). No open-weights model matches it yet on this specifically.
- MCP-Atlas / structured tool orchestration: Opus 4.7 leads. Cline/Roo's tool-call layer may produce more retries on open-weights models — not a model defect per se, just edge cases.
- Adversarial-prompt robustness: Closed frontier models have stronger safety/jailbreak resistance. Not relevant for solo coding work, important if you're exposing the proxy to untrusted inputs.
- Context length in practice: Groq's GPT-OSS 120B is 131K context, K2.6 self-hosted is 256K, but quality degrades meaningfully past ~50K input on all open models. Keep contexts tight.
- Cost vs quality crossover: Below ~2 hrs/day of opus-tier use, Anthropic Opus API at $30/hr is cheaper than running
opus-vastat $7.74/hr × 24. Self-hosted opus wins on continuous use + ZDR mandate, not just "I want a frontier model occasionally."
| Alias | Groq model | Tok/s | $/M in | $/M out |
|---|---|---|---|---|
haiku-api |
GPT-OSS 20B | 1000 | $0.075 | $0.30 |
sonnet-api |
GPT-OSS 120B | 500 | $0.15 | $0.60 |
Other models exist in Groq's production catalog (Llama 3.1 8B, Llama 3.3 70B, etc.) but the GPT-OSS line is coding-tuned and meaningfully cheaper, so we standardize on it. DeepSeek V4 and Kimi K2.6 are not on Groq production — for Opus-tier under ZDR, self-host via Level 6 (opus-vast).
This repo's Level 3 + Level 6 paths together cover:
- ✅ Zero data retention (Groq self-serve toggle; Vast/RunPod by container ownership)
- ✅ No training on your data (contractual on Groq; physical on self-hosted)
- ✅ HIPAA BAA available (all three providers — see COMPLIANCE.md for request process)
- ✅ SOC 2 Type 2 (Groq Inc, Vast Inc, RunPod Inc as of Oct 2025)
- ✅ Encryption in transit (TLS to provider edge)
- ✅ US data residency (default on all three)
- ✅ No third-party model provider in the inference path (Level 6)
What this repo does not give you out of the box:
- ❌ Cryptographic end-to-end (provider still sees plaintext during inference — Level 5 only)
- ❌ FedRAMP / HITRUST (Level 4 cloud APIs; or self-certify on Level 6 self-hosted)
- ❌ EU data residency (US-default; pick
*-vastwithgeolocation=EUto override) - ❌ Side-channel resistance on multi-tenant GPUs
COMPLIANCE.md has the full mapping with verbatim quotes from each provider's binding legal docs, plus the 7-step checklist for maintaining max-ZDR posture on Groq.
flowchart LR
subgraph laptop["Your laptop"]
Cline["VSCodium + Cline"]
LiteLLM["LiteLLM proxy<br/>Docker, :4000"]
Cline -->|"localhost:4000"| LiteLLM
end
LiteLLM -.->|"TLS + bearer token"| Edge
subgraph providers["Inference (pick one or many)"]
Edge["Provider edge"]
Edge --> API["Level 3 — Groq Cloud<br/>scale-to-zero, ZDR toggled"]
Edge --> Vast["Level 6 — Vast.ai Secure Cloud<br/>always-on pod"]
Edge --> RP["Level 6 — RunPod Secure Cloud<br/>pod or serverless"]
end
- LiteLLM is the local OpenAI-compatible proxy. Routes per-model-ID aliases, holds the master API key, injects per-route bearer tokens. Bound to
127.0.0.1only — never exposed. - Groq path is direct API. ZDR toggle in console gates retention.
- Vast / RunPod paths spin up a pod running
gpu-node/Dockerfile(vLLM + your chosen model). LiteLLM connects via TLS-terminated proxy URL + bearer token. - Cline = the agentic coding extension in VSCodium. Talks to LiteLLM on localhost.
No mesh VPN — provider-managed transport (TLS) + bearer tokens is the same E2E envelope, simpler to operate.
| API (Level 3, Groq) | Pod (Level 6, always-on) | Serverless (Level 6, scale-to-zero) | |
|---|---|---|---|
| Idle cost | $0 | $0.40–$15/hr | $0 |
| Active cost | per-token (~$0.13–$0.56/hr equivalent) | included in hourly | per-second of worker uptime |
| First request | 100ms | instant (host warm) | ~3–5 min cold-start (sometimes longer for sonnet) |
| Capacity risk | Groq has plenty | thin on 80GB+ for sonnet/opus | thin on H200 for sonnet |
| Privacy | contractual ZDR | physical (your container) | physical (your container) |
| Best for | most use — bursty or continuous | 4+ hrs/day on one tier | bursty but private |
Rule of thumb: <2 hrs/day → Level 3 API. >4 hrs/day → Level 6 pod. In between → Level 6 serverless.
Always-on pod pricing for each tier (from current available on-demand offers):
| Tier | RunPod Secure $/hr | Vast Secure Cloud $/hr | AWS EC2 $/hr | Notes |
|---|---|---|---|---|
| haiku (1× RTX 4090 24GB) | $0.69 | $0.40–0.67 | n/a — AWS doesn't rent 4090s | Vast cheapest when Iceland host rentable |
| sonnet (4× A100 / 4× H100 80GB) | $5.96 (often sold out) | $4.27 (A100) or $5.87 (H100 SXM) | $32–40 (p4de / p5) | Vast supply thin, 1–2 hosts at a time; AWS p5 even thinner |
| opus (8× H100 SXM 80GB) | $23.92 (often sold out) | $11.74 | $50–98 (p5.48xlarge) | France datacenter when listed |
| opus (alt) 4× H200 140GB | — | $7.74 | — | 560 GiB > Kimi K2.6's 554 GiB weights |
| opus (frontier — DeepSeek V4 Pro) | — | — | $39.80–$52.02 (p5e.48xlarge / p5en.48xlarge 8× H200) | See "DeepSeek V4 Pro" section below |
Versus going Anthropic-direct (no self-hosting): ~$30/hr for Opus-class agentic-coding workload. Crossover for opus is ~1.5 hrs/day before self-hosted wins on cost.
For Opus-class self-hosted: Kimi K2.6 (Moonshot, April-May 2026) is the strongest open-weights option and is what opus-vast runs. Honest benchmark picture — different sources report different numbers:
| Model | SWE-Bench Verified (range) | Intelligence Index (AA) | Context |
|---|---|---|---|
| Claude Opus 4.7 | 87.6% | 57 | 200K |
| Claude Opus 4.6 | ~85% | ~55 | 200K |
| Kimi K2.6 (this repo) | 80.2%–88.7% (source-dependent) | 54 | 256K |
| DeepSeek V4 Pro | 80.6%–83.7% | 52 | 1M |
The honest framing: Kimi K2.6 is comparable to Opus 4.6/4.7 — not strictly better, not meaningfully worse for typical work. The Intelligence Index gap is 3 points (57 vs 54) which translates to: occasionally noticeable on multi-step nuanced reasoning, invisible on most everyday tasks.
Where Opus still pulls ahead:
- Multi-step nuanced reasoning where each step builds on the last
- Long-horizon agentic loops (20+ tool calls without drift)
- GPQA Diamond / scientific reasoning (~94% vs ~82%)
- Adversarial prompt robustness (less relevant for solo coding)
Where Kimi K2.6 matches or wins:
- General chat, Q&A, code review, single-task agentic work
- Multilingual coding
- 256K context (vs Opus 200K)
- Cost: 5–6× cheaper than Anthropic Opus API when self-hosted at typical workloads
- You can audit it: open weights, your container, no third-party model provider seeing prompts
The point of this repo isn't that Kimi K2.6 beats Opus 4.7. The point is that for ZDR + privacy + audit, you get comparable Opus-class performance from a model you fully control. Claude.AI users move to this not because the open-weights model is better, but because they need contractual ZDR + their own data sovereignty.
| Model | Vendor | License | Notes for this repo |
|---|---|---|---|
| Kimi K2.6 | Moonshot | open weights | Opus-class — wired as opus-vast |
| DeepSeek V4 Pro | DeepSeek | open weights | Opus-class on benchmarks, similar to Kimi K2.6. Not wired — requires 8× H200 minimum (864 GB weights). |
| DeepSeek V4 Flash | DeepSeek | open weights | Already wired as sonnet-vast — sonnet-class FP8 |
| GLM-5.1 | Z.ai | open weights | Newer, similar tier to DeepSeek V4 Flash |
| Qwen 3.6 | Alibaba | Apache 2.0 | Strong on broad benchmarks; not yet wired |
| MiMo-V2.5-Pro | Xiaomi | open weights | Strong reasoning |
| MiniMax M2.7 | MiniMax | open weights | Recent open-source |
| Gemma 4 | open weights | Smaller — haiku-tier | |
| Ring-2.6-1T | Ant Group (inclusionAI) | open weights | Large MoE, 1T params |
DeepSeek V4 Pro is one viable open-weights opus-class model — 1.6T params MoE, ~49B active per token, 80.6% on SWE-bench (lower than Kimi K2.6's 88.7%, but ahead of most). Weights ~864 GB. Doesn't fit on 8× H100 80GB (640 GB total < 864 GB); minimum viable host is 8× H200 141GB = 1,128 GB single node, or 16× H100 across 2 nodes with NVLink+InfiniBand.
AWS EC2 instances that fit it (verified May 2026):
| Instance | Spec | US East (Ohio) $/hr | US West (N. California) $/hr | Availability |
|---|---|---|---|---|
| p5e.48xlarge | 8× H200, Sapphire Rapids CPU, 1,128 GiB HBM3e | $39.80 (was $34.61 pre-Jan 2026) | $49.75 | Tight — AWS hiked 15% in Jan due to GPU demand |
| p5en.48xlarge | 8× H200 + Gen5 PCIe (faster CPU↔GPU) | ~$42 estimated Ohio | $52.02 | Tighter than p5e |
Honest realities for this path:
- ~$40/hr in Ohio is the floor for DeepSeek V4 Pro on AWS — and US East has the best supply.
- Capacity is generally not on-demand — you typically use EC2 Capacity Blocks for ML (pre-book 1–6 month windows in cluster sizes 1–64 instances). On-demand p5e/p5en availability is genuinely scarce in May 2026.
- Compliance inheritance is the reason to pick AWS over Vast — AWS Bedrock-tier BAA + SOC 1/2/3 + ISO 27001/27017/27018 + FedRAMP Moderate (commercial) / High (GovCloud) + HITRUST CSF all inherit transitively to the EC2 instance you run vLLM on. Vast/RunPod can't match that audit story.
- Cost vs Anthropic Opus API: Anthropic Opus 4.7 ≈ $30/hr typical agentic load. AWS DeepSeek V4 Pro ≈ $40/hr. The self-hosting math does not work for opus-tier on AWS at current prices — Vast at $7.74/hr (4× H200) is 5× cheaper if your compliance bar is BAA-only rather than FedRAMP/HITRUST.
- Capacity Block reservations lock you into 1–6 months at a fixed rate. If your usage is <40 hrs/month, on-demand Vast wins on flexibility even at higher hourly.
When AWS Level 6 actually makes sense:
- You need FedRAMP / HITRUST inheritance on the inference path itself (Vast/RunPod can't give you this).
- You have predictable continuous workload (>4 hrs/day, 5 days/week) to amortize a Capacity Block reservation.
- Your data-governance team requires AWS-tier vendor risk management — not just BAA paper.
For most users, opus-vast on Vast.ai 4× H200 at $7.74/hr remains the right call. AWS H200 is the answer only when the compliance ceiling demands it.
DeepSeek V4 Pro vs Claude Opus 4.7: Opus is still better at long-context coherence and tool-use consistency. V4 Pro is closer on raw reasoning benchmarks. For agentic coding specifically, Opus still edges it — but the gap is small enough that self-hosting V4 Pro is a real choice if compliance forces it.
You only need to set up the providers you'll actually use.
Vast.ai — cheapest Level 6 path
- Sign up at https://cloud.vast.ai/
- Account → Create API Key → Advanced tab
- Permissions: Instances = Read+Write, everything else minimal, 2FA off (programmatic key)
- Copy →
.envasVAST_API_KEY=...
RunPod — only provider with serverless wired today
- https://console.runpod.io/user/settings → API Keys → Create
- Permissions: All scope (Restricted returns 403 on serverless
/openai/v1) - Add credit, copy →
.envasRUNPOD_API_KEY=...
Groq Cloud — Level 3
- Sign up at https://console.groq.com/
- Enable ZDR before first request: https://console.groq.com/settings/data-controls
- Create API key →
.envasGROQ_API_KEY=... - (HIPAA) email security@groq.com requesting a counter-signed BAA — see COMPLIANCE.md
# Level 3 — API mode
./scripts/api-up.sh # bring up LiteLLM with -api routes
./scripts/destroy.sh api # stop LiteLLM, keep keys
# Level 6 — Vast pods (recommended)
./scripts/deploy-vast.sh haiku # 1× RTX 4090, ~$0.40–0.67/hr
./scripts/deploy-vast.sh sonnet # 4× H100 80GB, ~$5.87/hr
./scripts/deploy-vast.sh opus # 8× H100 80GB, ~$11.74/hr
# Level 6 — RunPod alternatives
./scripts/deploy.sh haiku # always-on pod
./scripts/deploy-serverless.sh haiku # scale-to-zero
./scripts/deploy-serverless.sh sonnet # scale-to-zero (H200, capacity-dependent)
# Teardown
./scripts/destroy.sh haiku-vast # one tier
./scripts/destroy.sh all # everything across all providersPod termination stops billing within ~1 min. Serverless idle is already $0 (workersMin=0); teardown removes the endpoint + template.
Parallel cold-start, ~15-20 min wall time vs serial:
./scripts/deploy-vast.sh haiku & ./scripts/deploy.sh sonnet & waitEach deploy is independent — separate bearer token, separate model alias in LiteLLM. All share http://localhost:4000. Switch in Cline by changing the Model ID.
If your VSCodium runs on a Mac but you're connected via Remote-SSH to a Linux box, Cline runs in the remote extension host — so its localhost:4000 means the remote machine, not your Mac. LiteLLM stays on the Mac (keeps your provider API keys local); we tunnel port 4000 back over the SSH session you're already opening:
./scripts/tunnel.sh init <ssh-host> # adds RemoteForward 4000 to ~/.ssh/config
./scripts/tunnel.sh deinit <ssh-host> # removes it
./scripts/tunnel.sh status # shows configured hostsAfter init, reconnect any open Remote-SSH window (close → reopen). Cline's Base URL stays http://localhost:4000/v1 — it's now forwarded back to your Mac. No tailnet ACL changes, no extra listeners exposed on your Mac, encrypted by the same SSH transport you're already using.
If your tailnet ACL does allow remote → Mac (uncommon for tagged-devices → user setups), there's also an opt-in docker-compose.tailscale.yml that adds a Tailscale-interface binding — see comments in that file.
Avoid re-downloading the 554 GiB Kimi K2.6 weights every day:
./scripts/vol-up.sh opus # one-time ~$6 + ~1-2 hr download
./scripts/deploy-vast.sh opus # subsequent: 3-5 min cold start
./scripts/destroy.sh opus-vast # stops compute, keeps volume
./scripts/vol-down.sh opus # delete volume (end of project)Monthly cost: ~$986 for 80 hrs/mo of opus use (4 hrs/day × 20 days) — about 60% cheaper than Anthropic Opus API at typical agentic-coding token mix.
Caveat: Vast volumes are pinned to a specific host. If that machine disappears, the volume is unavailable until it comes back. RunPod network volumes (host-independent) aren't wired in this repo yet.
Field-tested gotchas baked into the scripts as comments and filters:
- Vast
verified≠ datacenter.verified: {eq: true}means "host passes basic reliability checks" (marketplace tier, Docker-only isolation). The actual ZDR/HIPAA filter isdatacenter: {eq: true}(ISO 27001, Tier 3/4, BAA-eligible).deploy-vast.shhardcodes the latter. - Vast rents whole hosts. Search must use
num_gpus: {eq: N}notgte: N— otherwise picking an 8-GPU host for a 4-GPU TP config double-bills. - CUDA forward-compat doesn't work on consumer Ada. RTX 4090 hosts with driver < 580 (
cuda_max_good < 13.0) fail withcudaInit error 804. Filter forces≥ 13.0. runpod/worker-v1-vllmhas no:stableor:latesttag — only versioned tags.:stablesilently stalls forever.deploy-serverless.shpins to a known-good version.- RunPod
RestrictedAPI-key scope returns 403 on/v2/<id>/openai/v1. Use All scope for serverless inference. - Plain HTTP on Vast. Vast direct-port-forwarding is
http://<host>:<port>, not HTTPS. The bearer token is the only thing keeping the endpoint private. Adequate for personal use given the bearer; run a Caddy/Cloudflared sidecar for full TLS. - Some multi-GPU Vast hosts have broken CDI runtime. A subset fail container creation with "unresolvable CDI devices." Tear down and pick a different operator — per-host bug, not provider-wide.
- Vast Serverless isn't wired here. Their model is Python SDK +
@app.remote()handlers, not a flag on top of pods. Tracked as a follow-up PR. - RunPod serverless workers go "unhealthy" on FP8 cold start with sonnet. Diagnosed but not yet root-caused — likely worker-v1-vllm + DeepSeek V4 incompat. Use the Vast pod path for sonnet today.
| Project | Closeness | Differs |
|---|---|---|
Leafcloud tf-leafcloud-opencode |
~70% | OpenCode TUI (not Cline), CIDR allowlist, Leafcloud-only, no BAA |
| OpenClaw + vLLM on Vast.ai / Salad | ~65% | OpenClaw runtime, no LiteLLM Anthropic shim |
| Netclode | ~55% | Mobile/iOS client, Ollama not vLLM, k3s + microVM-per-session |
| ZeroClaw + LiteLLM + vLLM in Docker | ~50% | DGX Spark focus, ZeroClaw not Cline |
| BentoVLLM / OpenLLM | ~50% | Just the "model → OpenAI endpoint" piece |
Differentiator: nobody else ships VSCodium + Cline + LiteLLM + rented-GPU vLLM + serverless mode + HIPAA-eligible host + verified Groq API ZDR posture as a single one-line-deploy template.
- BAA is a separate process on every provider — RunPod, Vast, Groq all gate it behind sales/email. None are self-serve clickwrap with a counter-signed PDF on file. Plan ~1-5 business days.
- Cold start is slow. Pods: ~10-20 min for haiku/sonnet, ~20-30 min for opus. Serverless: 3-10 min on first request after scale-to-zero. Run profiles in parallel to overlap warmups.
- 80GB datacenter supply is thin. Sonnet (4× A100/H100 80GB) and opus (8× H100 80GB) Secure-Cloud inventory rotates hourly. Have
GPU_NAME="H200"as a fallback. - No persistent vLLM cache by default (except via
vol-up.sh). Weights re-download each fresh pod. - Hugging Face anonymous works for most models. Qwen2.5-Coder-32B-AWQ and DeepSeek V4 Flash are open-weight; Kimi K2.6 too. Gated models need
HF_TOKENin.env. - Parallel mode billing. All three tiers running = ~$18-30/hr. Stop tiers you aren't testing with
./scripts/destroy.sh <profile>.
.
├── README.md # this file
├── COMPLIANCE.md # full Level-by-Level compliance mapping
├── LICENSE # MIT
├── start.command / start.bat # double-click launchers (Mac / Windows)
├── stop.command / stop.bat # double-click teardown
├── docker-compose.yml # LiteLLM container
├── litellm/config.yaml # model-ID routes
├── gpu-node/
│ ├── Dockerfile # vLLM image
│ └── start.sh # container entrypoint
├── scripts/
│ ├── install-prereqs.sh # macOS/Linux installer
│ ├── install-prereqs.ps1 # Windows installer
│ ├── api-up.sh # Level 3 — Groq API mode
│ ├── aider-up.sh # one-time install of Aider (terminal client)
│ ├── aider.sh # launch Aider pointed at the local proxy
│ ├── openhands-up.sh # browser-based agent UI for non-developers
│ ├── openhands-down.sh # stop OpenHands
│ ├── deploy.sh # Level 6 — RunPod always-on pod
│ ├── deploy-vast.sh # Level 6 — Vast.ai pod (recommended)
│ ├── deploy-serverless.sh # Level 6 — RunPod serverless
│ ├── vol-up.sh / vol-down.sh # Vast persistent volume management
│ ├── destroy.sh # teardown (any profile, any provider)
│ ├── preflight.sh # validate prereqs + .env
│ └── smoketest.sh # end-to-end path test
├── .env.example # API key template
└── .gitignore
smoketest.sh returns FAIL — read its output; it names the broken hop.
403 Forbidden from RunPod serverless — your RUNPOD_API_KEY is Restricted scope. Recreate with All scope.
Serverless worker stuck "initializing" or "unhealthy" — check the RunPod dashboard for that worker's logs. Common causes: template image tag doesn't exist, GPU pool capacity, or vLLM init failure for FP8 models on non-Hopper hardware.
vLLM "out of memory" — shrink MAX_LEN or lower GPU_UTIL. Haiku at 8K already exhausts KV cache on 24GB after CUDA-graph capture; default is 4K.
Cold-start request hits Cloudflare 524 — the sync /openai/v1 path has a 120s edge timeout. Worker is fine; subsequent requests succeed once warmed.
Vast vLLM crashes with cudaInit error 804 — driver too old for our container's CUDA libs. Filter forces cuda_max_good ≥ 13.0.
Vast "Pulling fs layer" stalls — host can't reach GHCR (typical of CN-located hosts). Filter inet_down ≥ 500 Mbps.
Vast picks an 8-GPU host when you want 4 — Vast rents whole hosts. Script uses num_gpus: {eq: N} to avoid this.
Open a private security advisory on this repository's GitHub Security tab. No bounty program; aim to respond within 5 business days.
MIT — see LICENSE.