zdr-coder

Self-host your AI coding assistant. Like ChatGPT or Claude — but the AI runs on a server you control, your prompts never get used to train anyone's model, and it costs cents per session instead of $20/month.

New to this? Jump to 📚 If you've never used a terminal for a step-by-step walkthrough. Most readers below are developers — that section is for everyone else.

📚 If you've never used a terminal, read this first

Most "self-host your AI" guides assume you're a developer. This section is for everyone else. No coding experience required. You'll spend ~30 minutes setting it up once, then never think about it again.

What you'll get

A chat window in your browser where you talk to an AI coding assistant (like ChatGPT, but private)
Your code and conversations never leave your computer or the AI provider — no training data harvest, no logs
Pennies per session instead of $20/month — Groq charges per-use, with $0 idle
Works on Mac, Windows, or Linux

What you'll need

What	Where	Cost
Docker Desktop (the engine that runs everything)	https://docs.docker.com/desktop/ — pick your OS	Free
A Groq account (provides the AI)	https://console.groq.com/	Free signup, pennies per use
About 30 minutes	—	Yours

You will not need to:

Know what an API, container, or proxy is
Write or edit any code
Run commands manually (after the one-time setup)

Step-by-step setup

1. Install Docker Desktop (10 min)

Go to the link in the table above, download the installer for your OS, run it. After installing, open Docker Desktop and wait until the menu-bar/tray icon shows "Docker Desktop is running." (You only do this once.)

2. Get your free Groq API key (3 min)

Go to https://console.groq.com/ — sign up with Google or email.
Click API Keys in the left sidebar → Create API Key → copy the key (it starts with gsk_...). Save it somewhere safe — you'll need it in a moment.

3. Turn on Zero Data Retention (1 min — important for privacy)

Go to https://console.groq.com/settings/data-controls
Toggle Zero Data Retention to ON.
This stops Groq from keeping any record of your prompts. You can verify it stuck by reloading the page.

4. Download this project (2 min)

At the top of this page, click the green <> Code button → Download ZIP.
Unzip the file. You'll have a folder called zdr-coder-main. Move it somewhere you'll find it again (Documents, Desktop, anywhere).
Open the folder. Find the file called .env.example — make a copy and rename the copy to .env (yes, just .env — no .example).
Open .env in TextEdit (Mac) or Notepad (Windows). Find the line that says GROQ_API_KEY= and paste your Groq key from step 2 right after the =. Save and close.

5. Start everything (one double-click)

Mac: Double-click start.command in the project folder. (If macOS warns about an "unidentified developer," right-click → Open → Open.)
Windows: Double-click start.bat in the project folder.

A Terminal/Command Prompt window will appear and show progress. After a minute or two it will say "Stack is running" and your browser will open to http://localhost:3000 — that's the AI chat interface (OpenHands).

6. Use it

In OpenHands, you'll see a chat box. Drop your project folder into the workspace pane (or just chat without a folder for general questions).
Type what you want the AI to do — "build me a simple todo app", "explain this code", "fix the bug where X happens", etc.
The AI will plan, edit files, and run commands inside its own sandbox. You watch and approve.

7. Stop when done

Double-click stop.command (Mac) or stop.bat (Windows). This stops the AI and the proxy. Your API key and settings are preserved for next time.

To start again later: just double-click start.command again.

What to do if something goes wrong

Symptom	Fix
`start.command` says "Docker Desktop did not start"	Open Docker Desktop manually from Applications, wait for the icon to say "running," then try `start.command` again.
`start.command` says "Missing Groq API key"	It will open the `.env` file for you. Paste your key after `GROQ_API_KEY=` and save.
Browser shows "This site can't be reached"	Wait another 30 seconds — first launch is slow. Refresh the page. If still broken, double-click `stop.command` then `start.command` again.
Bills look higher than expected	You may have left a self-hosted GPU pod running. Double-click `stop.command` to stop everything. Groq API mode by itself costs $0 when idle.
Anything else	Take a screenshot and ask the friend who pointed you here.

Costs in plain language

Using the default setup (Groq API mode), your costs are roughly:

What you're doing	Approximate cost
Stack sitting idle (you're not chatting)	$0
1 hour of active AI coding work	~$0.20
8 hours/day, every weekday, for a month	~$30/month
Compare: ChatGPT Plus	$20/month
Compare: Claude Pro	$20/month

So at most workloads, this is cheaper than ChatGPT Plus or Claude Pro while giving you actual zero-data-retention.

Why this exists

Who it's for: anyone who wants AI coding assistance with comparable performance to Claude (Anthropic) but with real ZDR + data privacy you can verify — including teams under HIPAA, SOC 2, or IP-sensitive workloads where "the model provider promises to be nice" isn't sufficient.

The thesis: open-weights models (GPT-OSS 120B, DeepSeek V4 Flash, Kimi K2.6) are now close enough to Claude Haiku / Sonnet / Opus that for most work you don't need Claude.AI. What you do need: a way to run them privately with contractual or technical ZDR. That's this repo.

What it gives you: one local proxy on http://localhost:4000, two privacy-preserving inference paths behind it (API + self-hosted), and tier aliases (haiku-api, sonnet-api, sonnet-vast, opus-vast) so your client config doesn't change when you swap underlying models. Drop-in for Aider, Cline, Roo Code, OpenHands.

Why now: ChatGPT Plus, Claude Pro, and Gemini Advanced all train on your input by default (or by tiny opt-in toggle), and none of the three are HIPAA-BAA-eligible. Your $20/mo buys faster models, not contractual privacy. This repo gives you contractual ZDR (Groq Cloud, with a self-serve toggle) or physical ZDR (your own pod on a Tier 3-4 datacenter) for less than $1/hr active and $0 idle. Total time to first request: 5 minutes for API mode, 15 minutes for a fresh pod.

Honest about what it doesn't do: no cryptographic E2E (provider still sees plaintext during inference — that's a Level 5 problem requiring TEE attestation), no FedRAMP / HITRUST of the rental platform itself (their datacenter partners have it transitively). COMPLIANCE.md documents every gap with verbatim citations.

Pick your privacy level

There are six levels of "how private is my AI." This repo gives you ⭐ levels 3 and 6 — the rest are listed so you can see where you'd otherwise land. Every claim below is sourced from the provider's own legal docs (links + verbatim quotes in COMPLIANCE.md).

Level	What it is	Cost	Provider sees your prompts?	Trained on by default?	HIPAA BAA?	Compliance certs	Good for
1. Lowest	Free consumer chat — chatgpt.com, claude.ai free, gemini.google.com	$0	Yes, plaintext, sampled humans may read	ChatGPT & Gemini: yes. Claude: opt-in only.	❌ never	❌ free tier excluded	Throwaway questions
2. Moderate	$20/mo consumer subs — ChatGPT Plus, Claude Pro, Gemini Advanced	~$20/mo	Yes, plaintext, sampled humans may read	ChatGPT Plus & Gemini Advanced: yes. Claude Pro: opt-in (toggle in settings).	❌ Plus / Pro / Advanced explicitly ineligible	❌ consumer tier excluded	Personal coding, nothing sensitive
3. High ⭐ this repo: `-api` routes*	Developer APIs with ZDR option — Groq, OpenAI API, Anthropic API, DeepInfra	$0.13–$4.50/hr active, $0 idle	Yes, plaintext, no human review under contract	❌ contractually no	✅ on request	SOC 2 Type 2, ISO 27001 (Groq)	Most use. Sensible default.
4. Very high	Enterprise cloud LLM APIs — AWS Bedrock (Anthropic Claude via Bedrock), Azure OpenAI Foundry, GCP Vertex AI	$3–$15 / million tokens (consumption-only, no contract minimum)	Yes, plaintext, cloud-vendor enforced no-access	❌ contractually no	✅ standard (Bedrock + Azure + GCP all have BAA in their default agreements)	SOC 2 Type 2, ISO 27001/27017/27018, FedRAMP Moderate (commercial) / High (GovCloud / Assured Workloads), HITRUST CSF	Regulated industries with audit obligations — HIPAA + FedRAMP/HITRUST required
5. Maximum	TEE-attested confidential inference — Tinfoil, GCP H100 CC mode	$5–$50/hr active or per-token	❌ cryptographically blind — hardware-attested	❌ enforced by hardware	✅ via Tinfoil	Tinfoil: SOC 2 only. GCP: full stack	National security, ultra-paranoid PHI
6. Own everything ⭐ this repo: `-vast`, `-serverless` routes	Self-host open-weights on rented GPU. Three sub-tiers by compliance ceiling: (a) Vast Secure Cloud / RunPod Secure Cloud — cheapest BAA-eligible; (b) AWS EC2 p5e/p5en (8× H200) — FedRAMP/HITRUST inheritance, much pricier; (c) Azure/GCP equivalents — similar to AWS	(a) $0.40–$15/hr Vast/RunPod pods, $0 idle serverless; (b) $39.80–$52.02/hr AWS p5e/p5en	Only the datacenter host operator's root user; contractually prohibited from introspecting (RunPod explicit, Vast implicit; AWS strongest)	❌ you control the model weights	✅ via datacenter operator BAA (all three)	(a) SOC 2 Type 2 platform, ISO 27001 via DC partners; (b) full AWS stack — SOC 2, ISO 27001, FedRAMP, HITRUST inherited	Long sessions, full audit trail, no managed-model provider in the path. AWS sub-tier when FedRAMP/HITRUST required.

A few non-obvious things from the research:

Paid consumer ≠ private. ChatGPT Plus and Gemini Advanced default to using your chats for training. Claude Pro defaults to opt-in (same as free Claude). None of Plus / Pro / Advanced is HIPAA-BAA-eligible. The $20 buys you faster models and higher rate limits, not contractual privacy.
Free Claude is more private than free ChatGPT. Claude requires opt-in for training; ChatGPT and Gemini opt you in by default. None are BAA-eligible.
AWS Nitro Enclaves can't run sonnet-class models. Nitro Enclaves have no GPU. The "confidential AI on AWS" marketing requires GovCloud Provisioned Throughput, not Enclaves.
Vast/RunPod compliance is split: the rental platform holds SOC 2 Type 2, but ISO 27001 belongs to their datacenter partners, not the platform itself.
Groq's ZDR is the strongest "Level 3" story because the toggle is self-serve in every account — most competitors gate ZDR behind enterprise contracts.

For each cell with verbatim provider quotes + URLs: see COMPLIANCE.md.

What you get from this repo

If you want…	Run this	Cost shape
Level 3 (API + ZDR), fastest setup	`./scripts/api-up.sh`	$0.13–$0.56/hr active, $0 idle
Level 6 always-on pod, cheapest	`./scripts/deploy-vast.sh haiku` (or sonnet / opus)	$0.40–$15/hr while up
Level 6 scale-to-zero, private	`./scripts/deploy-serverless.sh haiku` (or sonnet)	$0 idle, ~$0.50–$6/hr active
Stop everything	`./scripts/destroy.sh all`	$0

All three use the same local endpoint (http://localhost:4000). Switch between them in Cline by changing the Model ID field — no restart.

Five-minute setup

# 1. Install prereqs once (Docker, jq, openssl, VSCodium, Cline extension)
./scripts/install-prereqs.sh                     # macOS / Ubuntu / Debian
# or .\scripts\install-prereqs.ps1               # Windows (PowerShell as admin)

# 2. Set up your API keys
cp .env.example .env
$EDITOR .env                                     # set GROQ_API_KEY and/or VAST_API_KEY

# 3. Pick a path and run it
./scripts/api-up.sh                              # Level 3 — easiest, ZDR via Groq
# OR
./scripts/deploy-vast.sh haiku                   # Level 6 — your own GPU pod

Pick a client

LiteLLM is now serving an OpenAI-compatible endpoint on http://localhost:4000/v1. Point any agentic coding tool at it:

OpenHands (browser-based — best for non-developers, or anyone who wants a visual workspace):

./scripts/openhands-up.sh           # starts at http://localhost:3000
./scripts/openhands-down.sh         # stop

Browser UI with file workspace, sandboxed shell, plan/act loops. Backed by the same LiteLLM proxy. The non-developer start.command / start.bat launchers wrap this path end-to-end (Docker → LiteLLM → OpenHands → browser).

Aider (terminal, recommended for developers — no extension fragility):

./scripts/aider-up.sh   # one-time install (pipx install aider-chat)
./scripts/aider.sh      # launch — defaults to sonnet-api

Slash commands inside Aider: /add <file>, /run <cmd>, /commit, /undo, /help. Works flawlessly over SSH/tmux. To switch tier mid-session: exit and ZDR_MODEL=haiku-api ./scripts/aider.sh.

Cline / Roo Code (VSCode/VSCodium extension):

VSCodium → Cline (or Roo) → gear icon:

API Provider: OpenAI Compatible
Base URL: http://localhost:4000/v1
API Key: contents of .litellm-key (auto-generated; cat .litellm-key)
Model ID: sonnet-api (or sonnet-vast / haiku / etc — see below)

Cline is more autonomous; Aider is more controllable. Pick whichever fits your workflow. For SSH-Remote dev specifically, Aider avoids a class of extension-host networking bugs — see Using Cline from a remote SSH host below.

Done. Start coding.

Available model IDs

Model ID	What runs	Where	Tier mapping	Notes
`haiku-api`	GPT-OSS 20B (OpenAI open-weight, coding-tuned)	Groq Cloud	Level 3, ~Haiku-class	$0.10/hr active, $0 idle
`sonnet-api`	GPT-OSS 120B (OpenAI open-weight reasoning)	Groq Cloud	Level 3, ~Sonnet-ish	$0.20/hr active, $0 idle
`haiku-vast`	Qwen2.5-Coder-32B-AWQ (18GB INT4)	Vast Secure Cloud pod	Level 6, ~Haiku-class	$0.40–0.67/hr while up
`sonnet-vast`	DeepSeek V4 Flash (158B FP8, 149GB weights)	Vast Secure Cloud pod	Level 6, ~Sonnet-class	$5.87/hr while up
`opus-vast`	Kimi K2.6 (1T params, 554GB weights) — Opus-class on most benchmarks	Vast Secure Cloud pod	Level 6, ~Opus-class	$7.74–$15/hr while up
`haiku-serverless`	Qwen2.5-Coder-32B-AWQ	RunPod Serverless	Level 6, scale-to-zero	$0 idle, ~$0.50/hr active
`sonnet-serverless`	DeepSeek V4 Flash	RunPod Serverless	Level 6, scale-to-zero	$0 idle, ~$6/hr active (H200, capacity-dependent)
`haiku` / `sonnet` / `opus`	Same as `-vast`	RunPod always-on pod	Level 6, alt provider	Usually more expensive than Vast

Switching is instant — just change the field in Aider (ZDR_MODEL=...) or Cline (Model ID).

Honest tier mapping — what actually matches each Claude tier?

The aliases (haiku-api, sonnet-api, opus-vast) are aspirational labels mapped to the best ZDR-eligible open-weights option for that compute shape. As of May 2026:

Want	ZDR-eligible route in this repo	Real cost	Honest performance vs Claude
Haiku-class, cheapest	`haiku-api` (GPT-OSS 20B on Groq)	$0.10/hr active, $0 idle	Comparable to Haiku for simple edits; coding-tuned
Sonnet-ish, scale-to-zero	`sonnet-api` (GPT-OSS 120B on Groq)	$0.20/hr active, $0 idle	Between Sonnet 3.5 and Sonnet 4 for agentic loops; weaker on long-context multi-file work
Sonnet-class, self-hosted	`sonnet-vast` (DeepSeek V4 Flash on Vast 4× H100)	$5.87/hr while up	Solid Sonnet-class; FP8 weights, 65K context
Opus-class (best open option)	`opus-vast` (Kimi K2.6 on Vast 8× H100 or 4× H200)	$7.74–$11.74/hr while up	88.7% SWE-Bench vs Opus 4.7's 87.6% — Kimi K2.6 wins. Trade Opus's edge on GPQA / long-context / tool orchestration for the strongest SWE-Bench open-weights score.

There is no Groq-API equivalent for Opus-tier — Groq's production catalog tops out at GPT-OSS 120B, which sits between Sonnet 3.5 and Sonnet 4 on capability. Opus-class under ZDR means self-hosting (opus-vast → Kimi K2.6). We deliberately keep the API path to a single provider key (Groq) to minimize setup friction — adding a second API provider would mean another account and another key for marginal upside, since opus-vast already gets you Opus-class self-hosted.

Performance caveats — read this before betting on these aliases

ZDR-first framing: this repo prioritizes the contractual + technical privacy posture over hitting Claude's exact quality on every workload. The model choices are the best ZDR-eligible options, not necessarily the absolute best model. Specific caveats:

Long-horizon tool-use consistency: Claude Opus 4.7 still leads on 20+ tool-call agentic loops. Open-weights models including Kimi K2.6 drift more in long sessions. Mitigation: shorter task scope per aider session, explicit /clear between unrelated tasks.
Aider edit-block format compliance: older Qwen models drop the diff format ~5–10% of the time on multi-file refactors. GPT-OSS 120B and Kimi K2.6 are noticeably more reliable. If a model misbehaves, try --edit-format whole or --edit-format udiff.
GPQA / scientific reasoning: Opus 4.7 leads (~94.2%). No open-weights model matches it yet on this specifically.
MCP-Atlas / structured tool orchestration: Opus 4.7 leads. Cline/Roo's tool-call layer may produce more retries on open-weights models — not a model defect per se, just edge cases.
Adversarial-prompt robustness: Closed frontier models have stronger safety/jailbreak resistance. Not relevant for solo coding work, important if you're exposing the proxy to untrusted inputs.
Context length in practice: Groq's GPT-OSS 120B is 131K context, K2.6 self-hosted is 256K, but quality degrades meaningfully past ~50K input on all open models. Keep contexts tight.
Cost vs quality crossover: Below ~2 hrs/day of opus-tier use, Anthropic Opus API at $30/hr is cheaper than running opus-vast at $7.74/hr × 24. Self-hosted opus wins on continuous use + ZDR mandate, not just "I want a frontier model occasionally."

What's actually wired in API mode

Alias	Groq model	Tok/s	$/M in	$/M out
`haiku-api`	GPT-OSS 20B	1000	$0.075	$0.30
`sonnet-api`	GPT-OSS 120B	500	$0.15	$0.60

Other models exist in Groq's production catalog (Llama 3.1 8B, Llama 3.3 70B, etc.) but the GPT-OSS line is coding-tuned and meaningfully cheaper, so we standardize on it. DeepSeek V4 and Kimi K2.6 are not on Groq production — for Opus-tier under ZDR, self-host via Level 6 (opus-vast).

Compliance posture

This repo's Level 3 + Level 6 paths together cover:

✅ Zero data retention (Groq self-serve toggle; Vast/RunPod by container ownership)
✅ No training on your data (contractual on Groq; physical on self-hosted)
✅ HIPAA BAA available (all three providers — see COMPLIANCE.md for request process)
✅ SOC 2 Type 2 (Groq Inc, Vast Inc, RunPod Inc as of Oct 2025)
✅ Encryption in transit (TLS to provider edge)
✅ US data residency (default on all three)
✅ No third-party model provider in the inference path (Level 6)

What this repo does not give you out of the box:

❌ Cryptographic end-to-end (provider still sees plaintext during inference — Level 5 only)
❌ FedRAMP / HITRUST (Level 4 cloud APIs; or self-certify on Level 6 self-hosted)
❌ EU data residency (US-default; pick *-vast with geolocation=EU to override)
❌ Side-channel resistance on multi-tenant GPUs

COMPLIANCE.md has the full mapping with verbatim quotes from each provider's binding legal docs, plus the 7-step checklist for maintaining max-ZDR posture on Groq.

Architecture

flowchart LR
    subgraph laptop["Your laptop"]
        Cline["VSCodium + Cline"]
        LiteLLM["LiteLLM proxy<br/>Docker, :4000"]
        Cline -->|"localhost:4000"| LiteLLM
    end
    LiteLLM -.->|"TLS + bearer token"| Edge
    subgraph providers["Inference (pick one or many)"]
        Edge["Provider edge"]
        Edge --> API["Level 3 — Groq Cloud<br/>scale-to-zero, ZDR toggled"]
        Edge --> Vast["Level 6 — Vast.ai Secure Cloud<br/>always-on pod"]
        Edge --> RP["Level 6 — RunPod Secure Cloud<br/>pod or serverless"]
    end

LiteLLM is the local OpenAI-compatible proxy. Routes per-model-ID aliases, holds the master API key, injects per-route bearer tokens. Bound to 127.0.0.1 only — never exposed.
Groq path is direct API. ZDR toggle in console gates retention.
Vast / RunPod paths spin up a pod running gpu-node/Dockerfile (vLLM + your chosen model). LiteLLM connects via TLS-terminated proxy URL + bearer token.
Cline = the agentic coding extension in VSCodium. Talks to LiteLLM on localhost.

No mesh VPN — provider-managed transport (TLS) + bearer tokens is the same E2E envelope, simpler to operate.

Pod vs serverless vs API — when each wins

	API (Level 3, Groq)	Pod (Level 6, always-on)	Serverless (Level 6, scale-to-zero)
Idle cost	$0	$0.40–$15/hr	$0
Active cost	per-token (~$0.13–$0.56/hr equivalent)	included in hourly	per-second of worker uptime
First request	100ms	instant (host warm)	~3–5 min cold-start (sometimes longer for sonnet)
Capacity risk	Groq has plenty	thin on 80GB+ for sonnet/opus	thin on H200 for sonnet
Privacy	contractual ZDR	physical (your container)	physical (your container)
Best for	most use — bursty or continuous	4+ hrs/day on one tier	bursty but private

Rule of thumb: <2 hrs/day → Level 3 API. >4 hrs/day → Level 6 pod. In between → Level 6 serverless.

Cost comparison — live snapshot, May 2026

Always-on pod pricing for each tier (from current available on-demand offers):

Tier	RunPod Secure $/hr	Vast Secure Cloud $/hr	AWS EC2 $/hr	Notes
haiku (1× RTX 4090 24GB)	$0.69	$0.40–0.67	n/a — AWS doesn't rent 4090s	Vast cheapest when Iceland host rentable
sonnet (4× A100 / 4× H100 80GB)	$5.96 (often sold out)	$4.27 (A100) or $5.87 (H100 SXM)	$32–40 (p4de / p5)	Vast supply thin, 1–2 hosts at a time; AWS p5 even thinner
opus (8× H100 SXM 80GB)	$23.92 (often sold out)	$11.74	$50–98 (p5.48xlarge)	France datacenter when listed
opus (alt) 4× H200 140GB	—	$7.74	—	560 GiB > Kimi K2.6's 554 GiB weights
opus (frontier — DeepSeek V4 Pro)	—	—	$39.80–$52.02 (p5e.48xlarge / p5en.48xlarge 8× H200)	See "DeepSeek V4 Pro" section below

Versus going Anthropic-direct (no self-hosting): ~$30/hr for Opus-class agentic-coding workload. Crossover for opus is ~1.5 hrs/day before self-hosted wins on cost.

Opus-tier on open weights — Kimi K2.6 (already wired as `opus-vast`)

For Opus-class self-hosted: Kimi K2.6 (Moonshot, April-May 2026) is the strongest open-weights option and is what opus-vast runs. Honest benchmark picture — different sources report different numbers:

Model	SWE-Bench Verified (range)	Intelligence Index (AA)	Context
Claude Opus 4.7	87.6%	57	200K
Claude Opus 4.6	~85%	~55	200K
Kimi K2.6 (this repo)	80.2%–88.7% (source-dependent)	54	256K
DeepSeek V4 Pro	80.6%–83.7%	52	1M

The honest framing: Kimi K2.6 is comparable to Opus 4.6/4.7 — not strictly better, not meaningfully worse for typical work. The Intelligence Index gap is 3 points (57 vs 54) which translates to: occasionally noticeable on multi-step nuanced reasoning, invisible on most everyday tasks.

Where Opus still pulls ahead:

Multi-step nuanced reasoning where each step builds on the last
Long-horizon agentic loops (20+ tool calls without drift)
GPQA Diamond / scientific reasoning (~94% vs ~82%)
Adversarial prompt robustness (less relevant for solo coding)

Where Kimi K2.6 matches or wins:

General chat, Q&A, code review, single-task agentic work
Multilingual coding
256K context (vs Opus 200K)
Cost: 5–6× cheaper than Anthropic Opus API when self-hosted at typical workloads
You can audit it: open weights, your container, no third-party model provider seeing prompts

The point of this repo isn't that Kimi K2.6 beats Opus 4.7. The point is that for ZDR + privacy + audit, you get comparable Opus-class performance from a model you fully control. Claude.AI users move to this not because the open-weights model is better, but because they need contractual ZDR + their own data sovereignty.

Other recent OSS frontier releases (April–May 2026) worth knowing

Model	Vendor	License	Notes for this repo
Kimi K2.6	Moonshot	open weights	Opus-class — wired as `opus-vast`
DeepSeek V4 Pro	DeepSeek	open weights	Opus-class on benchmarks, similar to Kimi K2.6. Not wired — requires 8× H200 minimum (864 GB weights).
DeepSeek V4 Flash	DeepSeek	open weights	Already wired as `sonnet-vast` — sonnet-class FP8
GLM-5.1	Z.ai	open weights	Newer, similar tier to DeepSeek V4 Flash
Qwen 3.6	Alibaba	Apache 2.0	Strong on broad benchmarks; not yet wired
MiMo-V2.5-Pro	Xiaomi	open weights	Strong reasoning
MiniMax M2.7	MiniMax	open weights	Recent open-source
Gemma 4	Google	open weights	Smaller — haiku-tier
Ring-2.6-1T	Ant Group (inclusionAI)	open weights	Large MoE, 1T params

DeepSeek V4 Pro on AWS (Level 6 + L4-tier compliance) — when this makes sense

DeepSeek V4 Pro is one viable open-weights opus-class model — 1.6T params MoE, ~49B active per token, 80.6% on SWE-bench (lower than Kimi K2.6's 88.7%, but ahead of most). Weights ~864 GB. Doesn't fit on 8× H100 80GB (640 GB total < 864 GB); minimum viable host is 8× H200 141GB = 1,128 GB single node, or 16× H100 across 2 nodes with NVLink+InfiniBand.

AWS EC2 instances that fit it (verified May 2026):

Instance	Spec	US East (Ohio) $/hr	US West (N. California) $/hr	Availability
p5e.48xlarge	8× H200, Sapphire Rapids CPU, 1,128 GiB HBM3e	$39.80 (was $34.61 pre-Jan 2026)	$49.75	Tight — AWS hiked 15% in Jan due to GPU demand
p5en.48xlarge	8× H200 + Gen5 PCIe (faster CPU↔GPU)	~$42 estimated Ohio	$52.02	Tighter than p5e

Honest realities for this path:

~$40/hr in Ohio is the floor for DeepSeek V4 Pro on AWS — and US East has the best supply.
Capacity is generally not on-demand — you typically use EC2 Capacity Blocks for ML (pre-book 1–6 month windows in cluster sizes 1–64 instances). On-demand p5e/p5en availability is genuinely scarce in May 2026.
Compliance inheritance is the reason to pick AWS over Vast — AWS Bedrock-tier BAA + SOC 1/2/3 + ISO 27001/27017/27018 + FedRAMP Moderate (commercial) / High (GovCloud) + HITRUST CSF all inherit transitively to the EC2 instance you run vLLM on. Vast/RunPod can't match that audit story.
Cost vs Anthropic Opus API: Anthropic Opus 4.7 ≈ $30/hr typical agentic load. AWS DeepSeek V4 Pro ≈ $40/hr. The self-hosting math does not work for opus-tier on AWS at current prices — Vast at $7.74/hr (4× H200) is 5× cheaper if your compliance bar is BAA-only rather than FedRAMP/HITRUST.
Capacity Block reservations lock you into 1–6 months at a fixed rate. If your usage is <40 hrs/month, on-demand Vast wins on flexibility even at higher hourly.

When AWS Level 6 actually makes sense:

You need FedRAMP / HITRUST inheritance on the inference path itself (Vast/RunPod can't give you this).
You have predictable continuous workload (>4 hrs/day, 5 days/week) to amortize a Capacity Block reservation.
Your data-governance team requires AWS-tier vendor risk management — not just BAA paper.

For most users, opus-vast on Vast.ai 4× H200 at $7.74/hr remains the right call. AWS H200 is the answer only when the compliance ceiling demands it.

DeepSeek V4 Pro vs Claude Opus 4.7: Opus is still better at long-context coherence and tool-use consistency. V4 Pro is closer on raw reasoning benchmarks. For agentic coding specifically, Opus still edges it — but the gap is small enough that self-hosting V4 Pro is a real choice if compliance forces it.

Setup detail

Pick your provider(s)

You only need to set up the providers you'll actually use.

Vast.ai — cheapest Level 6 path

Sign up at https://cloud.vast.ai/
Account → Create API Key → Advanced tab
Permissions: Instances = Read+Write, everything else minimal, 2FA off (programmatic key)
Copy → .env as VAST_API_KEY=...

RunPod — only provider with serverless wired today

https://console.runpod.io/user/settings → API Keys → Create
Permissions: All scope (Restricted returns 403 on serverless /openai/v1)
Add credit, copy → .env as RUNPOD_API_KEY=...

Groq Cloud — Level 3

Sign up at https://console.groq.com/
Enable ZDR before first request: https://console.groq.com/settings/data-controls
Create API key → .env as GROQ_API_KEY=...
(HIPAA) email security@groq.com requesting a counter-signed BAA — see COMPLIANCE.md

Deploy and tear down

# Level 3 — API mode
./scripts/api-up.sh                              # bring up LiteLLM with -api routes
./scripts/destroy.sh api                         # stop LiteLLM, keep keys

# Level 6 — Vast pods (recommended)
./scripts/deploy-vast.sh haiku                   # 1× RTX 4090, ~$0.40–0.67/hr
./scripts/deploy-vast.sh sonnet                  # 4× H100 80GB, ~$5.87/hr
./scripts/deploy-vast.sh opus                    # 8× H100 80GB, ~$11.74/hr

# Level 6 — RunPod alternatives
./scripts/deploy.sh haiku                        # always-on pod
./scripts/deploy-serverless.sh haiku             # scale-to-zero
./scripts/deploy-serverless.sh sonnet            # scale-to-zero (H200, capacity-dependent)

# Teardown
./scripts/destroy.sh haiku-vast                  # one tier
./scripts/destroy.sh all                         # everything across all providers

Pod termination stops billing within ~1 min. Serverless idle is already $0 (workersMin=0); teardown removes the endpoint + template.

Running multiple tiers in parallel

Parallel cold-start, ~15-20 min wall time vs serial:

./scripts/deploy-vast.sh haiku &  ./scripts/deploy.sh sonnet &  wait

Each deploy is independent — separate bearer token, separate model alias in LiteLLM. All share http://localhost:4000. Switch in Cline by changing the Model ID.

Using Cline from a remote SSH host (VSCodium Remote-SSH, Tailscale SSH, etc.)

If your VSCodium runs on a Mac but you're connected via Remote-SSH to a Linux box, Cline runs in the remote extension host — so its localhost:4000 means the remote machine, not your Mac. LiteLLM stays on the Mac (keeps your provider API keys local); we tunnel port 4000 back over the SSH session you're already opening:

./scripts/tunnel.sh init <ssh-host>     # adds RemoteForward 4000 to ~/.ssh/config
./scripts/tunnel.sh deinit <ssh-host>   # removes it
./scripts/tunnel.sh status              # shows configured hosts

After init, reconnect any open Remote-SSH window (close → reopen). Cline's Base URL stays http://localhost:4000/v1 — it's now forwarded back to your Mac. No tailnet ACL changes, no extra listeners exposed on your Mac, encrypted by the same SSH transport you're already using.

If your tailnet ACL does allow remote → Mac (uncommon for tagged-devices → user setups), there's also an opt-in docker-compose.tailscale.yml that adds a Tailscale-interface binding — see comments in that file.

Persistent model cache (opus economics)

Avoid re-downloading the 554 GiB Kimi K2.6 weights every day:

./scripts/vol-up.sh opus            # one-time ~$6 + ~1-2 hr download
./scripts/deploy-vast.sh opus       # subsequent: 3-5 min cold start
./scripts/destroy.sh opus-vast      # stops compute, keeps volume
./scripts/vol-down.sh opus          # delete volume (end of project)

Monthly cost: ~$986 for 80 hrs/mo of opus use (4 hrs/day × 20 days) — about 60% cheaper than Anthropic Opus API at typical agentic-coding token mix.

Caveat: Vast volumes are pinned to a specific host. If that machine disappears, the volume is unavailable until it comes back. RunPod network volumes (host-independent) aren't wired in this repo yet.

Things we learned the hard way

Field-tested gotchas baked into the scripts as comments and filters:

Vast verified ≠ datacenter. verified: {eq: true} means "host passes basic reliability checks" (marketplace tier, Docker-only isolation). The actual ZDR/HIPAA filter is datacenter: {eq: true} (ISO 27001, Tier 3/4, BAA-eligible). deploy-vast.sh hardcodes the latter.
Vast rents whole hosts. Search must use num_gpus: {eq: N} not gte: N — otherwise picking an 8-GPU host for a 4-GPU TP config double-bills.
CUDA forward-compat doesn't work on consumer Ada. RTX 4090 hosts with driver < 580 (cuda_max_good < 13.0) fail with cudaInit error 804. Filter forces ≥ 13.0.
runpod/worker-v1-vllm has no :stable or :latest tag — only versioned tags. :stable silently stalls forever. deploy-serverless.sh pins to a known-good version.
RunPod Restricted API-key scope returns 403 on /v2/<id>/openai/v1. Use All scope for serverless inference.
Plain HTTP on Vast. Vast direct-port-forwarding is http://<host>:<port>, not HTTPS. The bearer token is the only thing keeping the endpoint private. Adequate for personal use given the bearer; run a Caddy/Cloudflared sidecar for full TLS.
Some multi-GPU Vast hosts have broken CDI runtime. A subset fail container creation with "unresolvable CDI devices." Tear down and pick a different operator — per-host bug, not provider-wide.
Vast Serverless isn't wired here. Their model is Python SDK + @app.remote() handlers, not a flag on top of pods. Tracked as a follow-up PR.
RunPod serverless workers go "unhealthy" on FP8 cold start with sonnet. Diagnosed but not yet root-caused — likely worker-v1-vllm + DeepSeek V4 incompat. Use the Vast pod path for sonnet today.

How `zdr-coder` compares to similar projects

Project	Closeness	Differs
Leafcloud `tf-leafcloud-opencode`	~70%	OpenCode TUI (not Cline), CIDR allowlist, Leafcloud-only, no BAA
OpenClaw + vLLM on Vast.ai / Salad	~65%	OpenClaw runtime, no LiteLLM Anthropic shim
Netclode	~55%	Mobile/iOS client, Ollama not vLLM, k3s + microVM-per-session
ZeroClaw + LiteLLM + vLLM in Docker	~50%	DGX Spark focus, ZeroClaw not Cline
BentoVLLM / OpenLLM	~50%	Just the "model → OpenAI endpoint" piece

Differentiator: nobody else ships VSCodium + Cline + LiteLLM + rented-GPU vLLM + serverless mode + HIPAA-eligible host + verified Groq API ZDR posture as a single one-line-deploy template.

Caveats

BAA is a separate process on every provider — RunPod, Vast, Groq all gate it behind sales/email. None are self-serve clickwrap with a counter-signed PDF on file. Plan ~1-5 business days.
Cold start is slow. Pods: ~10-20 min for haiku/sonnet, ~20-30 min for opus. Serverless: 3-10 min on first request after scale-to-zero. Run profiles in parallel to overlap warmups.
80GB datacenter supply is thin. Sonnet (4× A100/H100 80GB) and opus (8× H100 80GB) Secure-Cloud inventory rotates hourly. Have GPU_NAME="H200" as a fallback.
No persistent vLLM cache by default (except via vol-up.sh). Weights re-download each fresh pod.
Hugging Face anonymous works for most models. Qwen2.5-Coder-32B-AWQ and DeepSeek V4 Flash are open-weight; Kimi K2.6 too. Gated models need HF_TOKEN in .env.
Parallel mode billing. All three tiers running = ~$18-30/hr. Stop tiers you aren't testing with ./scripts/destroy.sh <profile>.

Files

.
├── README.md                       # this file
├── COMPLIANCE.md                   # full Level-by-Level compliance mapping
├── LICENSE                         # MIT
├── start.command / start.bat       # double-click launchers (Mac / Windows)
├── stop.command / stop.bat         # double-click teardown
├── docker-compose.yml              # LiteLLM container
├── litellm/config.yaml             # model-ID routes
├── gpu-node/
│   ├── Dockerfile                  # vLLM image
│   └── start.sh                    # container entrypoint
├── scripts/
│   ├── install-prereqs.sh          # macOS/Linux installer
│   ├── install-prereqs.ps1         # Windows installer
│   ├── api-up.sh                   # Level 3 — Groq API mode
│   ├── aider-up.sh                 # one-time install of Aider (terminal client)
│   ├── aider.sh                    # launch Aider pointed at the local proxy
│   ├── openhands-up.sh             # browser-based agent UI for non-developers
│   ├── openhands-down.sh           # stop OpenHands
│   ├── deploy.sh                   # Level 6 — RunPod always-on pod
│   ├── deploy-vast.sh              # Level 6 — Vast.ai pod (recommended)
│   ├── deploy-serverless.sh        # Level 6 — RunPod serverless
│   ├── vol-up.sh / vol-down.sh     # Vast persistent volume management
│   ├── destroy.sh                  # teardown (any profile, any provider)
│   ├── preflight.sh                # validate prereqs + .env
│   └── smoketest.sh                # end-to-end path test
├── .env.example                    # API key template
└── .gitignore

Troubleshooting

smoketest.sh returns FAIL — read its output; it names the broken hop.

403 Forbidden from RunPod serverless — your RUNPOD_API_KEY is Restricted scope. Recreate with All scope.

Serverless worker stuck "initializing" or "unhealthy" — check the RunPod dashboard for that worker's logs. Common causes: template image tag doesn't exist, GPU pool capacity, or vLLM init failure for FP8 models on non-Hopper hardware.

vLLM "out of memory" — shrink MAX_LEN or lower GPU_UTIL. Haiku at 8K already exhausts KV cache on 24GB after CUDA-graph capture; default is 4K.

Cold-start request hits Cloudflare 524 — the sync /openai/v1 path has a 120s edge timeout. Worker is fine; subsequent requests succeed once warmed.

Vast vLLM crashes with cudaInit error 804 — driver too old for our container's CUDA libs. Filter forces cuda_max_good ≥ 13.0.

Vast "Pulling fs layer" stalls — host can't reach GHCR (typical of CN-located hosts). Filter inet_down ≥ 500 Mbps.

Vast picks an 8-GPU host when you want 4 — Vast rents whole hosts. Script uses num_gpus: {eq: N} to avoid this.

Reporting vulnerabilities

Open a private security advisory on this repository's GitHub Security tab. No bounty program; aim to respond within 5 business days.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
gpu-node		gpu-node
litellm		litellm
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
COMPLIANCE.md		COMPLIANCE.md
LICENSE		LICENSE
README.md		README.md
docker-compose.tailscale.yml		docker-compose.tailscale.yml
docker-compose.yml		docker-compose.yml
start.bat		start.bat
start.command		start.command
stop.bat		stop.bat
stop.command		stop.command

Folders and files

Latest commit

History

Repository files navigation

zdr-coder

📚 If you've never used a terminal, read this first

What you'll get

What you'll need

Step-by-step setup

1. Install Docker Desktop (10 min)

2. Get your free Groq API key (3 min)

3. Turn on Zero Data Retention (1 min — important for privacy)

4. Download this project (2 min)

5. Start everything (one double-click)

6. Use it

7. Stop when done

What to do if something goes wrong

Costs in plain language

Why this exists

Pick your privacy level

What you get from this repo

Five-minute setup

Pick a client

Available model IDs

Honest tier mapping — what actually matches each Claude tier?

Performance caveats — read this before betting on these aliases

What's actually wired in API mode

Compliance posture

Architecture

Pod vs serverless vs API — when each wins

Cost comparison — live snapshot, May 2026

Opus-tier on open weights — Kimi K2.6 (already wired as opus-vast)

Other recent OSS frontier releases (April–May 2026) worth knowing

DeepSeek V4 Pro on AWS (Level 6 + L4-tier compliance) — when this makes sense

Setup detail

Pick your provider(s)

Deploy and tear down

Running multiple tiers in parallel

Using Cline from a remote SSH host (VSCodium Remote-SSH, Tailscale SSH, etc.)

Persistent model cache (opus economics)

Things we learned the hard way

How zdr-coder compares to similar projects

Caveats

Files

Troubleshooting

Reporting vulnerabilities

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Opus-tier on open weights — Kimi K2.6 (already wired as `opus-vast`)

How `zdr-coder` compares to similar projects

Packages