Developer AI Lab

A reproducible harness that answers: given a model family + N developers, which self-hosted hardware serves them at SLO and at what $/dev? — a hardware-sizing tool for self-hosting coding-assistant models. See docs/methodology.md for the full framing, and docs/results.md for how results are produced and visualized.

▶ Live demo: developer-ai-lab.streamlit.app — the dashboard over this repo's committed results, no setup needed.

⚠️ Benchmark runs provision real, paid GPU pods on RunPod, billed per second (a typical validated run costs $0.10–$1). Everything else — tests, dashboard, configuration — is free and offline. Details below.

What it answers

For a given model and team size N, which hardware serves N concurrent developers within SLO?
What is the $/dev-month for self-hosting on each hardware option?
How does $/dev change with team size, and where does each GPU stop holding the SLO?

Principle: the GPU pod runs inference only

The expensive resource is the GPU, and it is needed only for model inference. Everything else — the agent, builds, tests, load generation, scoring — runs off the GPU. Models are driven through the agent we actually use, kept model-independent:

Claude Code (headless)  ->  LiteLLM (Anthropic API)  ->  destream proxy  ->  vLLM (OpenAI API)  ->  model

The destream proxy (scripts/destream_proxy.py) sits between LiteLLM and vLLM because vLLM's qwen3 streaming tool-call parser is buggy (tool calls leak into message content); the proxy forces the upstream call to be non-streaming and re-emits the reply as SSE chunks.

Swapping models is a config change (configs/models/<family>-<quant>.yaml + configs/hardware/<hw>.yaml; the LiteLLM upstream via LITELLM_UPSTREAM_MODEL / VLLM_BASE) — never new code. Works for Qwen, GLM, MiniMax, etc.

Benchmarks

The lab runs a benchmark with a chosen model on a hardware config. A benchmark is a self-contained folder under benchmarks/<name>/ that defines the team sizes to test (devs), the task and its prompts, and the validity gates — the model family and the hardware are the run axes you vary:

make run BENCHMARK=<benchmark> MODEL=<family> HARDWARE=<hardware>

Two benchmarks ship built-in:

dev-load — the capacity benchmark. N agents concurrently build the same small TypeScript module; each N in the devs list is run directly (holds SLO? + $/dev at that N). Team sizes: [4, 8] (edit scenario.yaml to add more). Pick the model with MODEL=<family>.
smoke — stack pre-flight (make validate). A single-agent minimal task to confirm the serving + agent + gate stack works before a full capacity run.

How a benchmark run works

The run chooses a model family (MODEL=<family>, e.g. qwen3-coder). The variant (quant) is resolved per hardware: the hardware config declares supported_quants (e.g. [fp8, awq, bf16]), and compose.py picks the model file that matches the first supported quant (configs/models/<family>-<quant>.yaml). This means the chosen model targets the right quantization automatically on each GPU without manual overrides.

Each N in the benchmark's devs list is tested directly — N agents run in parallel, metrics are collected from /metrics, and gates are scored. The output per N is: holds_slo (bool), cost_per_dev_month, median_tps, median_ttft, valid, tokens_per_dev.

flowchart LR
    subgraph LOCAL["your machine"]
        O["run_benchmark"]
        subgraph CON["Docker toolchain container<br/>base (Node · Docker · claude) + scenario overlay"]
            CC["N × Claude Code agents"] --> LL["LiteLLM → destream proxy (local)"]
        end
        GATES["gates scored per N<br/>→ validity filter + report"]
    end
    subgraph POD["GPU pod"]
        V["vLLM<br/>inference only"]
    end
    O -->|"RunPod API · create / terminate"| POD
    LL -->|"ssh -L 8000 tunnel"| V
    O -->|"per devs level"| CON
    CON -.->|"metrics + gate results"| GATES

Gates are a minimum-validity filter, not a quality score. They confirm each agent did real work (so the performance and token numbers are trustworthy) but do not rank model quality — that is out of scope for this hardware-sizing question.

Running

Local requirements: macOS or Linux, Python 3.11+, Docker (daemon running), ssh and rsync. Plus a RunPod account with credit — every make run creates a real, per-second-billed GPU pod (see RunPod & costs).

One-time local setup:

cp .env.example .env            # RUNPOD_API_KEY + SSH_KEY_PATH (passphrase-less); optional HF_TOKEN
pip install -r requirements-orchestrator.txt

Build the base toolchain image (needed once):

docker build -t dail-toolchain -f infra/toolchain/Dockerfile .

Run the capacity benchmark with a model on a specific hardware:

make run BENCHMARK=dev-load MODEL=qwen3-coder HARDWARE=l40s
make run BENCHMARK=dev-load MODEL=glm-5.2 HARDWARE=h200 GPUS=8   # GLM-5.2 on 8× H200

Validate the stack first (cheap, ~1 short agent):

make validate HARDWARE=l40s

Both create the pod, run, pull results to results/ (gitignored, config-keyed — see below), and terminate the pod so per-second billing stops.

Reading results

Runs are laid out config-keyed, so the tree mirrors model → hardware → benchmark:

results/<family>-<quant>/<hardware>-<gpus>gpu/<benchmark>/<run_id>/
  report.json               # flat by_devs summary (kept)
  artifacts/n<N>/agent<i>/  # per-agent workspace + transcripts (heavy)

report.json has a flat by_devs structure: one entry per N with holds_slo, cost_per_dev_month, median_tps, median_ttft, valid, tokens_per_dev, plus a timeline and pod_cost_usd (measured pod wall-clock × the hardware's list $/GPU-hr). The latest run per config+benchmark is the lexically-max <run_id>.

make dashboard          # benchmark → N → model×hardware viewer (Streamlit + Plotly)
make prune-artifacts    # drop artifacts/ of all but the latest run per config (keeps every report.json)

The dashboard lists benchmarks, then for a chosen team size compares every model×hardware combo ranked by $/dev (cheapest that holds SLO wins), with per-run detail (timeline, token totals, gate outcomes). The smoke pre-flight is hidden. The small report.json files are committed — they feed the hosted demo at developer-ai-lab.streamlit.app (.gitignore whitelists them); the heavy artifacts/ stay gitignored. The dashboard reads results/**/report.json.

BENCHMARK accepts a bare name as a shortcut: make run BENCHMARK=dev-load is the same as BENCHMARK=benchmarks/dev-load/scenario.yaml. MODEL defaults to qwen3-coder; HARDWARE defaults to l40s; GPUS defaults to 1; KEEP=1 leaves the pod running for debugging. QUANT=<quant> forces the model variant instead of the hardware's preference order — the knob that makes same-hardware quantization pairs measurable (e.g. QUANT=awq on an fp8-first GPU); an unservable combination refuses to launch.

Configuration

A run is composed from three independent, reusable axes — benchmark(task+N) × model × hardware — that scripts/lib/compose.py merges into the flat config the pod consumes (written to configs/_composed.yaml, rsync'd to the pod). The model (MODEL=) and hardware (HARDWARE=) are chosen at run time; the benchmark carries the task + devs. Precedence for any shared key is constants < model < hardware < benchmark.serving. Adding a model, a GPU, or a benchmark is a new YAML file — never new code.

Model — `configs/models/*.yaml` (one file per model variant)

The portable serving recipe for one model variant. Resolution is by the family and quant fields inside the file (matched against the hardware's supported_quants), not by the filename — so any *.yaml works; naming it <family>-<quant>.yaml is just a convention (the shipped qwen3-coder-fp8.yaml / qwen3-coder-awq.yaml declare family: qwen3-coder with quant: fp8 / awq). Required: family, quant, model, served_model_name, vllm_version, image, container_disk_gb.

Field	Required	What it is
`family`	yes	Model family name (selected at run time via `--model` / Make `MODEL=`; matched against `configs/models/*.yaml`)
`quant`	yes	Quantization variant (e.g. `fp8`, `awq`, `bf16`); matched against hardware `supported_quants`
`model`	yes	Hugging Face repo id served by vLLM
`served_model_name`	yes	Short name vLLM/LiteLLM expose (and the pod label)
`vllm_version`	yes	Pinned vLLM version (must match the `image`'s torch/CUDA)
`image`	yes	RunPod base image; its torch/CUDA must satisfy `vllm_version`
`container_disk_gb`	yes	Pod container disk (large enough for the weights)
`pip_constraints`	no	Extra pip constraints (e.g. `["transformers<5"]`)
`tool_call_parser`	no	vLLM tool-call parser (e.g. `qwen3_xml`, `glm47`)
`reasoning_parser`	no	vLLM reasoning parser (e.g. `glm45`)
`enable_auto_tool_choice`	no	Enable vLLM auto tool choice
`kv_cache_dtype`	no	e.g. `fp8`
`extra_args`	no	Raw extra vLLM CLI args (list)
`env`	no	Env vars exported before vLLM starts

tensor_parallel_size is not here — it is derived from the GPU count (--gpus).

Hardware — `configs/hardware/<hw>.yaml`

One GPU type, priced per GPU. The GPU count is the run's --gpus N (Make GPUS=N), not a field here.

Field	What it is
`gpu_type_ids`	One-element list with the exact RunPod GPU id (e.g. `["NVIDIA L40S"]`)
`price_usd_per_gpu_hour`	Per-GPU on-demand price (total = price × `--gpus`)
`gpu_memory_utilization`	vLLM `--gpu-memory-utilization` for this GPU
`supported_quants`	Ordered list of quant variants this GPU supports (e.g. `[fp8, awq, bf16]`); `compose.py` picks the first that matches a model file
`provider` / `instance`	Informational annotation (exported to the pod env as `HW_PROVIDER` / `HW_INSTANCE`)

Benchmark — `benchmarks/<name>/scenario.yaml`

A self-contained benchmark package — the task, not the model (the model is MODEL= at run time). serving: overrides the workload knobs for that run; max_num_seqs is derived by compose.py from max(devs) — do not set it manually.

Field	What it is
`name` / `description`	Benchmark identity
`devs`	List of team sizes to test (e.g. `[4, 8]`); each N is run directly
`slo.max_ttft`	Max acceptable median time-to-first-token per stream (s)
`slo.min_tps`	Min acceptable median per-stream decode throughput (tok/s)
`serving.max_model_len`	vLLM context window for this run
`task.phases`	Ordered list of headless Claude Code sessions: `id`, `title`, `prompt`, `allowed_tools`, `max_turns`
`task.gates`	Validity filter: `id`, `description`, `cmd` (shell at workspace root), `expect_exit`

RunPod & costs

⚠️ Every benchmark run provisions real, paid GPU hardware. Each make run creates an on-demand GPU pod on RunPod, runs, and terminates it. You are billed per second for as long as the pod exists.

Prerequisites

A RunPod account with credit loaded — pods won't start otherwise.
A RunPod API key (console → Settings → API Keys).
A passphrase-less SSH key — its public half is injected into the pod at boot and the orchestrator reaches the pod over SSH.

Put both in .env (gitignored):

cp .env.example .env
# RUNPOD_API_KEY=...
# SSH_KEY_PATH=~/.ssh/runpod_key      # the PRIVATE key; <path>.pub must exist
# HF_TOKEN=hf_...                     # optional: free HF read token; see note below

A free HF_TOKEN (huggingface.co/settings/tokens) is optional but recommended: the pod downloads model weights from the Hugging Face Hub, and anonymous downloads share a low per-IP rate limit (3k resolver requests / 5 min) that 429-stalls partway through a large model. A token raises it to 5k/5 min per-account. start_vllm.sh picks it up from .env.

Which GPU type is used is declared in configs/hardware/<hw>.yaml (gpu_type_ids); the GPU count is the run's --gpus N (Make: GPUS=N, default 1). Each hardware config is a single type — if it has no capacity the run fails, so pick another hardware config.

What you pay for, and the safety net

You pay GPU only while the pod exists, which the orchestrators keep as short as possible. Three layers guard against runaway spend:

The pod is terminated on every exit path — success, error, Ctrl-C, even SIGTERM.
A watchdog kills the pod if vLLM stalls on startup, generation hangs, or a wall-clock ceiling is hit, and reaps pods orphaned by a crashed run on the next start.
make reap is a panic button that terminates all your pods via the API, any time:

make reap

If you ever interrupt a run in a way that skips cleanup, run make reap and/or check the RunPod console — a forgotten pod bills until it is terminated.

Observed costs

Real costs from our own runs. A run = create → serve/generate → terminate; cost = the GPU's list $/hr times how long the pod is up (model download + load + the actual benchmark). Each run computes this for you from the measured wall-clock — pod_cost_usd in report.json, with the per-step timeline showing where it went (see docs/results.md and make dashboard). We'll extend this table as we test more:

Model	Hardware	Benchmark	Approx. cost
Qwen3-Coder-30B-A3B-FP8	1× NVIDIA L40S	dev-load (N=4)	~$0.40

Persistent weight cache (huge models)

For very large models the weight download dominates the run — GLM-5.2's ~744GB takes tens of minutes on an 8×H200 pod at ~$29/h, and by default it re-downloads every run. A RunPod Network Volume caches the weights across pods so they download once:

In the RunPod console, create a Network Volume (~1TB) in a data center that has your GPUs — the volume pins every run to that DC, so pick one with 8×H200 capacity.

Put its id and DC in .env:

DAIL_NETWORK_VOLUME_ID=<volume id>
DAIL_DATA_CENTER_ID=<e.g. EU-RO-1>

The volume mounts at /workspace, so /workspace/huggingface (the HF cache) survives pod teardown. The first run downloads the weights (slow, paid once); later runs mount the same volume and load from it (~~minutes). A network volume has a monthly storage cost (~~$0.05–0.07/GB), so delete it when you're done iterating on that model.

Layout

configs/models/, configs/hardware/ — the model and hardware config axes (fields above).
benchmarks/<name>/ — the benchmark packages (scenario.yaml + phases/*.md + optional Dockerfile); see Configuration above to add one.
scripts/ — run entry (run_benchmark.py → bench_sdd.run) and the result viewer (dashboard.py); scripts/lib/ is tested pure logic (compose, gates, vllm_metrics, bench_report, report_terminal, timeline, command builders, …). SLO scoring lives in bench_report.py.
infra/litellm/ — the Anthropic↔OpenAI proxy. infra/toolchain/ — the local toolchain image.
docs/ — user docs (goals, methodology, results).

make test runs the unit suite (offline, free); make help lists all targets. The three requirements files map to install targets: requirements-orchestrator.txt (your machine), requirements-report.txt (the dashboard), requirements.txt (the pod / toolchain image). See docs/ for goals, methodology, and validated results.

Contributing & license

Contributions welcome — see CONTRIBUTING.md (short version: adding a model/GPU/benchmark is a YAML file, never code; never launch a paid pod to verify a change). Licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Developer AI Lab

What it answers

Principle: the GPU pod runs inference only

Benchmarks

How a benchmark run works

Running

Reading results

Configuration

Model — `configs/models/*.yaml` (one file per model variant)

Hardware — `configs/hardware/<hw>.yaml`

Benchmark — `benchmarks/<name>/scenario.yaml`

RunPod & costs

Prerequisites

What you pay for, and the safety net

Observed costs

Persistent weight cache (huge models)

Layout

Contributing & license

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
benchmarks		benchmarks
configs		configs
docs		docs
infra		infra
results		results
scripts		scripts
tests		tests
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements-orchestrator.txt		requirements-orchestrator.txt
requirements-report.txt		requirements-report.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Developer AI Lab

What it answers

Principle: the GPU pod runs inference only

Benchmarks

How a benchmark run works

Running

Reading results

Configuration

Model — configs/models/*.yaml (one file per model variant)

Hardware — configs/hardware/<hw>.yaml

Benchmark — benchmarks/<name>/scenario.yaml

RunPod & costs

Prerequisites

What you pay for, and the safety net

Observed costs

Persistent weight cache (huge models)

Layout

Contributing & license

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Model — `configs/models/*.yaml` (one file per model variant)

Hardware — `configs/hardware/<hw>.yaml`

Benchmark — `benchmarks/<name>/scenario.yaml`

Packages