srclight · tofutim · Mar 23, 2026
diff --git a/.claude/skills/benchmarking/SKILL.md b/.claude/skills/benchmarking/SKILL.md
@@ -0,0 +1,93 @@
+# Benchmarking
+
+Run quality benchmarks across models, interpret results, and update quality scores.
+
+## When to Use
+
+When you need to evaluate model quality (e.g., after adding new models, comparing providers, or auditing the tier assignments).
+
+## Steps
+
+### 1. Select models to benchmark
+
+```
+get_fastest(min_tier="A", count=10, verified=True)
+```
+
+Or target specific models:
+```
+scan(provider="nvidia", verify=True)
+```
+
+### 2. Run the benchmark
+
+Via MCP:
+```
+benchmark(min_tier="A", provider="nvidia")
+```
+
+Or for a specific model:
+```
+benchmark(model_id="nvidia/llama-3.1-nemotron-ultra-253b-v1")
+```
+
+The benchmark runs 5 coding challenges and scores pass/fail. Results are stored in the quality database (~/.model-radar/quality.json) and affect future `get_fastest()` rankings.
+
+### 3. Interpret results
+
+Quality scores are 0-5 (number of challenges passed):
+- **5/5**: Excellent -- reliable for coding tasks
+- **4/5**: Good -- minor issues, generally usable
+- **3/5**: Acceptable -- may struggle with complex tasks
+- **2/5 or below**: Avoid for coding -- consider downgrading tier
+
+### 4. Cross-validate with judge evaluation
+
+For deeper quality assessment, use LLM-as-judge:
+
+```
+judge(
+  prompt="Write a Python function that finds the longest common subsequence of two strings",
+  rubric=["correctness", "efficiency", "code_quality"],
+  scale="1-5",
+  count=3
+)
+```
+
+See `docs/playbook-llm-as-judge.md` for full evaluation patterns.
+
+### 5. Update tier assignments if needed
+
+If benchmark results consistently disagree with the assigned tier:
+1. Check SWE-bench Verified for updated scores
+2. Update the `tier` field in `src/model_radar/providers.py`
+3. Run tests to ensure no tier validation failures
+
+### 6. Batch benchmark for comprehensive audit
+
+To benchmark all models from a provider:
+```
+scan(provider="provider_key", verify=True)
+benchmark(provider="provider_key")
+```
+
+To benchmark across all configured providers:
+```
+benchmark(min_tier="B")
+```
+
+## Interpreting Benchmark vs Tier Disagreements
+
+| Benchmark | Tier | Action |
+|-----------|------|--------|
+| 5/5 | B or lower | Check SWE-bench, consider upgrade |
+| 0-2/5 | A or higher | May be a flaky model, re-run. If consistent, downgrade |
+| 3-4/5 | matches tier | No action needed |
+
+## Checklist
+
+- [ ] Models selected (verified alive first)
+- [ ] Benchmark run completed
+- [ ] Results interpreted (scores + tier alignment)
+- [ ] Tier adjustments made if needed (in providers.py)
+- [ ] Tests pass after any tier changes
diff --git a/.claude/skills/provider-setup/SKILL.md b/.claude/skills/provider-setup/SKILL.md
@@ -0,0 +1,84 @@
+# Provider Setup
+
+Add a new LLM provider to model-radar end-to-end.
+
+## When to Use
+
+When adding a new provider (e.g., a new free LLM API) to the model-radar catalog.
+
+## Steps
+
+### 1. Research the provider
+
+Gather:
+- Provider name and API base URL
+- Authentication method (Bearer token, API key in query param, no auth)
+- Environment variable convention (e.g., PROVIDER_API_KEY)
+- Available models with their IDs
+- Free tier availability and rate limits
+- Model quality tiers (check SWE-bench Verified if available)
+
+### 2. Add provider definition in `src/model_radar/providers.py`
+
+Add a new `Provider` entry to the `PROVIDERS` dict:
+
+```python
+"provider_key": Provider(
+    name="Provider Name",
+    base_url="https://api.provider.com/v1",
+    env_vars=["PROVIDER_API_KEY"],
+    models=[
+        Model("org/model-name", "Model Display Name", tier="A", ctx=32768),
+    ],
+),
+```
+
+Key fields:
+- `base_url`: The OpenAI-compatible chat/completions endpoint base
+- `env_vars`: List of environment variable names for the API key
+- `tier`: SWE-bench tier (S+, S, A+, A, A-, B+, B, C)
+- `ctx`: Context window size
+
+### 3. Handle auth quirks in `src/model_radar/runner.py`
+
+Most providers use `Authorization: Bearer <key>`. If the new provider differs:
+- Query param auth: add to the `if provider == "..."` block around line 55
+- Token auth: add to the Token block around line 62
+- No auth: add to `_NO_AUTH_PROVIDERS` in `config.py`
+
+### 4. Add provider sync if they have a models API
+
+If the provider has a `/models` endpoint for dynamic model discovery, add a fetch function in `src/model_radar/provider_sync.py`.
+
+### 5. Add setup guide
+
+Add signup instructions in `src/model_radar/guides.py` so `setup_guide("provider_key")` returns useful onboarding steps.
+
+### 6. Write tests
+
+Add test coverage in `tests/test_providers.py`:
+- Provider key exists in PROVIDERS
+- Models have valid tiers
+- Base URL is well-formed
+
+### 7. Verify
+
+```sh
+python -m pytest tests/ -v
+model-radar providers                    # should list new provider
+model-radar scan --provider provider_key # should ping successfully
+```
+
+Via MCP:
+```
+list_providers()          # new provider shows up
+scan(provider="provider_key", verify=True)  # models respond
+```
+
+## Checklist
+
+- [ ] Provider added to PROVIDERS dict in providers.py
+- [ ] Auth handling in runner.py (if non-standard)
+- [ ] Setup guide in guides.py
+- [ ] Tests added and passing
+- [ ] Verified via CLI scan and MCP tools
diff --git a/.claude/skills/release/SKILL.md b/.claude/skills/release/SKILL.md
@@ -0,0 +1,84 @@
+# Release
+
+Version bump, PyPI publish, and MCP registry update for model-radar.
+
+## When to Use
+
+When you need to cut a new release of model-radar.
+
+## Steps
+
+### 1. Pre-flight checks
+
+```sh
+python -m pytest tests/ -v
+ruff check src/ tests/
+```
+
+All tests must pass and lint must be clean before proceeding.
+
+### 2. Decide the version
+
+Follow semver: MAJOR.MINOR.PATCH
+- PATCH: bug fixes, provider data updates
+- MINOR: new MCP tools, new providers, new features
+- MAJOR: breaking API changes (rare)
+
+Check current version:
+```sh
+grep '^version' pyproject.toml
+```
+
+### 3. Bump version in BOTH files
+
+These must match exactly:
+
+1. `pyproject.toml` -> `version = "X.Y.Z"`
+2. `src/model_radar/__init__.py` -> `__version__ = "X.Y.Z"`
+
+### 4. Update server.json if needed
+
+If the PyPI package version in `server.json` is stale, update the `version` fields to match. This is the MCP registry manifest.
+
+### 5. Commit the version bump
+
+```sh
+git add pyproject.toml src/model_radar/__init__.py server.json
+git commit -m "bump: vX.Y.Z -- <one-line summary of changes>"
+```
+
+### 6. Merge to master and tag
+
+```sh
+git checkout develop && git merge --no-ff feature/xxx   # if on feature branch
+git checkout master && git merge develop --no-ff -m "release: vX.Y.Z"
+git tag -a vX.Y.Z -m "vX.Y.Z -- <summary>"
+git checkout develop
+git push origin master develop --tags
+```
+
+### 7. Create GitHub release
+
+```sh
+gh release create vX.Y.Z --title "vX.Y.Z" --notes "<release notes>"
+```
+
+This triggers the publish workflow which:
+- Builds and publishes to PyPI via trusted publisher (OIDC)
+- Publishes to MCP Registry via mcp-publisher (OIDC)
+
+### 8. Verify
+
+- Check PyPI: `pip install model-radar-mcp==X.Y.Z` works
+- Check MCP Registry listing is updated
+
+## Checklist
+
+- [ ] Tests pass
+- [ ] Lint clean
+- [ ] Version bumped in pyproject.toml AND __init__.py
+- [ ] server.json version updated if stale
+- [ ] Commit message includes version and summary
+- [ ] Merged develop -> master with --no-ff
+- [ ] Tag created and pushed
+- [ ] GitHub release created (triggers CI publish)
diff --git a/.gitignore b/.gitignore
@@ -10,8 +10,9 @@ build/
 # API keys — never commit the real config
 config.json
 
-# CLAUDE.md is private — keep canonical copy in Vault only
-CLAUDE.md
+# Environment files
+.env
+.env.*
 
 # Cursor rules are private — golden copy in Vault, symlink in repo
 .cursor/rules

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,98 @@
+# CLAUDE.md -- Model Radar
+
+## What Is This
+
+MCP server that pings 219+ free coding LLM models across 21 providers, ranks by real-time latency, and helps AI agents pick the fastest model. Python 3.11+, built on FastMCP + httpx + click.
+
+## Commands
+
+```sh
+pip install -e .                                    # dev install
+pip install -e ".[dev]"                             # with test/lint deps
+python -m pytest tests/ -v                          # run tests (167 tests)
+ruff check src/ tests/                              # lint
+model-radar serve                                   # MCP server (stdio)
+model-radar serve --transport sse --port 8765       # SSE + Streamable HTTP
+model-radar serve --transport sse --port 8765 --web # SSE + web dashboard
+model-radar scan --min-tier S --limit 5             # CLI scan
+model-radar providers                               # list providers
+```
+
+## Key Modules
+
+| Module | Purpose |
+|--------|---------|
+| `server.py` | FastMCP server, all 19 MCP tool definitions |
+| `providers.py` | Provider/model catalog, tier system (S+ through C) |
+| `scanner.py` | Async ping engine, parallel scanning, adaptive rate limiting |
+| `runner.py` | Prompt execution, automatic fallback, batch execution |
+| `judge.py` | LLM-as-judge: rate, compare, batch evaluate |
+| `config.py` | Config management (~/.model-radar/config.json) |
+| `db.py` | SQLite persistence for model catalog and ping results |
+
+Full architecture: `docs/architecture.md`
+
+## MCP Tools (19 total)
+
+**Read-only (no side effects):**
+list_providers, list_models, scan, get_fastest, get_workers, provider_status, server_stats
+
+**Execution (runs prompts on external LLMs):**
+run, ask, batch_run, judge, compare, batch_judge, backtranslate_eval, benchmark
+
+**Write (modifies local config/state):**
+configure_key, refresh_models, setup_workflow, restart_server
+
+**Informational (returns text guidance):**
+setup_guide, host_swap_instructions
+
+## Version Bumps
+
+Update BOTH files together:
+- `pyproject.toml` -> `version = "X.Y.Z"`
+- `src/model_radar/__init__.py` -> `__version__ = "X.Y.Z"`
+
+## Release Process
+
+```sh
+# All work on develop or feature branches
+git checkout develop && git checkout -b feature/xxx
+# ... work, commit ...
+git checkout develop && git merge --no-ff feature/xxx
+
+# Release: develop -> master, tag
+git checkout master && git merge develop --no-ff -m "release: vX.Y.Z"
+git tag -a vX.Y.Z -m "vX.Y.Z description"
+git checkout develop
+git push origin master develop --tags
+```
+
+Publishing is automated: GitHub Actions runs on `release: [published]` to push to PyPI (OIDC) and MCP Registry.
+
+## Do
+
+- Keep the dependency footprint minimal (httpx + mcp + click)
+- Use `docs/` for detailed playbooks; keep this file concise
+- Test with `python -m pytest tests/ -v` before committing
+- Use provider diversity in judge/worker selection
+
+## Don't
+
+- Commit API keys or config.json (keys live in ~/.model-radar/config.json with 0o600)
+- Add heavy dependencies without discussion
+- Remove provider definitions without checking if they're still active
+- Skip the two-file version bump (pyproject.toml + __init__.py)
+- Commit directly to master -- always work on develop or feature branches
+
+## Docs
+
+- `docs/architecture.md` -- module map, data flow, transport, rate limiting
+- `docs/mcp-transport.md` -- transport options, stateless HTTP, client config
+- `docs/playbook-translation-pipeline.md` -- batch translation patterns
+- `docs/playbook-llm-as-judge.md` -- evaluation patterns and judge selection
+
+## Skills
+
+- `.claude/skills/release/` -- Version bump + PyPI publish + MCP registry workflow
+- `.claude/skills/provider-setup/` -- Add a new provider end-to-end
+- `.claude/skills/benchmarking/` -- Run quality benchmarks and interpret results