From 7aed68cae9a1784f48b664a55315b8f4c57afea2 Mon Sep 17 00:00:00 2001
From: Tim Uy <tim@gig8.com>
Date: Mon, 23 Mar 2026 06:52:15 -0700
Subject: [PATCH] =?UTF-8?q?feat:=20AI-firstify=20audit=20=E2=80=94=20add?=
 =?UTF-8?q?=20.claude/skills/,=20public=20CLAUDE.md,=20expand=20MCP=20inst?=
 =?UTF-8?q?ructions?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Create .claude/skills/ with 3 prescriptive skills: release, provider-setup, benchmarking
- Add public CLAUDE.md with build/test/release procedures, do/don't section, tool classification
- Un-gitignore CLAUDE.md (was private-only); add .env/.env.* to .gitignore for safety
- Expand MCP server instructions: separate read-only vs write tools, add setup workflow happy path
- Safety audit: no credentials in committed code, .gitignore covers config.json + .env

Closes #9
---
 .claude/skills/benchmarking/SKILL.md   | 93 ++++++++++++++++++++++++
 .claude/skills/provider-setup/SKILL.md | 84 ++++++++++++++++++++++
 .claude/skills/release/SKILL.md        | 84 ++++++++++++++++++++++
 .gitignore                             |  5 +-
 CLAUDE.md                              | 98 ++++++++++++++++++++++++++
 src/model_radar/server.py              | 24 +++++--
 6 files changed, 380 insertions(+), 8 deletions(-)
 create mode 100644 .claude/skills/benchmarking/SKILL.md
 create mode 100644 .claude/skills/provider-setup/SKILL.md
 create mode 100644 .claude/skills/release/SKILL.md
 create mode 100644 CLAUDE.md

diff --git a/.claude/skills/benchmarking/SKILL.md b/.claude/skills/benchmarking/SKILL.md
new file mode 100644
index 0000000..aa2b370
--- /dev/null
+++ b/.claude/skills/benchmarking/SKILL.md
@@ -0,0 +1,93 @@
+# Benchmarking
+
+Run quality benchmarks across models, interpret results, and update quality scores.
+
+## When to Use
+
+When you need to evaluate model quality (e.g., after adding new models, comparing providers, or auditing the tier assignments).
+
+## Steps
+
+### 1. Select models to benchmark
+
+```
+get_fastest(min_tier="A", count=10, verified=True)
+```
+
+Or target specific models:
+```
+scan(provider="nvidia", verify=True)
+```
+
+### 2. Run the benchmark
+
+Via MCP:
+```
+benchmark(min_tier="A", provider="nvidia")
+```
+
+Or for a specific model:
+```
+benchmark(model_id="nvidia/llama-3.1-nemotron-ultra-253b-v1")
+```
+
+The benchmark runs 5 coding challenges and scores pass/fail. Results are stored in the quality database (~/.model-radar/quality.json) and affect future `get_fastest()` rankings.
+
+### 3. Interpret results
+
+Quality scores are 0-5 (number of challenges passed):
+- **5/5**: Excellent -- reliable for coding tasks
+- **4/5**: Good -- minor issues, generally usable
+- **3/5**: Acceptable -- may struggle with complex tasks
+- **2/5 or below**: Avoid for coding -- consider downgrading tier
+
+### 4. Cross-validate with judge evaluation
+
+For deeper quality assessment, use LLM-as-judge:
+
+```
+judge(
+  prompt="Write a Python function that finds the longest common subsequence of two strings",
+  rubric=["correctness", "efficiency", "code_quality"],
+  scale="1-5",
+  count=3
+)
+```
+
+See `docs/playbook-llm-as-judge.md` for full evaluation patterns.
+
+### 5. Update tier assignments if needed
+
+If benchmark results consistently disagree with the assigned tier:
+1. Check SWE-bench Verified for updated scores
+2. Update the `tier` field in `src/model_radar/providers.py`
+3. Run tests to ensure no tier validation failures
+
+### 6. Batch benchmark for comprehensive audit
+
+To benchmark all models from a provider:
+```
+scan(provider="provider_key", verify=True)
+benchmark(provider="provider_key")
+```
+
+To benchmark across all configured providers:
+```
+benchmark(min_tier="B")
+```
+
+## Interpreting Benchmark vs Tier Disagreements
+
+| Benchmark | Tier | Action |
+|-----------|------|--------|
+| 5/5 | B or lower | Check SWE-bench, consider upgrade |
+| 0-2/5 | A or higher | May be a flaky model, re-run. If consistent, downgrade |
+| 3-4/5 | matches tier | No action needed |
+
+## Checklist
+
+- [ ] Models selected (verified alive first)
+- [ ] Benchmark run completed
+- [ ] Results interpreted (scores + tier alignment)
+- [ ] Tier adjustments made if needed (in providers.py)
+- [ ] Tests pass after any tier changes
diff --git a/.claude/skills/provider-setup/SKILL.md b/.claude/skills/provider-setup/SKILL.md
new file mode 100644
index 0000000..b067086
--- /dev/null
+++ b/.claude/skills/provider-setup/SKILL.md
@@ -0,0 +1,84 @@
+# Provider Setup
+
+Add a new LLM provider to model-radar end-to-end.
+
+## When to Use
+
+When adding a new provider (e.g., a new free LLM API) to the model-radar catalog.
+
+## Steps
+
+### 1. Research the provider
+
+Gather:
+- Provider name and API base URL
+- Authentication method (Bearer token, API key in query param, no auth)
+- Environment variable convention (e.g., PROVIDER_API_KEY)
+- Available models with their IDs
+- Free tier availability and rate limits
+- Model quality tiers (check SWE-bench Verified if available)
+
+### 2. Add provider definition in `src/model_radar/providers.py`
+
+Add a new `Provider` entry to the `PROVIDERS` dict:
+
+```python
+"provider_key": Provider(
+    name="Provider Name",
+    base_url="https://api.provider.com/v1",
+    env_vars=["PROVIDER_API_KEY"],
+    models=[
+        Model("org/model-name", "Model Display Name", tier="A", ctx=32768),
+    ],
+),
+```
+
+Key fields:
+- `base_url`: The OpenAI-compatible chat/completions endpoint base
+- `env_vars`: List of environment variable names for the API key
+- `tier`: SWE-bench tier (S+, S, A+, A, A-, B+, B, C)
+- `ctx`: Context window size
+
+### 3. Handle auth quirks in `src/model_radar/runner.py`
+
+Most providers use `Authorization: Bearer <key>`. If the new provider differs:
+- Query param auth: add to the `if provider == "..."` block around line 55
+- Token auth: add to the Token block around line 62
+- No auth: add to `_NO_AUTH_PROVIDERS` in `config.py`
+
+### 4. Add provider sync if they have a models API
+
+If the provider has a `/models` endpoint for dynamic model discovery, add a fetch function in `src/model_radar/provider_sync.py`.
+
+### 5. Add setup guide
+
+Add signup instructions in `src/model_radar/guides.py` so `setup_guide("provider_key")` returns useful onboarding steps.
+
+### 6. Write tests
+
+Add test coverage in `tests/test_providers.py`:
+- Provider key exists in PROVIDERS
+- Models have valid tiers
+- Base URL is well-formed
+
+### 7. Verify
+
+```sh
+python -m pytest tests/ -v
+model-radar providers                    # should list new provider
+model-radar scan --provider provider_key # should ping successfully
+```
+
+Via MCP:
+```
+list_providers()          # new provider shows up
+scan(provider="provider_key", verify=True)  # models respond
+```
+
+## Checklist
+
+- [ ] Provider added to PROVIDERS dict in providers.py
+- [ ] Auth handling in runner.py (if non-standard)
+- [ ] Setup guide in guides.py
+- [ ] Tests added and passing
+- [ ] Verified via CLI scan and MCP tools
diff --git a/.claude/skills/release/SKILL.md b/.claude/skills/release/SKILL.md
new file mode 100644
index 0000000..006b181
--- /dev/null
+++ b/.claude/skills/release/SKILL.md
@@ -0,0 +1,84 @@
+# Release
+
+Version bump, PyPI publish, and MCP registry update for model-radar.
+
+## When to Use
+
+When you need to cut a new release of model-radar.
+
+## Steps
+
+### 1. Pre-flight checks
+
+```sh
+python -m pytest tests/ -v
+ruff check src/ tests/
+```
+
+All tests must pass and lint must be clean before proceeding.
+
+### 2. Decide the version
+
+Follow semver: MAJOR.MINOR.PATCH
+- PATCH: bug fixes, provider data updates
+- MINOR: new MCP tools, new providers, new features
+- MAJOR: breaking API changes (rare)
+
+Check current version:
+```sh
+grep '^version' pyproject.toml
+```
+
+### 3. Bump version in BOTH files
+
+These must match exactly:
+
+1. `pyproject.toml` -> `version = "X.Y.Z"`
+2. `src/model_radar/__init__.py` -> `__version__ = "X.Y.Z"`
+
+### 4. Update server.json if needed
+
+If the PyPI package version in `server.json` is stale, update the `version` fields to match. This is the MCP registry manifest.
+
+### 5. Commit the version bump
+
+```sh
+git add pyproject.toml src/model_radar/__init__.py server.json
+git commit -m "bump: vX.Y.Z -- <one-line summary of changes>"
+```
+
+### 6. Merge to master and tag
+
+```sh
+git checkout develop && git merge --no-ff feature/xxx   # if on feature branch
+git checkout master && git merge develop --no-ff -m "release: vX.Y.Z"
+git tag -a vX.Y.Z -m "vX.Y.Z -- <summary>"
+git checkout develop
+git push origin master develop --tags
+```
+
+### 7. Create GitHub release
+
+```sh
+gh release create vX.Y.Z --title "vX.Y.Z" --notes "<release notes>"
+```
+
+This triggers the publish workflow which:
+- Builds and publishes to PyPI via trusted publisher (OIDC)
+- Publishes to MCP Registry via mcp-publisher (OIDC)
+
+### 8. Verify
+
+- Check PyPI: `pip install model-radar-mcp==X.Y.Z` works
+- Check MCP Registry listing is updated
+
+## Checklist
+
+- [ ] Tests pass
+- [ ] Lint clean
+- [ ] Version bumped in pyproject.toml AND __init__.py
+- [ ] server.json version updated if stale
+- [ ] Commit message includes version and summary
+- [ ] Merged develop -> master with --no-ff
+- [ ] Tag created and pushed
+- [ ] GitHub release created (triggers CI publish)
diff --git a/.gitignore b/.gitignore
index 5047128..3352828 100644
--- a/.gitignore
+++ b/.gitignore
@@ -10,8 +10,9 @@ build/
 # API keys — never commit the real config
 config.json
 
-# CLAUDE.md is private — keep canonical copy in Vault only
-CLAUDE.md
+# Environment files
+.env
+.env.*
 
 # Cursor rules are private — golden copy in Vault, symlink in repo
 .cursor/rules
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000..343cef9
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,98 @@
+# CLAUDE.md -- Model Radar
+
+## What Is This
+
+MCP server that pings 219+ free coding LLM models across 21 providers, ranks by real-time latency, and helps AI agents pick the fastest model. Python 3.11+, built on FastMCP + httpx + click.
+
+## Commands
+
+```sh
+pip install -e .                                    # dev install
+pip install -e ".[dev]"                             # with test/lint deps
+python -m pytest tests/ -v                          # run tests (167 tests)
+ruff check src/ tests/                              # lint
+model-radar serve                                   # MCP server (stdio)
+model-radar serve --transport sse --port 8765       # SSE + Streamable HTTP
+model-radar serve --transport sse --port 8765 --web # SSE + web dashboard
+model-radar scan --min-tier S --limit 5             # CLI scan
+model-radar providers                               # list providers
+```
+
+## Key Modules
+
+| Module | Purpose |
+|--------|---------|
+| `server.py` | FastMCP server, all 19 MCP tool definitions |
+| `providers.py` | Provider/model catalog, tier system (S+ through C) |
+| `scanner.py` | Async ping engine, parallel scanning, adaptive rate limiting |
+| `runner.py` | Prompt execution, automatic fallback, batch execution |
+| `judge.py` | LLM-as-judge: rate, compare, batch evaluate |
+| `config.py` | Config management (~/.model-radar/config.json) |
+| `db.py` | SQLite persistence for model catalog and ping results |
+
+Full architecture: `docs/architecture.md`
+
+## MCP Tools (19 total)
+
+**Read-only (no side effects):**
+list_providers, list_models, scan, get_fastest, get_workers, provider_status, server_stats
+
+**Execution (runs prompts on external LLMs):**
+run, ask, batch_run, judge, compare, batch_judge, backtranslate_eval, benchmark
+
+**Write (modifies local config/state):**
+configure_key, refresh_models, setup_workflow, restart_server
+
+**Informational (returns text guidance):**
+setup_guide, host_swap_instructions
+
+## Version Bumps
+
+Update BOTH files together:
+- `pyproject.toml` -> `version = "X.Y.Z"`
+- `src/model_radar/__init__.py` -> `__version__ = "X.Y.Z"`
+
+## Release Process
+
+```sh
+# All work on develop or feature branches
+git checkout develop && git checkout -b feature/xxx
+# ... work, commit ...
+git checkout develop && git merge --no-ff feature/xxx
+
+# Release: develop -> master, tag
+git checkout master && git merge develop --no-ff -m "release: vX.Y.Z"
+git tag -a vX.Y.Z -m "vX.Y.Z description"
+git checkout develop
+git push origin master develop --tags
+```
+
+Publishing is automated: GitHub Actions runs on `release: [published]` to push to PyPI (OIDC) and MCP Registry.
+
+## Do
+
+- Keep the dependency footprint minimal (httpx + mcp + click)
+- Use `docs/` for detailed playbooks; keep this file concise
+- Test with `python -m pytest tests/ -v` before committing
+- Use provider diversity in judge/worker selection
+
+## Don't
+
+- Commit API keys or config.json (keys live in ~/.model-radar/config.json with 0o600)
+- Add heavy dependencies without discussion
+- Remove provider definitions without checking if they're still active
+- Skip the two-file version bump (pyproject.toml + __init__.py)
+- Commit directly to master -- always work on develop or feature branches
+
+## Docs
+
+- `docs/architecture.md` -- module map, data flow, transport, rate limiting
+- `docs/mcp-transport.md` -- transport options, stateless HTTP, client config
+- `docs/playbook-translation-pipeline.md` -- batch translation patterns
+- `docs/playbook-llm-as-judge.md` -- evaluation patterns and judge selection
+
+## Skills
+
+- `.claude/skills/release/` -- Version bump + PyPI publish + MCP registry workflow
+- `.claude/skills/provider-setup/` -- Add a new provider end-to-end
+- `.claude/skills/benchmarking/` -- Run quality benchmarks and interpret results
diff --git a/src/model_radar/server.py b/src/model_radar/server.py
index 51d0649..fdc6cbb 100644
--- a/src/model_radar/server.py
+++ b/src/model_radar/server.py
@@ -88,17 +88,29 @@
   Translate back to source language using a different model, compute gloss overlap. \
   The most powerful non-circular quality metric for translation: translate→back-translate→overlap.
 
-## Tool guide — Quality & Setup
-- refresh_models(provider?, run_ping?, ping_limit?) — Fetch latest model lists from APIs; \
-  use periodically so free/paid and model list stay current.
+## Tool guide — Quality & Setup (read-only)
 - benchmark(...) — Quality-test models; results show in later scan/get_fastest.
 - setup_guide(provider?) — Signup instructions for unconfigured providers.
-- configure_key(provider, api_key) — Save an API key.
-- setup_workflow(step, provider_selection?) — Step-by-step setup (Playwright, providers, keys).
 - host_swap_instructions(model_id?, provider?, min_tier?) — Where to set base_url + model_id on the host.
-- restart_server() — (SSE only) Exit so process manager can restart. Allowed by default; set MODEL_RADAR_ALLOW_RESTART=0 to disable.
 - server_stats() — Server start time and uptime.
 
+## Tool guide — Configuration (writes to local config/state)
+- configure_key(provider, api_key) — Save an API key to ~/.model-radar/config.json.
+- refresh_models(provider?, run_ping?, ping_limit?) — Fetch latest model lists from APIs; \
+  use periodically so free/paid and model list stay current.
+- setup_workflow(step, provider_selection?) — Step-by-step setup (Playwright, providers, keys).
+- restart_server() — (SSE only) Exit so process manager can restart. Allowed by default; set MODEL_RADAR_ALLOW_RESTART=0 to disable.
+
+## Setup workflow — agent-driven happy path
+New user? Walk them through setup in this order:
+1. `list_providers()` — see which providers already have keys
+2. `setup_guide()` — show signup instructions for all unconfigured providers
+3. For each provider the user wants: `configure_key(provider, api_key)` — save the key
+4. `refresh_models()` — fetch latest model lists from provider APIs
+5. `scan(verify=True)` — verify which models are actually alive and responding
+6. `get_fastest(min_tier="A", count=5)` — recommend the best models to start with
+7. `host_swap_instructions()` — show how to configure Cursor/IDE with the fastest model
+
 ## Tier scale (SWE-bench Verified)
 Better → worse: S+ (70%+) > S (60-70%) > A+ (50-60%) > A (40-50%) > A- (35-40%) > B+ (30-35%) > B (20-30%) > C (<20%). \
 min_tier="A" means "A or better" (includes A+, S, S+).