From 7aed68cae9a1784f48b664a55315b8f4c57afea2 Mon Sep 17 00:00:00 2001 From: Tim Uy Date: Mon, 23 Mar 2026 06:52:15 -0700 Subject: [PATCH] =?UTF-8?q?feat:=20AI-firstify=20audit=20=E2=80=94=20add?= =?UTF-8?q?=20.claude/skills/,=20public=20CLAUDE.md,=20expand=20MCP=20inst?= =?UTF-8?q?ructions?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Create .claude/skills/ with 3 prescriptive skills: release, provider-setup, benchmarking - Add public CLAUDE.md with build/test/release procedures, do/don't section, tool classification - Un-gitignore CLAUDE.md (was private-only); add .env/.env.* to .gitignore for safety - Expand MCP server instructions: separate read-only vs write tools, add setup workflow happy path - Safety audit: no credentials in committed code, .gitignore covers config.json + .env Closes #9 --- .claude/skills/benchmarking/SKILL.md | 93 ++++++++++++++++++++++++ .claude/skills/provider-setup/SKILL.md | 84 ++++++++++++++++++++++ .claude/skills/release/SKILL.md | 84 ++++++++++++++++++++++ .gitignore | 5 +- CLAUDE.md | 98 ++++++++++++++++++++++++++ src/model_radar/server.py | 24 +++++-- 6 files changed, 380 insertions(+), 8 deletions(-) create mode 100644 .claude/skills/benchmarking/SKILL.md create mode 100644 .claude/skills/provider-setup/SKILL.md create mode 100644 .claude/skills/release/SKILL.md create mode 100644 CLAUDE.md diff --git a/.claude/skills/benchmarking/SKILL.md b/.claude/skills/benchmarking/SKILL.md new file mode 100644 index 0000000..aa2b370 --- /dev/null +++ b/.claude/skills/benchmarking/SKILL.md @@ -0,0 +1,93 @@ +# Benchmarking + +Run quality benchmarks across models, interpret results, and update quality scores. + +## When to Use + +When you need to evaluate model quality (e.g., after adding new models, comparing providers, or auditing the tier assignments). + +## Steps + +### 1. Select models to benchmark + +``` +get_fastest(min_tier="A", count=10, verified=True) +``` + +Or target specific models: +``` +scan(provider="nvidia", verify=True) +``` + +### 2. Run the benchmark + +Via MCP: +``` +benchmark(min_tier="A", provider="nvidia") +``` + +Or for a specific model: +``` +benchmark(model_id="nvidia/llama-3.1-nemotron-ultra-253b-v1") +``` + +The benchmark runs 5 coding challenges and scores pass/fail. Results are stored in the quality database (~/.model-radar/quality.json) and affect future `get_fastest()` rankings. + +### 3. Interpret results + +Quality scores are 0-5 (number of challenges passed): +- **5/5**: Excellent -- reliable for coding tasks +- **4/5**: Good -- minor issues, generally usable +- **3/5**: Acceptable -- may struggle with complex tasks +- **2/5 or below**: Avoid for coding -- consider downgrading tier + +### 4. Cross-validate with judge evaluation + +For deeper quality assessment, use LLM-as-judge: + +``` +judge( + prompt="Write a Python function that finds the longest common subsequence of two strings", + rubric=["correctness", "efficiency", "code_quality"], + scale="1-5", + count=3 +) +``` + +See `docs/playbook-llm-as-judge.md` for full evaluation patterns. + +### 5. Update tier assignments if needed + +If benchmark results consistently disagree with the assigned tier: +1. Check SWE-bench Verified for updated scores +2. Update the `tier` field in `src/model_radar/providers.py` +3. Run tests to ensure no tier validation failures + +### 6. Batch benchmark for comprehensive audit + +To benchmark all models from a provider: +``` +scan(provider="provider_key", verify=True) +benchmark(provider="provider_key") +``` + +To benchmark across all configured providers: +``` +benchmark(min_tier="B") +``` + +## Interpreting Benchmark vs Tier Disagreements + +| Benchmark | Tier | Action | +|-----------|------|--------| +| 5/5 | B or lower | Check SWE-bench, consider upgrade | +| 0-2/5 | A or higher | May be a flaky model, re-run. If consistent, downgrade | +| 3-4/5 | matches tier | No action needed | + +## Checklist + +- [ ] Models selected (verified alive first) +- [ ] Benchmark run completed +- [ ] Results interpreted (scores + tier alignment) +- [ ] Tier adjustments made if needed (in providers.py) +- [ ] Tests pass after any tier changes diff --git a/.claude/skills/provider-setup/SKILL.md b/.claude/skills/provider-setup/SKILL.md new file mode 100644 index 0000000..b067086 --- /dev/null +++ b/.claude/skills/provider-setup/SKILL.md @@ -0,0 +1,84 @@ +# Provider Setup + +Add a new LLM provider to model-radar end-to-end. + +## When to Use + +When adding a new provider (e.g., a new free LLM API) to the model-radar catalog. + +## Steps + +### 1. Research the provider + +Gather: +- Provider name and API base URL +- Authentication method (Bearer token, API key in query param, no auth) +- Environment variable convention (e.g., PROVIDER_API_KEY) +- Available models with their IDs +- Free tier availability and rate limits +- Model quality tiers (check SWE-bench Verified if available) + +### 2. Add provider definition in `src/model_radar/providers.py` + +Add a new `Provider` entry to the `PROVIDERS` dict: + +```python +"provider_key": Provider( + name="Provider Name", + base_url="https://api.provider.com/v1", + env_vars=["PROVIDER_API_KEY"], + models=[ + Model("org/model-name", "Model Display Name", tier="A", ctx=32768), + ], +), +``` + +Key fields: +- `base_url`: The OpenAI-compatible chat/completions endpoint base +- `env_vars`: List of environment variable names for the API key +- `tier`: SWE-bench tier (S+, S, A+, A, A-, B+, B, C) +- `ctx`: Context window size + +### 3. Handle auth quirks in `src/model_radar/runner.py` + +Most providers use `Authorization: Bearer `. If the new provider differs: +- Query param auth: add to the `if provider == "..."` block around line 55 +- Token auth: add to the Token block around line 62 +- No auth: add to `_NO_AUTH_PROVIDERS` in `config.py` + +### 4. Add provider sync if they have a models API + +If the provider has a `/models` endpoint for dynamic model discovery, add a fetch function in `src/model_radar/provider_sync.py`. + +### 5. Add setup guide + +Add signup instructions in `src/model_radar/guides.py` so `setup_guide("provider_key")` returns useful onboarding steps. + +### 6. Write tests + +Add test coverage in `tests/test_providers.py`: +- Provider key exists in PROVIDERS +- Models have valid tiers +- Base URL is well-formed + +### 7. Verify + +```sh +python -m pytest tests/ -v +model-radar providers # should list new provider +model-radar scan --provider provider_key # should ping successfully +``` + +Via MCP: +``` +list_providers() # new provider shows up +scan(provider="provider_key", verify=True) # models respond +``` + +## Checklist + +- [ ] Provider added to PROVIDERS dict in providers.py +- [ ] Auth handling in runner.py (if non-standard) +- [ ] Setup guide in guides.py +- [ ] Tests added and passing +- [ ] Verified via CLI scan and MCP tools diff --git a/.claude/skills/release/SKILL.md b/.claude/skills/release/SKILL.md new file mode 100644 index 0000000..006b181 --- /dev/null +++ b/.claude/skills/release/SKILL.md @@ -0,0 +1,84 @@ +# Release + +Version bump, PyPI publish, and MCP registry update for model-radar. + +## When to Use + +When you need to cut a new release of model-radar. + +## Steps + +### 1. Pre-flight checks + +```sh +python -m pytest tests/ -v +ruff check src/ tests/ +``` + +All tests must pass and lint must be clean before proceeding. + +### 2. Decide the version + +Follow semver: MAJOR.MINOR.PATCH +- PATCH: bug fixes, provider data updates +- MINOR: new MCP tools, new providers, new features +- MAJOR: breaking API changes (rare) + +Check current version: +```sh +grep '^version' pyproject.toml +``` + +### 3. Bump version in BOTH files + +These must match exactly: + +1. `pyproject.toml` -> `version = "X.Y.Z"` +2. `src/model_radar/__init__.py` -> `__version__ = "X.Y.Z"` + +### 4. Update server.json if needed + +If the PyPI package version in `server.json` is stale, update the `version` fields to match. This is the MCP registry manifest. + +### 5. Commit the version bump + +```sh +git add pyproject.toml src/model_radar/__init__.py server.json +git commit -m "bump: vX.Y.Z -- " +``` + +### 6. Merge to master and tag + +```sh +git checkout develop && git merge --no-ff feature/xxx # if on feature branch +git checkout master && git merge develop --no-ff -m "release: vX.Y.Z" +git tag -a vX.Y.Z -m "vX.Y.Z -- " +git checkout develop +git push origin master develop --tags +``` + +### 7. Create GitHub release + +```sh +gh release create vX.Y.Z --title "vX.Y.Z" --notes "" +``` + +This triggers the publish workflow which: +- Builds and publishes to PyPI via trusted publisher (OIDC) +- Publishes to MCP Registry via mcp-publisher (OIDC) + +### 8. Verify + +- Check PyPI: `pip install model-radar-mcp==X.Y.Z` works +- Check MCP Registry listing is updated + +## Checklist + +- [ ] Tests pass +- [ ] Lint clean +- [ ] Version bumped in pyproject.toml AND __init__.py +- [ ] server.json version updated if stale +- [ ] Commit message includes version and summary +- [ ] Merged develop -> master with --no-ff +- [ ] Tag created and pushed +- [ ] GitHub release created (triggers CI publish) diff --git a/.gitignore b/.gitignore index 5047128..3352828 100644 --- a/.gitignore +++ b/.gitignore @@ -10,8 +10,9 @@ build/ # API keys — never commit the real config config.json -# CLAUDE.md is private — keep canonical copy in Vault only -CLAUDE.md +# Environment files +.env +.env.* # Cursor rules are private — golden copy in Vault, symlink in repo .cursor/rules diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..343cef9 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,98 @@ +# CLAUDE.md -- Model Radar + +## What Is This + +MCP server that pings 219+ free coding LLM models across 21 providers, ranks by real-time latency, and helps AI agents pick the fastest model. Python 3.11+, built on FastMCP + httpx + click. + +## Commands + +```sh +pip install -e . # dev install +pip install -e ".[dev]" # with test/lint deps +python -m pytest tests/ -v # run tests (167 tests) +ruff check src/ tests/ # lint +model-radar serve # MCP server (stdio) +model-radar serve --transport sse --port 8765 # SSE + Streamable HTTP +model-radar serve --transport sse --port 8765 --web # SSE + web dashboard +model-radar scan --min-tier S --limit 5 # CLI scan +model-radar providers # list providers +``` + +## Key Modules + +| Module | Purpose | +|--------|---------| +| `server.py` | FastMCP server, all 19 MCP tool definitions | +| `providers.py` | Provider/model catalog, tier system (S+ through C) | +| `scanner.py` | Async ping engine, parallel scanning, adaptive rate limiting | +| `runner.py` | Prompt execution, automatic fallback, batch execution | +| `judge.py` | LLM-as-judge: rate, compare, batch evaluate | +| `config.py` | Config management (~/.model-radar/config.json) | +| `db.py` | SQLite persistence for model catalog and ping results | + +Full architecture: `docs/architecture.md` + +## MCP Tools (19 total) + +**Read-only (no side effects):** +list_providers, list_models, scan, get_fastest, get_workers, provider_status, server_stats + +**Execution (runs prompts on external LLMs):** +run, ask, batch_run, judge, compare, batch_judge, backtranslate_eval, benchmark + +**Write (modifies local config/state):** +configure_key, refresh_models, setup_workflow, restart_server + +**Informational (returns text guidance):** +setup_guide, host_swap_instructions + +## Version Bumps + +Update BOTH files together: +- `pyproject.toml` -> `version = "X.Y.Z"` +- `src/model_radar/__init__.py` -> `__version__ = "X.Y.Z"` + +## Release Process + +```sh +# All work on develop or feature branches +git checkout develop && git checkout -b feature/xxx +# ... work, commit ... +git checkout develop && git merge --no-ff feature/xxx + +# Release: develop -> master, tag +git checkout master && git merge develop --no-ff -m "release: vX.Y.Z" +git tag -a vX.Y.Z -m "vX.Y.Z description" +git checkout develop +git push origin master develop --tags +``` + +Publishing is automated: GitHub Actions runs on `release: [published]` to push to PyPI (OIDC) and MCP Registry. + +## Do + +- Keep the dependency footprint minimal (httpx + mcp + click) +- Use `docs/` for detailed playbooks; keep this file concise +- Test with `python -m pytest tests/ -v` before committing +- Use provider diversity in judge/worker selection + +## Don't + +- Commit API keys or config.json (keys live in ~/.model-radar/config.json with 0o600) +- Add heavy dependencies without discussion +- Remove provider definitions without checking if they're still active +- Skip the two-file version bump (pyproject.toml + __init__.py) +- Commit directly to master -- always work on develop or feature branches + +## Docs + +- `docs/architecture.md` -- module map, data flow, transport, rate limiting +- `docs/mcp-transport.md` -- transport options, stateless HTTP, client config +- `docs/playbook-translation-pipeline.md` -- batch translation patterns +- `docs/playbook-llm-as-judge.md` -- evaluation patterns and judge selection + +## Skills + +- `.claude/skills/release/` -- Version bump + PyPI publish + MCP registry workflow +- `.claude/skills/provider-setup/` -- Add a new provider end-to-end +- `.claude/skills/benchmarking/` -- Run quality benchmarks and interpret results diff --git a/src/model_radar/server.py b/src/model_radar/server.py index 51d0649..fdc6cbb 100644 --- a/src/model_radar/server.py +++ b/src/model_radar/server.py @@ -88,17 +88,29 @@ Translate back to source language using a different model, compute gloss overlap. \ The most powerful non-circular quality metric for translation: translate→back-translate→overlap. -## Tool guide — Quality & Setup -- refresh_models(provider?, run_ping?, ping_limit?) — Fetch latest model lists from APIs; \ - use periodically so free/paid and model list stay current. +## Tool guide — Quality & Setup (read-only) - benchmark(...) — Quality-test models; results show in later scan/get_fastest. - setup_guide(provider?) — Signup instructions for unconfigured providers. -- configure_key(provider, api_key) — Save an API key. -- setup_workflow(step, provider_selection?) — Step-by-step setup (Playwright, providers, keys). - host_swap_instructions(model_id?, provider?, min_tier?) — Where to set base_url + model_id on the host. -- restart_server() — (SSE only) Exit so process manager can restart. Allowed by default; set MODEL_RADAR_ALLOW_RESTART=0 to disable. - server_stats() — Server start time and uptime. +## Tool guide — Configuration (writes to local config/state) +- configure_key(provider, api_key) — Save an API key to ~/.model-radar/config.json. +- refresh_models(provider?, run_ping?, ping_limit?) — Fetch latest model lists from APIs; \ + use periodically so free/paid and model list stay current. +- setup_workflow(step, provider_selection?) — Step-by-step setup (Playwright, providers, keys). +- restart_server() — (SSE only) Exit so process manager can restart. Allowed by default; set MODEL_RADAR_ALLOW_RESTART=0 to disable. + +## Setup workflow — agent-driven happy path +New user? Walk them through setup in this order: +1. `list_providers()` — see which providers already have keys +2. `setup_guide()` — show signup instructions for all unconfigured providers +3. For each provider the user wants: `configure_key(provider, api_key)` — save the key +4. `refresh_models()` — fetch latest model lists from provider APIs +5. `scan(verify=True)` — verify which models are actually alive and responding +6. `get_fastest(min_tier="A", count=5)` — recommend the best models to start with +7. `host_swap_instructions()` — show how to configure Cursor/IDE with the fastest model + ## Tier scale (SWE-bench Verified) Better → worse: S+ (70%+) > S (60-70%) > A+ (50-60%) > A (40-50%) > A- (35-40%) > B+ (30-35%) > B (20-30%) > C (<20%). \ min_tier="A" means "A or better" (includes A+, S, S+).