Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions .claude/skills/benchmarking/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Benchmarking

Run quality benchmarks across models, interpret results, and update quality scores.

## When to Use

When you need to evaluate model quality (e.g., after adding new models, comparing providers, or auditing the tier assignments).

## Steps

### 1. Select models to benchmark

```
get_fastest(min_tier="A", count=10, verified=True)
```

Or target specific models:
```
scan(provider="nvidia", verify=True)
```

### 2. Run the benchmark

Via MCP:
```
benchmark(min_tier="A", provider="nvidia")
```

Or for a specific model:
```
benchmark(model_id="nvidia/llama-3.1-nemotron-ultra-253b-v1")
```

The benchmark runs 5 coding challenges and scores pass/fail. Results are stored in the quality database (~/.model-radar/quality.json) and affect future `get_fastest()` rankings.

### 3. Interpret results

Quality scores are 0-5 (number of challenges passed):
- **5/5**: Excellent -- reliable for coding tasks
- **4/5**: Good -- minor issues, generally usable
- **3/5**: Acceptable -- may struggle with complex tasks
- **2/5 or below**: Avoid for coding -- consider downgrading tier

### 4. Cross-validate with judge evaluation

For deeper quality assessment, use LLM-as-judge:

```
judge(
prompt="Write a Python function that finds the longest common subsequence of two strings",
rubric=["correctness", "efficiency", "code_quality"],
scale="1-5",
count=3
)
```

See `docs/playbook-llm-as-judge.md` for full evaluation patterns.

### 5. Update tier assignments if needed

If benchmark results consistently disagree with the assigned tier:
1. Check SWE-bench Verified for updated scores
2. Update the `tier` field in `src/model_radar/providers.py`
3. Run tests to ensure no tier validation failures

### 6. Batch benchmark for comprehensive audit

To benchmark all models from a provider:
```
scan(provider="provider_key", verify=True)
benchmark(provider="provider_key")
```

To benchmark across all configured providers:
```
benchmark(min_tier="B")
```

## Interpreting Benchmark vs Tier Disagreements

| Benchmark | Tier | Action |
|-----------|------|--------|
| 5/5 | B or lower | Check SWE-bench, consider upgrade |
| 0-2/5 | A or higher | May be a flaky model, re-run. If consistent, downgrade |
| 3-4/5 | matches tier | No action needed |

## Checklist

- [ ] Models selected (verified alive first)
- [ ] Benchmark run completed
- [ ] Results interpreted (scores + tier alignment)
- [ ] Tier adjustments made if needed (in providers.py)
- [ ] Tests pass after any tier changes
84 changes: 84 additions & 0 deletions .claude/skills/provider-setup/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Provider Setup

Add a new LLM provider to model-radar end-to-end.

## When to Use

When adding a new provider (e.g., a new free LLM API) to the model-radar catalog.

## Steps

### 1. Research the provider

Gather:
- Provider name and API base URL
- Authentication method (Bearer token, API key in query param, no auth)
- Environment variable convention (e.g., PROVIDER_API_KEY)
- Available models with their IDs
- Free tier availability and rate limits
- Model quality tiers (check SWE-bench Verified if available)

### 2. Add provider definition in `src/model_radar/providers.py`

Add a new `Provider` entry to the `PROVIDERS` dict:

```python
"provider_key": Provider(
name="Provider Name",
base_url="https://api.provider.com/v1",
env_vars=["PROVIDER_API_KEY"],
models=[
Model("org/model-name", "Model Display Name", tier="A", ctx=32768),
],
),
```

Key fields:
- `base_url`: The OpenAI-compatible chat/completions endpoint base
- `env_vars`: List of environment variable names for the API key
- `tier`: SWE-bench tier (S+, S, A+, A, A-, B+, B, C)
- `ctx`: Context window size

### 3. Handle auth quirks in `src/model_radar/runner.py`

Most providers use `Authorization: Bearer <key>`. If the new provider differs:
- Query param auth: add to the `if provider == "..."` block around line 55
- Token auth: add to the Token block around line 62
- No auth: add to `_NO_AUTH_PROVIDERS` in `config.py`

### 4. Add provider sync if they have a models API

If the provider has a `/models` endpoint for dynamic model discovery, add a fetch function in `src/model_radar/provider_sync.py`.

### 5. Add setup guide

Add signup instructions in `src/model_radar/guides.py` so `setup_guide("provider_key")` returns useful onboarding steps.

### 6. Write tests

Add test coverage in `tests/test_providers.py`:
- Provider key exists in PROVIDERS
- Models have valid tiers
- Base URL is well-formed

### 7. Verify

```sh
python -m pytest tests/ -v
model-radar providers # should list new provider
model-radar scan --provider provider_key # should ping successfully
```

Via MCP:
```
list_providers() # new provider shows up
scan(provider="provider_key", verify=True) # models respond
```

## Checklist

- [ ] Provider added to PROVIDERS dict in providers.py
- [ ] Auth handling in runner.py (if non-standard)
- [ ] Setup guide in guides.py
- [ ] Tests added and passing
- [ ] Verified via CLI scan and MCP tools
84 changes: 84 additions & 0 deletions .claude/skills/release/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Release

Version bump, PyPI publish, and MCP registry update for model-radar.

## When to Use

When you need to cut a new release of model-radar.

## Steps

### 1. Pre-flight checks

```sh
python -m pytest tests/ -v
ruff check src/ tests/
```

All tests must pass and lint must be clean before proceeding.

### 2. Decide the version

Follow semver: MAJOR.MINOR.PATCH
- PATCH: bug fixes, provider data updates
- MINOR: new MCP tools, new providers, new features
- MAJOR: breaking API changes (rare)

Check current version:
```sh
grep '^version' pyproject.toml
```

### 3. Bump version in BOTH files

These must match exactly:

1. `pyproject.toml` -> `version = "X.Y.Z"`
2. `src/model_radar/__init__.py` -> `__version__ = "X.Y.Z"`

### 4. Update server.json if needed

If the PyPI package version in `server.json` is stale, update the `version` fields to match. This is the MCP registry manifest.

### 5. Commit the version bump

```sh
git add pyproject.toml src/model_radar/__init__.py server.json
git commit -m "bump: vX.Y.Z -- <one-line summary of changes>"
```

### 6. Merge to master and tag

```sh
git checkout develop && git merge --no-ff feature/xxx # if on feature branch
git checkout master && git merge develop --no-ff -m "release: vX.Y.Z"
git tag -a vX.Y.Z -m "vX.Y.Z -- <summary>"
git checkout develop
git push origin master develop --tags
```

### 7. Create GitHub release

```sh
gh release create vX.Y.Z --title "vX.Y.Z" --notes "<release notes>"
```

This triggers the publish workflow which:
- Builds and publishes to PyPI via trusted publisher (OIDC)
- Publishes to MCP Registry via mcp-publisher (OIDC)

### 8. Verify

- Check PyPI: `pip install model-radar-mcp==X.Y.Z` works
- Check MCP Registry listing is updated

## Checklist

- [ ] Tests pass
- [ ] Lint clean
- [ ] Version bumped in pyproject.toml AND __init__.py
- [ ] server.json version updated if stale
- [ ] Commit message includes version and summary
- [ ] Merged develop -> master with --no-ff
- [ ] Tag created and pushed
- [ ] GitHub release created (triggers CI publish)
5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,9 @@ build/
# API keys — never commit the real config
config.json

# CLAUDE.md is private — keep canonical copy in Vault only
CLAUDE.md
# Environment files
.env
.env.*

# Cursor rules are private — golden copy in Vault, symlink in repo
.cursor/rules
Expand Down
98 changes: 98 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# CLAUDE.md -- Model Radar

## What Is This

MCP server that pings 219+ free coding LLM models across 21 providers, ranks by real-time latency, and helps AI agents pick the fastest model. Python 3.11+, built on FastMCP + httpx + click.

## Commands

```sh
pip install -e . # dev install
pip install -e ".[dev]" # with test/lint deps
python -m pytest tests/ -v # run tests (167 tests)
ruff check src/ tests/ # lint
model-radar serve # MCP server (stdio)
model-radar serve --transport sse --port 8765 # SSE + Streamable HTTP
model-radar serve --transport sse --port 8765 --web # SSE + web dashboard
model-radar scan --min-tier S --limit 5 # CLI scan
model-radar providers # list providers
```

## Key Modules

| Module | Purpose |
|--------|---------|
| `server.py` | FastMCP server, all 19 MCP tool definitions |
| `providers.py` | Provider/model catalog, tier system (S+ through C) |
| `scanner.py` | Async ping engine, parallel scanning, adaptive rate limiting |
| `runner.py` | Prompt execution, automatic fallback, batch execution |
| `judge.py` | LLM-as-judge: rate, compare, batch evaluate |
| `config.py` | Config management (~/.model-radar/config.json) |
| `db.py` | SQLite persistence for model catalog and ping results |

Full architecture: `docs/architecture.md`

## MCP Tools (19 total)

**Read-only (no side effects):**
list_providers, list_models, scan, get_fastest, get_workers, provider_status, server_stats

**Execution (runs prompts on external LLMs):**
run, ask, batch_run, judge, compare, batch_judge, backtranslate_eval, benchmark

**Write (modifies local config/state):**
configure_key, refresh_models, setup_workflow, restart_server

**Informational (returns text guidance):**
setup_guide, host_swap_instructions

## Version Bumps

Update BOTH files together:
- `pyproject.toml` -> `version = "X.Y.Z"`
- `src/model_radar/__init__.py` -> `__version__ = "X.Y.Z"`

## Release Process

```sh
# All work on develop or feature branches
git checkout develop && git checkout -b feature/xxx
# ... work, commit ...
git checkout develop && git merge --no-ff feature/xxx

# Release: develop -> master, tag
git checkout master && git merge develop --no-ff -m "release: vX.Y.Z"
git tag -a vX.Y.Z -m "vX.Y.Z description"
git checkout develop
git push origin master develop --tags
```

Publishing is automated: GitHub Actions runs on `release: [published]` to push to PyPI (OIDC) and MCP Registry.

## Do

- Keep the dependency footprint minimal (httpx + mcp + click)
- Use `docs/` for detailed playbooks; keep this file concise
- Test with `python -m pytest tests/ -v` before committing
- Use provider diversity in judge/worker selection

## Don't

- Commit API keys or config.json (keys live in ~/.model-radar/config.json with 0o600)
- Add heavy dependencies without discussion
- Remove provider definitions without checking if they're still active
- Skip the two-file version bump (pyproject.toml + __init__.py)
- Commit directly to master -- always work on develop or feature branches

## Docs

- `docs/architecture.md` -- module map, data flow, transport, rate limiting
- `docs/mcp-transport.md` -- transport options, stateless HTTP, client config
- `docs/playbook-translation-pipeline.md` -- batch translation patterns
- `docs/playbook-llm-as-judge.md` -- evaluation patterns and judge selection

## Skills

- `.claude/skills/release/` -- Version bump + PyPI publish + MCP registry workflow
- `.claude/skills/provider-setup/` -- Add a new provider end-to-end
- `.claude/skills/benchmarking/` -- Run quality benchmarks and interpret results
Loading