Benchmark AI coding agents against your own codebase.
Mine real tasks from your repo history, run agents against them, and find out which setup actually works best for your code, not someone else's benchmark suite.
Existing benchmarks (SWE-bench, HumanEval) use fixed task sets that AI models may have memorized from training data, and as general public benchmarks likely don't capture what is most important to your unique workflows. codeprobe mines tasks from your private repo history, producing benchmarks that are impossible to contaminate. You can also point the tool at any public repo to mine tasks from.
codeprobe orchestrates external AI coding agents — you need at least one installed:
| Agent | Install | Required env var |
|---|---|---|
| Claude Code | claude.ai/download | ANTHROPIC_API_KEY |
| GitHub Copilot | npm install -g @github/copilot-cli (>= 1.0.4) |
GitHub auth via gh auth login |
| Codex | Included via pip install codeprobe[codex] |
OPENAI_API_KEY |
You also need:
- Python 3.11+
- Git (for task mining and worktree isolation)
- GitHub CLI (
gh) — optional, for mining tasks from GitHub PRs with linked issues
The assess and mine --enrich commands need an LLM for scoring/enrichment. codeprobe auto-detects the best available backend:
| Priority | Backend | Install | Env var |
|---|---|---|---|
| 1 | Anthropic SDK | pip install codeprobe[anthropic] |
ANTHROPIC_API_KEY |
| 2 | OpenAI SDK | pip install codeprobe[codex] |
OPENAI_API_KEY |
| 3 | Claude CLI | claude.ai/download | ANTHROPIC_API_KEY |
Override with CODEPROBE_LLM_BACKEND=anthropic|openai|claude-cli. Without any backend, assess falls back to heuristic scoring.
pip install codeprobe
cd /path/to/your/repo
codeprobe assess . # Score benchmarking potential (optional)
codeprobe mine . # Extract tasks from repo history
codeprobe run . # Run agents against tasks
codeprobe interpret . # Get recommendationsPrefer driving codeprobe through a coding agent instead? See docs/workflows/with-agents.md for the skills-based workflow (/experiment, /assess-codebase, /interpret).
| Command | Purpose |
|---|---|
codeprobe assess |
Score a codebase's benchmarking potential |
codeprobe init |
Interactive wizard — choose what to compare |
codeprobe mine |
Mine eval tasks from merged PRs/MRs |
codeprobe probe |
Generate fast micro-benchmark probes (30s each) |
codeprobe experiment |
Manage comparison experiments (init, add-config) |
codeprobe run |
Execute tasks against AI agents |
codeprobe interpret |
Analyze results, rank configurations |
codeprobe doctor |
Check environment readiness (agents, keys, git) |
codeprobe preambles list |
List available preambles at all search levels |
codeprobe oracle-check |
Compare agent answer against oracle ground truth |
codeprobe scaffold |
Create/validate eval task directories |
codeprobe ratings |
Record and analyze agent session quality ratings |
Mine real code-change tasks from your git history. Agents must reproduce known fixes and features.
codeprobe mine . --count 10 --source github
codeprobe mine . --count 5 --min-files 4 # Harder tasks (more files changed)
codeprobe mine . --enrich # LLM-enriched instructionsFast exact-match tasks (30s each) that test code navigation and comprehension — no agent sandbox needed.
codeprobe probe . -n 10 -l python -s 42 -o ./probesGenerates four probe types: find-function, count-callers, return-type, module-dependency.
End-to-end flows from a raw repo to ranked agent results. Each workflow covers the full assess → mine → validate → run → interpret pipeline.
| Workflow | When to use | Guide |
|---|---|---|
| Standard | Repo has merged PRs/MRs | docs/workflows/standard.md |
| Cold-start | New repo, squashed history, vendored code | docs/workflows/cold-start.md |
| Cross-repo | Tasks spanning multiple repositories | docs/workflows/cross-repo.md |
Quick start (standard path):
codeprobe assess /path/to/repo
codeprobe mine /path/to/repo --goal quality --count 10 --no-interactive
codeprobe validate /path/to/repo/.codeprobe/tasks/<task-id>
codeprobe run /path/to/repo --agent claude --max-cost-usd 5.00
codeprobe interpret /path/to/repoFor the full MCP comparison setup (preambles, baseline vs with-MCP configs), see the next section.
Compare agent performance with and without MCP tools (Sourcegraph, GitHub, etc.).
When --mcp-families mining writes ground truth using a single backend (e.g. Sourcegraph's sg_find_references), and the experiment then gives one config the same MCP tool, the with-MCP config can score 1.0 simply because it called the backend that wrote the answer key. The reported delta then measures "did the agent invoke the grading rubric" rather than tool value (tracked as codeprobe-ekhi).
codeprobe ships three structural mitigations that are on by default; do not disable them unless you know what you are giving up:
- Multi-source consensus mining —
--mcp-familiesruns every available backend (sourcegraph,ast,grep) and only ships tasks where ≥2 backends agree above--consensus-threshold(default0.8pairwise file-level F1). Tasks below the threshold are quarantined undertasks_quarantined/with adivergence_report.json.--consensus-mode intersection(default) keeps the high-precision intersection;--consensus-mode unionkeeps everything any backend found.--no-consensusreverts to legacy single-backend GT and is unsafe for MCP-vs-no-MCP comparisons. - Tool-independent AST oracle —
--backend ast(also one of the consensus backends) resolves ground truth via Pythonastand a Go scanner, with no dependency on Sourcegraph or grep. Use it as a standalone backend or as the independent leg of consensus. - Aggregate-time bias detection —
codeprobe experiment aggregateflagsbackend_overlap,overshipping, andno_independent_baselinepatterns before printing the score table. See How to read aggregate output.
After mining, also run cross-validation to surface remaining divergences across the per-backend ground-truth files:
codeprobe mine-cross-validate /path/to/repo/.codeprobe/tasks \
--threshold 0.6 # exit 1 if any pair falls below — useful in CIThe full command set above is the supported path for honest MCP-vs-no-MCP measurement; tasks that survive consensus + cross-validation are independent of the agent's tool surface and safe to publish.
# Set up Sourcegraph credentials (used as one of the consensus backends)
export SOURCEGRAPH_TOKEN="your-token"
# Mine MCP-optimized tasks with default consensus across sourcegraph + ast + grep
codeprobe mine /path/to/repo \
--org-scale --mcp-families --count 5 \
--no-interactive --no-llm \
--sg-repo github.com/sg-evals/your-repo
# Optional: cross-validate the resulting per-backend ground truths
codeprobe mine-cross-validate /path/to/repo/.codeprobe/tasksMCP task families: symbol-reference-trace, type-hierarchy-consumers, change-scope-audit.
The Sourcegraph token enables the SG leg of consensus. With no token, consensus falls back to
ast + grep; you'll see fewer shipped tasks but the comparison stays honest. Pass--backend astto skip Sourcegraph entirely.
# Create experiment
codeprobe experiment init /path/to/repo --name mcp-comparison
# Copy mined tasks into the experiment
cp -r /path/to/repo/.codeprobe/tasks/* /path/to/repo/mcp-comparison/tasks/
# Baseline config (no MCP, no preamble)
codeprobe experiment add-config /path/to/repo/mcp-comparison \
--label baseline --agent claude --model claude-haiku-4-5-20251001
# Sourcegraph MCP config (preamble + MCP server)
codeprobe experiment add-config /path/to/repo/mcp-comparison \
--label with-sourcegraph --agent claude --model claude-haiku-4-5-20251001 \
--preamble sourcegraph \
--mcp-config '{"mcpServers":{"sourcegraph":{"type":"http","url":"https://sourcegraph.com/.api/mcp/all","headers":{"Authorization":"token ${SOURCEGRAPH_TOKEN}"}}}}'
# Run and interpret
codeprobe run /path/to/repo/mcp-comparison --agent claude --max-cost-usd 5.00
codeprobe interpret /path/to/repo/mcp-comparisonFor oracle / symbol-reference-trace / change-scope-audit tasks where the
agent's answer is text rather than code edits, you can isolate the
comparison further by stashing the workspace's source files for the
duration of the run. The agent then must use Sourcegraph MCP — there
is nothing local to fall back on. Pair the v2 sourcegraph preamble
(which declares "Local source files are not present") with the
--hide-local-source flag:
codeprobe experiment add-config /path/to/repo/mcp-comparison \
--label with-sg-isolated --agent claude --model claude-sonnet-4-6 \
--preamble sourcegraph \
--mcp-config '{"mcpServers":{"sourcegraph":{"type":"http","url":"https://sourcegraph.com/.api/mcp/all","headers":{"Authorization":"token ${SOURCEGRAPH_TOKEN}"}}}}' \
--hide-local-source--hide-local-source mirrors CodeScaleBench's Dockerfile.sg_only and
EnterpriseBench's generate_sg_only_dockerfile pattern. The workspace
appears empty during the agent's run and is restored on exit (including
on exception). .git, .codeprobe, and .codeprobe-worktrees* are
preserved by default.
Not compatible with SDLC tasks — they need source files to edit.
For SDLC, use --preamble sourcegraph without --hide-local-source;
the v2 preamble alone (no isolation) is the relevant intervention.
Preambles are composable instruction templates prepended to the agent's prompt for MCP-enabled configs. Built-in preambles: sourcegraph, github.
Override built-ins by placing a .md file in:
<task_dir>/preambles/(per-task).codeprobe/preambles/(project-level)~/.codeprobe/preambles/(user-level)
Template variables (filled by task_preamble_context):
{{sg_repo}},{{repo_name}},{{repo_path}},{{task_id}}— task identity{{repo_scope}}— one-line repo-scoping directive (sourcegraph preamble; built frommetadata.sg_repo){{workflow_tail}}— category-specialised continuation of the numbered "Required Workflow" list (sourcegraph preamble; varies bymetadata.category)
# Running
codeprobe run . --parallel 5 # Run 5 tasks concurrently within each config
codeprobe run . --config-parallel 2 # Run 2 configs concurrently (default 1 = serial).
# Cross-config parallelism multiplies in-flight
# task count and inflates --max-cost-usd
# overshoot proportionally; default 1 keeps the
# cost cap honest. Opt in only when you don't
# need cost containment.
codeprobe run . --max-cost-usd 2.00 # Stop when cost budget is reached
codeprobe run . --dry-run # Estimate resource usage without running
codeprobe run . --model opus-4 # Override experiment.json model
codeprobe run . --timeout 600 # Override default 300s timeout
codeprobe run . --repeats 3 # Run each task 3 times
codeprobe run . --show-prompt # Print resolved prompt without running agent
# Mining
codeprobe mine . --enrich # Use LLM to improve weak task instructions
codeprobe mine . --org-scale # Mine comprehension tasks (not SDLC)
codeprobe mine . --mcp-families # Include MCP-optimized task families
codeprobe mine . --sg-repo REPO # Sourcegraph repo for ground truth enrichment
codeprobe mine . --backend ast # Tool-independent ground truth (Python + Go AST)
codeprobe mine . --mcp-families # Default: consensus across sourcegraph + ast + grep
codeprobe mine . --mcp-families --consensus-threshold 0.9 # Stricter agreement
codeprobe mine . --mcp-families --consensus-backends ast,grep # Drop a backend
codeprobe mine . --mcp-families --no-consensus # UNSAFE: legacy single-backend GT
codeprobe mine . --preset quick # Quick scan: count=3
codeprobe mine . --preset mcp # MCP eval: org-scale + MCP families + enrich
# Cross-validate after mining
codeprobe mine-cross-validate ./.codeprobe/tasks --threshold 0.6
# Mine profiles (save/load custom flag combinations)
codeprobe mine --save-profile my-setup --count 10 --org-scale .
codeprobe mine --profile my-setup . # Load saved flags
codeprobe mine --list-profiles # Show available profiles
# Experiment configs
codeprobe experiment add-config . --preamble sourcegraph # Attach MCP preamble
codeprobe experiment add-config . --mcp-config config.json # Attach MCP server
codeprobe experiment add-config . --hide-local-source # sg-only mode (v0.10.0+):
# stash workspace source for
# the run; restore after.
# Pair with --preamble
# sourcegraph. Incompatible
# with SDLC tasks.
# Diagnostics
codeprobe doctor # Check agents, API keys, git, Python
codeprobe preambles list # Show available preambles at all levels
# Output
codeprobe interpret . --format csv # Export for pivot tables
codeprobe interpret . --format html # Self-contained HTML reportcodeprobe experiment aggregate prints per-config metrics and pairwise deltas, and emits reports/aggregate.json. It also runs three lightweight bias detectors so silent measurement artifacts don't get reported as real signal.
When a warning fires, it appears above the score table as [<kind>] <message> and is mirrored to aggregate.json under bias_warnings.
| Warning kind | What it means | What to do |
|---|---|---|
backend_overlap |
A config's MCP tool surface includes a backend that produced the ground truth. | Do not report the with-MCP win as tool value — it may be tautological. Use an independent oracle (AST, hand-curated GT) instead. |
overshipping |
The losing config scored ≈0 with recall ≈1.0; it found everything but was over-shipping. | Likely measures a tool capability boundary, not tool quality. Tighten the loser's tool surface or expand the GT. |
no_independent_baseline |
Every task's GT comes from a single backend reachable by some configs but not all. | Aggregate winner is suppressed (pairwise deltas hidden). Mine GT with a different backend before declaring a winner. |
Pass --no-warn to suppress the stdout block and re-enable winner ranking — useful when scripting. The structured bias_warnings array is always written to aggregate.json regardless.
codeprobe experiment aggregate ./mcp-comparison
codeprobe experiment aggregate ./mcp-comparison --no-warn # for CI / pivots- Claude Code (
--agent claude) — headless viaclaude -p - GitHub Copilot (
--agent copilot) — via Copilot CLI - Codex (
--agent codex) — via OpenAI API - Custom agents via the
AgentAdapterprotocol
GitHub, GitLab, Bitbucket, Azure DevOps, Gitea/Forgejo, and local repos.
Configuration lives in experiment.json (created by codeprobe init or codeprobe experiment init). CLI flags override experiment.json values — precedence: built-in defaults < experiment.json < CLI flags.
Run-time observability is on by default: Rich Live dashboard in TTY, JSON event lines with --log-format json for CI. Cost budget warnings at 80% and 100% thresholds are always visible on stderr.
Apache-2.0