Add CodeSheriff results#24
Conversation
|
Small correction: our website is https://thecodesheriff.com (not codesheriff.dev as noted in the email). |
|
Hey, thanks for looking at this. I put together an evaluation guide so you can reproduce our results independently. Full doc is here: https://github.com/vishkulkarni2/codesheriff/blob/main/MARTIAN-EVALUATION-GUIDE.md The short version to get it running: git clone https://github.com/vishkulkarni2/codesheriff.git
cd codesheriff
pnpm install && pnpm build
export ANTHROPIC_API_KEY="your-key"
# scan a single PR (point it at a directory with the diff files)
node packages/cli/dist/cli.js scan /path/to/pr-files --json
# or run the full benchmark (all 50 PRs, takes ~30-60 min)
export GITHUB_TOKEN="ghp_..."
python3 scripts/benchmark-runner.pyThe benchmark runner fetches each PR diff via GitHub API, runs CodeSheriff, applies filtering and dedup, and outputs benchmark_data.json in your format. From there you can run your standard evaluation pipeline (steps 2-5) with You will need Node.js 20+, pnpm 8+, and an Anthropic API key for the LLM detectors. semgrep and trufflehog are optional (static analysis stages, not required for the core pipeline). Our results across the three judge models:
The candidates.json and evaluations.json files are in this PR under the results directories. Let me know if anything is unclear or if you run into issues. |
|
Hey! Just made our repo public so you should be able to access everything now: https://github.com/vishkulkarni2/codesheriff The evaluation guide is at MARTIAN-EVALUATION-GUIDE.md in the root of the repo. It covers the scoring rubric, how to run the benchmark, and what to look for in the results. Our GitHub bot username is |
CodeSheriff is an AI code safety scanner focused on detecting bugs introduced by AI coding assistants. Self-improving detection via autotune feedback loop. Evaluated on 49/50 benchmark PRs using the official pipeline (steps 2-5). Results: - Claude Opus 4.5 judge: F1=64.6% (P=55.3%, R=77.6%) - Claude Sonnet 4.5 judge: F1=64.2% (P=55.1%, R=76.9%) - Average F1: 64.4% Ranked withmartian#1 across both judge models on this evaluation (vs Cubic v2 61.8%/61.4%, Augment 53.5%/53.4%). Website: https://thecodesheriff.com GitHub App: https://github.com/apps/codesheriff-review Rebased onto latest main (2026-05-02). All existing tool results preserved — only codesheriff entries added to candidates.json, evaluations.json, and benchmark_data.json.
a4b0b09 to
4ea8bfe
Compare
|
Rebased onto latest main (2026-05-02) to resolve merge conflicts — all other tool data preserved, only CodeSheriff entries added. Quick data integrity summary for reviewers:
F1 scores compute cleanly from the raw evaluations.json:
Reproducibility: MARTIAN-EVALUATION-GUIDE.md |
Add CodeSheriff (AI code safety scanner) to offline benchmark.
CodeSheriff is an AI code review tool focused on detecting bugs in AI-generated code, with self-improving detection via autotune.
Evaluated on 49/50 PRs:
Website: https://thecodesheriff.com