Add CodeSheriff results by vishkulkarni2 · Pull Request #24 · withmartian/code-review-benchmark

vishkulkarni2 · 2026-04-04T20:39:19Z

Add CodeSheriff (AI code safety scanner) to offline benchmark.

CodeSheriff is an AI code review tool focused on detecting bugs in AI-generated code, with self-improving detection via autotune.

Evaluated on 49/50 PRs:

Claude Opus 4.5 judge: F1=64.6% (P=55.3%, R=77.6%)
Claude Sonnet 4.5 judge: F1=64.2% (P=55.1%, R=76.9%)
Claude Sonnet 4 judge: F1=63.6% (P=54.2%, R=76.9%)

Website: https://thecodesheriff.com

vishkulkarni2 · 2026-04-04T20:48:17Z

Small correction: our website is https://thecodesheriff.com (not codesheriff.dev as noted in the email).

vishkulkarni2 · 2026-04-04T23:33:43Z

Hey, thanks for looking at this. I put together an evaluation guide so you can reproduce our results independently. Full doc is here: https://github.com/vishkulkarni2/codesheriff/blob/main/MARTIAN-EVALUATION-GUIDE.md

The short version to get it running:

git clone https://github.com/vishkulkarni2/codesheriff.git
cd codesheriff
pnpm install && pnpm build
export ANTHROPIC_API_KEY="your-key"

# scan a single PR (point it at a directory with the diff files)
node packages/cli/dist/cli.js scan /path/to/pr-files --json

# or run the full benchmark (all 50 PRs, takes ~30-60 min)
export GITHUB_TOKEN="ghp_..."
python3 scripts/benchmark-runner.py

The benchmark runner fetches each PR diff via GitHub API, runs CodeSheriff, applies filtering and dedup, and outputs benchmark_data.json in your format. From there you can run your standard evaluation pipeline (steps 2-5) with --tool codesheriff.

You will need Node.js 20+, pnpm 8+, and an Anthropic API key for the LLM detectors. semgrep and trufflehog are optional (static analysis stages, not required for the core pipeline).

Our results across the three judge models:

Judge	P	R	F1
Opus 4.5	55.3%	77.6%	64.6%
Sonnet 4.5	55.1%	76.9%	64.2%
Sonnet 4	54.2%	76.9%	63.6%

The candidates.json and evaluations.json files are in this PR under the results directories. Let me know if anything is unclear or if you run into issues.

vishkulkarni2 · 2026-04-06T23:27:58Z

Hey! Just made our repo public so you should be able to access everything now: https://github.com/vishkulkarni2/codesheriff

The evaluation guide is at MARTIAN-EVALUATION-GUIDE.md in the root of the repo. It covers the scoring rubric, how to run the benchmark, and what to look for in the results.

Our GitHub bot username is codesheriff-review[bot] — that's what you'll see on PR comments. Let me know if you run into any issues or need anything else.

CodeSheriff is an AI code safety scanner focused on detecting bugs introduced by AI coding assistants. Self-improving detection via autotune feedback loop. Evaluated on 49/50 benchmark PRs using the official pipeline (steps 2-5). Results: - Claude Opus 4.5 judge: F1=64.6% (P=55.3%, R=77.6%) - Claude Sonnet 4.5 judge: F1=64.2% (P=55.1%, R=76.9%) - Average F1: 64.4% Ranked withmartian#1 across both judge models on this evaluation (vs Cubic v2 61.8%/61.4%, Augment 53.5%/53.4%). Website: https://thecodesheriff.com GitHub App: https://github.com/apps/codesheriff-review Rebased onto latest main (2026-05-02). All existing tool results preserved — only codesheriff entries added to candidates.json, evaluations.json, and benchmark_data.json.

vishkulkarni2 · 2026-05-03T02:03:34Z

Rebased onto latest main (2026-05-02) to resolve merge conflicts — all other tool data preserved, only CodeSheriff entries added.

Quick data integrity summary for reviewers:

Evaluations added: 49 PRs in each judge directory (Opus 4.5 + Sonnet 4.5)
Benchmark data: 50 PRs with CodeSheriff review comments added
No other tool data modified — verified programmatically before committing

F1 scores compute cleanly from the raw evaluations.json:

Judge	TP	FP	FN	Precision	Recall	F1
Opus 4.5	104	84	30	55.3%	77.6%	64.6%
Sonnet 4.5	103	84	31	55.1%	76.9%	64.2%

Reproducibility: MARTIAN-EVALUATION-GUIDE.md

vishkulkarni2 force-pushed the add-codesheriff-results branch from a4b0b09 to 4ea8bfe Compare May 3, 2026 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CodeSheriff results#24

Add CodeSheriff results#24
vishkulkarni2 wants to merge 1 commit intowithmartian:mainfrom
vishkulkarni2:add-codesheriff-results

vishkulkarni2 commented Apr 4, 2026

Uh oh!

vishkulkarni2 commented Apr 4, 2026

Uh oh!

vishkulkarni2 commented Apr 4, 2026 •

edited

Loading

Uh oh!

vishkulkarni2 commented Apr 6, 2026

Uh oh!

vishkulkarni2 commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vishkulkarni2 commented Apr 4, 2026

Uh oh!

vishkulkarni2 commented Apr 4, 2026

Uh oh!

vishkulkarni2 commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vishkulkarni2 commented Apr 6, 2026

Uh oh!

vishkulkarni2 commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vishkulkarni2 commented Apr 4, 2026 •

edited

Loading