Skip to content

chore: benchmark corpus, baseline results, and implementation requirements#109

Open
stephenc222 wants to merge 1 commit into
mainfrom
chore/benchmarks-and-requirements
Open

chore: benchmark corpus, baseline results, and implementation requirements#109
stephenc222 wants to merge 1 commit into
mainfrom
chore/benchmarks-and-requirements

Conversation

@stephenc222

@stephenc222 stephenc222 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds benchmarks/ — a reproducible public benchmark that validates hotspots' bug-prediction accuracy across 7 well-known, deliberately hard repos
  • Establishes the v1.25.3 baseline (mean ρ=+0.350 across 7 repos; curl/curl P@10=1.00)
  • Adds docs/requirements/ with four fully-specified implementation specs (REQ-001 through REQ-004)
  • Adds STATUS.md as the working tracker for open implementation work

Benchmark baseline (v1.25.3, ARS formula, no trained ranker)

Repo Language ρ P@10
curl/curl C +0.476 1.00
redis/redis C +0.476 0.70
facebook/react JavaScript +0.352 0.50
git/git C +0.340 0.50
django/django Python +0.293 0.70
golang/go Go +0.265 0.00
microsoft/vscode TypeScript +0.251 0.40
mean +0.350 0.54

Corpus: 7 repos · features pinned to pre-2024 SHAs · labels from 2024 bug-fix commits only · fixed label window prevents score drift over time.

Requirements docs ready to implement

REQ What Effort
REQ-001 HistoryDepth tier annotation on ranked output Small
REQ-002 convention_bug_fix_count as 10th ranker feature Minimal
REQ-003 --explain phrase layer (✦ lines on CRITICAL/HIGH) Medium
REQ-004 Public benchmark corpus spec (partially implemented here)

REQ-002 triggers a benchmark re-run when it ships.

Test plan

  • benchmarks/label.py and benchmarks/score.py run without error on a repo with a bare clone available
  • benchmarks/versions/v1.25.3.json is valid JSON (python3 -m json.tool benchmarks/versions/v1.25.3.json)
  • benchmarks/RESULTS.md renders correctly on GitHub
  • STATUS.md renders correctly on GitHub
  • No changes to any Rust source — this PR is docs and tooling only

🤖 Generated with Claude Code

REQ-001: history depth tier annotation on ranked output
REQ-002: convention_bug_fix_count as 10th ranker feature
REQ-003: ranker explanation layer (✦ phrases, --explain flag)
REQ-004: public benchmark corpus design with pinned SHAs and versioned results

All four backed by findings from hotspots-research. REQ-001–003 are
ready to implement. REQ-004 benchmark infrastructure is partially complete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@stephenc222 stephenc222 force-pushed the chore/benchmarks-and-requirements branch from 23e1ced to 5ed4dd5 Compare June 27, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant