chore: benchmark corpus, baseline results, and implementation requirements by stephenc222 · Pull Request #109 · Stephen-Collins-tech/hotspots

stephenc222 · 2026-06-27T14:38:44Z

Summary

Adds benchmarks/ — a reproducible public benchmark that validates hotspots' bug-prediction accuracy across 7 well-known, deliberately hard repos
Establishes the v1.25.3 baseline (mean ρ=+0.350 across 7 repos; curl/curl P@10=1.00)
Adds docs/requirements/ with four fully-specified implementation specs (REQ-001 through REQ-004)
Adds STATUS.md as the working tracker for open implementation work

Benchmark baseline (v1.25.3, ARS formula, no trained ranker)

Repo	Language	ρ	P@10
curl/curl	C	+0.476	1.00
redis/redis	C	+0.476	0.70
facebook/react	JavaScript	+0.352	0.50
git/git	C	+0.340	0.50
django/django	Python	+0.293	0.70
golang/go	Go	+0.265	0.00
microsoft/vscode	TypeScript	+0.251	0.40
mean		+0.350	0.54

Corpus: 7 repos · features pinned to pre-2024 SHAs · labels from 2024 bug-fix commits only · fixed label window prevents score drift over time.

Requirements docs ready to implement

REQ	What	Effort
REQ-001	`HistoryDepth` tier annotation on ranked output	Small
REQ-002	`convention_bug_fix_count` as 10th ranker feature	Minimal
REQ-003	`--explain` phrase layer (✦ lines on CRITICAL/HIGH)	Medium
REQ-004	Public benchmark corpus spec (partially implemented here)	—

REQ-002 triggers a benchmark re-run when it ships.

Test plan

benchmarks/label.py and benchmarks/score.py run without error on a repo with a bare clone available
benchmarks/versions/v1.25.3.json is valid JSON (python3 -m json.tool benchmarks/versions/v1.25.3.json)
benchmarks/RESULTS.md renders correctly on GitHub
STATUS.md renders correctly on GitHub
No changes to any Rust source — this PR is docs and tooling only

🤖 Generated with Claude Code

REQ-001: history depth tier annotation on ranked output REQ-002: convention_bug_fix_count as 10th ranker feature REQ-003: ranker explanation layer (✦ phrases, --explain flag) REQ-004: public benchmark corpus design with pinned SHAs and versioned results All four backed by findings from hotspots-research. REQ-001–003 are ready to implement. REQ-004 benchmark infrastructure is partially complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

stephenc222 force-pushed the chore/benchmarks-and-requirements branch from 23e1ced to 5ed4dd5 Compare June 27, 2026 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: benchmark corpus, baseline results, and implementation requirements#109

chore: benchmark corpus, baseline results, and implementation requirements#109
stephenc222 wants to merge 1 commit into
mainfrom
chore/benchmarks-and-requirements

stephenc222 commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

stephenc222 commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark baseline (v1.25.3, ARS formula, no trained ranker)

Requirements docs ready to implement

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stephenc222 commented Jun 27, 2026 •

edited

Loading