Skip to content

Add creator score metrics for benchmark creation quality#1

Open
itagled wants to merge 5 commits into
strangeloopcanon:mainfrom
itagled:main
Open

Add creator score metrics for benchmark creation quality#1
itagled wants to merge 5 commits into
strangeloopcanon:mainfrom
itagled:main

Conversation

@itagled

@itagled itagled commented May 30, 2026

Copy link
Copy Markdown

Summary

Adds two complementary continuous scores for ranking benchmark creators
in BenchBench grids. Both complement (not replace) best_creator_signal_row
from build_6x6_result_artifacts.py, which returns a single winner without
quantifying how much better one creator is than another.

Context

Per our exchange on X — this addresses the formalization you mentioned wasn't
yet resolved. The goal is a continuous, comparable score per creator, so the
output is a ranked leaderboard rather than just a categorical winner.

What's included

scripts/creator_score.py — two ranking functions:

  • creator_score(row): bands-based score in [0, 1]. Faithful translation of
    best_creator_signal_row's philosophy into a continuous score: rewards
    solvers landing in the useful band (1–14/30) and low row mean. Conservative
    starting point.

  • creator_score_difficulty(row, creator_index=None): continuous score
    max(0, 1 - mean_others/30). Excludes the creator's own cell when computing
    mean. Discontinuity at zero (broken benchmark) is intentional — a benchmark
    no solver can crack provides no signal regardless of magnitude.

Both functions are self-contained: parsing helpers are duplicated rather than
imported from build_6x6_result_artifacts.py to avoid coupling to a script
module. Happy to consolidate parsing into a shared module in a separate PR
if you prefer.

tests/test_creator_score.py — 28 unit tests covering both functions:
edge cases (empty rows, malformed cells, invalid creator_index), band
boundaries (14/15), broken benchmark handling, continuity of the new score,
and ranking ordering.

scripts/validate_creator_score.py — reproducible side-by-side comparison
against the canonical Round 3 grid. Run with:

python scripts/validate_creator_score.py

Side-by-side results on Canonical Round 3

Creator Bands score Difficulty score
GPT-5.2 (frozen) 0.64 (63.8%) 0.61 (60.7%)
Gemini 3.5 Flash 0.06 (6.4%) 0.41 (41.3%)
Gemini 3.1 Pro 0.06 (6.4%) 0.35 (35.3%)
GPT-5.5 0.00 (0.0%) 0.20 (20.0%)
GPT-5.4 0.00 (0.0%) 0.13 (12.7%)
Claude Opus 0.00 (0.0%) 0.01 (0.7%)

All three methods (bands, difficulty, best_creator_signal_row) agree on
the #1 winner: GPT-5.2 with Reimbursement Forensics. The difficulty score
provides more granular discrimination among the rest, where bands collapses
3 of 6 creators into 0%.

Design notes

  • Why two functions, not one? Different operationalizations of "good
    creator". creator_score is conservative and matches your existing
    best_creator_signal_row logic; creator_score_difficulty is a simpler,
    more continuous alternative directly motivated by the README's framing
    ("strong models cannot simply clear"). Up to you which (if any) to
    integrate.

  • On zeros: a single non-zero score "saves" the benchmark.
    best_creator_signal_row penalizes each zero cell individually (via
    -int(stats["zero"]) in the tiebreaker), treating them as a red flag.
    creator_score_difficulty takes a different stance: the benchmark is
    considered broken only if every solver scored zero (i.e., mean_others = 0).
    If at least one solver landed any non-zero score, the benchmark demonstrated
    it can be solved — the rest being zero just means it was too hard for those
    models, not that it's malformed. This matters more as panels mix model
    generations: a benchmark a frontier model cracks while older or smaller
    models score zero shouldn't be penalized for the older models' failure.
    Genuine misspecification still surfaces as all-zero rows and collapses to 0.

  • Why doesn't creator_score_difficulty reach exactly 1? By construction.
    1 - mean_others/30 reaches 1 only when mean_others = 0, but that case
    triggers the broken-benchmark guard (returns 0). The score therefore
    approaches 1 asymptotically as mean_others approaches 0 from above —
    the more solvers there are and the lower they score (without hitting all
    zeros), the closer the score gets. This reflects the metric's logic
    honestly without artificial normalization.

  • What I considered but didn't include: an own_score factor (penalizing
    creators who fail their own benchmark). It's tempting, but conflates
    "harder benchmark" with "creator competence" — and the README frames the
    problem as the former. Left for a possible follow-up if of interest.

Future work (optional, not part of this PR)

Broader backend support: BenchBench currently routes to 4 specific agent
CLIs (Codex, Antigravity, Claude Code, cursor-agent). Adding an OpenCode
or Aider backend would unlock ~75 model providers via a single integration,
significantly broadening which models can participate in sweeps. Happy to
explore as a separate PR if of interest.

How to test

python -m unittest tests.test_creator_score      # 28 tests should pass
python scripts/validate_creator_score.py         # side-by-side ranking

Co-authored-by: Cursor <cursoragent@cursor.com>
@strangeloopcanon

Copy link
Copy Markdown
Owner

Thank you for this! The tests and validation script are useful, and I like having a reproducible side-by-side against the canonical grid.

I’d like to hold merge until the score matches BenchBench’s current creator-signal semantics. Right now both proposed scores can reward zero-heavy rows. In creator_score, zeros lower the row mean, so a row with fewer useful 1-14/30 cells but more zeros can outrank a row with more useful cells. In creator_score_difficulty, a row where one non-creator solver gets 1/30 and the rest get 0/30 lands near the top, which makes a near-zero-wall look like strong creator quality.

The repo’s current framing treats many low-nonzero cells as the primary signal and treats zeros as an audit warning. Could you revise the continuous score to preserve that ordering and zero penalty, or keep raw difficulty as a separate diagnostic rather than a creator-quality leaderboard?

@itagled

itagled commented Jun 5, 2026

Copy link
Copy Markdown
Author

Hello Rohit, sorry for the slow reply, this landed in my inbox and I only caught it yesterday.

Thanks for the careful review. Both points land, and they clarified something I hadn't fully resolved myself: whether BenchBench is measuring difficulty or quality (i.e. how well a benchmark concentrates solvers in the 1-14 band). My original creator_score_difficulty leaned toward difficulty, which felt like a natural way to capture model performance; but I now see the repo's framing is closer to quality, and that's where the zero-handling breaks down. A near-zero-wall shouldn't read as strong creator quality, and right now it does.

Two things I'd propose:

1. Keep creator_score_difficulty as a separate diagnostic, not a creator-quality leaderboard.
Pure difficulty still seems useful as a first approximation, and it could serve as a base if a future PR wants to go deeper specifically on benchmark difficulty. I'd reframe/rename it so it's clearly a diagnostic rather than a quality ranking; matching your second suggestion.

2. Replace the quality score with a continuous per-cell curve.
Instead of counting cells in the 1-14 band (which is intrinsically discontinuous; a creator collapses to 0 the moment no solver lands in the band), I'd map each solver cell to a continuous quality_index, then average across non-creator solvers. The curve peaks in the middle of the band, penalizes zeros hardest, and gives low-but-decreasing credit as benchmarks get too easy:

score quality_index score quality_index
0 -10 16 20
1 45 17 18
2 60 18 17
3 73 19 15
4 84 20 14
5 92 21 13
6 98 22 11
7 100 23 10
8 98 24 9
9 94 25 8
10 88 26 7
11 80 27 6
12 70 28 5
13 58 29 4
14 44 30 3
15 22

The design intent:

  • Peak at 7: the center of the 1-14 band, rewarding balanced difficulty over benchmarks that sit at either edge.
  • Negative value at 0: this is deliberate. It makes zeros penalized more heavily than highs, consistent with your current tiebreaker (-int(stats["zero"]) ordered before -int(stats["high"])). It also prevents the failure mode where a row full of zeros plus one mid-band cell outranks a genuinely-too-easy row.
  • Long low tail for 15-30: a too-easy benchmark still scored something, so it stays above a broken one, but well below the useful band.

Averaging a continuous per-cell value (rather than a discrete band count) is what gives the final score its continuity: small changes in solver scores produce small changes in the ranking, instead of cliff edges at the band boundaries.

Applied to the Canonical Round 3 grid

Excluding each creator's own cell, mapping the rest through the curve and averaging:

Creator Non-self cells → quality_index quality_avg
GPT-5.2 (frozen) 14→44, 11→80, 12→70, 11→80, 11→80 70.80
Gemini 3.5 Flash 4→84, 23→10, 15→22, 21→13, 25→8 27.40
Gemini 3.1 Pro 1→45, 26→7, 26→7, 18→17, 26→7 16.60
GPT-5.5 25→8, 24→9, 23→10, 24→9, 24→9 9.00
GPT-5.4 27→6, 27→6, 25→8, 27→6, 25→8 6.80
Claude Opus 30→3, 30→3, 30→3, 30→3, 29→4 3.20

GPT-5.2 stays clearly on top, consistent with the existing methods and the README narrative. A couple of things worth highlighting:

  • Gemini 3.1 Pro lands 3rd despite having a 1/30 cell (which scores high at 45). The four 26/30 cells (scoring 7 each) pull it down; so a single hard cell no longer rescues a benchmark that's too easy for everyone else. This is exactly the near-zero-wall behavior the difficulty score got wrong.
  • The too-easy creators (Claude Opus, GPT-5.4, GPT-5.5) cluster at the bottom without collapsing to a flat 0, so they remain distinguishable from each other.

The exact curve here is an example; the shape (and especially how hard 0 is penalized) is very much up for discussion. I still want to experiment with different distributions against real grids to find the shape that best represents creator quality. But before I implement this and update the PR, does this direction look right to you?

@strangeloopcanon

Copy link
Copy Markdown
Owner

Sorry it took me a bit to come back, just looked at this. This direction looks right conceptually: keep raw difficulty as a diagnostic, and make creator quality a separate metric.

I’d be careful about making the per-cell curve do too much policy work, though. The main invariant I want to preserve is still: many low-nonzero cells first, zeros as a strong audit penalty, then continuous shape as a secondary signal. The curve can smooth the ranking inside that structure, but it shouldn’t let one or two peak-band cells plus a wall of zeros beat a row that is broadly in the useful band.

Maybe the right structure is multiple rankings or markings rather than one score carrying everything: a creator-quality ranking, a raw difficulty diagnostic, and explicit zero-wall/audit flags. That would keep the interpretation clearer while still allowing the continuous curve to add texture.

itagled and others added 4 commits June 20, 2026 12:23
Introduce quality_index and creator_score_quality for continuous benchmark
quality scoring, plus a build script and markdown report over canonical grids.

Co-authored-by: Cursor <cursoragent@cursor.com>
… and validation

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@itagled

itagled commented Jun 21, 2026

Copy link
Copy Markdown
Author

Thanks for the steer, this matches how I ended up thinking about it. I've implemented the three-part structure you suggested. Summary of what's in the update:

1. Creator quality, continuous per-cell curve (primary ranking). creator_score_quality maps each non-creator solver cell through a continuous quality_index and averages. The curve is a rescaled Lorentzian centered at 7.5 (midpoint of the useful band), calibrated so scores 7 and 8 both hit 100, decaying to a floor at 30, with 0 reserved (=0) for broken cells. Two tunable parameters: gamma (bell width) and floor_val (value at 30).

2. Raw difficulty, kept as a separate diagnostic, not part of the quality leaderboard, per your earlier point.

3. Zeros, surfaced as an explicit per-row zero_count marking, decoupled from the score (not folded into the curve).

On your invariant ("many low-nonzero cells first, zeros as a strong audit penalty, then continuous shape as secondary; don't let one or two peak cells plus a wall of zeros beat a broadly-useful row"), I took that seriously and tried to pressure-test it rather than just assert it. A few things I want to be transparent about:

On the per-cell curve doing too much policy work. I agree, and I deliberately kept the curve from carrying the non-compensation guarantee. The reason is structural: for any averaging score you can't simultaneously have within-band texture (7 worth more than 14), strict non-compensation of zeros, and continuity, they're mutually incompatible in the limit. So rather than force the curve to do all three, the curve provides texture, and zeros are handled by quality_index(0)=0 plus the separate zero_count marking.

On whether to make the zero-wall a structural flag (tiering rows) vs. an informational marking. I built a Monte Carlo harness (simulate_creator_score.py, methodology + 50k-grid results committed under scripts/creator_score_analysis/) to check how often the failure mode you flagged actually occurs, across five generative regimes (uniform, realistic mix, clean, weak-solver-correlated-zeros, and adverse-stress). Findings:

  • The inversion you're most worried about, a zero-heavy row outranking a broadly-useful row, stays under 4% in every regime and is ~0.8% in the realistic mix, at the chosen defaults (γ=7, floor=10).
  • A structural zero-wall flag (tiering flagged rows below clean ones) would only correct a further ~0.02% of cases, because quality_index(0)=0 already pushes zero-heavy rows down on its own.

Given that, I left the zero-wall as an explicit informational marking rather than a structural tier, it keeps the interpretation clean (your "multiple markings" idea) without a threshold doing heavy lifting for a ~0.02% gain. But this is a genuine judgment call and easy to flip, if you'd rather the flag hard-partition the ranking, it's a small change.

One real-world case that informed this: in exp 008 (Fable / Rosetta Fieldwork), two solver cells are 0/30, but the run summary marks both as infrastructure failures (a CLI 400 and a timeout), not real zeros. A structural zero-flag would have penalized Fable's benchmark for failures that have nothing to do with its quality. The informational zero_count surfaces "2 zeros, worth a human look" without auto-penalizing, which seems like the right behavior when zeros can be noise.

Validation against the canonical Round 3 grid (validate_creator_score.py now prints quality + zero_count per creator):

Creator quality difficulty zeros your verdict
GPT-5.2 (frozen) 73.59 0.61 0 target to beat
Gemini 3.5 Flash 36.67 0.41 0 separates, too easy at top
Gemini 3.1 Pro 25.47 0.35 0 separates, too easy at top
GPT-5.5 16.43 0.20 0 too easy
GPT-5.4 13.52 0.13 0 saturated
Claude Opus 10.15 0.01 0 saturated

The quality ranking reproduces your categorical reads: the incumbent dominates, the too-easy/saturated rows cluster at the bottom without collapsing into a flat tie, and the Gemini challengers sit in between. Calibration rationale (why γ=7, floor=10, including the parameter sweep and the trade-offs) is in creator_score_analysis/creator_score_methodology.md.

Tests: 44 unit tests covering the curve anchors, symmetry, creator-index exclusion, and the diagnostics. Happy to adjust the curve shape or flip the zero-wall to structural if you'd prefer, wanted to land the structure first and let the calibration be a separate conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants