Add creator score metrics for benchmark creation quality#1
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Thank you for this! The tests and validation script are useful, and I like having a reproducible side-by-side against the canonical grid. I’d like to hold merge until the score matches BenchBench’s current creator-signal semantics. Right now both proposed scores can reward zero-heavy rows. In The repo’s current framing treats many low-nonzero cells as the primary signal and treats zeros as an audit warning. Could you revise the continuous score to preserve that ordering and zero penalty, or keep raw difficulty as a separate diagnostic rather than a creator-quality leaderboard? |
|
Hello Rohit, sorry for the slow reply, this landed in my inbox and I only caught it yesterday. Thanks for the careful review. Both points land, and they clarified something I hadn't fully resolved myself: whether BenchBench is measuring difficulty or quality (i.e. how well a benchmark concentrates solvers in the 1-14 band). My original Two things I'd propose: 1. Keep 2. Replace the quality score with a continuous per-cell curve.
The design intent:
Averaging a continuous per-cell value (rather than a discrete band count) is what gives the final score its continuity: small changes in solver scores produce small changes in the ranking, instead of cliff edges at the band boundaries. Applied to the Canonical Round 3 gridExcluding each creator's own cell, mapping the rest through the curve and averaging:
GPT-5.2 stays clearly on top, consistent with the existing methods and the README narrative. A couple of things worth highlighting:
The exact curve here is an example; the shape (and especially how hard 0 is penalized) is very much up for discussion. I still want to experiment with different distributions against real grids to find the shape that best represents creator quality. But before I implement this and update the PR, does this direction look right to you? |
|
Sorry it took me a bit to come back, just looked at this. This direction looks right conceptually: keep raw difficulty as a diagnostic, and make creator quality a separate metric. I’d be careful about making the per-cell curve do too much policy work, though. The main invariant I want to preserve is still: many low-nonzero cells first, zeros as a strong audit penalty, then continuous shape as a secondary signal. The curve can smooth the ranking inside that structure, but it shouldn’t let one or two peak-band cells plus a wall of zeros beat a row that is broadly in the useful band. Maybe the right structure is multiple rankings or markings rather than one score carrying everything: a creator-quality ranking, a raw difficulty diagnostic, and explicit zero-wall/audit flags. That would keep the interpretation clearer while still allowing the continuous curve to add texture. |
Introduce quality_index and creator_score_quality for continuous benchmark quality scoring, plus a build script and markdown report over canonical grids. Co-authored-by: Cursor <cursoragent@cursor.com>
… and validation Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Thanks for the steer, this matches how I ended up thinking about it. I've implemented the three-part structure you suggested. Summary of what's in the update: 1. Creator quality, continuous per-cell curve (primary ranking).
2. Raw difficulty, kept as a separate diagnostic, not part of the quality leaderboard, per your earlier point. 3. Zeros, surfaced as an explicit per-row On your invariant ("many low-nonzero cells first, zeros as a strong audit penalty, then continuous shape as secondary; don't let one or two peak cells plus a wall of zeros beat a broadly-useful row"), I took that seriously and tried to pressure-test it rather than just assert it. A few things I want to be transparent about: On the per-cell curve doing too much policy work. I agree, and I deliberately kept the curve from carrying the non-compensation guarantee. The reason is structural: for any averaging score you can't simultaneously have within-band texture (7 worth more than 14), strict non-compensation of zeros, and continuity, they're mutually incompatible in the limit. So rather than force the curve to do all three, the curve provides texture, and zeros are handled by On whether to make the zero-wall a structural flag (tiering rows) vs. an informational marking. I built a Monte Carlo harness (
Given that, I left the zero-wall as an explicit informational marking rather than a structural tier, it keeps the interpretation clean (your "multiple markings" idea) without a threshold doing heavy lifting for a ~0.02% gain. But this is a genuine judgment call and easy to flip, if you'd rather the flag hard-partition the ranking, it's a small change. One real-world case that informed this: in exp 008 (Fable / Rosetta Fieldwork), two solver cells are 0/30, but the run summary marks both as infrastructure failures (a CLI 400 and a timeout), not real zeros. A structural zero-flag would have penalized Fable's benchmark for failures that have nothing to do with its quality. The informational Validation against the canonical Round 3 grid (
The quality ranking reproduces your categorical reads: the incumbent dominates, the too-easy/saturated rows cluster at the bottom without collapsing into a flat tie, and the Gemini challengers sit in between. Calibration rationale (why γ=7, floor=10, including the parameter sweep and the trade-offs) is in Tests: 44 unit tests covering the curve anchors, symmetry, creator-index exclusion, and the diagnostics. Happy to adjust the curve shape or flip the zero-wall to structural if you'd prefer, wanted to land the structure first and let the calibration be a separate conversation. |
Summary
Adds two complementary continuous scores for ranking benchmark creators
in BenchBench grids. Both complement (not replace)
best_creator_signal_rowfrom
build_6x6_result_artifacts.py, which returns a single winner withoutquantifying how much better one creator is than another.
Context
Per our exchange on X — this addresses the formalization you mentioned wasn't
yet resolved. The goal is a continuous, comparable score per creator, so the
output is a ranked leaderboard rather than just a categorical winner.
What's included
scripts/creator_score.py— two ranking functions:creator_score(row): bands-based score in [0, 1]. Faithful translation ofbest_creator_signal_row's philosophy into a continuous score: rewardssolvers landing in the useful band (1–14/30) and low row mean. Conservative
starting point.
creator_score_difficulty(row, creator_index=None): continuous scoremax(0, 1 - mean_others/30). Excludes the creator's own cell when computingmean. Discontinuity at zero (broken benchmark) is intentional — a benchmark
no solver can crack provides no signal regardless of magnitude.
Both functions are self-contained: parsing helpers are duplicated rather than
imported from
build_6x6_result_artifacts.pyto avoid coupling to a scriptmodule. Happy to consolidate parsing into a shared module in a separate PR
if you prefer.
tests/test_creator_score.py— 28 unit tests covering both functions:edge cases (empty rows, malformed cells, invalid creator_index), band
boundaries (14/15), broken benchmark handling, continuity of the new score,
and ranking ordering.
scripts/validate_creator_score.py— reproducible side-by-side comparisonagainst the canonical Round 3 grid. Run with:
Side-by-side results on Canonical Round 3
All three methods (bands, difficulty,
best_creator_signal_row) agree onthe #1 winner: GPT-5.2 with Reimbursement Forensics. The difficulty score
provides more granular discrimination among the rest, where bands collapses
3 of 6 creators into 0%.
Design notes
Why two functions, not one? Different operationalizations of "good
creator".
creator_scoreis conservative and matches your existingbest_creator_signal_rowlogic;creator_score_difficultyis a simpler,more continuous alternative directly motivated by the README's framing
("strong models cannot simply clear"). Up to you which (if any) to
integrate.
On zeros: a single non-zero score "saves" the benchmark.
best_creator_signal_rowpenalizes each zero cell individually (via-int(stats["zero"])in the tiebreaker), treating them as a red flag.creator_score_difficultytakes a different stance: the benchmark isconsidered broken only if every solver scored zero (i.e.,
mean_others = 0).If at least one solver landed any non-zero score, the benchmark demonstrated
it can be solved — the rest being zero just means it was too hard for those
models, not that it's malformed. This matters more as panels mix model
generations: a benchmark a frontier model cracks while older or smaller
models score zero shouldn't be penalized for the older models' failure.
Genuine misspecification still surfaces as all-zero rows and collapses to 0.
Why doesn't
creator_score_difficultyreach exactly 1? By construction.1 - mean_others/30reaches 1 only whenmean_others = 0, but that casetriggers the broken-benchmark guard (returns 0). The score therefore
approaches 1 asymptotically as
mean_othersapproaches 0 from above —the more solvers there are and the lower they score (without hitting all
zeros), the closer the score gets. This reflects the metric's logic
honestly without artificial normalization.
What I considered but didn't include: an
own_scorefactor (penalizingcreators who fail their own benchmark). It's tempting, but conflates
"harder benchmark" with "creator competence" — and the README frames the
problem as the former. Left for a possible follow-up if of interest.
Future work (optional, not part of this PR)
Broader backend support: BenchBench currently routes to 4 specific agent
CLIs (Codex, Antigravity, Claude Code, cursor-agent). Adding an OpenCode
or Aider backend would unlock ~75 model providers via a single integration,
significantly broadening which models can participate in sweeps. Happy to
explore as a separate PR if of interest.
How to test