Proposal: A judge-calibration subset for AssetOpsBench

### Motivation

The leaderboard is scored by a single LLM judge (Llama-4-Maverick-17B, per the README's Leaderboards section) across the six rubric dimensions. No published agreement number between that judge and human expert raters on the rubric appears in the repo or the linked paper. Recent agent-benchmark reviews flag single-judge designs as a top reliability gap[^1][^2].

A 50-scenario subset, double-labelled by two domain experts on the same rubric, gives the leaderboard a defensible "how much does the judge actually agree with humans?" number per dimension. The subset is one-shot work; only the judge rotates against it. The same artifact doubles as the human anchor for any later judge swap, ensemble proposal, or the kappa-ensemble discussion in #281.

Filing this as methodology, not a contribution offer. I don't have the bandwidth to drive the PR end-to-end. The interesting question is whether the protocol fits the roadmap; who picks it up is downstream of that.

### Relation to existing work

- **#281 (@AstroBoy1).** Same reliability concern; #281 frames the fix as inter-judge kappa across an ensemble[^3]. This proposal is the orthogonal axis: judge-vs-human kappa on a labelled subset. Inter-judge kappa tells you the judges agree; judge-vs-human kappa tells you they're right. Both share one kappa-reporting API.
- **#279 + #280 (@ShuxinLin).** The natural home. #280 ships three graders (`exact_string_match`, `numeric_match`, `llm_judge`) against the six-criterion rubric. A `judge_calibration` HF config would run through them unchanged; the only addition is a small post-step that reads the saved trajectories and computes per-dimension kappa. No new plumbing.
- **#287 / #292 (@eggrollofchaos).** Useful contribution-shape precedent: a new HF configuration plus minimal glue. This would look roughly the same.

### Methodology

1. **Subset (50 scenarios).** Stratified across the four agents (IoT, FMSR, TSFM, WO; the HF `type` field tags FMSR scenarios as `FMSA`) and the three groups (retrospective, predictive, prescriptive). The deterministic / non-deterministic split should mirror the population ratio, not deviate from it.
2. **Two expert raters.** Practical minimum for Cohen's kappa. Ideally an industrial-engineer + data-scientist pair drawn from the same pool FailureSensorIQ used for its five-expert human-ceiling measurement. Three-to-five raters tightens the CIs at proportional cost and switches the metric to Fleiss' / Light's kappa.
3. **Per-dimension agreement metric.** Cohen's kappa for nominal dimensions; quadratic-weighted kappa for ordinal. `task completeness`, `clarity`, and `justification` look ordinal, but the per-dimension classification is best settled against the actual judge prompt before fixing a metric.
4. **Judge scoring.** Re-run the existing Maverick-17B judge against the same 50 scenarios. An optional ensemble panel for sensitivity reporting needs Fleiss' / Light's kappa or pairwise-averaged Cohen's instead of plain Cohen's.
5. **Publication.** Per-dimension agreement table on the leaderboard README (or a sibling docs page if that fits better). The labelled subset ships as the `judge_calibration` HF dataset configuration, so future judges can be re-scored without relabelling.

The configuration sits **alongside** `scenarios` as a companion config, not as a schema extension on it. Scenarios follow the existing `docs/guideline/utterance_design_guideline.md` and `docs/guideline/ground_truth_design_guideline.md` templates.

### Suggested defaults (each negotiable)

- **Per-dimension metric.** Cohen's kappa for nominal, quadratic-weighted kappa for ordinal. Which dimension lands in which bucket is best left to whoever owns the judge prompt.
- **Subset provenance.** Existing public `scenarios`. A held-out hidden slice is worth considering if contamination becomes a concern.
- **Expert panel size.** Two raters (kappa minimum). Three-to-five brings it in line with the FailureSensorIQ ceiling.
- **Acceptance threshold.** None hardcoded. Report raw kappa with Landis-Koch bands (slight / fair / moderate / substantial / almost perfect).
- **Confidence intervals.** At n=50 the 95% bootstrap CI on kappa is wide (half-width roughly 0.20 to 0.25, depending on the value). Reporting the band straight matters more than narrowing it.
- **Publication location.** Leaderboard README, paper appendix, or a dedicated `docs/CALIBRATION.md`. No strong preference.

### Scope and effort (rough)

- ~100 expert-hours per rater (50 scenarios at roughly 2 hours of review across 6 dimensions). ~200 hours total across two raters.
- A small Python harness for kappa computation and reporting.
- The 50 scenarios in JSON, expert label files, and the agreement table.
- Fits the <300-line PR guidance in CONTRIBUTING.md.

### Note on implementation

If the protocol fits the roadmap, anyone can pick it up: the team, @AstroBoy1 (natural extension of #281), or a future external contributor following the shape of #287 / #292. I won't be driving the PR.

### References

[^1]: Wang, Mang, Cheung, Sen, Song (UC Berkeley RDI, April 2026), "How We Broke Top AI Agent Benchmarks: And What Comes Next". <https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/>
[^2]: Mehta, S. (Nov 2025), "Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems". <https://arxiv.org/abs/2511.14136>
[^3]: Cohen, J. (1960), "A coefficient of agreement for nominal scales", *Educational and Psychological Measurement* 20(1), 37-46. Canonical kappa reference.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: A judge-calibration subset for AssetOpsBench #296

Motivation

Relation to existing work

Methodology

Suggested defaults (each negotiable)

Scope and effort (rough)

Note on implementation

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Proposal: A judge-calibration subset for AssetOpsBench #296

Description

Motivation

Relation to existing work

Methodology

Suggested defaults (each negotiable)

Scope and effort (rough)

Note on implementation

References

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions