Skip to content

Proposal: A judge-calibration subset for AssetOpsBench #296

@jasdian

Description

@jasdian

Motivation

The leaderboard is scored by a single LLM judge (Llama-4-Maverick-17B, per the README's Leaderboards section) across the six rubric dimensions. No published agreement number between that judge and human expert raters on the rubric appears in the repo or the linked paper. Recent agent-benchmark reviews flag single-judge designs as a top reliability gap12.

A 50-scenario subset, double-labelled by two domain experts on the same rubric, gives the leaderboard a defensible "how much does the judge actually agree with humans?" number per dimension. The subset is one-shot work; only the judge rotates against it. The same artifact doubles as the human anchor for any later judge swap, ensemble proposal, or the kappa-ensemble discussion in #281.

Filing this as methodology, not a contribution offer. I don't have the bandwidth to drive the PR end-to-end. The interesting question is whether the protocol fits the roadmap; who picks it up is downstream of that.

Relation to existing work

Methodology

  1. Subset (50 scenarios). Stratified across the four agents (IoT, FMSR, TSFM, WO; the HF type field tags FMSR scenarios as FMSA) and the three groups (retrospective, predictive, prescriptive). The deterministic / non-deterministic split should mirror the population ratio, not deviate from it.
  2. Two expert raters. Practical minimum for Cohen's kappa. Ideally an industrial-engineer + data-scientist pair drawn from the same pool FailureSensorIQ used for its five-expert human-ceiling measurement. Three-to-five raters tightens the CIs at proportional cost and switches the metric to Fleiss' / Light's kappa.
  3. Per-dimension agreement metric. Cohen's kappa for nominal dimensions; quadratic-weighted kappa for ordinal. task completeness, clarity, and justification look ordinal, but the per-dimension classification is best settled against the actual judge prompt before fixing a metric.
  4. Judge scoring. Re-run the existing Maverick-17B judge against the same 50 scenarios. An optional ensemble panel for sensitivity reporting needs Fleiss' / Light's kappa or pairwise-averaged Cohen's instead of plain Cohen's.
  5. Publication. Per-dimension agreement table on the leaderboard README (or a sibling docs page if that fits better). The labelled subset ships as the judge_calibration HF dataset configuration, so future judges can be re-scored without relabelling.

The configuration sits alongside scenarios as a companion config, not as a schema extension on it. Scenarios follow the existing docs/guideline/utterance_design_guideline.md and docs/guideline/ground_truth_design_guideline.md templates.

Suggested defaults (each negotiable)

  • Per-dimension metric. Cohen's kappa for nominal, quadratic-weighted kappa for ordinal. Which dimension lands in which bucket is best left to whoever owns the judge prompt.
  • Subset provenance. Existing public scenarios. A held-out hidden slice is worth considering if contamination becomes a concern.
  • Expert panel size. Two raters (kappa minimum). Three-to-five brings it in line with the FailureSensorIQ ceiling.
  • Acceptance threshold. None hardcoded. Report raw kappa with Landis-Koch bands (slight / fair / moderate / substantial / almost perfect).
  • Confidence intervals. At n=50 the 95% bootstrap CI on kappa is wide (half-width roughly 0.20 to 0.25, depending on the value). Reporting the band straight matters more than narrowing it.
  • Publication location. Leaderboard README, paper appendix, or a dedicated docs/CALIBRATION.md. No strong preference.

Scope and effort (rough)

  • ~100 expert-hours per rater (50 scenarios at roughly 2 hours of review across 6 dimensions). ~200 hours total across two raters.
  • A small Python harness for kappa computation and reporting.
  • The 50 scenarios in JSON, expert label files, and the agreement table.
  • Fits the <300-line PR guidance in CONTRIBUTING.md.

Note on implementation

If the protocol fits the roadmap, anyone can pick it up: the team, @AstroBoy1 (natural extension of #281), or a future external contributor following the shape of #287 / #292. I won't be driving the PR.

References

Footnotes

  1. Wang, Mang, Cheung, Sen, Song (UC Berkeley RDI, April 2026), "How We Broke Top AI Agent Benchmarks: And What Comes Next". https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

  2. Mehta, S. (Nov 2025), "Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems". https://arxiv.org/abs/2511.14136

  3. Cohen, J. (1960), "A coefficient of agreement for nominal scales", Educational and Psychological Measurement 20(1), 37-46. Canonical kappa reference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions