Motivation
The leaderboard is scored by a single LLM judge (Llama-4-Maverick-17B, per the README's Leaderboards section) across the six rubric dimensions. No published agreement number between that judge and human expert raters on the rubric appears in the repo or the linked paper. Recent agent-benchmark reviews flag single-judge designs as a top reliability gap12.
A 50-scenario subset, double-labelled by two domain experts on the same rubric, gives the leaderboard a defensible "how much does the judge actually agree with humans?" number per dimension. The subset is one-shot work; only the judge rotates against it. The same artifact doubles as the human anchor for any later judge swap, ensemble proposal, or the kappa-ensemble discussion in #281.
Filing this as methodology, not a contribution offer. I don't have the bandwidth to drive the PR end-to-end. The interesting question is whether the protocol fits the roadmap; who picks it up is downstream of that.
Relation to existing work
Methodology
- Subset (50 scenarios). Stratified across the four agents (IoT, FMSR, TSFM, WO; the HF
type field tags FMSR scenarios as FMSA) and the three groups (retrospective, predictive, prescriptive). The deterministic / non-deterministic split should mirror the population ratio, not deviate from it.
- Two expert raters. Practical minimum for Cohen's kappa. Ideally an industrial-engineer + data-scientist pair drawn from the same pool FailureSensorIQ used for its five-expert human-ceiling measurement. Three-to-five raters tightens the CIs at proportional cost and switches the metric to Fleiss' / Light's kappa.
- Per-dimension agreement metric. Cohen's kappa for nominal dimensions; quadratic-weighted kappa for ordinal.
task completeness, clarity, and justification look ordinal, but the per-dimension classification is best settled against the actual judge prompt before fixing a metric.
- Judge scoring. Re-run the existing Maverick-17B judge against the same 50 scenarios. An optional ensemble panel for sensitivity reporting needs Fleiss' / Light's kappa or pairwise-averaged Cohen's instead of plain Cohen's.
- Publication. Per-dimension agreement table on the leaderboard README (or a sibling docs page if that fits better). The labelled subset ships as the
judge_calibration HF dataset configuration, so future judges can be re-scored without relabelling.
The configuration sits alongside scenarios as a companion config, not as a schema extension on it. Scenarios follow the existing docs/guideline/utterance_design_guideline.md and docs/guideline/ground_truth_design_guideline.md templates.
Suggested defaults (each negotiable)
- Per-dimension metric. Cohen's kappa for nominal, quadratic-weighted kappa for ordinal. Which dimension lands in which bucket is best left to whoever owns the judge prompt.
- Subset provenance. Existing public
scenarios. A held-out hidden slice is worth considering if contamination becomes a concern.
- Expert panel size. Two raters (kappa minimum). Three-to-five brings it in line with the FailureSensorIQ ceiling.
- Acceptance threshold. None hardcoded. Report raw kappa with Landis-Koch bands (slight / fair / moderate / substantial / almost perfect).
- Confidence intervals. At n=50 the 95% bootstrap CI on kappa is wide (half-width roughly 0.20 to 0.25, depending on the value). Reporting the band straight matters more than narrowing it.
- Publication location. Leaderboard README, paper appendix, or a dedicated
docs/CALIBRATION.md. No strong preference.
Scope and effort (rough)
- ~100 expert-hours per rater (50 scenarios at roughly 2 hours of review across 6 dimensions). ~200 hours total across two raters.
- A small Python harness for kappa computation and reporting.
- The 50 scenarios in JSON, expert label files, and the agreement table.
- Fits the <300-line PR guidance in CONTRIBUTING.md.
Note on implementation
If the protocol fits the roadmap, anyone can pick it up: the team, @AstroBoy1 (natural extension of #281), or a future external contributor following the shape of #287 / #292. I won't be driving the PR.
References
Motivation
The leaderboard is scored by a single LLM judge (Llama-4-Maverick-17B, per the README's Leaderboards section) across the six rubric dimensions. No published agreement number between that judge and human expert raters on the rubric appears in the repo or the linked paper. Recent agent-benchmark reviews flag single-judge designs as a top reliability gap12.
A 50-scenario subset, double-labelled by two domain experts on the same rubric, gives the leaderboard a defensible "how much does the judge actually agree with humans?" number per dimension. The subset is one-shot work; only the judge rotates against it. The same artifact doubles as the human anchor for any later judge swap, ensemble proposal, or the kappa-ensemble discussion in #281.
Filing this as methodology, not a contribution offer. I don't have the bandwidth to drive the PR end-to-end. The interesting question is whether the protocol fits the roadmap; who picks it up is downstream of that.
Relation to existing work
exact_string_match,numeric_match,llm_judge) against the six-criterion rubric. Ajudge_calibrationHF config would run through them unchanged; the only addition is a small post-step that reads the saved trajectories and computes per-dimension kappa. No new plumbing.Methodology
typefield tags FMSR scenarios asFMSA) and the three groups (retrospective, predictive, prescriptive). The deterministic / non-deterministic split should mirror the population ratio, not deviate from it.task completeness,clarity, andjustificationlook ordinal, but the per-dimension classification is best settled against the actual judge prompt before fixing a metric.judge_calibrationHF dataset configuration, so future judges can be re-scored without relabelling.The configuration sits alongside
scenariosas a companion config, not as a schema extension on it. Scenarios follow the existingdocs/guideline/utterance_design_guideline.mdanddocs/guideline/ground_truth_design_guideline.mdtemplates.Suggested defaults (each negotiable)
scenarios. A held-out hidden slice is worth considering if contamination becomes a concern.docs/CALIBRATION.md. No strong preference.Scope and effort (rough)
Note on implementation
If the protocol fits the roadmap, anyone can pick it up: the team, @AstroBoy1 (natural extension of #281), or a future external contributor following the shape of #287 / #292. I won't be driving the PR.
References
Footnotes
Wang, Mang, Cheung, Sen, Song (UC Berkeley RDI, April 2026), "How We Broke Top AI Agent Benchmarks: And What Comes Next". https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/ ↩
Mehta, S. (Nov 2025), "Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems". https://arxiv.org/abs/2511.14136 ↩
Cohen, J. (1960), "A coefficient of agreement for nominal scales", Educational and Psychological Measurement 20(1), 37-46. Canonical kappa reference. ↩