Skip to content

measurement: confidence calibration score (do confidence scores predict outcomes?) #56

Description

@mataeil

From the loop-engineering measurement canon (med priority). The Loop Scorecard (v1.4.0) now measures outcome quality, goal progress, gap resolution, and lesson application. The remaining canon metric not yet implemented is confidence calibration: for each domain, correlate confidence-at-decision (decision_log) against the realized outcome quality_multiplier (outcomes.json). A well-calibrated domain's high-confidence cycles should yield high quality; miscalibration means Orient is updating confidence on activity, not value.

Compute (every N cycles, e.g. in 6-C4): per domain, Pearson/sign correlation between confidence_at_decision and the cycle's quality_multiplier over the retained history; surface as a 'Calibration' line on the scorecard. Deferred from the v1.4.0 measurement release to keep that PR focused; data (decision_log confidence + outcomes quality) already exists.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions