refactor(core): remove borderline verdict#857
Merged
Conversation
Simplify EvaluationVerdict to 'pass' | 'fail' | 'skip'. Scores below 0.8 are now 'fail' (previously 0.6-0.8 was 'borderline'). Remove borderline from EvalSummary, scoreToVerdict, negateScore, and composite evaluator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change borderline expectations to fail (scores 0.6-0.8 are now fail). Remove borderline-specific tests in negation and composite-threshold. Update threshold aggregator tests since only pass verdicts count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update example scripts, documentation, baseline JSONL fixtures, and skill references to reflect binary pass/fail verdict system. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract PASS_THRESHOLD = 0.8 as single source of truth in scoring.ts - Replace magic 0.8 in evaluate.ts and orchestrator.ts with the constant - Add file header to scoring.ts explaining the scoring model - Use data-driven NEGATED_VERDICT map instead of ternary chain - Remove dead isNonEmptyString import from composite.ts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c00ace3 to
a7e37b2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
borderlinefromEvaluationVerdicttype, simplifying topass | fail | skipscoreToVerdict(): scores < 0.8 are nowfail(previously 0.6-0.8 wasborderline)negateScore(): only swap pass↔fail, skip stays skipborderlinefield fromEvalSummaryand fix inconsistent threshold (was 0.5, now matches 0.8)passcounts as passing in threshold aggregatorIndustry research confirms no major eval framework uses a named intermediate verdict. The numeric score already captures nuance, and the borderline verdict added complexity without clear value.
Test plan
bun run build— TypeScript compilation passesbun run test— all 1713 tests pass (1295 core + 67 eval + 351 cli)bun run lint— Biome passesgrep -r "borderline" packages/ apps/ examples/ plugins/— no remaining code references🤖 Generated with Claude Code