Skip to content

refactor(core): remove borderline verdict#857

Merged
christso merged 5 commits intomainfrom
refactor/remove-borderline-verdict
Mar 29, 2026
Merged

refactor(core): remove borderline verdict#857
christso merged 5 commits intomainfrom
refactor/remove-borderline-verdict

Conversation

@christso
Copy link
Copy Markdown
Collaborator

Summary

  • Remove borderline from EvaluationVerdict type, simplifying to pass | fail | skip
  • Simplify scoreToVerdict(): scores < 0.8 are now fail (previously 0.6-0.8 was borderline)
  • Simplify negateScore(): only swap pass↔fail, skip stays skip
  • Remove borderline field from EvalSummary and fix inconsistent threshold (was 0.5, now matches 0.8)
  • Update composite evaluator: only pass counts as passing in threshold aggregator
  • Update all tests, examples, docs, baseline JSONL fixtures, and skill references

Industry research confirms no major eval framework uses a named intermediate verdict. The numeric score already captures nuance, and the borderline verdict added complexity without clear value.

Test plan

  • bun run build — TypeScript compilation passes
  • bun run test — all 1713 tests pass (1295 core + 67 eval + 351 cli)
  • bun run lint — Biome passes
  • grep -r "borderline" packages/ apps/ examples/ plugins/ — no remaining code references
  • Pre-push hooks pass (build, typecheck, lint, test, validate)

🤖 Generated with Claude Code

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 29, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: a7e37b2
Status:⚡️  Build in progress...

View logs

christso and others added 5 commits March 29, 2026 22:47
Simplify EvaluationVerdict to 'pass' | 'fail' | 'skip'. Scores below 0.8
are now 'fail' (previously 0.6-0.8 was 'borderline'). Remove borderline
from EvalSummary, scoreToVerdict, negateScore, and composite evaluator.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change borderline expectations to fail (scores 0.6-0.8 are now fail).
Remove borderline-specific tests in negation and composite-threshold.
Update threshold aggregator tests since only pass verdicts count.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update example scripts, documentation, baseline JSONL fixtures, and
skill references to reflect binary pass/fail verdict system.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract PASS_THRESHOLD = 0.8 as single source of truth in scoring.ts
- Replace magic 0.8 in evaluate.ts and orchestrator.ts with the constant
- Add file header to scoring.ts explaining the scoring model
- Use data-driven NEGATED_VERDICT map instead of ternary chain
- Remove dead isNonEmptyString import from composite.ts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@christso christso force-pushed the refactor/remove-borderline-verdict branch from c00ace3 to a7e37b2 Compare March 29, 2026 22:49
@christso christso merged commit 7ae533e into main Mar 29, 2026
1 of 2 checks passed
@christso christso deleted the refactor/remove-borderline-verdict branch March 29, 2026 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant