Skip to content

fix(studio): use 0.8 pass threshold instead of 1.0#863

Merged
christso merged 13 commits intomainfrom
fix/862-pass-threshold
Mar 30, 2026
Merged

fix(studio): use 0.8 pass threshold instead of 1.0#863
christso merged 13 commits intomainfrom
fix/862-pass-threshold

Conversation

@christso
Copy link
Copy Markdown
Collaborator

Summary

  • Fix Studio dashboard showing results as "failures" when they don't achieve a perfect 1.0 score — now uses the core engine's 0.8 PASS_THRESHOLD
  • Add configurable pass_threshold via config.yaml in the .agentv/results/runs/ directory
  • Fix all 12 hardcoded score >= 1 checks across server (5) and client (7) code

How to override the threshold

Create .agentv/results/runs/config.yaml:

pass_threshold: 0.9

The Studio server reads this on startup and serves it via /api/config. The frontend fetches it and uses it for all pass/fail UI decisions. Default remains 0.8 (matching @agentv/core PASS_THRESHOLD).

Files changed

Server:

  • apps/cli/src/commands/results/studio-config.ts (new) — config loader
  • apps/cli/src/commands/results/serve.ts/api/config endpoint + threshold in 5 endpoints
  • apps/cli/src/commands/trace/utils.tslistResultFiles pass count

Client:

  • apps/studio/src/lib/api.tsuseStudioConfig hook, isPassing helper
  • apps/studio/src/lib/types.tsStudioConfigResponse type
  • apps/studio/src/components/RunDetail.tsx — pass/fail counting
  • apps/studio/src/components/EvalDetail.tsx — failure reason display
  • apps/studio/src/components/Sidebar.tsx — sidebar pass/fail indicators
  • apps/studio/src/routes/runs/$runId_.dataset.$dataset.tsx — dataset page

Tests:

  • apps/cli/test/commands/results/studio-config.test.ts (new) — 6 tests

Test plan

  • All 1719 tests pass (1295 core + 67 eval + 357 cli)
  • Typecheck passes
  • Lint passes
  • Pre-push hooks pass (build, typecheck, lint, test, validate examples)
  • Manual UAT: run agentv studio and verify scores between 0.8-1.0 show as passed
  • Manual UAT: verify config.yaml override works

Closes #862

🤖 Generated with Claude Code

christso and others added 8 commits March 30, 2026 16:08
…hold

Replace hardcoded `score >= 1` checks in 5 server endpoints with a
configurable pass_threshold loaded from config.yaml in the runs directory.
Defaults to PASS_THRESHOLD (0.8) from @agentv/core when no config exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d isPassing helper

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tils

Replace hardcoded score >= 1.0 with PASS_THRESHOLD (0.8) in listResultFiles
pass count calculation so it aligns with the standard evaluation threshold.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace hardcoded score < 1 checks with isPassing(score, passThreshold)
using the studio config's pass_threshold (default 0.8).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hardcoded score >= 1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace hardcoded `score >= 1` checks with `isPassing(score, passThreshold)`
using the `useStudioConfig` hook in EvalSidebar, DatasetSidebar, and DatasetPage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 30, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 74e4c0f
Status: ✅  Deploy successful!
Preview URL: https://42293bce.agentv.pages.dev
Branch Preview URL: https://fix-862-pass-threshold.agentv.pages.dev

View logs

christso and others added 5 commits March 30, 2026 16:28
Remove redundant Output and Task tabs that showed identical file trees.
Replace with a single Files tab for browsing eval artifacts. Remove
legacy fallback logic for pre-manifest result formats.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The loadStudioConfig was receiving the project root (cwd) instead of
the actual runs directory. Now correctly constructs the path to
.agentv/results/runs/config.yaml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The config is a global Studio setting, not per-run data. It belongs
alongside cache.json in the .agentv/ directory root.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Re-read .agentv/config.yaml on each API request instead of once at
startup so external edits are picked up immediately. Add POST /api/config
endpoint to save config changes. Add /settings route with card-based UI
for editing pass threshold. Add settings link to sidebar.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Read config once per request, not once per result row.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@christso christso merged commit cbf5f5b into main Mar 30, 2026
4 checks passed
@christso christso deleted the fix/862-pass-threshold branch March 30, 2026 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: failures should be based on 80% threshold not 100%

1 participant