Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
130 commits
Select commit Hold shift + click to select a range
3e2d4b0
test(junior): Tighten integration test boundaries
dcramer Jun 4, 2026
29e3c80
test(junior): Split Slack turn behavior suites
dcramer Jun 4, 2026
5ca1205
test(junior): Split subscribed Slack behavior tests
dcramer Jun 4, 2026
433de55
test(junior): Split Slack image behavior suites
dcramer Jun 4, 2026
d17dcf2
test(junior): Split heartbeat integration contracts
dcramer Jun 4, 2026
bfa3a65
test(junior): Split conversation work component suites
dcramer Jun 4, 2026
2d3a1e2
test(junior): Split plugin package registry tests
dcramer Jun 4, 2026
433b872
test(junior): Split sandbox egress proxy suites
dcramer Jun 4, 2026
ec5066f
docs(testing): Record testing architecture review
dcramer Jun 4, 2026
06482d9
test(junior): Extract lazy sandbox test contracts
dcramer Jun 4, 2026
829564b
test(junior): Extract sandbox executor fixture
dcramer Jun 4, 2026
c9ec916
test(junior): Split sandbox executor snapshots
dcramer Jun 4, 2026
dcd69b6
test(junior): Split sandbox executor bash tests
dcramer Jun 4, 2026
2d7d5ff
test(junior): Split sandbox executor tool tests
dcramer Jun 5, 2026
3532c96
test(junior): Extract respond runtime fixture
dcramer Jun 5, 2026
a37a402
test(junior): Extract MCP respond harness
dcramer Jun 5, 2026
a6d2ed4
test(junior): Split MCP respond scenarios
dcramer Jun 5, 2026
3a4bd66
test(junior): Split CLI check suites
dcramer Jun 5, 2026
b9a644e
test(junior): Split subscribed routing suites
dcramer Jun 5, 2026
a0f017d
test(junior): Split turn session record suites
dcramer Jun 5, 2026
8ce253e
test(junior): Split Slack schedule tool suites
dcramer Jun 5, 2026
f4b51e5
test(junior): Split MCP OAuth callback suites
dcramer Jun 5, 2026
cc82d86
test(junior): Split MCP auth runtime suites
dcramer Jun 5, 2026
fe7c7c9
test(junior): Split OAuth callback Slack suites
dcramer Jun 5, 2026
71538d0
test(junior): Move timeout resume runner tests
dcramer Jun 5, 2026
667976f
test(junior): Split runtime dependency snapshot suites
dcramer Jun 5, 2026
5029ecc
test(junior): Split Slack turn resume suites
dcramer Jun 5, 2026
c91f557
test(junior): Rework OAuth callback route tests
dcramer Jun 5, 2026
448eadb
test(junior): Rework MCP OAuth callback route tests
dcramer Jun 5, 2026
f72671a
test(junior): Split OAuth resume Slack suites
dcramer Jun 5, 2026
6fd9c78
test(junior): Move respond runtime orchestration tests
dcramer Jun 5, 2026
ef27d3d
test(junior): Move lazy sandbox respond coverage
dcramer Jun 5, 2026
2a1cda2
test(junior): Move respond startup errors
dcramer Jun 5, 2026
8c964ca
test(junior): Remove respond runtime mock fixture
dcramer Jun 5, 2026
8f16778
test(junior): Move MCP respond tests to component ports
dcramer Jun 5, 2026
3cce7bc
test(junior): Move sandbox executor coverage to component
dcramer Jun 5, 2026
48ba071
test(junior): Group Slack resume integration suites
dcramer Jun 5, 2026
6bb42a9
test(junior): Organize Slack tool integration suites
dcramer Jun 5, 2026
6558599
test(junior): Organize OAuth callback integration suites
dcramer Jun 5, 2026
f11395a
test(junior): Split MCP OAuth resume lock coverage
dcramer Jun 5, 2026
b6810ed
docs(testing): Record cleanup completion
dcramer Jun 5, 2026
9e665f8
test(junior): Split Slack message content suites
dcramer Jun 5, 2026
a65baa4
test(junior): Use App Home builder deps
dcramer Jun 5, 2026
15b50d5
test(junior): Use plugin auth orchestration deps
dcramer Jun 5, 2026
96b091c
test(junior): Use MCP auth orchestration deps
dcramer Jun 5, 2026
a58220a
test(junior): Drop turn-session log assertion
dcramer Jun 5, 2026
65afd75
test(junior): Dedupe tool error handler coverage
dcramer Jun 5, 2026
7996a45
test(junior): Use real tool error handling in agent tools
dcramer Jun 5, 2026
14d8d7e
test(junior): Move Slack emoji rules to unit coverage
dcramer Jun 5, 2026
4d5d740
test(junior): Trim duplicate reaction alias coverage
dcramer Jun 5, 2026
d612a06
test(junior): Use snapshot warmup CLI deps
dcramer Jun 5, 2026
5d88603
test(junior): Move snapshot tests to component layer
dcramer Jun 5, 2026
eee71e1
test(junior): Trim duplicate sandbox data path case
dcramer Jun 5, 2026
f6a82a2
test(junior): Use turn session record services
dcramer Jun 5, 2026
872e73c
test(junior): Use capability factory deps
dcramer Jun 5, 2026
5a56064
test(junior): Use real plugin package discovery
dcramer Jun 5, 2026
31af137
test(junior): Use real skill plugin discovery
dcramer Jun 5, 2026
9d96a31
test(junior): Use snapshot resolver services
dcramer Jun 5, 2026
25aa65a
test(junior): Use config defaults services
dcramer Jun 5, 2026
1e60c58
test(junior): Use sandbox executor services
dcramer Jun 5, 2026
1df6e8b
test(junior): Use sandbox egress services
dcramer Jun 5, 2026
4059011
test(junior): Use respond MCP services
dcramer Jun 5, 2026
c5a9aa1
test(junior): Move Slack resume tests to component
dcramer Jun 5, 2026
29164b0
test(junior): Use MCP OAuth services
dcramer Jun 5, 2026
544799a
test(junior): Use web fetch services
dcramer Jun 5, 2026
a835cf2
test(junior): Use image generation deps
dcramer Jun 5, 2026
cd11c21
test(junior): Use tool error services
dcramer Jun 5, 2026
2babfc8
test(junior): Inject OAuth callback handlers
dcramer Jun 5, 2026
f1b7f04
test(junior): Use MCP client factory
dcramer Jun 5, 2026
61441a6
test(junior): Use Slack outbound boundary
dcramer Jun 5, 2026
fe2d00f
test(junior): Organize unit test tree
dcramer Jun 5, 2026
1dd57b2
test(evals): Inject harness runtime factory
dcramer Jun 5, 2026
d53daa2
test(junior): Organize root unit tests
dcramer Jun 5, 2026
0719dc3
test(junior): Move traced stream test under pi
dcramer Jun 5, 2026
1f5c77e
docs(testing): Remove review diary
dcramer Jun 5, 2026
97ad8c2
test(junior): Trim subscribed classifier cases
dcramer Jun 5, 2026
70e5d19
test(junior): Dedupe agent auth tool cases
dcramer Jun 5, 2026
1386cc7
test(junior): Thin duplicated test scaffolding
dcramer Jun 5, 2026
e7710ae
test(junior): Move plugin set checks to owner tests
dcramer Jun 5, 2026
d7e5cdf
test(junior): Trim duplicated status and sandbox cases
dcramer Jun 5, 2026
30b20e3
test(junior): Tighten shared test fixtures
dcramer Jun 5, 2026
afd5d2f
test(junior): Share turn session message fixtures
dcramer Jun 5, 2026
5e644b4
test(junior): Tighten turn result status fixtures
dcramer Jun 5, 2026
fb7bd80
test(junior): Centralize skill test lifecycle fixtures
dcramer Jun 5, 2026
196737c
test(junior): Assert Slack state instead of prompt prose
dcramer Jun 5, 2026
cac45b0
test(junior): Assert Slack state over prompt probes
dcramer Jun 5, 2026
f5be126
test(junior): Thin thinking level router tests
dcramer Jun 5, 2026
fe77588
test(junior): Inject sandbox adapter services
dcramer Jun 5, 2026
fb64533
test(junior-evals): Score thinking level routing
dcramer Jun 5, 2026
e895b29
docs(evals): Capture generation fixture boundaries
dcramer Jun 5, 2026
6106736
fix(junior-evals): Align harness with eval types
dcramer Jun 5, 2026
6bc0289
test(junior): Thin duplicate Slack timing coverage
dcramer Jun 5, 2026
91198f3
test(junior): Dedupe auth pause assertions
dcramer Jun 5, 2026
81af59c
test(junior): Drop MCP auth call counters
dcramer Jun 5, 2026
b4dc981
test(junior): Harden testing boundary seams
dcramer Jun 5, 2026
ea6bf74
test(junior): Add shared Vitest fixtures
dcramer Jun 5, 2026
6468f8b
docs(testing): Tighten mock and telemetry policy
dcramer Jun 5, 2026
9ee8e66
test(junior): Remove feature-level telemetry assertions
dcramer Jun 5, 2026
3d67058
test(junior): Harden test boundary cleanup
dcramer Jun 5, 2026
89f1238
test(junior): Share direct tool test fixtures
dcramer Jun 5, 2026
120b851
test(junior): Add shared test clock helpers
dcramer Jun 5, 2026
73d789d
test(junior): Centralize fake clock setup
dcramer Jun 5, 2026
a0fcf09
test(junior): Freeze schedule tool fixture clock
dcramer Jun 5, 2026
df3ae2d
test(junior): Use deterministic fixture expiries
dcramer Jun 5, 2026
92183db
test(junior): Type plugin auth token store fixture
dcramer Jun 5, 2026
b193195
test(junior): Type agent tool test fixtures
dcramer Jun 5, 2026
c7f4881
test(junior): Use real thread context messages
dcramer Jun 5, 2026
f19de89
test(evals): Cover low thinking routing
dcramer Jun 5, 2026
5c02060
test(junior): Merge load skill tool tests
dcramer Jun 5, 2026
8f10a8f
test(junior): Tighten MCP call tool fixtures
dcramer Jun 5, 2026
94e72af
test(junior): Tighten web search unit fixtures
dcramer Jun 5, 2026
ef8e8bd
test(junior): Type image generation fixtures
dcramer Jun 5, 2026
3336d63
test(junior): Reapply cleanup after rebase
dcramer Jun 6, 2026
a31f809
test(junior): Use renamed boundary check in coverage
dcramer Jun 6, 2026
f5eeb4f
ref(test): Remove trivial DI from testing seams
dcramer Jun 6, 2026
36b5977
test(runtime): Trim brittle component test seams
dcramer Jun 6, 2026
60e7e5c
test(junior): Prune low-signal behavior checks
dcramer Jun 6, 2026
d1a0c05
test(junior): Finish test-suite cleanup pass
dcramer Jun 6, 2026
a464d0b
ref(test): Flatten Slack runtime test adapters
dcramer Jun 6, 2026
1cfa549
ref(test): Remove test-only dependency seams
dcramer Jun 6, 2026
dea6e2c
test(evals): Cover unavailable image analysis
dcramer Jun 6, 2026
f06bb58
fix(evals): Use runtime adapter overrides
dcramer Jun 6, 2026
fe7ffc1
test(junior): Reconcile runtime fixtures after rebase
dcramer Jun 8, 2026
cb66f6d
test(junior): Fix heartbeat coverage run expectations
dcramer Jun 8, 2026
6d3bf4e
test(junior): Move ingress coverage to integration tests
dcramer Jun 8, 2026
6a8b71a
test(junior): Reconcile testing cleanup after rebase
dcramer Jun 12, 2026
0b75c6d
fix(evals): Align chat peer dependencies
dcramer Jun 13, 2026
74c8bd1
ref(test): Tighten test fixture boundaries
dcramer Jun 13, 2026
8f6725b
ci: Restore frozen install and coverage timeouts
dcramer Jun 13, 2026
2c24908
test(junior): Centralize ordinary Vitest timeouts
dcramer Jun 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
"test:watch": "pnpm --filter @sentry/junior test:watch",
"evals": "pnpm --filter @sentry/junior-evals evals",
"evals:record": "pnpm --filter @sentry/junior-evals evals:record",
"typecheck": "pnpm --filter @sentry/junior-plugin-api typecheck && pnpm --filter @sentry/junior-scheduler typecheck && pnpm --filter @sentry/junior typecheck && pnpm --filter @sentry/junior-dashboard typecheck && pnpm --filter @sentry/junior-testing typecheck && pnpm --filter @sentry/junior-example typecheck",
"typecheck": "pnpm --filter @sentry/junior-plugin-api typecheck && pnpm --filter @sentry/junior-scheduler typecheck && pnpm --filter @sentry/junior typecheck && pnpm --filter @sentry/junior-evals typecheck && pnpm --filter @sentry/junior-dashboard typecheck && pnpm --filter @sentry/junior-testing typecheck && pnpm --filter @sentry/junior-example typecheck",
"skills:check": "pnpm --filter @sentry/junior skills:check",
"test:ci": "pnpm --filter @sentry/junior build && pnpm --filter @sentry/junior-dashboard build && pnpm --filter @sentry/junior test:coverage && pnpm --filter @sentry/junior-dashboard test:coverage"
},
Expand Down
33 changes: 20 additions & 13 deletions packages/junior-evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Quick mapping:
- `evals/*`: Integration-style coverage for conversation-level agent behavior and quality scoring through the runtime harness.
- `tests/unit/*` (or non-integration tests): isolated logic/invariant tests.

This separation is enforced by `pnpm --filter @sentry/junior run test:slack-boundary`.
This separation is enforced by `pnpm --filter @sentry/junior run test:boundaries`.

## What Is In Scope

Expand Down Expand Up @@ -59,22 +59,28 @@ For each `it()` case inside a `describeEval()` suite:
2. Create a fresh runtime instance for the case via the chat composition root; do not mutate the production singleton runtime.
3. Route message events through real ingress + queue-worker behavior, with only the external queue transport replaced by an in-memory harness shim.
4. Return observed artifacts as JSON for LLM judgment, including structured `assistant_posts` with text plus actual attached-file metadata, and Slack-visible metadata.
The output also includes compact `turn_diagnostics` so evals can assert user-facing runtime metadata such as selected thinking level without scraping logs.
The helper pretty-prints this JSON so failure output stays readable in local runs and CI.
5. `vitest-evals` scores the output against `criteria` (A–E → 1.0–0.0).

Harness override knobs (in `EvalOverrides`):

- `auto_complete_mcp_oauth`: after our app genuinely starts an MCP OAuth flow for the listed providers, the harness immediately completes the fake provider callback.
- `auto_complete_oauth`: after our app genuinely starts a generic OAuth flow for the listed providers, the harness immediately completes the fake provider callback.
- `credential_providers`: seed normal provider credentials for the listed providers. GitHub uses dummy GitHub App env vars plus an intercepted installation-token exchange; Sentry uses the normal OAuth token store.
- `fail_reply_call`: force a non-retryable reply failure on a specific call.
- `mock_image_generation`: stub the image-generation HTTP response with a valid image payload while still exercising the real attachment path.
- `plugin_dirs`: load plugin fixtures from eval-local directories without adding workspace packages.
- `reply_texts`: override returned reply text per call.
- `reply_timeout_ms`: lower or set the per-reply harness timeout for a specific scenario. It cannot exceed 30 seconds.
- `subscribed_decisions`: controls the subscribed-message reply gate in the harness. If you use it, do not claim that reply-selection behavior is being validated by the eval itself.

These knobs work by overriding services on the eval-local runtime instance. They must not reintroduce mutable global runtime behavior seams.
- `auth.autoCompleteMcpOAuth`: after our app genuinely starts an MCP OAuth flow for the listed providers, the harness immediately completes the fake provider callback.
- `auth.autoCompleteOAuth`: after our app genuinely starts a generic OAuth flow for the listed providers, the harness immediately completes the fake provider callback.
- `auth.credentialProviders`: seed normal provider credentials for the listed providers. GitHub uses dummy GitHub App env vars plus an intercepted installation-token exchange; Sentry uses the normal OAuth token store.
- `plugins.pluginDirs`: load plugin fixtures from eval-local directories without adding workspace packages.
- `plugins.pluginPackages`: load named workspace plugin packages for plugin-specific behavior evals.
- `plugins.skillDirs`: load skill fixture directories into the real reply-generation path.
- `replyGeneration.cannedResults`: return structured reply results for downstream delivery or resilience scenarios.
- `replyGeneration.cannedTexts`: return reply text per successful call for downstream delivery scenarios.
- `replyGeneration.failCall`: force a non-retryable reply failure on a specific call.
- `replyGeneration.mockImageGeneration`: stub the image-generation HTTP response with a valid image payload while still exercising the real attachment path.
- `replyGeneration.timeoutMs`: lower or set the per-reply harness timeout for a specific scenario. It cannot exceed 30 seconds.
- `replyGeneration.unsetGatewayCredentials`: remove gateway credentials for the duration of real reply generation when the scenario explicitly covers missing credential behavior.
- `subscribedReplyDecisions`: controls the subscribed-message reply gate in the harness. If you use it, do not claim that reply-selection behavior is being validated by the eval itself.

These knobs configure role-named scenario adapters on the eval-local runtime instance. They must not reintroduce mutable global runtime behavior seams or nested production service override bags.
`replyGeneration.cannedTexts` and `replyGeneration.cannedResults` bypass real reply generation, so use them only for downstream delivery behavior, not prompt, model-routing, or thinking-level coverage.

Tool replay:

Expand Down Expand Up @@ -106,7 +112,7 @@ Evals require real Vercel Sandbox access. If sandbox bootstrap fails, the eval f

- Add core cases under `evals/core/*.eval.ts` and plugin-specific cases under `evals/<plugin>/` using `describeEval()` with `slackEvals`.
- Use event builders (`mention`, `threadMessage`, `threadStart`) from `evals/helpers.ts`.
- Use `auto_complete_mcp_oauth` or `auto_complete_oauth` when the harness should instantly complete the fake provider callback after our app has genuinely initiated auth.
- Use `auth.autoCompleteMcpOAuth` or `auth.autoCompleteOAuth` when the harness should instantly complete the fake provider callback after our app has genuinely initiated auth.
- For multi-turn, pass the same `thread` override so events land in one thread.
- Keep each case focused on one primary behavior.
- Encode all expectations in `criteria`; do not add deterministic inline assertions.
Expand All @@ -127,6 +133,7 @@ Do not do these in eval files:

- Do not import `@/chat/slack/*` directly.
- Do not use MSW Slack helpers (`queueSlackApiResponse`, `getCapturedSlackApiCalls`, `queueSlackApiError`, `queueSlackRateLimit`).
- Do not import raw Slack capture wrappers. Use eval artifact helpers that expose Slack-visible posts, reactions, canvases, or files instead.
- Do not validate raw Slack Web API request payload shapes from evals.
- Do not validate implementation internals (exact tool names, sandbox IDs, or other non-user-visible details) unless the scenario explicitly evaluates those surfaces.

Expand Down
Loading
Loading