test(junior): Rework testing architecture#532
Draft
dcramer wants to merge 130 commits into
Draft
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
0d12ef9 to
43a47d4
Compare
Replace repeated any-cast Slack message stubs with a small Message fixture. This keeps the unit suite focused on thread context normalization while exercising the real Chat SDK message shape. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Add a focused routing eval for deterministic one-step transforms. The eval asserts turn diagnostics directly so thinking-level routing is checked as behavior rather than incidental rubric prose. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Move the host loadSkill cases into the canonical tool suite and delete the misplaced skills test file. Keep the same coverage while removing result any-casts, cleaning up temporary skill directories, and avoiding real skill discovery in the unknown-skill unit case. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Centralize the fake MCP manager and tool result builders in the callMcpTool unit suite. Keep the invalid payload coverage while containing the unsafe call path in one helper. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Centralize webSearch execution and AI SDK result fixtures so the suite keeps the same Gateway adapter coverage with fewer casts and less repeated setup. Restore the patched AbortController in a finally block for better isolation. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Use typed completeText and fetch fixtures in the imageGenerate unit suite and centralize tool execution. This keeps the same adapter coverage while removing repeated execute casts and broad dependency casts. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Keep rebased resume and reporting tests aligned with the testing policy. Drop stale telemetry footer assertions and preserve the focused runtime test seams. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Keep the coverage test script aligned with the consolidated test boundary policy command. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Use real plugin registry, memory state, env stubs, fake timers, and temp files for tests that previously relied on production dependency wrappers. Keep explicit fakes at real external boundaries such as Vercel Sandbox, Slack delivery, OAuth launch, model completion, and HTTP fetch. Update testing policy docs to reject production dependency parameters for fs, env, time, logging, spans, and local helpers. This keeps behavior paths wired through real adapters by default. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Remove low-signal prompt-shape and persistence-failure cases from runtime component tests. Keep auth, yield, and timeout contracts covered through real state and adapter boundaries, and make the snapshot lock wait test use fake timers. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Remove duplicate sandbox and Slack image tests that asserted private implementation details or call-count-only behavior. Normalize dashboard reporting tests onto the shared Vitest clock helper. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Remove low-signal sandbox assertions, private prompt-wrapper checks, and duplicated Slack test helpers. Keep coverage focused on public behavior while sharing small fixture utilities across Slack integration tests. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Replace nested runtime service override bags with role-named adapter controls. Remove the Slack runtime clock dependency and use the shared fake clock helper in tests. Document when to use module-owned adapter selection versus explicit runtime scenario adapters so the test seam stays narrow. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Remove dependency injection that only existed to steer local helpers in tests. Keep production code on direct filesystem, skill loading, and turn-session state helpers where those are not real adapter boundaries. Update the affected tests to use temp app/plugin files, memory state, and env fixtures so coverage exercises the production paths more honestly. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Add an eval for image attachments when vision is unavailable so the model must acknowledge the image without inventing contents. Remove the remaining webFetch local-helper injection seam and its call-choreography unit test. Keep image generation adapters limited to the external model and fetch boundaries. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Update the eval behavior harness to use the flat Slack runtime adapter API so eval fixtures keep replacing only named scenario boundaries. Remove the broad runtime-factory override from harness unit tests and route those tests through the real Slack runtime with deterministic reply fixtures. Add the eval package typecheck to the normal root typecheck path so harness contract drift is caught before evals run. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Add a default Slack destination in the shared test runtime fixture so behavior tests keep using real runtime wiring after the destination contract from main. Remove stale generic tool-context channel capability overrides and update the subscribed-message retry test to use runtime adapter overrides. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Update heartbeat resume recovery tests to include the runtime destination now required for timeout resume scheduling. Adjust the scheduler heartbeat blocked-run case to exercise invalid credential routing, since scheduler storage now rejects malformed destinations before heartbeat can process them. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Replace prototype-style slash and JuniorChat ingress unit tests with signed Slack slash-command integration coverage. Add deterministic webFetch integration coverage for page extraction, image delivery, and HTTP client failures. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Preserve the mainline conversation-work and reporting changes while keeping the test cleanup branch focused on reliable boundaries. Prune stale split tests, move auth orchestration coverage to component tests, and keep shared fixtures aligned with the runtime contracts. Fix timeout continuation retries to use the timeout-resume reason while accepting legacy continuation errors during the cutover. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Add the ai and zod peer dependencies used by chat so the eval harness resolves the same chat type instance as Junior runtime fixtures. This keeps the rebased eval typecheck green without changing test behavior. Co-Authored-By: GPT-5 Codex <codex@openai.com>
18e46e8 to
0b75c6d
Compare
Narrow runtime test adapters to role-named scenario seams and group eval harness overrides by contract area. Move shared fixtures into feature folders, split the broad respond helper module, and update testing policy/enforcement so raw Slack captures and legacy flat eval override keys do not drift back in. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 74c8bd1. Configure here.
Align the eval package lockfile entry with the root ai override so pnpm frozen install succeeds in CI. Set shared Vitest timeouts for the coverage-heavy Junior suite and reserve explicit timeouts for known long-running build checks. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Remove stale per-test timeout overrides that are now covered by the shared Junior Vitest timeout budget. Keep local overrides reserved for known slow external or build boundaries. Co-Authored-By: GPT-5 Codex <codex@openai.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Reworks Junior's testing architecture so behavior coverage lives in the right layer: evals for agent-facing outcomes, integration for real runtime/Slack wiring, component tests for deterministic orchestration ports, and unit tests for local invariants. This branch also thins duplicate brittle tests, removes test-only production singleton mutation patterns, and updates the testing policies so mocks and telemetry assertions stay rare and explicit.
Boundary Enforcement
The old Slack-specific checker is now the broader
test:boundariescommand. It runs from both Junior and eval package scripts, scans eval sources, rejects integration module mocks, and blocks observability mocks/assertions outside raretests/unit/logging/**contract suites.Harness And Fixtures
Deterministic controls now use named harness ports, shared Vitest helpers, default clock helpers, memory adapters, MSW fixtures, and a shared direct-tool runtime fixture instead of module mocks or ad hoc empty runtime objects. Reply runtime overrides sit under
ReplyRequestContext.harness, capability catalog injection no longer shares the production global cache across test sources, direct Slack tool contracts use typed state/context fixtures, and evals expose compact turn diagnostics instead of scraping logs, spans, or prompts.Auth Regression
The cleanup uncovered and fixes a pending auth reuse bug: MCP auth reused a direct state import and both MCP/plugin reuse checks depended on wall time. Pending auth reuse now flows through injected services and clocks, with focused regression coverage.
Local Junior typecheck and the full package test command pass. The eval harness unit tests pass; live eval execution is still blocked in this worktree by missing Vercel/Gateway project configuration.