fix(guardrails): recover from provider outages by zaxbysauce · Pull Request #875 · zaxbysauce/opencode-swarm

zaxbysauce · 2026-05-16T03:17:50Z

Summary

Add next-turn Architect recovery guidance when a transcript shows a transient provider outage such as provider_unavailable / Network connection lost.
Keep recovery guidance bounded and session-scoped so repeated message transforms do not spam duplicate instructions.
Preserve the existing structured provider-error retry path from latest main and add release notes for the additional Architect recovery behavior.

Invariant audit

1 (plugin init): not touched — no plugin initialization path changes; node scripts/repro-704.mjs was run and timed out on both this branch and clean origin/main.
2 (runtime portability): touched — bun run build passed; node --input-type=module -e "await import('./dist/index.js'); console.log('dist import OK')" printed dist import OK.
3 (subprocesses): not touched — no subprocess call sites changed.
4 (.swarm containment): not touched — no .swarm storage or working-directory containment changes.
5 (plan durability): not touched — no plan ledger, projection, checkpoint, or status-shape changes.
6 (test_runner safety): not touched — validation used shell bun/node commands; no broad OpenCode test_runner validation was used.
7 (test writing): touched — loaded .opencode/skills/writing-tests/SKILL.md; new/updated tests use bun:test and _internals dependency injection instead of mock.module for changed hook collaborators.
8 (session state): touched — AgentSessionState.lastProviderRecoveryFingerprint is session-scoped and covered by same-array plus fresh-array advisory dedupe tests.
9 (guardrails/retry): touched — targeted hook tests cover no-status OpenRouter provider interruption classification, transient retry budget behavior, and fallback-exhausted retry semantics.
10 (chat/system msg): touched — recovery guidance is injected through the existing model-only system advisory surface; advisory-injection.test.ts covers injection, dedupe, and false-positive prose behavior.
11 (tool registration): not touched — no tools, TOOL_NAMES, AGENT_TOOL_MAP, command, or help surface changes.
12 (release/cache): touched — added docs/releases/v7.20.1.md; release-please-owned version files were not edited manually.

Test plan

bun run build
git diff --exit-code -- dist identified tracked dist drift after build; dist artifacts are included in this PR.
node --input-type=module -e "await import('./dist/index.js'); console.log('dist import OK')"
bun run typecheck
bunx biome ci . (passed with existing warnings in src/turbo/lean/runner.ts)
bun --smol test tests/unit/hooks/advisory-injection.test.ts tests/unit/hooks/guardrails-error-classification.test.ts tests/unit/hooks/guardrails-model-fallback.test.ts tests/unit/hooks/guardrails-fallback.test.ts tests/unit/hooks/guardrails-model-fallback.adversarial.test.ts tests/unit/hooks/guardrails.test.ts tests/unit/hooks/telemetry-guardrails-wiring.test.ts --timeout 30000 (228 pass, 1 pre-existing failure also reproduced on origin/main)
git diff --check

Pre-existing failures

node scripts/repro-704.mjs timed out after 304s on this branch and on a freshly built clean origin/main worktree at C:\tmp\opencode-swarm-main-856.
tests/unit/hooks/guardrails-model-fallback.adversarial.test.ts has a stale object-error assertion after current main added structured provider-error extraction; the same failure reproduces on origin/main.
tests/unit/tools/diagnostic-gating.test.ts, doc-scan*.test.ts, multiple Lean Turbo phase-complete* tests, repo-graph symlink tests, registration/conformance tests, sast-and-cochanger-tools.test.ts, test-runner-scope-cap.test.ts, and update-task-status*.test.ts all had failures that reproduced on clean origin/main.
tests/unit/services/skill-generator.test.ts fails on clean origin/main due Windows path separator expectations.
The combined bun --smol test tests/unit/cli tests/unit/commands tests/unit/config --timeout 120000 bucket is also red in this environment with mock/command isolation failures and child bun spawn ENOENT; it was not treated as a pass signal.

zaxbysauce

Swarm PR Review — PR #875 `fix(guardrails): recover from provider outages`

Structured review via 6-phase swarm workflow: 6 parallel explorer lanes → independent reviewer → T2 targeted re-verification → synthesis.
PR is currently in DRAFT state — feedback below is intended to inform the remaining work before promotion.

Verdict: REQUEST_CHANGES

Reason: Two confirmed HIGH gaps (T2: new provider-recovery detection logic is entirely untested; T4: dedup suppression invariant O-002 has no test) plus one confirmed HIGH finding (T1: pre-existing mock.module() debt). Core implementation is sound — five HIGH/MEDIUM security and correctness explorer candidates were disproved after reviewer verification.

Explorer False Positives (DISPROVED — calling these out explicitly)

The following candidates were raised by explorers but are not real issues:

ID	Explorer Claim	Verdict	Reason
C4	HIGH — regex `[_\s-]` char class ambiguity	DISPROVED	`-` at end of char class is always literal in JS regex; pattern is correct
S1	MEDIUM — prompt injection via advisory embedding	DISPROVED	Advisory uses a hardcoded template; zero user/provider-controlled text is embedded
S2	MEDIUM — HTML/Markdown escaping absent	DISPROVED	Output goes to system prompt string, not HTML; escaping is not applicable
C5	MEDIUM — fingerprint set before advisory injected	DISPROVED	Actual code order is correct: advisory pushed at lines 2326–2328, fingerprint set at 2329
D1	MEDIUM — version mismatch 7.20.0 vs v7.20.1	DISPROVED	Both `package.json` and `docs/releases/v7.20.0.md` are `7.20.0`; explorer misread filename

Pre-Existing Issues (not introduced by this PR)

ID	Severity	Finding	Location
T1	HIGH (pre-existing)	`mock.module()` used instead of `_internals` DI — violates invariant 7	`tests/unit/hooks/guardrails-error-classification.test.ts` lines 1138, 1168, 1211, 1255, 1318

T1 pre-dates PR #875 and is not a blocker for this PR, but is tracked tech debt.

Confirmed Findings (blockers and advisories)

🔴 T2 — HIGH | Missing test coverage for core new detection logic

tests/unit/hooks/advisory-injection.test.ts — verified by targeted reviewer agent reading the full file.

The five existing tests in advisory-injection.test.ts cover only the generic advisory queue mechanics (inject, clear, non-architect skip, create-when-absent, multi-join). The new provider-recovery detection path is entirely untested:

Function / Behavior	Test Status
`isTransientProviderFailureText()`	❌ NOT TESTED
`getProviderFailureFingerprint()`	❌ NOT TESTED
`getMostRecentAssistantText()`	❌ NOT TESTED
`lastProviderRecoveryFingerprint` dedup suppression	❌ NOT TESTED
Non-transient error does NOT trigger advisory	❌ NOT TESTED
Second identical fingerprint suppresses advisory (O-002)	❌ NOT TESTED

The invariant audit claims "advisory-injection.test.ts covers injection, dedupe, and false-positive prose behavior" — this is partially inaccurate. The dedup mechanism (O-002) is the central correctness claim of this PR and has no test.

Required tests before merge:

isTransientProviderFailureText() returns true for provider_unavailable / network connection lost, false for auth errors and prose
getProviderFailureFingerprint() returns stable output for same input, different for different input
getMostRecentAssistantText() returns empty string for empty array, correct text for non-empty
Same fingerprint on second call → no advisory injected (dedup)
Non-transient error (e.g. 401 unauthorized) → no advisory injected

🟡 T4 — MEDIUM | No test for fingerprint dedup suppression

Directly related to T2: the core O-002 obligation ("bounded and session-scoped, no spam") has no test verifying that a repeated identical provider outage fingerprint causes the advisory to be skipped on the second detection.

🟡 T5 — MEDIUM | No test for non-transient error non-injection

No test confirms that a non-matching error pattern (auth error, rate limit, validation failure) does not trigger the recovery advisory. This is essential to prevent advisory noise on hard errors that shouldn't trigger recovery guidance.

🟡 A1 — MEDIUM | O(N) transcript scan on every `messagesTransform` call

getMostRecentAssistantText() iterates messages backward unconditionally on every messagesTransform invocation for Architect sessions. For long-running sessions with large transcripts, this adds measurable latency per turn. Consider an early-exit gate and caching the last-seen message index.

ℹ️ C1 — ADVISORY (downgraded from MEDIUM)

The fingerprint is read and written with no await between them. Explorer flagged this as a TOCTOU race; in practice JavaScript's single-threaded execution model makes this a non-issue in the current hook architecture. Noted for awareness only.

Obligation Assessment

ID	Obligation	Status
O-001	Add next-turn Architect recovery guidance	✅ MET
O-002	Session-scoped, no duplicate spam	⚠️ PARTIALLY MET — code present, dedup invariant has no test
O-003	Preserve existing retry path	✅ MET
O-004	Release notes (`docs/releases/v7.20.0.md`)	✅ MET
O-005	`bun run build` passes, dist included	✅ MET
O-006	228 tests pass	⚠️ PARTIALLY MET — core new detection logic untested
O-007	`lastProviderRecoveryFingerprint` is session-scoped	✅ MET
O-008	New tests use `_internals` DI not `mock.module`	⚠️ UNVERIFIABLE — T2 gap means new detection logic has no tests at all

Summary

The implementation is clean and correct — advisory injection uses a hardcoded safe template, fingerprint dedup is synchronous and correct, session scoping is correct, release notes are accurate, existing retry path is preserved. The gap is entirely in test coverage: the new detection helpers and the dedup mechanism that are the heart of this PR's correctness claim are not tested. Adding the five tests listed under T2 closes the blockers.

Generated by Claude Code

zaxbysauce

Swarm Re-validation — PR #875 (updated commit `1a5a5f0`)

Re-ran targeted verification against all blockers from the previous review. Results:

Blocker	Status	Evidence
T2 (HIGH) — `isTransientProviderFailureText()` untested	✅ RESOLVED	Test `"provider recovery helpers classify, fingerprint, and select transcript text directly"` — directly exercises all three helper functions with true/false cases including 401 auth error
T2 (HIGH) — `getProviderFailureFingerprint()` untested	✅ RESOLVED	Same test verifies stability (same input → same output) and differentiation (different inputs → different outputs)
T2 (HIGH) — `getMostRecentAssistantText()` untested	✅ RESOLVED	Same test verifies correct message selection, multi-part filtering, and non-text-part exclusion
T4 (MEDIUM) — No dedup suppression test	✅ RESOLVED	Tests `"does not duplicate provider recovery guidance for the same interrupted transcript"` and `"does not duplicate provider recovery guidance when OpenCode supplies fresh message arrays"`
T5 (MEDIUM) — No non-transient error test	✅ RESOLVED	Test `"does not inject provider recovery guidance for non-transient auth failures"` — 401 unauthorized produces no TRANSIENT PROVIDER RECOVERY advisory

Only open item: A1 (MEDIUM advisory) — the backward scan of all messages in getMostRecentAssistantText() on every messagesTransform call is still O(N) with no early-exit gate. This was flagged as a performance advisory, not a blocker. Leaving it to the author's discretion.

Updated verdict: APPROVE_WITH_NOTES — all blockers resolved. A1 remains advisory-only. PR is still in DRAFT; when ready to promote, the one open item is the O(N) scan optimization.

Generated by Claude Code

…ble output Removes console.warn/console.info/console.error from the non-fatal diagnostic paths in executePhaseComplete and routes them to either the tool result warnings array (operator-visible, structured) or the debug-gated logger (OPENCODE_SWARM_DEBUG=1 only). Fixes issue #875. Changes: - safeWarn: console.warn → logger.warn (debug-gated) - Turbo-mode gate bypass: console.warn → warnings.push - drift_check disabled skip: console.info removed (warnings.push kept) - non-code phase skip: console.info removed (warnings.push kept) - Lock release failure: console.error → logger.warn (non-blocking, post-decision) Regression tests added for turbo-mode and non-code drift-skip paths asserting no console writes occur. ## Invariant audit - 1 (plugin init): not touched — no init path changes; phase-complete runs post-init - 2 (runtime portability): touched — dist/ rebuilt; node --input-type=module -e "await import('./dist/index.js')" → dist import OK - 3 (subprocesses): not touched — grep of changed source files shows no new spawn call sites - 4 (.swarm containment): not touched — no .swarm/ path changes - 5 (plan durability): not touched — no plan schema changes - 6 (test_runner safety): not touched — no test_runner changes - 7 (test writing): touched — two new bun:test regression tests with finally-block console restore; no mock.module usage - 8 (session state): not touched — no session state changes - 9 (guardrails/retry): not touched — no retry logic changes - 10 (chat/system msg): touched — grep src/tools/phase-complete.ts for console.warn/info/error returns zero matches; fix removes the violation - 11 (tool registration): not touched — no tool registration changes - 12 (release/cache): touched — docs/releases/v7.20.2.md created; package.json/CHANGELOG/.release-please-manifest.json untouched ## Pre-existing failures Tiers 2–4 contain pre-existing failures confirmed on clean main checkout: diagnostic-gating (3), phase-complete-lean-turbo-critic (9), phase-complete-lean-turbo-reviewer (7), sast-and-cochanger-tools (18), tool-registration-conformance (1), update-task-status.gate-fix (4), integration/phase-completion-e2e (subset), adversarial (10). None are in files touched by this PR. https://claude.ai/code/session_013N8A2pNYg7vwLtnjvP68cW

…ble output (#876) Removes console.warn/console.info/console.error from the non-fatal diagnostic paths in executePhaseComplete and routes them to either the tool result warnings array (operator-visible, structured) or the debug-gated logger (OPENCODE_SWARM_DEBUG=1 only). Fixes issue #875. Changes: - safeWarn: console.warn → logger.warn (debug-gated) - Turbo-mode gate bypass: console.warn → warnings.push - drift_check disabled skip: console.info removed (warnings.push kept) - non-code phase skip: console.info removed (warnings.push kept) - Lock release failure: console.error → logger.warn (non-blocking, post-decision) Regression tests added for turbo-mode and non-code drift-skip paths asserting no console writes occur. ## Invariant audit - 1 (plugin init): not touched — no init path changes; phase-complete runs post-init - 2 (runtime portability): touched — dist/ rebuilt; node --input-type=module -e "await import('./dist/index.js')" → dist import OK - 3 (subprocesses): not touched — grep of changed source files shows no new spawn call sites - 4 (.swarm containment): not touched — no .swarm/ path changes - 5 (plan durability): not touched — no plan schema changes - 6 (test_runner safety): not touched — no test_runner changes - 7 (test writing): touched — two new bun:test regression tests with finally-block console restore; no mock.module usage - 8 (session state): not touched — no session state changes - 9 (guardrails/retry): not touched — no retry logic changes - 10 (chat/system msg): touched — grep src/tools/phase-complete.ts for console.warn/info/error returns zero matches; fix removes the violation - 11 (tool registration): not touched — no tool registration changes - 12 (release/cache): touched — docs/releases/v7.20.2.md created; package.json/CHANGELOG/.release-please-manifest.json untouched ## Pre-existing failures Tiers 2–4 contain pre-existing failures confirmed on clean main checkout: diagnostic-gating (3), phase-complete-lean-turbo-critic (9), phase-complete-lean-turbo-reviewer (7), sast-and-cochanger-tools (18), tool-registration-conformance (1), update-task-status.gate-fix (4), integration/phase-completion-e2e (subset), adversarial (10). None are in files touched by this PR. https://claude.ai/code/session_013N8A2pNYg7vwLtnjvP68cW Co-authored-by: Claude <noreply@anthropic.com>

zaxbysauce commented May 16, 2026

View reviewed changes

fix(guardrails): recover from provider outages

1a5a5f0

zaxbysauce force-pushed the codex/issue-856-provider-recovery branch from 902e98c to 1a5a5f0 Compare May 16, 2026 16:20

zaxbysauce commented May 16, 2026

View reviewed changes

zaxbysauce marked this pull request as ready for review May 16, 2026 19:18

zaxbysauce merged commit a15e3a2 into main May 16, 2026
12 checks passed

zaxbysauce deleted the codex/issue-856-provider-recovery branch May 16, 2026 19:18

github-actions Bot mentioned this pull request May 16, 2026

chore(main): release 7.20.1 #877

Merged

zaxbysauce mentioned this pull request May 16, 2026

fix(phase-complete): keep internal phase diagnostics out of chat-visible output #876

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(guardrails): recover from provider outages#875

fix(guardrails): recover from provider outages#875
zaxbysauce merged 1 commit into
mainfrom
codex/issue-856-provider-recovery

zaxbysauce commented May 16, 2026

Uh oh!

zaxbysauce left a comment

Uh oh!

zaxbysauce left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zaxbysauce commented May 16, 2026

Summary

Invariant audit

Test plan

Pre-existing failures

Uh oh!

zaxbysauce left a comment

Choose a reason for hiding this comment

Swarm PR Review — PR #875 fix(guardrails): recover from provider outages

Verdict: REQUEST_CHANGES

Explorer False Positives (DISPROVED — calling these out explicitly)

Pre-Existing Issues (not introduced by this PR)

Confirmed Findings (blockers and advisories)

🔴 T2 — HIGH | Missing test coverage for core new detection logic

🟡 T4 — MEDIUM | No test for fingerprint dedup suppression

🟡 T5 — MEDIUM | No test for non-transient error non-injection

🟡 A1 — MEDIUM | O(N) transcript scan on every messagesTransform call

ℹ️ C1 — ADVISORY (downgraded from MEDIUM)

Obligation Assessment

Summary

Uh oh!

zaxbysauce left a comment

Choose a reason for hiding this comment

Swarm Re-validation — PR #875 (updated commit 1a5a5f0)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Swarm PR Review — PR #875 `fix(guardrails): recover from provider outages`

🟡 A1 — MEDIUM | O(N) transcript scan on every `messagesTransform` call

Swarm Re-validation — PR #875 (updated commit `1a5a5f0`)