fix(proxy): stop pool self-failover spam and operator-park stranding#44
Merged
nnemirovsky merged 1 commit intoMay 17, 2026
Merged
Conversation
A pooled member that returned 429/401 while every other member was cooling degraded back to itself: handlePoolFailover emitted a meaningless "X -> X" cred_failover plus one Telegram notice + audit row per agent retry (observed live as six identical "failed over X -> X (429)" notices), and the agent hard-failed because there was no real failover target. Root causes: - ResolveActive's all-cooled degrade picked the soonest-recovering member by time. `sluice pool rotate` parks the displaced member with reason "manual rotate" (300s) -- that member is healthy, just operator-deprioritized -- while a genuine 429 cools for 60s. So a rotate onto an exhausted account degraded back to the exhausted account instead of the parked-but-healthy peer, self-looping. - No dedup: an agent retry storm produced one notice/audit per request. Fixes: - vault.ManualRotateReason constant (shared with cmd/sluice rotate). ResolveActive now prefers an operator-parked-but-healthy member over a genuinely-failed one when every member is cooling, so a rotate onto an exhausted member fails over to the healthy peer. - handlePoolFailover detects to==from (no distinct target) as pool EXHAUSTION: emits the distinct pool_exhausted audit action and an honest "pool exhausted" operator notice instead of a self-referential cred_failover. FailoverEvent.Exhausted carries the distinction. - shouldEmitPoolNotice dedups identical (pool,from,to,tag) signals within a 30s window so a retry storm yields one row, not N. Tests: fail-before/pass-after for the degrade preference, the pool-exhausted self-failover suppression + dedup, and the real failover to an operator-parked-but-healthy peer.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Live operation of the credential pool surfaced two real failover defects (not v0.16.0-shippable as-is). Both are fixed here with fail-before/pass-after tests.
Symptom (observed in production)
Telegram showed six identical
pool openai_pool failed over openai_oauth_2 -> openai_oauth_2 (429)notices and the agent hard-failed (API call failed after 3 retries: HTTP 429) when the active pooled member was exhausted and the only other member was parked.Root causes
sluice pool rotateparks the displaced member with reason"manual rotate"(a 300s cooldown). That member is healthy, just operator-deprioritized. When the rotated-to member then 429s, every member is cooling, andResolveActive's all-cooled degrade picked the soonest-recovering by time — the genuinely-failed member (60s 429 TTL < 300s park) — so it degraded back onto the exhausted account and self-looped.handlePoolFailoveremitted a meaninglessfrom -> fromcred_failoverplus one audit row + one Telegram notice per agent retry.Fixes
vault.ManualRotateReasonconstant (shared withcmd/sluicerotate).ResolveActivenow prefers an operator-parked-but-healthy member over a genuinely-failed one when every member is cooling — apool rotateonto an exhausted member now fails over to the healthy peer instead of self-looping.handlePoolFailoverdetectsto == fromas pool exhaustion: distinctpool_exhaustedaudit action + an honest "pool exhausted" operator notice instead of a self-referentialcred_failover.FailoverEvent.Exhaustedcarries the distinction; the cooldown is still persisted.shouldEmitPoolNoticededups identical(pool,from,to,tag)signals within a 30s window so a retry storm yields one row, not N.Normal position-order selection still skips a manual-rotate-parked member (rotate semantics unchanged); the preference only applies to the all-cooled degrade/failover path.
Tests
vault: degrade prefers manual-rotate-parked-but-healthy over genuinely-failed-soonest; unchanged when no manual-rotate park exists.proxy:to==fromemitspool_exhausted(notcred_failover),Exhausted=true, and the retry storm collapses to exactly one audit row + oneonFailover; a real failover to an operator-parked-but-healthy peer iscred_failoverwithExhausted=false.go build/go vet/go vet -tags=e2e/golangci-lintclean; fullinternal/proxy+internal/vault+cmdsuites (1230 tests) +-racesubset green.Not included / follow-up
The third live-found issue — sluice's pooled token-host phantom expansion corrupting in-container
device_codeOAuth logins (codex--device-auth400s because the grant hits the pool's sharedauth.openai.com/oauth/token) — is a separate subsystem and tracked separately; it does not gate this failover fix.No release is cut by this PR (per request: hold until re-verified live on the deployment).