fix(proxy): stop pool self-failover spam and operator-park stranding by nnemirovsky · Pull Request #44 · nnemirovsky/sluice

nnemirovsky · 2026-05-17T01:51:50Z

What

Live operation of the credential pool surfaced two real failover defects (not v0.16.0-shippable as-is). Both are fixed here with fail-before/pass-after tests.

Symptom (observed in production)

Telegram showed six identical pool openai_pool failed over openai_oauth_2 -> openai_oauth_2 (429) notices and the agent hard-failed (API call failed after 3 retries: HTTP 429) when the active pooled member was exhausted and the only other member was parked.

Root causes

Operator-park stranding. sluice pool rotate parks the displaced member with reason "manual rotate" (a 300s cooldown). That member is healthy, just operator-deprioritized. When the rotated-to member then 429s, every member is cooling, and ResolveActive's all-cooled degrade picked the soonest-recovering by time — the genuinely-failed member (60s 429 TTL < 300s park) — so it degraded back onto the exhausted account and self-looped.
Self-failover spam. With no distinct target, handlePoolFailover emitted a meaningless from -> from cred_failover plus one audit row + one Telegram notice per agent retry.

Fixes

vault.ManualRotateReason constant (shared with cmd/sluice rotate). ResolveActive now prefers an operator-parked-but-healthy member over a genuinely-failed one when every member is cooling — a pool rotate onto an exhausted member now fails over to the healthy peer instead of self-looping.
handlePoolFailover detects to == from as pool exhaustion: distinct pool_exhausted audit action + an honest "pool exhausted" operator notice instead of a self-referential cred_failover. FailoverEvent.Exhausted carries the distinction; the cooldown is still persisted.
shouldEmitPoolNotice dedups identical (pool,from,to,tag) signals within a 30s window so a retry storm yields one row, not N.

Normal position-order selection still skips a manual-rotate-parked member (rotate semantics unchanged); the preference only applies to the all-cooled degrade/failover path.

Tests

vault: degrade prefers manual-rotate-parked-but-healthy over genuinely-failed-soonest; unchanged when no manual-rotate park exists.
proxy: to==from emits pool_exhausted (not cred_failover), Exhausted=true, and the retry storm collapses to exactly one audit row + one onFailover; a real failover to an operator-parked-but-healthy peer is cred_failover with Exhausted=false.

go build/go vet/go vet -tags=e2e/golangci-lint clean; full internal/proxy + internal/vault + cmd suites (1230 tests) + -race subset green.

Not included / follow-up

The third live-found issue — sluice's pooled token-host phantom expansion corrupting in-container device_code OAuth logins (codex --device-auth 400s because the grant hits the pool's shared auth.openai.com/oauth/token) — is a separate subsystem and tracked separately; it does not gate this failover fix.

No release is cut by this PR (per request: hold until re-verified live on the deployment).

A pooled member that returned 429/401 while every other member was cooling degraded back to itself: handlePoolFailover emitted a meaningless "X -> X" cred_failover plus one Telegram notice + audit row per agent retry (observed live as six identical "failed over X -> X (429)" notices), and the agent hard-failed because there was no real failover target. Root causes: - ResolveActive's all-cooled degrade picked the soonest-recovering member by time. `sluice pool rotate` parks the displaced member with reason "manual rotate" (300s) -- that member is healthy, just operator-deprioritized -- while a genuine 429 cools for 60s. So a rotate onto an exhausted account degraded back to the exhausted account instead of the parked-but-healthy peer, self-looping. - No dedup: an agent retry storm produced one notice/audit per request. Fixes: - vault.ManualRotateReason constant (shared with cmd/sluice rotate). ResolveActive now prefers an operator-parked-but-healthy member over a genuinely-failed one when every member is cooling, so a rotate onto an exhausted member fails over to the healthy peer. - handlePoolFailover detects to==from (no distinct target) as pool EXHAUSTION: emits the distinct pool_exhausted audit action and an honest "pool exhausted" operator notice instead of a self-referential cred_failover. FailoverEvent.Exhausted carries the distinction. - shouldEmitPoolNotice dedups identical (pool,from,to,tag) signals within a 30s window so a retry storm yields one row, not N. Tests: fail-before/pass-after for the degrade preference, the pool-exhausted self-failover suppression + dedup, and the real failover to an operator-parked-but-healthy peer.

nnemirovsky merged commit 70fcff3 into main May 17, 2026
6 checks passed

nnemirovsky deleted the fix/pool-failover-stranding-and-self-failover-spam branch May 17, 2026 02:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(proxy): stop pool self-failover spam and operator-park stranding#44

fix(proxy): stop pool self-failover spam and operator-park stranding#44
nnemirovsky merged 1 commit into
mainfrom
fix/pool-failover-stranding-and-self-failover-spam

nnemirovsky commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nnemirovsky commented May 17, 2026

What

Symptom (observed in production)

Root causes

Fixes

Tests

Not included / follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant