fix(proxy): harden continuity recovery, safe WS replay, and shutdown/restart bridge lifecycle by Kazet111 · Pull Request #415 · Soju06/codex-lb

Kazet111 · 2026-04-15T09:51:10Z

Summary

harden previous_response_id recovery for both WebSocket and HTTP Responses flows
recover locally when upstream loses previous_response_id continuity, without forcing client-side context blowup
narrow recovery so we do not mask unrelated invalid_request_error cases
add safe WebSocket replay for pre-created requests that fail before response.created with quota/rate-limit errors
make previous-response owner lookup session-aware by persisting session_id in request_logs and preferring turn-state scope over shared session scope
stop inferring previous_response_id from session scope for normal downstream requests, which was causing context blowup after restart/rebind
align HTTP fallback-without-bridge with the same owner-pinning and fail-closed continuity contract as bridge / WebSocket paths
persist non-bridge HTTP continuity anchors so follow-up /v1/responses requests without bridge can recover the original owner from real streamed response IDs
fix bridge continuity regressions introduced around restart, rebind, and prompt-cache/session recovery
add continuity observability for owner resolution and fail-closed decisions
harden shutdown/restart behavior so inflight bridge waiters and usage refresh work are failed or cancelled cleanly instead of hanging clients

Problem

We were seeing several production failure modes around follow-up turns and restarts:

previous_response_not_found
invalid_request_error with param=previous_response_id
rapid context growth / context_length_exceeded after restart or local rebind
long OpenCode /v1/responses sessions eventually failing with bridge_kind=session_header ... context_length_exceeded
downstream terminals stuck in working / reconnecting
shutdown-time leaks where inflight bridge creation or usage refresh work survived longer than the process lifecycle

We were also seeing WebSocket failures before response.created on quota/rate-limit conditions (for example usage_limit_reached), which surfaced as stream termination and forced
manual resend even when other accounts could continue the run.

A separate issue was that continuity and owner recovery could bleed across scopes:

shared session identifiers could influence owner lookup across terminals
normal downstream requests could inherit synthetic continuity they did not explicitly ask for
restart/rebind flows could split soft prompt_cache continuity into unintended hard bridge identities
preferred-owner fail-closed paths were not fully aligned across WS / HTTP / bridge selection and reconnect paths

Changes

WebSocket path (`previous_response_id` recovery)

added structured extraction helpers for upstream error payloads (code, param, message)
introduced a strict predicate for recoverable previous-response failures:
- code=previous_response_not_found
- code=invalid_request_error + param=previous_response_id + message semantics matching not found
rewrite only those recoverable events to retryable response.failed(stream_incomplete) and trigger reconnect
leave unrelated invalid_request_error responses untouched for downstream visibility
sanitize connect-time previous-response failures to the same retryable contract instead of leaking raw 400

WebSocket path (pre-created failure hardening)

added one-shot transparent replay when a terminal upstream error arrives before response.created
limited recovery to retryable quota/rate-limit failures:
- rate_limit_exceeded
- usage_limit_reached
- insufficient_quota
- usage_not_included
- quota_exceeded
gated replay to safe cases only:
- request is still awaiting response.created
- no assigned response_id
- no other pending requests
- replay count < 1
preserve request affinity policy (sticky_key, sticky_kind, reallocate_sticky) across reconnect/replay
suppress the original failing upstream event for replayed requests so downstream sees only the final outcome
if reconnect cannot acquire an account, downstream still receives an explicit terminal error
release the per-socket response.create gate correctly on fail-closed connect / terminal-error paths so later requests on the same downstream socket do not get blocked

HTTP Responses / fallback path

aligned the non-bridge HTTP stream path with the same continuity contract as bridge / WebSocket flows
added owner lookup and hard owner pinning for previous_response_id in fallback HTTP streaming
fail closed with retryable upstream_unavailable when the previous-response owner is unavailable, instead of silently failing over to another account
rewrite fallback HTTP previous_response_not_found to retryable stream_incomplete
added websocket-upstream preflight slimming / oversize rejection in the fallback HTTP path to avoid the 1009 message too big family of bridge-adjacent failures
persist upstream response.id and session_id for successful non-bridge streamed responses so later previous_response_id follow-ups can recover the original owner from
request_logs

HTTP bridge path

extracted _stream_http_bridge_session_events(...) to unify primary and retry stream handling
on recoverable local previous-response failure:
- evict the stale bridge session from the local map
- fail existing pending requests with stream_incomplete
- close stale upstream
- rebind a fresh local bridge session
- retry the request on the rebound session
- reacquire API key reservation before retry to avoid reservation reuse issues on rebind
preserve scoped previous-response ownership across bridge owner-forward / local retry flows
prefer live local bridge owner resolution before falling back to request-log owner lookup
stop bridge recovery from overriding an already-known preferred owner
make continuity-loss bridge failures fail closed with retryable continuity errors instead of surfacing raw owner-mismatch / continuity-loss internals
drop stale previous-response alias mappings when bridge recovery has to fall back away from them

Session scope and request-log continuity lookup

added session_id persistence to request_logs
added response lookup indexes for scoped owner recovery:
- (request_id, status, api_key_id, requested_at desc, id desc)
- (request_id, status, api_key_id, session_id, requested_at desc, id desc)
added session-scoped owner lookup for previous_response_id
owner lookup scope now prefers:
- x-codex-turn-state
- then x-codex-session-id / x-codex-conversation-id
this prevents owner bleed across terminals that share a broader session identity
stopped inferring previous_response_id from request-log session scope for normal HTTP / compact / WebSocket downstream requests
this removes unsafe synthetic continuity injection that was amplifying context size

Continuity and restart hardening

stopped rekeying recovered sessions from canonical prompt_cache back to synthetic turn_state_header
added durable fallback lookup by persisted latest_turn_state / latest_response_id when alias continuity is missing after restart
limited synthetic recovery anchors and preferred-account pinning to true hard continuity cases only
preserved soft prompt_cache semantics instead of promoting it to hard session identity
prevented forward-loop owner mismatch from force-taking over active durable sessions
made owner/ring metadata lookup for hard continuity fail closed instead of degrading to unpinned continuity
aligned preferred-owner selection failures across WS / HTTP / bridge paths to the same retryable owner-unavailable contract

Continuity observability

added Prometheus counters for:
- continuity owner resolution outcomes
- continuity fail-closed outcomes
added structured continuity logs with hashed identifiers for:
- owner resolution source / outcome
- fail-closed reason / surface
covered selection-time owner-unavailable outcomes in WS and HTTP paths so continuity incidents are observable before any upstream stream attempt

Shutdown and reconnect lifecycle

close_all_http_bridge_sessions() now fails inflight bridge waiters with a terminal error instead of leaving them blocked
capacity waiters now propagate shutdown errors instead of swallowing them and creating new sessions during teardown
closing an HTTP bridge session now fails pending downstream work with stream_incomplete
usage refresh singleflight now consumes terminal task exceptions and is cancelled during scheduler shutdown
scheduler shutdown now cancels inflight usage refresh work even if the scheduler loop task itself was never started
added lifecycle coverage for shutdown with an active bridge capacity waiter and inflight usage refresh work

Recovery guardrails

tightened _http_bridge_should_attempt_local_previous_response_recovery(...) using the same recoverable predicate
ensured only true previous-response-missing semantics are auto-recovered
avoided mixing full client-side resend payloads with synthetic continuity anchors, which was causing context blowup
ensured preferred-owner fail-closed paths never silently reuse the wrong account or leak owner-specific quota / continuity errors as if they were generic client failures

Testing

Added or updated unit coverage for:

WebSocket rewrite on recoverable invalid_request_error not-found semantics
no rewrite for non-recoverable invalid_request_error messages
no rewrite when param != previous_response_id
HTTP bridge recovery predicate requiring not-found semantics
API key reservation reacquire on local rebind retry
failing stale pending requests on local rebind
WebSocket transparent replay on pre-created response.failed(usage_limit_reached)
WebSocket transparent replay on pre-created error(usage_limit_reached)
sticky-thread affinity preservation across replay reconnect
session-scoped owner lookup preferring turn-state over shared session scope
no synthetic previous_response_id inference from session scope for normal requests
durable lookup fallback when continuity alias is missing after restart
fail-closed handling for forwarded owner mismatch
fail-closed handling when owner lookup metadata is unavailable
fail-closed handling when preferred-owner selection returns the wrong account or no account
gate release on WebSocket fail-closed connect / terminal error paths
stale previous-response alias cleanup in HTTP bridge recovery
inflight bridge waiter failure during shutdown
lifecycle shutdown coverage for bridge capacity waiters and usage refresh singleflight cancellation

Added or updated integration coverage for:

WS masking + reconnect recovery path for invalid_request_error(previous_response_id, ...not found...)
HTTP bridge rebind + successful follow-up after equivalent upstream failure
WS transparent replay success path for pre-created quota/rate-limit failures
WS transparent replay path when reconnect fails with no suitable account
HTTP bridge reconnect / reattach behavior across restart-style continuity scenarios
fallback /v1/responses continuity behavior without bridge, including owner pinning and retryable fail-closed responses
real two-step non-bridge HTTP continuity flow where the first streamed response seeds the anchor and the second request resolves the owner from persisted request-log state
oversized websocket-upstream request handling in fallback HTTP Responses flow

Validation

uvx ruff format --check .
uvx ruff check .
uv run ty check
openspec validate --specs
uv run pytest -q

Result:

1789 passed, 7 skipped, 4 warnings

Result

This keeps previous-response recovery deterministic for the real failure mode, preserves correct error surfacing for unrelated invalid requests, makes previous-response owner routing
session-aware, prevents restart-time and session-header continuity regressions that were causing context blowup, aligns fail-closed behavior across WS / HTTP / bridge paths, and
makes shutdown/reconnect behavior fail fast and cleanly instead of hanging clients.

Closes Previous response with id '...' not found. #395
Closes OpenCode via /v1/responses can fail with bridge_kind=session_header ... context_length_exceeded #423
Partially addresses error when sending message #221
Follow-up to Don't surface usage_limit_reached to Codex Desktop when other accounts are still available #385

Daltonganger · 2026-04-15T19:21:31Z

It looks like this one is much more then my solution, Think this one could be better if you have a good look! If this one is accepted, close mine: #416

…and add regression tests

…it WS errors

… and retry

…connect-only behavior

franciscomsilva · 2026-04-17T09:47:54Z

Gave this a try for this issue - working good so far.

Thanks @Kazet111

…t response routing

…follow-ups

Kazet111 changed the title ~~fix(proxy): narrow previous_response recovery to not_found semantics and add regression tests~~ fix(proxy): harden previous_response_id recovery and add safe WS replay for pre-created quota/rate-limit failures Apr 15, 2026

Kazet111 marked this pull request as draft April 15, 2026 16:09

Kazet111 changed the title ~~fix(proxy): harden previous_response_id recovery and add safe WS replay for pre-created quota/rate-limit failures~~ fix(proxy): harden continuity recovery, safe WS replay, and shutdown/restart bridge lifecycle Apr 16, 2026

Kazet111 marked this pull request as ready for review April 16, 2026 08:46

Kazet111 added 7 commits April 16, 2026 15:22

fix(proxy): narrow previous_response recovery to not_found semantics …

30249b2

…and add regression tests

fix(ws): transparently replay pre-created responses on quota/rate-lim…

f733420

…it WS errors

fix(proxy): harden shutdown and reconnect lifecycle

a6d6efa

test(proxy): fix typing in bridge shutdown regression coverage

fd7e73f

style: apply ruff formatting for bridge continuity changes

d83df7f

fix(proxy): preserve scoped previous-response ownership across bridge…

0e8e05f

… and retry

test(proxy): add regression coverage for bridged previous-response re…

157ab94

…connect-only behavior

Kazet111 force-pushed the fix/ws-http-bridge-previous-response-recovery-guard branch from aa32456 to 157ab94 Compare April 16, 2026 13:28

Kazet111 marked this pull request as draft April 16, 2026 14:43

franciscomsilva mentioned this pull request Apr 16, 2026

OpenCode via /v1/responses can fail with bridge_kind=session_header ... context_length_exceeded #423

Open

Kazet111 added 3 commits April 16, 2026 19:39

fix(proxy): harden continuity fail-closed flows

0fee562

fix(proxy): resolve ty diagnostics in continuity tests

7ba531b

fix(proxy): persist non-bridge continuity anchors

e4b55e1

Kazet111 marked this pull request as ready for review April 16, 2026 18:35

This was referenced Apr 18, 2026

Add optional staggered idle warmup for rolling 5h account windows #433

Open

feat(automations): scheduled cycles, grouped runs, and run details UI #438

Open

Kazet111 added 4 commits April 20, 2026 19:49

fix(proxy): mask previous_response_not_found without breaking infligh…

86d9afe

…t response routing

style(proxy): fix ruff line length

c2bec85

style(proxy): format service.py with ruff

4320056

fix(proxy): harden previous_response anchor matching for multiplexed …

e25922d

…follow-ups

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(proxy): harden continuity recovery, safe WS replay, and shutdown/restart bridge lifecycle#415

fix(proxy): harden continuity recovery, safe WS replay, and shutdown/restart bridge lifecycle#415
Kazet111 wants to merge 14 commits intoSoju06:mainfrom
Kazet111:fix/ws-http-bridge-previous-response-recovery-guard

Kazet111 commented Apr 15, 2026 •

edited

Loading

Uh oh!

Daltonganger commented Apr 15, 2026

Uh oh!

franciscomsilva commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Kazet111 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

WebSocket path (previous_response_id recovery)

WebSocket path (pre-created failure hardening)

HTTP Responses / fallback path

HTTP bridge path

Session scope and request-log continuity lookup

Continuity and restart hardening

Continuity observability

Shutdown and reconnect lifecycle

Recovery guardrails

Testing

Validation

Result

Uh oh!

Daltonganger commented Apr 15, 2026

Uh oh!

franciscomsilva commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kazet111 commented Apr 15, 2026 •

edited

Loading

WebSocket path (`previous_response_id` recovery)