Skip to content

fix(proxy): harden continuity recovery, safe WS replay, and shutdown/restart bridge lifecycle#415

Open
Kazet111 wants to merge 14 commits intoSoju06:mainfrom
Kazet111:fix/ws-http-bridge-previous-response-recovery-guard
Open

fix(proxy): harden continuity recovery, safe WS replay, and shutdown/restart bridge lifecycle#415
Kazet111 wants to merge 14 commits intoSoju06:mainfrom
Kazet111:fix/ws-http-bridge-previous-response-recovery-guard

Conversation

@Kazet111
Copy link
Copy Markdown
Contributor

@Kazet111 Kazet111 commented Apr 15, 2026

Summary

  • harden previous_response_id recovery for both WebSocket and HTTP Responses flows
  • recover locally when upstream loses previous_response_id continuity, without forcing client-side context blowup
  • narrow recovery so we do not mask unrelated invalid_request_error cases
  • add safe WebSocket replay for pre-created requests that fail before response.created with quota/rate-limit errors
  • make previous-response owner lookup session-aware by persisting session_id in request_logs and preferring turn-state scope over shared session scope
  • stop inferring previous_response_id from session scope for normal downstream requests, which was causing context blowup after restart/rebind
  • align HTTP fallback-without-bridge with the same owner-pinning and fail-closed continuity contract as bridge / WebSocket paths
  • persist non-bridge HTTP continuity anchors so follow-up /v1/responses requests without bridge can recover the original owner from real streamed response IDs
  • fix bridge continuity regressions introduced around restart, rebind, and prompt-cache/session recovery
  • add continuity observability for owner resolution and fail-closed decisions
  • harden shutdown/restart behavior so inflight bridge waiters and usage refresh work are failed or cancelled cleanly instead of hanging clients

Problem

We were seeing several production failure modes around follow-up turns and restarts:

  • previous_response_not_found
  • invalid_request_error with param=previous_response_id
  • rapid context growth / context_length_exceeded after restart or local rebind
  • long OpenCode /v1/responses sessions eventually failing with bridge_kind=session_header ... context_length_exceeded
  • downstream terminals stuck in working / reconnecting
  • shutdown-time leaks where inflight bridge creation or usage refresh work survived longer than the process lifecycle

We were also seeing WebSocket failures before response.created on quota/rate-limit conditions (for example usage_limit_reached), which surfaced as stream termination and forced
manual resend even when other accounts could continue the run.

A separate issue was that continuity and owner recovery could bleed across scopes:

  • shared session identifiers could influence owner lookup across terminals
  • normal downstream requests could inherit synthetic continuity they did not explicitly ask for
  • restart/rebind flows could split soft prompt_cache continuity into unintended hard bridge identities
  • preferred-owner fail-closed paths were not fully aligned across WS / HTTP / bridge selection and reconnect paths

Changes

WebSocket path (previous_response_id recovery)

  • added structured extraction helpers for upstream error payloads (code, param, message)
  • introduced a strict predicate for recoverable previous-response failures:
    • code=previous_response_not_found
    • code=invalid_request_error + param=previous_response_id + message semantics matching not found
  • rewrite only those recoverable events to retryable response.failed(stream_incomplete) and trigger reconnect
  • leave unrelated invalid_request_error responses untouched for downstream visibility
  • sanitize connect-time previous-response failures to the same retryable contract instead of leaking raw 400

WebSocket path (pre-created failure hardening)

  • added one-shot transparent replay when a terminal upstream error arrives before response.created
  • limited recovery to retryable quota/rate-limit failures:
    • rate_limit_exceeded
    • usage_limit_reached
    • insufficient_quota
    • usage_not_included
    • quota_exceeded
  • gated replay to safe cases only:
    • request is still awaiting response.created
    • no assigned response_id
    • no other pending requests
    • replay count < 1
  • preserve request affinity policy (sticky_key, sticky_kind, reallocate_sticky) across reconnect/replay
  • suppress the original failing upstream event for replayed requests so downstream sees only the final outcome
  • if reconnect cannot acquire an account, downstream still receives an explicit terminal error
  • release the per-socket response.create gate correctly on fail-closed connect / terminal-error paths so later requests on the same downstream socket do not get blocked

HTTP Responses / fallback path

  • aligned the non-bridge HTTP stream path with the same continuity contract as bridge / WebSocket flows
  • added owner lookup and hard owner pinning for previous_response_id in fallback HTTP streaming
  • fail closed with retryable upstream_unavailable when the previous-response owner is unavailable, instead of silently failing over to another account
  • rewrite fallback HTTP previous_response_not_found to retryable stream_incomplete
  • added websocket-upstream preflight slimming / oversize rejection in the fallback HTTP path to avoid the 1009 message too big family of bridge-adjacent failures
  • persist upstream response.id and session_id for successful non-bridge streamed responses so later previous_response_id follow-ups can recover the original owner from
    request_logs

HTTP bridge path

  • extracted _stream_http_bridge_session_events(...) to unify primary and retry stream handling
  • on recoverable local previous-response failure:
    • evict the stale bridge session from the local map
    • fail existing pending requests with stream_incomplete
    • close stale upstream
    • rebind a fresh local bridge session
    • retry the request on the rebound session
    • reacquire API key reservation before retry to avoid reservation reuse issues on rebind
  • preserve scoped previous-response ownership across bridge owner-forward / local retry flows
  • prefer live local bridge owner resolution before falling back to request-log owner lookup
  • stop bridge recovery from overriding an already-known preferred owner
  • make continuity-loss bridge failures fail closed with retryable continuity errors instead of surfacing raw owner-mismatch / continuity-loss internals
  • drop stale previous-response alias mappings when bridge recovery has to fall back away from them

Session scope and request-log continuity lookup

  • added session_id persistence to request_logs
  • added response lookup indexes for scoped owner recovery:
    • (request_id, status, api_key_id, requested_at desc, id desc)
    • (request_id, status, api_key_id, session_id, requested_at desc, id desc)
  • added session-scoped owner lookup for previous_response_id
  • owner lookup scope now prefers:
    • x-codex-turn-state
    • then x-codex-session-id / x-codex-conversation-id
  • this prevents owner bleed across terminals that share a broader session identity
  • stopped inferring previous_response_id from request-log session scope for normal HTTP / compact / WebSocket downstream requests
  • this removes unsafe synthetic continuity injection that was amplifying context size

Continuity and restart hardening

  • stopped rekeying recovered sessions from canonical prompt_cache back to synthetic turn_state_header
  • added durable fallback lookup by persisted latest_turn_state / latest_response_id when alias continuity is missing after restart
  • limited synthetic recovery anchors and preferred-account pinning to true hard continuity cases only
  • preserved soft prompt_cache semantics instead of promoting it to hard session identity
  • prevented forward-loop owner mismatch from force-taking over active durable sessions
  • made owner/ring metadata lookup for hard continuity fail closed instead of degrading to unpinned continuity
  • aligned preferred-owner selection failures across WS / HTTP / bridge paths to the same retryable owner-unavailable contract

Continuity observability

  • added Prometheus counters for:
    • continuity owner resolution outcomes
    • continuity fail-closed outcomes
  • added structured continuity logs with hashed identifiers for:
    • owner resolution source / outcome
    • fail-closed reason / surface
  • covered selection-time owner-unavailable outcomes in WS and HTTP paths so continuity incidents are observable before any upstream stream attempt

Shutdown and reconnect lifecycle

  • close_all_http_bridge_sessions() now fails inflight bridge waiters with a terminal error instead of leaving them blocked
  • capacity waiters now propagate shutdown errors instead of swallowing them and creating new sessions during teardown
  • closing an HTTP bridge session now fails pending downstream work with stream_incomplete
  • usage refresh singleflight now consumes terminal task exceptions and is cancelled during scheduler shutdown
  • scheduler shutdown now cancels inflight usage refresh work even if the scheduler loop task itself was never started
  • added lifecycle coverage for shutdown with an active bridge capacity waiter and inflight usage refresh work

Recovery guardrails

  • tightened _http_bridge_should_attempt_local_previous_response_recovery(...) using the same recoverable predicate
  • ensured only true previous-response-missing semantics are auto-recovered
  • avoided mixing full client-side resend payloads with synthetic continuity anchors, which was causing context blowup
  • ensured preferred-owner fail-closed paths never silently reuse the wrong account or leak owner-specific quota / continuity errors as if they were generic client failures

Testing

Added or updated unit coverage for:

  • WebSocket rewrite on recoverable invalid_request_error not-found semantics
  • no rewrite for non-recoverable invalid_request_error messages
  • no rewrite when param != previous_response_id
  • HTTP bridge recovery predicate requiring not-found semantics
  • API key reservation reacquire on local rebind retry
  • failing stale pending requests on local rebind
  • WebSocket transparent replay on pre-created response.failed(usage_limit_reached)
  • WebSocket transparent replay on pre-created error(usage_limit_reached)
  • sticky-thread affinity preservation across replay reconnect
  • session-scoped owner lookup preferring turn-state over shared session scope
  • no synthetic previous_response_id inference from session scope for normal requests
  • durable lookup fallback when continuity alias is missing after restart
  • fail-closed handling for forwarded owner mismatch
  • fail-closed handling when owner lookup metadata is unavailable
  • fail-closed handling when preferred-owner selection returns the wrong account or no account
  • gate release on WebSocket fail-closed connect / terminal error paths
  • stale previous-response alias cleanup in HTTP bridge recovery
  • inflight bridge waiter failure during shutdown
  • lifecycle shutdown coverage for bridge capacity waiters and usage refresh singleflight cancellation

Added or updated integration coverage for:

  • WS masking + reconnect recovery path for invalid_request_error(previous_response_id, ...not found...)
  • HTTP bridge rebind + successful follow-up after equivalent upstream failure
  • WS transparent replay success path for pre-created quota/rate-limit failures
  • WS transparent replay path when reconnect fails with no suitable account
  • HTTP bridge reconnect / reattach behavior across restart-style continuity scenarios
  • fallback /v1/responses continuity behavior without bridge, including owner pinning and retryable fail-closed responses
  • real two-step non-bridge HTTP continuity flow where the first streamed response seeds the anchor and the second request resolves the owner from persisted request-log state
  • oversized websocket-upstream request handling in fallback HTTP Responses flow

Validation

  • uvx ruff format --check .
  • uvx ruff check .
  • uv run ty check
  • openspec validate --specs
  • uv run pytest -q

Result:

  • 1789 passed, 7 skipped, 4 warnings

Result

This keeps previous-response recovery deterministic for the real failure mode, preserves correct error surfacing for unrelated invalid requests, makes previous-response owner routing
session-aware, prevents restart-time and session-header continuity regressions that were causing context blowup, aligns fail-closed behavior across WS / HTTP / bridge paths, and
makes shutdown/reconnect behavior fail fast and cleanly instead of hanging clients.

@Kazet111 Kazet111 changed the title fix(proxy): narrow previous_response recovery to not_found semantics and add regression tests fix(proxy): harden previous_response_id recovery and add safe WS replay for pre-created quota/rate-limit failures Apr 15, 2026
@Kazet111 Kazet111 marked this pull request as draft April 15, 2026 16:09
@Daltonganger
Copy link
Copy Markdown
Contributor

It looks like this one is much more then my solution, Think this one could be better if you have a good look! If this one is accepted, close mine: #416

@Kazet111 Kazet111 changed the title fix(proxy): harden previous_response_id recovery and add safe WS replay for pre-created quota/rate-limit failures fix(proxy): harden continuity recovery, safe WS replay, and shutdown/restart bridge lifecycle Apr 16, 2026
@Kazet111 Kazet111 marked this pull request as ready for review April 16, 2026 08:46
@Kazet111 Kazet111 marked this pull request as ready for review April 16, 2026 18:35
@franciscomsilva
Copy link
Copy Markdown

Gave this a try for this issue - working good so far.

Thanks @Kazet111

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenCode via /v1/responses can fail with bridge_kind=session_header ... context_length_exceeded Previous response with id '...' not found.

3 participants