fix(proxy): harden continuity recovery, safe WS replay, and shutdown/restart bridge lifecycle#415
Open
Kazet111 wants to merge 14 commits intoSoju06:mainfrom
Open
Conversation
Contributor
|
It looks like this one is much more then my solution, Think this one could be better if you have a good look! If this one is accepted, close mine: #416 |
…and add regression tests
…connect-only behavior
aa32456 to
157ab94
Compare
This was referenced Apr 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
previous_response_idrecovery for both WebSocket and HTTP Responses flowsprevious_response_idcontinuity, without forcing client-side context blowupinvalid_request_errorcasesresponse.createdwith quota/rate-limit errorssession_idinrequest_logsand preferring turn-state scope over shared session scopeprevious_response_idfrom session scope for normal downstream requests, which was causing context blowup after restart/rebind/v1/responsesrequests without bridge can recover the original owner from real streamed response IDsProblem
We were seeing several production failure modes around follow-up turns and restarts:
previous_response_not_foundinvalid_request_errorwithparam=previous_response_idcontext_length_exceededafter restart or local rebind/v1/responsessessions eventually failing withbridge_kind=session_header ... context_length_exceededworking/reconnectingWe were also seeing WebSocket failures before
response.createdon quota/rate-limit conditions (for exampleusage_limit_reached), which surfaced as stream termination and forcedmanual resend even when other accounts could continue the run.
A separate issue was that continuity and owner recovery could bleed across scopes:
prompt_cachecontinuity into unintended hard bridge identitiesChanges
WebSocket path (
previous_response_idrecovery)code,param,message)code=previous_response_not_foundcode=invalid_request_error+param=previous_response_id+ message semantics matching not foundresponse.failed(stream_incomplete)and trigger reconnectinvalid_request_errorresponses untouched for downstream visibility400WebSocket path (pre-created failure hardening)
response.createdrate_limit_exceededusage_limit_reachedinsufficient_quotausage_not_includedquota_exceededresponse.createdresponse_id< 1sticky_key,sticky_kind,reallocate_sticky) across reconnect/replayresponse.creategate correctly on fail-closed connect / terminal-error paths so later requests on the same downstream socket do not get blockedHTTP Responses / fallback path
previous_response_idin fallback HTTP streamingupstream_unavailablewhen the previous-response owner is unavailable, instead of silently failing over to another accountprevious_response_not_foundto retryablestream_incomplete1009 message too bigfamily of bridge-adjacent failuresresponse.idandsession_idfor successful non-bridge streamed responses so laterprevious_response_idfollow-ups can recover the original owner fromrequest_logsHTTP bridge path
_stream_http_bridge_session_events(...)to unify primary and retry stream handlingstream_incompleteSession scope and request-log continuity lookup
session_idpersistence torequest_logs(request_id, status, api_key_id, requested_at desc, id desc)(request_id, status, api_key_id, session_id, requested_at desc, id desc)previous_response_idx-codex-turn-statex-codex-session-id/x-codex-conversation-idprevious_response_idfrom request-log session scope for normal HTTP / compact / WebSocket downstream requestsContinuity and restart hardening
prompt_cacheback to syntheticturn_state_headerlatest_turn_state/latest_response_idwhen alias continuity is missing after restartprompt_cachesemantics instead of promoting it to hard session identityContinuity observability
Shutdown and reconnect lifecycle
close_all_http_bridge_sessions()now fails inflight bridge waiters with a terminal error instead of leaving them blockedstream_incompleteRecovery guardrails
_http_bridge_should_attempt_local_previous_response_recovery(...)using the same recoverable predicateTesting
Added or updated unit coverage for:
invalid_request_errornot-found semanticsinvalid_request_errormessagesparam != previous_response_idresponse.failed(usage_limit_reached)error(usage_limit_reached)previous_response_idinference from session scope for normal requestsAdded or updated integration coverage for:
invalid_request_error(previous_response_id, ...not found...)/v1/responsescontinuity behavior without bridge, including owner pinning and retryable fail-closed responsesValidation
uvx ruff format --check .uvx ruff check .uv run ty checkopenspec validate --specsuv run pytest -qResult:
1789 passed, 7 skipped, 4 warningsResult
This keeps previous-response recovery deterministic for the real failure mode, preserves correct error surfacing for unrelated invalid requests, makes previous-response owner routing
session-aware, prevents restart-time and session-header continuity regressions that were causing context blowup, aligns fail-closed behavior across WS / HTTP / bridge paths, and
makes shutdown/reconnect behavior fail fast and cleanly instead of hanging clients.
/v1/responsescan fail withbridge_kind=session_header ... context_length_exceeded#423