fix(jsonrpc): isolate malformed frames so one bad message can't cancel all in-flight requests#1605
fix(jsonrpc): isolate malformed frames so one bad message can't cancel all in-flight requests#1605chuwik wants to merge 2 commits into
Conversation
…l all in-flight requests The read loop parsed each Content-Length frame inside read_message and treated a serde_json body error the same as a fatal I/O error: it logged "error reading from CLI", broke the loop, and drained every pending request. One corrupt frame therefore cancelled all concurrent in-flight requests sharing the connection (e.g. list_models + list_global_skills + account.getQuota), leaving the desktop model picker empty on launch (github/app#836). Content-Length framing is honest: the reader assembles a full frame body before parsing and stays byte-aligned to the next frame regardless of any single body's content. So a body-level JSON error is self-contained and must not tear down the shared connection. Separate framing from parsing: read_message becomes read_frame returning the raw body bytes (I/O and protocol/framing errors still propagate as fatal). read_loop parses the body itself; on a parse error it recovers the response id from the frame head and fails only that one awaiter with a -32700 parse error, then continues serving the connection. Frames with no recoverable id are dropped. If framing were ever desynced, the next read_frame hits a framing error and the loop still breaks and reconnects. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens the Rust JSON-RPC client read loop so that a single correctly-framed but malformed JSON body (e.g., truncated mid-escape) no longer tears down the shared connection and cancels unrelated in-flight requests—addressing the “empty model picker on launch” failure mode described in github/app#836.
Changes:
- Split framing from parsing (
read_message→read_frame) so JSON parse failures can be handled per-frame without treating them as fatal transport errors. - Add best-effort recovery for malformed response frames: attempt to extract the response
idand fail only that pending request with JSON-RPC-32700(parse error). - Add unit tests for
extract_response_idand an integration regression test ensuring one malformed response does not cancel other concurrent requests.
Show a summary per file
| File | Description |
|---|---|
| rust/src/jsonrpc.rs | Refactors read loop to isolate malformed frames; adds parse-error code, id recovery, and associated unit tests. |
| rust/tests/jsonrpc_test.rs | Adds integration regression test ensuring malformed frames fail only their own request. |
Copilot's findings
- Files reviewed: 2/2 changed files
- Comments generated: 1
|
@chuwik Do we know the root cause in the sense of why there was a "bad frame" in the first place? And what makes this Rust-specific? I'd be reluctant to simply ignore bad data (even if logging a warning). The protocol relies on guaranteed delivery - any missed messages might brick the session, since the session may strictly require a response to a call. If we can explain why the bad frame is unavoidable then maybe it would be a justification for ignoring it, but also maybe not. Our transports are stdio or TCP, both of which provide their own delivery guarantees as long as the connection as a whole isn't dropped. The issue on github/app says:
... so it seems like this may be easy enough to repro. |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sorry for the noise, agent researching decided to send a PR; it's not ready for review, though I appreciate your input! Still trying to figure out the root cause of the bad frame, user cannot repro anymore. Unless I find a smoking gun, I may leave as-is unless it comes back again. |
Problem
The desktop app's model picker is empty on launch (github/app#836). The log shows the JSON-RPC read loop dying on a single bad frame and taking every concurrent request down with it:
Root cause — blast radius, not partial-chunk framing
read_loopparsed each Content-Length frame insideread_messageand treated aserde_jsonbody-parse error identically to a fatal I/O error: it loggederror reading from CLI, broke the loop, and then drained all pending requests. So one corrupt frame cancelled every in-flight request sharing the connection (list_models+list_global_skills+account.getQuota), leaving the picker empty.Content-Length framing is honest:
read_messagereads the header,read_exact(length)assembles the complete body, then parses — so the reader is always byte-aligned to the next frame regardless of any single body's content. A body-level JSON error is therefore self-contained and must not tear down the shared connection. (The popular "a\uXXXXescape split across read chunks causes a partial parse" theory is a red herring — there is no partial-chunk parsing.)Fix
Separate framing from parsing so a bad body fails only the implicated request:
read_message→read_frame, returning the raw body bytes. I/O and Content-Length/protocol/framing errors still propagate asErr(fatal → break → reconnect); clean EOF at a frame boundary still returnsOk(None).read_loopparses the body itself. OnOkit dispatches as before; on a parse error it callsfail_unparseable_frameandcontinues instead of breaking.fail_unparseable_framebest-effort recovers the responseidfrom the frame head (responses serializeidahead of the truncated payload) and delivers a synthetic-32700parse-error response to just that one awaiter, so it wakes immediately instead of hanging. Frames with no recoverable id (notifications, server requests) are dropped. Awarn!records the serde error,frame_len, and recovered id.extract_response_idscans the first ≤256 bytes; returnsNoneif the frame is a notification/request ("method"present), else parses the integer after the first"id"key.error_codes::PARSE_ERROR = -32700.If framing were ever desynced by a dishonest CLI, the next
read_framehits a framing error →Err→ fatal → break → reconnect, so the loop self-heals within ~1 frame. For the honest-framing reality, this is a strict improvement: one bad message is skipped and every other concurrent request succeeds.Tests
extract_response_id(truncated body, whitespace/error frames, notification →None).malformed_frame_fails_only_its_own_request: two concurrent requests share the read loop; the server replies to one with a correctly-framed-but-truncated body (ends mid-\u) and to the other with a valid response. Asserts the healthy request succeeds (not collateral-cancelled) and the malformed one resolves promptly with code-32700(no hang).Resolves github/app#836.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com