fix(code): bound cloud session reconnect attempts with cumulative cap#2285
fix(code): bound cloud session reconnect attempts with cumulative cap#2285k11kirky wants to merge 4 commits into
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
f24d005 to
8e70577
Compare
7c4e07c to
21d3033
Compare
Prompt To Fix All With AIFix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
apps/code/src/main/services/cloud-task/service.ts:1077-1079
When the poll confirms the run is making progress (`stateChanged=true`), only `cumulativeReconnectAttempts` is reset while `reconnectAttempts` retains its prior error count. A watcher that has accumulated 5 error reconnects, then sees a poll-confirmed state change, then gets just 1 more error will immediately fail with "Cloud stream disconnected" — even though the run was confirmed alive. Reset `reconnectAttempts` here too so both budgets are aligned when real progress is observed.
```suggestion
if (stateChanged) {
// Polled progress proves the run is alive — don't burn cumulative budget.
watcher.cumulativeReconnectAttempts = 0;
watcher.reconnectAttempts = 0;
```
Reviews (1): Last reviewed commit: "chore(code): trim verbose comments on cl..." | Re-trigger Greptile |
| if (stateChanged) { | ||
| // Polled progress proves the run is alive — don't burn cumulative budget. | ||
| watcher.cumulativeReconnectAttempts = 0; |
There was a problem hiding this comment.
When the poll confirms the run is making progress (
stateChanged=true), only cumulativeReconnectAttempts is reset while reconnectAttempts retains its prior error count. A watcher that has accumulated 5 error reconnects, then sees a poll-confirmed state change, then gets just 1 more error will immediately fail with "Cloud stream disconnected" — even though the run was confirmed alive. Reset reconnectAttempts here too so both budgets are aligned when real progress is observed.
| if (stateChanged) { | |
| // Polled progress proves the run is alive — don't burn cumulative budget. | |
| watcher.cumulativeReconnectAttempts = 0; | |
| if (stateChanged) { | |
| // Polled progress proves the run is alive — don't burn cumulative budget. | |
| watcher.cumulativeReconnectAttempts = 0; | |
| watcher.reconnectAttempts = 0; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/code/src/main/services/cloud-task/service.ts
Line: 1077-1079
Comment:
When the poll confirms the run is making progress (`stateChanged=true`), only `cumulativeReconnectAttempts` is reset while `reconnectAttempts` retains its prior error count. A watcher that has accumulated 5 error reconnects, then sees a poll-confirmed state change, then gets just 1 more error will immediately fail with "Cloud stream disconnected" — even though the run was confirmed alive. Reset `reconnectAttempts` here too so both budgets are aligned when real progress is observed.
```suggestion
if (stateChanged) {
// Polled progress proves the run is alive — don't burn cumulative budget.
watcher.cumulativeReconnectAttempts = 0;
watcher.reconnectAttempts = 0;
```
How can I resolve this? If you propose a fix, please make it concise.21d3033 to
419607b
Compare
8e70577 to
3a6f514
Compare
jonathanlab
left a comment
There was a problem hiding this comment.
Greptile comment seems valid here. Wondering if we should make this some kind of exponential backoff instead?
MAX_SSE_RECONNECT_ATTEMPTS=5 was defeated by two loopholes that let unhealthy cloud runs reconnect forever without surfacing an error: - `scheduleReconnect` with `countAttempt: false` actively reset the counter to 0, so the clean-EOF path in `handleStreamCompletion` could loop without ever hitting the cap. - A successful SSE event also resets attempts, so a "dribble" of one event per reconnect cycle would never accumulate either. Two changes: 1. `countAttempt: false` now means "don't increment" rather than "reset to 0". 2. New `cumulativeReconnectAttempts` counter bumps on every schedule regardless of `countAttempt` and only resets on user-initiated retry. When it exceeds MAX_CUMULATIVE_RECONNECT_ ATTEMPTS=30 the watcher fails with a clear unrecoverable message so the UI can surface "this run is broken" instead of looping silently. Generated-By: PostHog Code Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7
The cumulative cap as introduced would also trip on healthy-but-quiet cloud runs whose SSE stream cycles cleanly without events (keepalive- driven close, or the agent simply not emitting between actions). Reset the cumulative counter whenever real progress is observed — either a non-keepalive SSE event in `handleSseEvent`, or a state change detected via the post-stream-completion polling fetch — so only stuck loops with no forward motion get capped. Generated-By: PostHog Code Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7
Generated-By: PostHog Code Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7
Polled stateChanged means the run is alive; reset both reconnectAttempts and cumulativeReconnectAttempts so a prior error streak doesn't trip MAX_SSE_RECONNECT_ATTEMPTS on the next single error after progress. Generated-By: PostHog Code Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7
3a6f514 to
70d06a5
Compare
419607b to
5d2948c
Compare

Problem
When a cloud run stream repeatedly closes with a clean EOF (no error), the per-attempt reconnect counter was being reset to zero on each clean disconnect. This allowed the watcher to loop indefinitely without ever hitting the
MAX_SSE_RECONNECT_ATTEMPTSlimit, leaving users stuck watching a run that can never be reached.Changes
A new
cumulativeReconnectAttemptscounter was added to the watcher state that increments on every reconnect attempt regardless of whether the disconnect was clean or errored. UnlikereconnectAttempts, it is never reset by a clean EOF — only by a successful SSE event or confirmed in-progress poll response. If the cumulative count exceedsMAX_CUMULATIVE_RECONNECT_ATTEMPTS(30), the watcher is failed with a retryable "Cloud run unreachable" error.The previous behavior of resetting
reconnectAttemptsto zero on clean EOF was removed, since the cumulative counter now handles the runaway loop case independently.How did you test this?
A new automated test was added that mocks the stream to always return a clean empty SSE response and verifies that after exhausting the cumulative reconnect budget, the watcher emits an error update with
kind: "error"and the expected"Cloud run unreachable"message.Publish to changelog?
No