Skip to content

fix(cloud-agent): prevent false disconnects on idle cloud stream#2294

Merged
tatoalo merged 1 commit into
mainfrom
feat/cloud-agent/cloud-run-stream-reconnect
May 22, 2026
Merged

fix(cloud-agent): prevent false disconnects on idle cloud stream#2294
tatoalo merged 1 commit into
mainfrom
feat/cloud-agent/cloud-run-stream-reconnect

Conversation

@tatoalo
Copy link
Copy Markdown
Contributor

@tatoalo tatoalo commented May 22, 2026

Problem

Idle runs (only keepalives flowing) were then falsely marked disconnected so keepalives never reset the reconnect counter and a clean transport cut counted as a failed attempt, so after a few cycles the watcher exhausted its retries and surfaced a spurious error. Rebooting would fix but of course not a great fix at that.

Changes

  • split reconnect tracking into transport budget reset by any keepalive or data frame and a backend-error budget
  • pace reconnect backoff on the budget that drives each reconnect rather than the max of both, so a stale backend-error count can't inflate a recovered transport reconnect's delay
  • when a stream ends cleanly but the follow-up run-state fetch keeps failing, charge the transport budget so it surfaces a "state unavailable" error instead of retrying every ~2s forever

@tatoalo tatoalo self-assigned this May 22, 2026
@tatoalo tatoalo requested a review from a team May 22, 2026 09:59
@tatoalo tatoalo force-pushed the feat/cloud-agent/cloud-run-stream-reconnect branch from 315f484 to 7d8db81 Compare May 22, 2026 10:01
@tatoalo tatoalo added the Create Release This will trigger a new release label May 22, 2026
@tatoalo tatoalo enabled auto-merge (squash) May 22, 2026 10:03
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
apps/code/src/main/services/cloud-task/service.test.ts:672-694
**`makeInProgressRun` factory duplicated across six tests**

The `makeInProgressRun` arrow function with an identical `createJsonResponse` body is copy-pasted into at least six of the eight new test cases. This violates the OnceAndOnlyOnce principle the team follows. Hoisting it to a shared helper at the top of the `describe` block would remove ~50 lines of repetition and make future response-shape changes a single-point edit.

### Issue 2 of 2
apps/code/src/main/services/cloud-task/service.test.ts:1338
The budget comment lists six delay values (2 + 4 + 8 + 16 + 30 + 30 = 90 s) but the sixth delay is never scheduled — `scheduleReconnect` hits the `> MAX_SSE_RECONNECT_ATTEMPTS` guard immediately when `reconnectAttempts` reaches 6 and returns without enqueuing a timer. The real total backoff before the watcher is failed is 2 + 4 + 8 + 16 + 30 = 60 s (5 delays for 5 successful reconnects, 6th call fails immediately).

```suggestion
    // Budget is 5 attempts (2s + 4s + 8s + 16s + 30s of backoff; the 6th call to
    // scheduleReconnect fails the watcher immediately without adding another delay).
```

Reviews (1): Last reviewed commit: "fix(cloud-agent): prevent false disconne..." | Re-trigger Greptile

Comment on lines 672 to +694
});
});

it("clears the backend-error budget after a healthy long-lived cut", async () => {
vi.useFakeTimers();

const updates: unknown[] = [];
service.on(CloudTaskEvent.Update, (payload) => updates.push(payload));

const makeInProgressRun = () =>
createJsonResponse({
id: "run-1",
status: "in_progress",
stage: null,
output: null,
error_message: null,
branch: "main",
updated_at: "2026-01-01T00:00:00Z",
});

mockNetFetch
.mockResolvedValueOnce(makeInProgressRun()) // bootstrap: fetchTaskRun
.mockResolvedValueOnce(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 makeInProgressRun factory duplicated across six tests

The makeInProgressRun arrow function with an identical createJsonResponse body is copy-pasted into at least six of the eight new test cases. This violates the OnceAndOnlyOnce principle the team follows. Hoisting it to a shared helper at the top of the describe block would remove ~50 lines of repetition and make future response-shape changes a single-point edit.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/code/src/main/services/cloud-task/service.test.ts
Line: 672-694

Comment:
**`makeInProgressRun` factory duplicated across six tests**

The `makeInProgressRun` arrow function with an identical `createJsonResponse` body is copy-pasted into at least six of the eight new test cases. This violates the OnceAndOnlyOnce principle the team follows. Hoisting it to a shared helper at the top of the `describe` block would remove ~50 lines of repetition and make future response-shape changes a single-point edit.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

// 500s. The reconnect must charge the budget so it eventually gives up.
firstControllerRef.current?.close();

// Budget is 5 attempts (2s + 4s + 8s + 16s + 30s + 30s of backoff).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The budget comment lists six delay values (2 + 4 + 8 + 16 + 30 + 30 = 90 s) but the sixth delay is never scheduled — scheduleReconnect hits the > MAX_SSE_RECONNECT_ATTEMPTS guard immediately when reconnectAttempts reaches 6 and returns without enqueuing a timer. The real total backoff before the watcher is failed is 2 + 4 + 8 + 16 + 30 = 60 s (5 delays for 5 successful reconnects, 6th call fails immediately).

Suggested change
// Budget is 5 attempts (2s + 4s + 8s + 16s + 30s + 30s of backoff).
// Budget is 5 attempts (2s + 4s + 8s + 16s + 30s of backoff; the 6th call to
// scheduleReconnect fails the watcher immediately without adding another delay).
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/code/src/main/services/cloud-task/service.test.ts
Line: 1338

Comment:
The budget comment lists six delay values (2 + 4 + 8 + 16 + 30 + 30 = 90 s) but the sixth delay is never scheduled — `scheduleReconnect` hits the `> MAX_SSE_RECONNECT_ATTEMPTS` guard immediately when `reconnectAttempts` reaches 6 and returns without enqueuing a timer. The real total backoff before the watcher is failed is 2 + 4 + 8 + 16 + 30 = 60 s (5 delays for 5 successful reconnects, 6th call fails immediately).

```suggestion
    // Budget is 5 attempts (2s + 4s + 8s + 16s + 30s of backoff; the 6th call to
    // scheduleReconnect fails the watcher immediately without adding another delay).
```

How can I resolve this? If you propose a fix, please make it concise.

@tatoalo tatoalo merged commit 5470b51 into main May 22, 2026
15 checks passed
@tatoalo tatoalo deleted the feat/cloud-agent/cloud-run-stream-reconnect branch May 22, 2026 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Create Release This will trigger a new release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants