fix(code): bound cloud session reconnect attempts with cumulative cap by k11kirky · Pull Request #2285 · PostHog/code

k11kirky · 2026-05-21T12:59:41Z

Problem

When a cloud run stream repeatedly closes with a clean EOF (no error), the per-attempt reconnect counter was being reset to zero on each clean disconnect. This allowed the watcher to loop indefinitely without ever hitting the MAX_SSE_RECONNECT_ATTEMPTS limit, leaving users stuck watching a run that can never be reached.

Changes

A new cumulativeReconnectAttempts counter was added to the watcher state that increments on every reconnect attempt regardless of whether the disconnect was clean or errored. Unlike reconnectAttempts, it is never reset by a clean EOF — only by a successful SSE event or confirmed in-progress poll response. If the cumulative count exceeds MAX_CUMULATIVE_RECONNECT_ATTEMPTS (30), the watcher is failed with a retryable "Cloud run unreachable" error.

The previous behavior of resetting reconnectAttempts to zero on clean EOF was removed, since the cumulative counter now handles the runaway loop case independently.

How did you test this?

A new automated test was added that mocks the stream to always return a clean empty SSE response and verifies that after exhausting the cumulative reconnect budget, the watcher emits an error update with kind: "error" and the expected "Cloud run unreachable" message.

Publish to changelog?

No

k11kirky · 2026-05-21T12:59:59Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

greptile-apps · 2026-05-21T14:26:46Z

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
apps/code/src/main/services/cloud-task/service.ts:1077-1079
When the poll confirms the run is making progress (`stateChanged=true`), only `cumulativeReconnectAttempts` is reset while `reconnectAttempts` retains its prior error count. A watcher that has accumulated 5 error reconnects, then sees a poll-confirmed state change, then gets just 1 more error will immediately fail with "Cloud stream disconnected" — even though the run was confirmed alive. Reset `reconnectAttempts` here too so both budgets are aligned when real progress is observed.

```suggestion
      if (stateChanged) {
        // Polled progress proves the run is alive — don't burn cumulative budget.
        watcher.cumulativeReconnectAttempts = 0;
        watcher.reconnectAttempts = 0;
```

_{Reviews (1): Last reviewed commit: "chore(code): trim verbose comments on cl..." | Re-trigger Greptile}

greptile-apps · 2026-05-21T14:26:50Z

      if (stateChanged) {
+        // Polled progress proves the run is alive — don't burn cumulative budget.
+        watcher.cumulativeReconnectAttempts = 0;


When the poll confirms the run is making progress (stateChanged=true), only cumulativeReconnectAttempts is reset while reconnectAttempts retains its prior error count. A watcher that has accumulated 5 error reconnects, then sees a poll-confirmed state change, then gets just 1 more error will immediately fail with "Cloud stream disconnected" — even though the run was confirmed alive. Reset reconnectAttempts here too so both budgets are aligned when real progress is observed.

Suggested change

if (stateChanged) {

// Polled progress proves the run is alive — don't burn cumulative budget.

watcher.cumulativeReconnectAttempts = 0;

if (stateChanged) {

// Polled progress proves the run is alive — don't burn cumulative budget.

watcher.cumulativeReconnectAttempts = 0;

watcher.reconnectAttempts = 0;

Prompt To Fix With AI

This is a comment left during a code review. Path: apps/code/src/main/services/cloud-task/service.ts Line: 1077-1079 Comment: When the poll confirms the run is making progress (`stateChanged=true`), only `cumulativeReconnectAttempts` is reset while `reconnectAttempts` retains its prior error count. A watcher that has accumulated 5 error reconnects, then sees a poll-confirmed state change, then gets just 1 more error will immediately fail with "Cloud stream disconnected" — even though the run was confirmed alive. Reset `reconnectAttempts` here too so both budgets are aligned when real progress is observed. ```suggestion if (stateChanged) { // Polled progress proves the run is alive — don't burn cumulative budget. watcher.cumulativeReconnectAttempts = 0; watcher.reconnectAttempts = 0; ``` How can I resolve this? If you propose a fix, please make it concise.

jonathanlab

Greptile comment seems valid here. Wondering if we should make this some kind of exponential backoff instead?

MAX_SSE_RECONNECT_ATTEMPTS=5 was defeated by two loopholes that let unhealthy cloud runs reconnect forever without surfacing an error: - `scheduleReconnect` with `countAttempt: false` actively reset the counter to 0, so the clean-EOF path in `handleStreamCompletion` could loop without ever hitting the cap. - A successful SSE event also resets attempts, so a "dribble" of one event per reconnect cycle would never accumulate either. Two changes: 1. `countAttempt: false` now means "don't increment" rather than "reset to 0". 2. New `cumulativeReconnectAttempts` counter bumps on every schedule regardless of `countAttempt` and only resets on user-initiated retry. When it exceeds MAX_CUMULATIVE_RECONNECT_ ATTEMPTS=30 the watcher fails with a clear unrecoverable message so the UI can surface "this run is broken" instead of looping silently. Generated-By: PostHog Code Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7

The cumulative cap as introduced would also trip on healthy-but-quiet cloud runs whose SSE stream cycles cleanly without events (keepalive- driven close, or the agent simply not emitting between actions). Reset the cumulative counter whenever real progress is observed — either a non-keepalive SSE event in `handleSseEvent`, or a state change detected via the post-stream-completion polling fetch — so only stuck loops with no forward motion get capped. Generated-By: PostHog Code Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7

Generated-By: PostHog Code Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7

Polled stateChanged means the run is alive; reset both reconnectAttempts and cumulativeReconnectAttempts so a prior error streak doesn't trip MAX_SSE_RECONNECT_ATTEMPTS on the next single error after progress. Generated-By: PostHog Code Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7

k11kirky mentioned this pull request May 21, 2026

fix(code): coalesce writeLocalLogs per taskRunId to stop main-thread storm #2255

Open

4 tasks

k11kirky mentioned this pull request May 21, 2026

fix(code): break unrecoverable cloud log reconcile loop on stable deficiency #2284

Open

k11kirky force-pushed the posthog-code/bound-reconnect-budget branch from f24d005 to 8e70577 Compare May 21, 2026 13:44

k11kirky force-pushed the posthog-code/bound-cloud-reconcile-loop branch from 7c4e07c to 21d3033 Compare May 21, 2026 13:44

k11kirky marked this pull request as ready for review May 21, 2026 14:19

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

k11kirky force-pushed the posthog-code/bound-cloud-reconcile-loop branch from 21d3033 to 419607b Compare May 21, 2026 14:37

k11kirky force-pushed the posthog-code/bound-reconnect-budget branch from 8e70577 to 3a6f514 Compare May 21, 2026 14:37

jonathanlab approved these changes May 22, 2026

View reviewed changes

k11kirky added 4 commits May 22, 2026 13:59

chore(code): trim verbose comments on cloud reconnect budget

94901a5

Generated-By: PostHog Code Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7

k11kirky force-pushed the posthog-code/bound-reconnect-budget branch from 3a6f514 to 70d06a5 Compare May 22, 2026 13:02

k11kirky force-pushed the posthog-code/bound-cloud-reconcile-loop branch from 419607b to 5d2948c Compare May 22, 2026 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(code): bound cloud session reconnect attempts with cumulative cap#2285

fix(code): bound cloud session reconnect attempts with cumulative cap#2285
k11kirky wants to merge 4 commits into
posthog-code/bound-cloud-reconcile-loopfrom
posthog-code/bound-reconnect-budget

k11kirky commented May 21, 2026 •

edited

Loading

Uh oh!

k11kirky commented May 21, 2026

Uh oh!

greptile-apps Bot commented May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

jonathanlab left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

k11kirky commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

How did you test this?

Publish to changelog?

Uh oh!

k11kirky commented May 21, 2026

Uh oh!

greptile-apps Bot commented May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

jonathanlab left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

k11kirky commented May 21, 2026 •

edited

Loading