Skip to content

fix(code): bound cloud session reconnect attempts with cumulative cap#2285

Open
k11kirky wants to merge 4 commits into
posthog-code/bound-cloud-reconcile-loopfrom
posthog-code/bound-reconnect-budget
Open

fix(code): bound cloud session reconnect attempts with cumulative cap#2285
k11kirky wants to merge 4 commits into
posthog-code/bound-cloud-reconcile-loopfrom
posthog-code/bound-reconnect-budget

Conversation

@k11kirky
Copy link
Copy Markdown
Contributor

@k11kirky k11kirky commented May 21, 2026

Problem

When a cloud run stream repeatedly closes with a clean EOF (no error), the per-attempt reconnect counter was being reset to zero on each clean disconnect. This allowed the watcher to loop indefinitely without ever hitting the MAX_SSE_RECONNECT_ATTEMPTS limit, leaving users stuck watching a run that can never be reached.

Changes

A new cumulativeReconnectAttempts counter was added to the watcher state that increments on every reconnect attempt regardless of whether the disconnect was clean or errored. Unlike reconnectAttempts, it is never reset by a clean EOF — only by a successful SSE event or confirmed in-progress poll response. If the cumulative count exceeds MAX_CUMULATIVE_RECONNECT_ATTEMPTS (30), the watcher is failed with a retryable "Cloud run unreachable" error.

The previous behavior of resetting reconnectAttempts to zero on clean EOF was removed, since the cumulative counter now handles the runaway loop case independently.

How did you test this?

A new automated test was added that mocks the stream to always return a clean empty SSE response and verifies that after exhausting the cumulative reconnect budget, the watcher emits an error update with kind: "error" and the expected "Cloud run unreachable" message.

Publish to changelog?

No

Copy link
Copy Markdown
Contributor Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@k11kirky k11kirky force-pushed the posthog-code/bound-reconnect-budget branch from f24d005 to 8e70577 Compare May 21, 2026 13:44
@k11kirky k11kirky force-pushed the posthog-code/bound-cloud-reconcile-loop branch from 7c4e07c to 21d3033 Compare May 21, 2026 13:44
@k11kirky k11kirky marked this pull request as ready for review May 21, 2026 14:19
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
apps/code/src/main/services/cloud-task/service.ts:1077-1079
When the poll confirms the run is making progress (`stateChanged=true`), only `cumulativeReconnectAttempts` is reset while `reconnectAttempts` retains its prior error count. A watcher that has accumulated 5 error reconnects, then sees a poll-confirmed state change, then gets just 1 more error will immediately fail with "Cloud stream disconnected" — even though the run was confirmed alive. Reset `reconnectAttempts` here too so both budgets are aligned when real progress is observed.

```suggestion
      if (stateChanged) {
        // Polled progress proves the run is alive — don't burn cumulative budget.
        watcher.cumulativeReconnectAttempts = 0;
        watcher.reconnectAttempts = 0;
```

Reviews (1): Last reviewed commit: "chore(code): trim verbose comments on cl..." | Re-trigger Greptile

Comment on lines 1077 to +1079
if (stateChanged) {
// Polled progress proves the run is alive — don't burn cumulative budget.
watcher.cumulativeReconnectAttempts = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 When the poll confirms the run is making progress (stateChanged=true), only cumulativeReconnectAttempts is reset while reconnectAttempts retains its prior error count. A watcher that has accumulated 5 error reconnects, then sees a poll-confirmed state change, then gets just 1 more error will immediately fail with "Cloud stream disconnected" — even though the run was confirmed alive. Reset reconnectAttempts here too so both budgets are aligned when real progress is observed.

Suggested change
if (stateChanged) {
// Polled progress proves the run is alive — don't burn cumulative budget.
watcher.cumulativeReconnectAttempts = 0;
if (stateChanged) {
// Polled progress proves the run is alive — don't burn cumulative budget.
watcher.cumulativeReconnectAttempts = 0;
watcher.reconnectAttempts = 0;
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/code/src/main/services/cloud-task/service.ts
Line: 1077-1079

Comment:
When the poll confirms the run is making progress (`stateChanged=true`), only `cumulativeReconnectAttempts` is reset while `reconnectAttempts` retains its prior error count. A watcher that has accumulated 5 error reconnects, then sees a poll-confirmed state change, then gets just 1 more error will immediately fail with "Cloud stream disconnected" — even though the run was confirmed alive. Reset `reconnectAttempts` here too so both budgets are aligned when real progress is observed.

```suggestion
      if (stateChanged) {
        // Polled progress proves the run is alive — don't burn cumulative budget.
        watcher.cumulativeReconnectAttempts = 0;
        watcher.reconnectAttempts = 0;
```

How can I resolve this? If you propose a fix, please make it concise.

@k11kirky k11kirky force-pushed the posthog-code/bound-cloud-reconcile-loop branch from 21d3033 to 419607b Compare May 21, 2026 14:37
@k11kirky k11kirky force-pushed the posthog-code/bound-reconnect-budget branch from 8e70577 to 3a6f514 Compare May 21, 2026 14:37
Copy link
Copy Markdown
Contributor

@jonathanlab jonathanlab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile comment seems valid here. Wondering if we should make this some kind of exponential backoff instead?

k11kirky added 4 commits May 22, 2026 13:59
MAX_SSE_RECONNECT_ATTEMPTS=5 was defeated by two loopholes that let
unhealthy cloud runs reconnect forever without surfacing an error:
- `scheduleReconnect` with `countAttempt: false` actively reset the
  counter to 0, so the clean-EOF path in `handleStreamCompletion`
  could loop without ever hitting the cap.
- A successful SSE event also resets attempts, so a "dribble" of one
  event per reconnect cycle would never accumulate either.

Two changes:
1. `countAttempt: false` now means "don't increment" rather than
   "reset to 0".
2. New `cumulativeReconnectAttempts` counter bumps on every
   schedule regardless of `countAttempt` and only resets on
   user-initiated retry. When it exceeds MAX_CUMULATIVE_RECONNECT_
   ATTEMPTS=30 the watcher fails with a clear unrecoverable message
   so the UI can surface "this run is broken" instead of looping
   silently.

Generated-By: PostHog Code
Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7
The cumulative cap as introduced would also trip on healthy-but-quiet
cloud runs whose SSE stream cycles cleanly without events (keepalive-
driven close, or the agent simply not emitting between actions). Reset
the cumulative counter whenever real progress is observed — either a
non-keepalive SSE event in `handleSseEvent`, or a state change detected
via the post-stream-completion polling fetch — so only stuck loops with
no forward motion get capped.

Generated-By: PostHog Code
Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7
Generated-By: PostHog Code
Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7
Polled stateChanged means the run is alive; reset both reconnectAttempts
and cumulativeReconnectAttempts so a prior error streak doesn't trip
MAX_SSE_RECONNECT_ATTEMPTS on the next single error after progress.

Generated-By: PostHog Code
Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7
@k11kirky k11kirky force-pushed the posthog-code/bound-reconnect-budget branch from 3a6f514 to 70d06a5 Compare May 22, 2026 13:02
@k11kirky k11kirky force-pushed the posthog-code/bound-cloud-reconcile-loop branch from 419607b to 5d2948c Compare May 22, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants