Skip to content

fix(cloud-task): drop unsafe append in log-gap inconsistency path#2102

Merged
tatoalo merged 2 commits into
mainfrom
posthog-code/fix-cloud-log-gap-renderer-oom
May 8, 2026
Merged

fix(cloud-task): drop unsafe append in log-gap inconsistency path#2102
tatoalo merged 2 commits into
mainfrom
posthog-code/fix-cloud-log-gap-renderer-oom

Conversation

@sampennington
Copy link
Copy Markdown
Contributor

Summary

Fixes a renderer OOM crash loop on long-running cloud tasks. Observed: 16 V8 OOM renderer crashes between 06:12 and 06:34 on 2026-05-08 against task aab06c8f-… (run accumulated 11k+ log entries).

reconcileCloudLogGapOnce had a fallback that appended update.newEntries on top of session.events whenever local + S3 fetches couldn't catch up to expectedCount. But newEntries is the snapshot's tail slice (positions [expectedCount-N, expectedCount]) — it's not contiguous with the renderer's processedLineCount. So each invocation:

  • pushed unaligned events onto session.events (duplicates and gaps)
  • bumped processedLineCount = latestCount + newEntries.length past entries we never actually had

With cloud-task watchers reconnecting every ~2 min on error: 'terminated', this stacked indefinitely until V8 OOM'd the renderer.

The fix: when the fetched logs lag behind expectedCount, log the existing Cloud task log count inconsistency warning and return. The next snapshot or logs update retries once the source has caught up. The success branch (S3/local catches up to expectedCount) is unchanged.

Test plan

  • pnpm --filter code typecheck
  • pnpm exec vitest run src/renderer/features/sessions/service/service.test.ts — all 80 tests pass
  • Updated processes a pending cloud log gap… test to assert the corrected behavior: queue still drains (verified via second readLocalLogs call) but no append happens when fetches stay stale
  • Manually exercise a long-running cloud task with intermittent stream terminated errors and confirm the renderer no longer OOMs

Notes

Two other related leaks remain (out of scope here, deferred):

  1. emittedLogEntries in apps/code/src/main/services/cloud-task/service.ts grows unbounded on single-subscriber runs (only pruned in emitCurrentSnapshot, which only fires for 2nd+ subscribers).
  2. emitCurrentSnapshot ships the full historical log (newEntries: snapshotEntries) to every newly-attached renderer subscriber — for an 11k-entry run that's a multi-MB IPC payload per re-attach.

Both are worth a follow-up. This PR is the conservative hotfix.

When `reconcileCloudLogGapOnce` fetched local + S3 logs and they came
back with fewer entries than `expectedCount`, the fallback appended
`newEntries` (the snapshot's tail slice at positions
[expectedCount-N, expectedCount]) on top of `session.events` and
ratcheted `processedLineCount = latestCount + newEntries.length`.
The appended slice isn't contiguous with `latestCount`, so events ended
up with duplicates and gaps and the count drifted past entries we never
had. With cloud-task watchers reconnecting every ~2 min on
`error: 'terminated'`, this stacked indefinitely and OOM'd the V8
renderer heap (16 renderer crashes observed in a single morning).

Now the inconsistency path logs the warning and returns; the next
snapshot/log update retries once the source has caught up.

Generated-By: PostHog Code
Task-Id: 3294a546-b9be-404e-ba5c-5d762cc95bd8
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
apps/code/src/renderer/features/sessions/service/service.ts:3568-3573
Now that the warning fires specifically when the fetch lagged behind `expectedCount`, adding `fetchedCount: rawEntries.length` would make it immediately clear how far behind the S3/local fetch was — useful for diagnosing future incidents without needing to correlate separate log lines.

```suggestion
    log.warn("Cloud task log count inconsistency", {
      taskRunId,
      currentCount,
      expectedCount,
      fetchedCount: rawEntries.length,
      entriesReceived: newEntries.length,
    });
```

Reviews (1): Last reviewed commit: "fix(cloud-task): drop unsafe append in l..." | Re-trigger Greptile

Comment on lines 3568 to 3573
log.warn("Cloud task log count inconsistency", {
taskRunId,
currentCount,
expectedCount,
entriesReceived: newEntries.length,
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Now that the warning fires specifically when the fetch lagged behind expectedCount, adding fetchedCount: rawEntries.length would make it immediately clear how far behind the S3/local fetch was — useful for diagnosing future incidents without needing to correlate separate log lines.

Suggested change
log.warn("Cloud task log count inconsistency", {
taskRunId,
currentCount,
expectedCount,
entriesReceived: newEntries.length,
});
log.warn("Cloud task log count inconsistency", {
taskRunId,
currentCount,
expectedCount,
fetchedCount: rawEntries.length,
entriesReceived: newEntries.length,
});
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/code/src/renderer/features/sessions/service/service.ts
Line: 3568-3573

Comment:
Now that the warning fires specifically when the fetch lagged behind `expectedCount`, adding `fetchedCount: rawEntries.length` would make it immediately clear how far behind the S3/local fetch was — useful for diagnosing future incidents without needing to correlate separate log lines.

```suggestion
    log.warn("Cloud task log count inconsistency", {
      taskRunId,
      currentCount,
      expectedCount,
      fetchedCount: rawEntries.length,
      entriesReceived: newEntries.length,
    });
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — added in 7e676a9.

Make it immediately clear from a single log line how far behind the
S3/local fetch was relative to expectedCount.

Generated-By: PostHog Code
Task-Id: 3294a546-b9be-404e-ba5c-5d762cc95bd8
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Reviews (2): Last reviewed commit: "log fetchedCount in cloud-log inconsiste..." | Re-trigger Greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

Copy link
Copy Markdown
Contributor

@tatoalo tatoalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go

@tatoalo tatoalo added the Create Release This will trigger a new release label May 8, 2026
@tatoalo tatoalo merged commit 12d3323 into main May 8, 2026
15 checks passed
@tatoalo tatoalo deleted the posthog-code/fix-cloud-log-gap-renderer-oom branch May 8, 2026 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Create Release This will trigger a new release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants