fix(webapp): propagate abort signal through realtime proxy fetch#3442
fix(webapp): propagate abort signal through realtime proxy fetch#3442
Conversation
The three high-traffic realtime proxy routes (/realtime/v1/runs,
/realtime/v1/runs/:id, /realtime/v1/batches/:id) all route through
RealtimeClient.streamRun/streamRuns/streamBatch -> #streamRunsWhere ->
#performElectricRequest -> longPollingFetch(url, {signal}). The
#streamRunsWhere caller hardcoded signal=undefined, so the upstream
fetch to Electric had no abort signal. When a downstream client
disconnected mid long-poll, undici kept the upstream socket open and
continued buffering response chunks that would never be read, until
Electric's own poll timeout elapsed (up to ~20s). The buffered bytes
live in native memory below V8's accounting, so the retention shows
up only in RSS — invisible to heap snapshots.
Thread a signal parameter through streamRun/streamRuns/streamBatch
(and the shared #streamRunsWhere) and pass getRequestAbortSignal()
from each of the three route handlers. Also cancel the upstream body
explicitly in longPollingFetch's error path and treat AbortError as
a clean client-close (499) rather than a 500, matching the semantic
of 'downstream went away'.
Verified in an isolated standalone reproducer (fetch-a-slow-upstream
pattern, 5 rounds of 200 parallel fetches, burst-and-discard):
A: no signal, body never consumed Δrss=+59.4 MB
B: signal propagated, abort on close Δrss=+15.4 MB (plateaus)
C: no signal, res.body.cancel() Δrss=-25.4 MB
Sustained 10-round test with B: RSS oscillates in a 49-65 MB band
with no upward trend -> the signal propagation fully releases the
undici buffers; the +15 MB residual in the single-round test was
one-time allocator overhead, not accumulation.
|
WalkthroughThis pull request addresses a memory leak in the realtime proxy RSS by implementing abort signal propagation through the fetch chain. Three realtime route handlers ( Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
apps/webapp/app/services/realtimeClient.server.ts (1)
354-360: Nit: parameter order inconsistent with public API.Public methods place
signalas the last parameter (afterclientVersion), but#performElectricRequestplacessignalbeforeclientVersion. Not functionally incorrect, but aligning the ordering avoids future confusion at call sites.♻️ Optional refactor
- async `#performElectricRequest`( - url: URL, - environment: RealtimeEnvironment, - apiVersion: API_VERSIONS, - signal?: AbortSignal, - clientVersion?: string - ) { + async `#performElectricRequest`( + url: URL, + environment: RealtimeEnvironment, + apiVersion: API_VERSIONS, + clientVersion?: string, + signal?: AbortSignal + ) {(Update the two call sites in
#streamRunsWhereaccordingly.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/webapp/app/services/realtimeClient.server.ts` around lines 354 - 360, The private method `#performElectricRequest` has its parameters out of order (signal before clientVersion) compared to the public API; reorder the parameter list so clientVersion comes before signal (i.e., ...apiVersion, clientVersion?: string, signal?: AbortSignal) and update all call sites in `#streamRunsWhere` to pass arguments in the new order (swap the two last args where they're currently passed as signal, clientVersion). Ensure the function signature and every invocation use the same parameter order to keep the API consistent.apps/webapp/app/utils/longPollingFetch.ts (1)
50-71: Consider checking the signal's aborted state to handle edge cases in undici error handling.Node.js fetch (undici) doesn't always throw a
DOMExceptionnamed"AbortError"when a signal is aborted. In edge cases—such as when the request body is already consumed or certain socket closures occur—undici can throw aTypeErrorinstead while the signal is still aborted. Adding a check foroptions?.signal?.abortedas a fallback alongside theerror.namecheck ensures the 499 response is returned consistently, even if undici's error shape changes in future versions.♻️ Optional refactor
- // AbortError is the expected path when downstream disconnects with a - // propagated signal — treat as a clean client-close, not a server error. - if (error instanceof Error && error.name === "AbortError") { - throw new Response(null, { status: 499 }); - } + // AbortError is the expected path when downstream disconnects with a + // propagated signal — treat as a clean client-close, not a server error. + if ( + options?.signal?.aborted || + (error instanceof Error && error.name === "AbortError") + ) { + throw new Response(null, { status: 499 }); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/webapp/app/utils/longPollingFetch.ts` around lines 50 - 71, The current catch block treats only Error.name === "AbortError" as a client-close; update the logic to also detect when the request signal was aborted (e.g., options?.signal?.aborted or whichever signal variable is passed into longPollingFetch) and treat that as the same 499 path. Specifically, inside the catch after canceling upstream, check (options?.signal?.aborted || (error instanceof Error && error.name === "AbortError")) and throw new Response(null, { status: 499 }) in that case; keep the existing TypeError and generic Error branches (and continue to log via logger) for other cases, referencing upstream, error, and logger to locate the block to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.server-changes/fix-realtime-fetch-signal-leak.md:
- Line 6: Summary: hyphenate the compound modifier "mid long-poll" to
"mid-long-poll". Find the sentence that reads "when a client disconnected mid
long-poll" in the changelog entry and replace "mid long-poll" with
"mid-long-poll" (or alternatively "mid long-polling request") so the compound
modifier is properly hyphenated.
---
Nitpick comments:
In `@apps/webapp/app/services/realtimeClient.server.ts`:
- Around line 354-360: The private method `#performElectricRequest` has its
parameters out of order (signal before clientVersion) compared to the public
API; reorder the parameter list so clientVersion comes before signal (i.e.,
...apiVersion, clientVersion?: string, signal?: AbortSignal) and update all call
sites in `#streamRunsWhere` to pass arguments in the new order (swap the two last
args where they're currently passed as signal, clientVersion). Ensure the
function signature and every invocation use the same parameter order to keep the
API consistent.
In `@apps/webapp/app/utils/longPollingFetch.ts`:
- Around line 50-71: The current catch block treats only Error.name ===
"AbortError" as a client-close; update the logic to also detect when the request
signal was aborted (e.g., options?.signal?.aborted or whichever signal variable
is passed into longPollingFetch) and treat that as the same 499 path.
Specifically, inside the catch after canceling upstream, check
(options?.signal?.aborted || (error instanceof Error && error.name ===
"AbortError")) and throw new Response(null, { status: 499 }) in that case; keep
the existing TypeError and generic Error branches (and continue to log via
logger) for other cases, referencing upstream, error, and logger to locate the
block to change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 74c97be1-f5e0-4afc-bff5-8faa35615848
📒 Files selected for processing (6)
.server-changes/fix-realtime-fetch-signal-leak.mdapps/webapp/app/routes/realtime.v1.batches.$batchId.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/services/realtimeClient.server.tsapps/webapp/app/utils/longPollingFetch.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: units / e2e-webapp / 🧪 E2E Tests: Webapp
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: typecheck / typecheck
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: sdk-compat / Bun Runtime
- GitHub Check: sdk-compat / Cloudflare Workers
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
- GitHub Check: sdk-compat / Deno Runtime
- GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
Files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.tsapps/webapp/app/utils/longPollingFetch.tsapps/webapp/app/services/realtimeClient.server.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use zod for validation in packages/core and apps/webapp
Files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.tsapps/webapp/app/utils/longPollingFetch.tsapps/webapp/app/services/realtimeClient.server.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
Add crumbs as you write code using
//@Crumbscomments or `// `#region` `@crumbsblocks. These are temporary debug instrumentation and must be stripped usingagentcrumbs stripbefore merge.
Files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.tsapps/webapp/app/utils/longPollingFetch.tsapps/webapp/app/services/realtimeClient.server.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries
Files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.tsapps/webapp/app/utils/longPollingFetch.tsapps/webapp/app/services/realtimeClient.server.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}
📄 CodeRabbit inference engine (AGENTS.md)
Format code using Prettier before committing
Files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.tsapps/webapp/app/utils/longPollingFetch.tsapps/webapp/app/services/realtimeClient.server.ts
**/*.ts{,x}
📄 CodeRabbit inference engine (CLAUDE.md)
Always import from
@trigger.dev/sdkwhen writing Trigger.dev tasks. Never use@trigger.dev/sdk/v3or deprecatedclient.defineJob.
Files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.tsapps/webapp/app/utils/longPollingFetch.tsapps/webapp/app/services/realtimeClient.server.ts
apps/webapp/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
apps/webapp/**/*.{ts,tsx}: Access environment variables through theenvexport ofenv.server.tsinstead of directly accessingprocess.env
Use subpath exports from@trigger.dev/corepackage instead of importing from the root@trigger.dev/corepathUse named constants for sentinel/placeholder values (e.g.
const UNSET_VALUE = '__unset__') instead of raw string literals scattered across comparisons
Files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.tsapps/webapp/app/utils/longPollingFetch.tsapps/webapp/app/services/realtimeClient.server.ts
apps/webapp/**/*.server.ts
📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)
apps/webapp/**/*.server.ts: Never userequest.signalfor detecting client disconnects. UsegetRequestAbortSignal()fromapp/services/httpAsyncStorage.server.tsinstead, which is wired directly to Expressres.on('close')and fires reliably
Access environment variables viaenvexport fromapp/env.server.ts. Never useprocess.envdirectly
Always usefindFirstinstead offindUniquein Prisma queries.findUniquehas an implicit DataLoader that batches concurrent calls and has active bugs even in Prisma 6.x (uppercase UUIDs returning null, composite key SQL correctness issues, 5-10x worse performance).findFirstis never batched and avoids this entire class of issues
Files:
apps/webapp/app/services/realtimeClient.server.ts
🧠 Learnings (18)
📓 Common learnings
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3430
File: apps/webapp/app/presenters/v3/RunStreamPresenter.server.ts:0-0
Timestamp: 2026-04-23T13:26:27.529Z
Learning: In `apps/webapp/app/utils/sse.ts` (and callers such as `apps/webapp/app/presenters/v3/RunStreamPresenter.server.ts`), the string-reason type passed to `.abort()` is NOT load-bearing for heap retention in SSE streams on Node 20. Isolation testing (2000 requests × 200 KB payload) confirmed `.abort()` vs `.abort("string")` produces identical heap deltas once `AbortSignal.any` is removed. The actual root cause of the SSE memory leak was `AbortSignal.any`, which registers signals with a FinalizationRegistry that does not reliably release them on abort (see nodejs/node#54614 and `#55351`). Named sentinel constants (`ABORT_REASON_*`, exported from `sse.ts`) are a readability/style win per the CLAUDE.md named-constant guideline, not a correctness fix. Do not flag `.abort()` vs `.abort("reason string")` as a memory-retention issue in future reviews of this codebase.
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3399
File: apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts:282-291
Timestamp: 2026-04-16T14:07:46.808Z
Learning: In `apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts` (`streamResponse`), the pattern `signal.addEventListener("abort", cleanup, { once: true })` does NOT need an explicit `removeEventListener` call in the non-abort cleanup paths (inactivity, cancel). The `AbortController` is per-request, scoped to `httpAsyncStorage` (created in `apps/webapp/server.ts` per-request middleware), so it gets GC'd when the request ends — taking the listener and closure with it. The `isCleanedUp` guard prevents double-execution, and `redis.disconnect()` is called before the request ends. Do not flag this as a listener/closure leak.
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/webapp/CLAUDE.md:0-0
Timestamp: 2026-04-16T14:19:16.330Z
Learning: Applies to apps/webapp/**/*.server.ts : Never use `request.signal` for detecting client disconnects. Use `getRequestAbortSignal()` from `app/services/httpAsyncStorage.server.ts` instead, which is wired directly to Express `res.on('close')` and fires reliably
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3399
File: apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts:26-42
Timestamp: 2026-04-16T13:24:09.546Z
Learning: In `apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts`, `RedisRealtimeStreams` is only ever instantiated once as a process-wide singleton via `singleton("realtimeStreams", initializeRedisRealtimeStreams)` in `apps/webapp/app/services/realtime/v1StreamsGlobal.server.ts` (line 30). Therefore, the instance-level `_sharedRedis` field and `sharedRedis` getter are effectively process-scoped. Do not flag them as a per-request connection leak. The v2 streaming path uses a completely separate class (`S2RealtimeStreams`).
📚 Learning: 2026-04-23T13:26:27.529Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3430
File: apps/webapp/app/presenters/v3/RunStreamPresenter.server.ts:0-0
Timestamp: 2026-04-23T13:26:27.529Z
Learning: In `apps/webapp/app/utils/sse.ts` (and callers such as `apps/webapp/app/presenters/v3/RunStreamPresenter.server.ts`), the string-reason type passed to `.abort()` is NOT load-bearing for heap retention in SSE streams on Node 20. Isolation testing (2000 requests × 200 KB payload) confirmed `.abort()` vs `.abort("string")` produces identical heap deltas once `AbortSignal.any` is removed. The actual root cause of the SSE memory leak was `AbortSignal.any`, which registers signals with a FinalizationRegistry that does not reliably release them on abort (see nodejs/node#54614 and `#55351`). Named sentinel constants (`ABORT_REASON_*`, exported from `sse.ts`) are a readability/style win per the CLAUDE.md named-constant guideline, not a correctness fix. Do not flag `.abort()` vs `.abort("reason string")` as a memory-retention issue in future reviews of this codebase.
Applied to files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.ts.server-changes/fix-realtime-fetch-signal-leak.mdapps/webapp/app/utils/longPollingFetch.tsapps/webapp/app/services/realtimeClient.server.ts
📚 Learning: 2026-04-16T14:19:16.330Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/webapp/CLAUDE.md:0-0
Timestamp: 2026-04-16T14:19:16.330Z
Learning: Applies to apps/webapp/**/*.server.ts : Never use `request.signal` for detecting client disconnects. Use `getRequestAbortSignal()` from `app/services/httpAsyncStorage.server.ts` instead, which is wired directly to Express `res.on('close')` and fires reliably
Applied to files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.ts.server-changes/fix-realtime-fetch-signal-leak.mdapps/webapp/app/services/realtimeClient.server.ts
📚 Learning: 2026-04-16T14:07:46.808Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3399
File: apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts:282-291
Timestamp: 2026-04-16T14:07:46.808Z
Learning: In `apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts` (`streamResponse`), the pattern `signal.addEventListener("abort", cleanup, { once: true })` does NOT need an explicit `removeEventListener` call in the non-abort cleanup paths (inactivity, cancel). The `AbortController` is per-request, scoped to `httpAsyncStorage` (created in `apps/webapp/server.ts` per-request middleware), so it gets GC'd when the request ends — taking the listener and closure with it. The `isCleanedUp` guard prevents double-execution, and `redis.disconnect()` is called before the request ends. Do not flag this as a listener/closure leak.
Applied to files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.ts.server-changes/fix-realtime-fetch-signal-leak.mdapps/webapp/app/utils/longPollingFetch.tsapps/webapp/app/services/realtimeClient.server.ts
📚 Learning: 2025-10-08T11:48:12.327Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 2593
File: packages/core/src/v3/workers/warmStartClient.ts:168-170
Timestamp: 2025-10-08T11:48:12.327Z
Learning: The trigger.dev runners execute only in Node 21 and 22 environments, so modern Node.js APIs like AbortSignal.any (introduced in v20.3.0) are supported.
Applied to files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.tsapps/webapp/app/services/realtimeClient.server.ts
📚 Learning: 2026-03-25T15:29:25.889Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2026-03-25T15:29:25.889Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `metadata.stream()` to stream data in realtime from inside tasks
Applied to files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/services/realtimeClient.server.ts
📚 Learning: 2026-04-16T14:19:16.330Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/webapp/CLAUDE.md:0-0
Timestamp: 2026-04-16T14:19:16.330Z
Learning: Applies to apps/webapp/app/v3/services/{cancelTaskRun,batchTriggerV3}.server.ts : When editing services that branch on `RunEngineVersion` to support both V1 and V2 (e.g., `cancelTaskRun.server.ts`, `batchTriggerV3.server.ts`), only modify V2 code paths
Applied to files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.tsapps/webapp/app/services/realtimeClient.server.ts
📚 Learning: 2026-04-16T13:24:09.546Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3399
File: apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts:26-42
Timestamp: 2026-04-16T13:24:09.546Z
Learning: In `apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts`, `RedisRealtimeStreams` is only ever instantiated once as a process-wide singleton via `singleton("realtimeStreams", initializeRedisRealtimeStreams)` in `apps/webapp/app/services/realtime/v1StreamsGlobal.server.ts` (line 30). Therefore, the instance-level `_sharedRedis` field and `sharedRedis` getter are effectively process-scoped. Do not flag them as a per-request connection leak. The v2 streaming path uses a completely separate class (`S2RealtimeStreams`).
Applied to files:
apps/webapp/app/routes/realtime.v1.runs.ts.server-changes/fix-realtime-fetch-signal-leak.mdapps/webapp/app/services/realtimeClient.server.ts
📚 Learning: 2025-11-27T16:26:37.432Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-27T16:26:37.432Z
Learning: Applies to {packages/core,apps/webapp}/**/*.{ts,tsx} : Use zod for validation in packages/core and apps/webapp
Applied to files:
apps/webapp/app/routes/realtime.v1.runs.ts
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).
Applied to files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.tsapps/webapp/app/utils/longPollingFetch.tsapps/webapp/app/services/realtimeClient.server.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.
Applied to files:
apps/webapp/app/routes/realtime.v1.runs.tsapps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.tsapps/webapp/app/utils/longPollingFetch.tsapps/webapp/app/services/realtimeClient.server.ts
📚 Learning: 2026-03-02T12:43:17.177Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: internal-packages/database/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:17.177Z
Learning: Applies to internal-packages/database/**/{app,src,webapp}/**/*.{ts,tsx,js,jsx} : Use `$replica` from `~/db.server` for read-heavy queries in the webapp instead of the primary database connection
Applied to files:
apps/webapp/app/routes/realtime.v1.runs.$runId.tsapps/webapp/app/routes/realtime.v1.batches.$batchId.ts
📚 Learning: 2026-04-20T15:06:19.815Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3417
File: apps/webapp/app/routes/realtime.v1.sessions.$session.$io.ts:37-51
Timestamp: 2026-04-20T15:06:19.815Z
Learning: In `apps/webapp/app/routes/realtime.v1.sessions.$session.$io.ts` (and all session realtime read paths), `$replica` is intentionally used for the `resolveSessionByIdOrExternalId` call — including the `closedAt` guard in the PUT/initialize path. The project convention is to use `$replica` consistently across all session realtime routes. The race window (replica lag allowing a ghost-initialize after close) is accepted as not realistic in practice (clients follow the close API response; they do not race it). If replica lag ever causes issues, the mitigation is to revisit all realtime routes together, not to swap individual routes to `prisma`. Do not flag `$replica` usage in session realtime routes as a stale-read issue.
Applied to files:
apps/webapp/app/routes/realtime.v1.batches.$batchId.ts
📚 Learning: 2026-04-20T15:06:11.054Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3417
File: apps/webapp/app/routes/realtime.v1.streams.$runId.$target.$streamId.append.ts:16-26
Timestamp: 2026-04-20T15:06:11.054Z
Learning: In `apps/webapp/app/routes/realtime.v1.streams.$runId.$target.$streamId.append.ts` and `apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts`, the `MAX_APPEND_BODY_BYTES` cap of 512 KiB (1024 * 512) is intentional even though `appendPart` wraps the body in JSON (which could expand quote-heavy payloads beyond S2's 1 MiB per-record limit). The maintainer considers worst-case quote-heavy payloads pathological and not realistic. If S2 rejections occur in practice, an encoded-size guard will be added inside `appendPart` rather than lowering the raw body cap on every caller. Do not flag this as an issue in future reviews.
Applied to files:
.server-changes/fix-realtime-fetch-signal-leak.mdapps/webapp/app/services/realtimeClient.server.ts
📚 Learning: 2026-04-07T14:12:59.018Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3331
File: apps/webapp/app/runEngine/concerns/batchPayloads.server.ts:112-136
Timestamp: 2026-04-07T14:12:59.018Z
Learning: In `apps/webapp/app/runEngine/concerns/batchPayloads.server.ts`, the `pRetry` call wrapping `uploadPacketToObjectStore` intentionally retries **all** error types (no `shouldRetry` filter / `AbortError` guards). The maintainer explicitly prefers over-retrying to under-retrying because multiple heterogeneous object store backends are supported and it is impractical to enumerate all permanent error signatures. Do not flag this as an issue in future reviews.
Applied to files:
apps/webapp/app/utils/longPollingFetch.ts
📚 Learning: 2026-04-16T14:09:34.540Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3388
File: apps/webapp/app/services/platform.v3.server.ts:542-569
Timestamp: 2026-04-16T14:09:34.540Z
Learning: In `apps/webapp/app/services/platform.v3.server.ts`, the `getEntitlement` SWR loader intentionally returns `undefined` on errors (instead of a failure sentinel) because `unkey/cache` already deduplicates concurrent in-process loader calls via `deduplicateLoadFromOrigin` (a shared promise map keyed by namespace::key). During a billing outage, concurrent requests on the same process share one pending HTTP call rather than fanning out. The fail-open `{ hasAccess: true }` fallback is applied *outside* the SWR call so error results are never committed to cache. The maintainer will revisit if sustained multi-instance outage patterns emerge in practice. Do not re-raise the failure-sentinel suggestion for this function in future reviews.
Applied to files:
apps/webapp/app/utils/longPollingFetch.ts
📚 Learning: 2025-07-21T12:52:44.342Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 2284
File: apps/webapp/app/services/realtimeClient.server.ts:111-127
Timestamp: 2025-07-21T12:52:44.342Z
Learning: Electric (the database service used in the realtimeClient) has built-in SQL injection protection and safely handles whereClause parameters passed via URL parameters, so direct string interpolation of runId values into SQL where clauses is safe when using Electric.
Applied to files:
apps/webapp/app/services/realtimeClient.server.ts
📚 Learning: 2026-03-26T09:02:07.973Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3274
File: apps/webapp/app/services/runsReplicationService.server.ts:922-924
Timestamp: 2026-03-26T09:02:07.973Z
Learning: When parsing Trigger.dev task run annotations in server-side services, keep `TaskRun.annotations` strictly conforming to the `RunAnnotations` schema from `trigger.dev/core/v3`. If the code already uses `RunAnnotations.safeParse` (e.g., in a `#parseAnnotations` helper), treat that as intentional/necessary for atomic, schema-accurate annotation handling. Do not recommend relaxing the annotation payload schema or using a permissive “passthrough” parse path, since the annotations are expected to be written atomically in one operation and should not contain partial/legacy payloads that would require a looser parser.
Applied to files:
apps/webapp/app/services/realtimeClient.server.ts
🪛 LanguageTool
.server-changes/fix-realtime-fetch-signal-leak.md
[grammar] ~6-~6: Use a hyphen to join words.
Context: ...ignal, so when a client disconnected mid long-poll, undici kept the upstream sock...
(QB_NEW_EN_HYPHEN)
🔇 Additional comments (5)
apps/webapp/app/utils/longPollingFetch.ts (1)
14-17: LGTM — upstream retention for explicit cancellation.Hoisting
upstreamout of thetryso thecatchcan callupstream?.body?.cancel()is the correct shape for releasing undici socket/buffer state when the downstream stream is never consumed. Optional chaining + the inner try/catch also correctly handles the "body already transferred/locked" and "fetch rejected before assignment" cases.apps/webapp/app/routes/realtime.v1.runs.ts (1)
29-37: LGTM.Signal wiring matches the updated
streamRunssignature (position 7), and sourcing it fromgetRequestAbortSignal()is the correct pattern per the webapp guideline —request.signalwould not fire reliably under@remix-run/express. As per coding guidelines: "Never userequest.signalfor detecting client disconnects. UsegetRequestAbortSignal()fromapp/services/httpAsyncStorage.server.tsinstead".apps/webapp/app/routes/realtime.v1.batches.$batchId.ts (1)
31-39: LGTM.Consistent abort-signal propagation — matches
streamBatch's new 7th parameter and usesgetRequestAbortSignal()per the webapp guideline.apps/webapp/app/routes/realtime.v1.runs.$runId.ts (1)
44-56: LGTM — nice inline explanation.The inline comment explaining the RSS/undici buffering mechanism is helpful; signal is correctly placed as the 7th argument to
streamRunand sourced fromgetRequestAbortSignal().apps/webapp/app/services/realtimeClient.server.ts (1)
112-190: LGTM — signal is threaded consistently through the public API.Adding
signal?: AbortSignalas the final optional parameter onstreamRun/streamBatch/streamRunsis backward compatible, and each call into#streamRunsWhereforwards it correctly. Aborted long-polls will throw aResponse(499)out oflongPollingFetch, and#performElectricRequest's try/catch still awaits#decrementConcurrencybefore rethrowing, so the concurrency counter is released on client disconnect.
aa2ae56 to
586315b
Compare
Summary
Fixes an RSS-only memory leak in the three realtime proxy routes (
/realtime/v1/runs,/realtime/v1/runs/:id,/realtime/v1/batches/:id). Client disconnects during an in-flight long-poll would leave the upstream fetch to Electric running with no way to abort it, so undici kept the socket open and buffered response chunks that would never be consumed.Root cause
All three routes flow through
RealtimeClient.streamRun/streamRuns/streamBatch→#streamRunsWhere→#performElectricRequest→longPollingFetch(url, { signal }). The chain was already signal-aware, but#streamRunsWherehardcodedsignal=undefinedwhen calling#performElectricRequest, so no signal ever reachedlongPollingFetch.When a downstream client aborts a long-poll mid-flight:
longPollingFetchpromise has already resolved (it returns as soon as upstream headers arrive) and handed backnew Response(upstream.body, {...}).undicikeeps the upstream socket open and continues buffering chunks into theReadableStreamthat nothing will ever read from.These buffers live below V8's accounting — no
heapUsedorexternalgrowth, no sign in heap snapshots, only RSS. An isolated standalone reproducer (fetchagainst a slow-streaming upstream, discard theResponsebefore consuming its body) measures ~44 KB retained per leaked request after GC. That's consistent with the undici socket + receive buffer + HTTP parser state for a long-lived chunked response. The pattern is the shape documented in nodejs/undici#1108 and #2143.What changed
realtimeClient.server.ts— add optionalsignalparameter tostreamRun,streamRuns,streamBatch, and the shared#streamRunsWhere; thread it through to#performElectricRequestinstead of hardcodingundefined.realtime.v1.runs.$runId.ts,realtime.v1.runs.ts,realtime.v1.batches.$batchId.ts— passgetRequestAbortSignal()(fromhttpAsyncStorage.server.ts) at the call site. This is the signal wired tores.on('close')and fires reliably on downstream disconnect.longPollingFetch.ts— belt-and-suspenders: cancel the upstream body explicitly in the error path, and treatAbortErroras a clean499instead of a500. This both releases undici's buffers deterministically on error and avoids spurious 500s in request logs when a client legitimately walks away.Verification
Standalone reproducer: slow upstream server streams 32 KB chunks every 100 ms for 5 seconds per request. The proxy does
fetch(url)with varying signal/cancel strategies, createsnew Response(upstream.body, ...), and discards it without consuming the body (simulating the leak path).Results from 1 000 parallel fetches per variant, measured post-GC:
res.body.cancel()10-round sustained test of variant B to distinguish accumulating retention from one-time allocator overhead:
RSS oscillates in a 49-65 MB band with no upward trend — signal propagation fully releases the buffers.
Risk
AbortErrornow surfaces as499rather than500. Any dashboard or alert that counts 500s in request logs will see slightly fewer of them; this is the intended behavior.RealtimeClient.streamRun/streamRuns/streamBatch, so callers that don't opt in get the previous behavior.Test plan