fix(core): retry TASK_MIDDLEWARE_ERROR under the task's retry policy by deepshekhardas · Pull Request #11 · deepshekhardas/trigger.dev

deepshekhardas · 2026-05-24T12:17:08Z

Adds TASK_MIDDLEWARE_ERROR to shouldLookupRetrySettings so it is retried under the task's retry policy instead of failing the run on the first attempt.

The error was already classified as retryable by shouldRetryError, but shouldLookupRetrySettings did not include it, so the retry flow fell through to fail_run.

Closes triggerdotdev#3676

Summary by cubic

Retries TASK_MIDDLEWARE_ERROR under the task's retry policy in @trigger.dev/core, preventing runs from failing on the first attempt. Aligns shouldLookupRetrySettings with shouldRetryError so middleware errors follow task backoff.

Bug Fixes
- Included TASK_MIDDLEWARE_ERROR in shouldLookupRetrySettings to use task retry/backoff.
- Added a unit test to verify retry behavior for this error.

^{Written for commit ab651f8. Summary will update on new commits. Review in cubic}

…entry (triggerdotdev#2920)

…t build server failures (triggerdotdev#2913)

…erve log chain (triggerdotdev#2900)

…erdotdev#2911)

)

- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913) - Include PR body drafts for consolidated tracking

When the underlying logical-replication client errored (e.g. after a Postgres failover), the runs and sessions replication services logged the error and left the stream stopped. The host process kept running, the WAL backed up, and ClickHouse silently fell behind. Both services now run a configurable recovery strategy on stream errors, defaulting to in-process reconnect with exponential backoff so a fresh self-hosted setup heals on its own: - "reconnect" (default) re-subscribes via the existing subscribe(lastLsn) path with exponential backoff (1s -> 60s cap, unlimited attempts), which re-validates the publication, re-acquires the leader lock, and resumes from the last acknowledged LSN. - "exit" calls process.exit after a short flush window so a host's supervisor (Docker restart=always, systemd, k8s, etc.) can replace the process. - "log" preserves the historical behaviour. Per-service strategy + exit knobs are env-driven via RUN_REPLICATION_ERROR_STRATEGY / SESSION_REPLICATION_ERROR_STRATEGY plus matching *_EXIT_DELAY_MS / *_EXIT_CODE. Reconnect tuning is shared across both services via REPLICATION_RECONNECT_INITIAL_DELAY_MS / _MAX_DELAY_MS / _MAX_ATTEMPTS (0 = unlimited).

Addresses PR review feedback: - LogicalReplicationClient.subscribe() can throw before its internal "error" listener is wired up (notably when pg client.connect() fails mid-failover). The reconnect strategy's catch block only logged, so recovery silently stopped. Now also calls scheduleReconnect(err) — the pendingReconnect guard makes it idempotent if an error event was also emitted. - Reject negative values for the new replication-recovery env vars and cap exit codes at 255. - Convert the new ReplicationErrorRecovery{Deps,} interfaces to type aliases to match the repo's TypeScript style. - Tighten the reconnect dep comment to drop a stale "lastAcknowledgedLsn" reference (the wrapper-tracked resume LSN is what callers actually pass). - Restore process.exit after service.shutdown() in the exit-strategy test so a delayed exit timer can't terminate the test worker.

LogicalReplicationClient.subscribe() can resolve without throwing or emitting an "error" event when leader-lock acquisition fails — it just calls this.stop() and returns. The reconnect callback now checks isStopped after subscribe() and throws so the recovery handler can schedule the next attempt instead of silently giving up.

…rough handle() The previous post-subscribe() isStopped check was always true on the happy path: subscribe() calls stop() up front (setting _isStopped=true) and only resets the flag inside the replicationStart event, which fires asynchronously after subscribe() returns. So the check threw on every successful reconnect, the catch rescheduled, the next attempt tore down the just-built client, and the cycle continued — replication briefly worked between teardowns, which is why the integration test passed. Replace it with the correct nudge: subscribe to leaderElection and call the recovery handler on isLeader=false. That's the only subscribe() exit path that doesn't either throw or emit an "error" event (the other silent-return paths emit "error" first via createPublication/createSlot failures).

The previous commit routed leaderElection(false) through handle(), which under the exit strategy schedules process.exit. In a multi-instance deployment that turns lost leader election — a normal operational state — into a restart loop: exit, supervisor restarts, election fails again, exit, and so on. Add a dedicated notifyLeaderElectionLost() on ReplicationErrorRecovery that the reconnect strategy treats as another retry trigger, while exit and log strategies no-op. Wire the wrapper services through the new method.

fix(webapp): auto-recover replication services after stream errors

cubic-dev-ai

1 issue found across 3 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/src/v3/errors.ts">

<violation number="1" location="packages/core/src/v3/errors.ts:647">
P2: User-facing error text contains a garbled replacement character (`�`), which degrades message readability.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-05-24T12:19:10Z

        `over the realtime stream's per-record cap of ${maxSize} bytes. ` +
        `For oversized payloads (e.g. large tool outputs), write the value to your own store and ` +
-        `emit only an id/url through the chat stream — see https://trigger.dev/docs/ai-chat/patterns/large-payloads.`
+        `emit only an id/url through the chat stream � see https://trigger.dev/docs/ai-chat/patterns/large-payloads.`


P2: User-facing error text contains a garbled replacement character (�), which degrades message readability.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/core/src/v3/errors.ts, line 647: <comment>User-facing error text contains a garbled replacement character (`�`), which degrades message readability.</comment> <file context> @@ -641,7 +644,7 @@ export class ChatChunkTooLargeError extends Error { `over the realtime stream's per-record cap of ${maxSize} bytes. ` + `For oversized payloads (e.g. large tool outputs), write the value to your own store and ` + - `emit only an id/url through the chat stream — see https://trigger.dev/docs/ai-chat/patterns/large-payloads.` + `emit only an id/url through the chat stream � see https://trigger.dev/docs/ai-chat/patterns/large-payloads.` ); this.name = "ChatChunkTooLargeError"; </file context>

Suggested change

`emit only an id/url through the chat stream � see https://trigger.dev/docs/ai-chat/patterns/large-payloads.`

`emit only an id/url through the chat stream — see https://trigger.dev/docs/ai-chat/patterns/large-payloads.`

Deploy Bot and others added 20 commits February 2, 2026 16:16

fix(cli-v3): allow disabling source-map-support to prevent OOM with S…

ed41f0a

…entry (triggerdotdev#2920)

fix(cli-v3): ignore engine checks during deployment install to preven…

023c3fd

…t build server failures (triggerdotdev#2913)

fix(core): delegate to original console in ConsoleInterceptor to pres…

93aa053

…erve log chain (triggerdotdev#2900)

fix(cli-v3): authenticate to Docker Hub to prevent rate limits (trigg…

8b684e1

…erdotdev#2911)

fix(cli-v3): ensure worker cleanup on SIGINT/SIGTERM (triggerdotdev#2909

737ad56

)

verify: add reproduction scripts and PR details for all major fixes

c97cbcc

- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913) - Include PR body drafts for consolidated tracking

verify: add reproduction scripts and PR details for all major fixes

aa90db9

- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913) - Include PR body drafts for consolidated tracking

docs: add consolidated PR body description

f5ce2bc

chore: remove reproduction scripts and temporary files

8c986db

Merge remote-tracking branch 'remotes/origin/fix/sentry-oom-2920'

82f198f

Merge branch 'fix/issue-2909-orphaned-workers'

9a3e8d0

chore: remove reproduction scripts after verification

e101f8e

fix: resolve typecheck errors after merge

d01d438

Merge pull request #10 from deepshekhardas/pr/3613-replication-fix

d35bf04

fix(webapp): auto-recover replication services after stream errors

fix(core): retry TASK_MIDDLEWARE_ERROR under the task's retry policy

ab651f8

cubic-dev-ai Bot reviewed May 24, 2026

View reviewed changes

deepshekhardas force-pushed the main branch from d35bf04 to 4e919e7 Compare June 17, 2026 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): retry TASK_MIDDLEWARE_ERROR under the task's retry policy#11

fix(core): retry TASK_MIDDLEWARE_ERROR under the task's retry policy#11
deepshekhardas wants to merge 20 commits into
mainfrom
fix/3676-retry-middleware-errors

deepshekhardas commented May 24, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	`emit only an id/url through the chat stream � see https://trigger.dev/docs/ai-chat/patterns/large-payloads.`
	`emit only an id/url through the chat stream — see https://trigger.dev/docs/ai-chat/patterns/large-payloads.`

Conversation

deepshekhardas commented May 24, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

deepshekhardas commented May 24, 2026 •

edited by cubic-dev-ai Bot

Loading