Commit aed4402
fix(tables): surface real error causes on cell-execution failures (diagnostics) (#4868)
* fix(tables): retry transient DB/Redis failures in cell execution and surface error causes
Workflow-group-cell runs intermittently failed on trivial DB reads/writes
under heavy fan-out, stranding cells in `running`. Investigation showed the
PlanetScale and ElastiCache backends were healthy at the time — the failures
are transient connection-level faults that the cell (maxAttempts: 1) had no
tolerance for, and the real cause was never logged (Drizzle wraps it as
"Failed query: ..." and the driver cause lives in error.cause).
Resilience:
- Add retryTransient (lib/table/retry-transient.ts): retries only transient
infra errors (reuses isRetryableInfrastructureError; adds an ioredis
command-timeout match) with jittered backoff, then rethrows. Fail-fast for
everything else.
- Wrap the cell's getTableById/getRowById reads, the terminal write
(cell-write updateRow — idempotent via the executionId guard), and the
Redis cascade-lock acquire.
Diagnostics:
- Add describeError (lib/core/errors/retryable-infrastructure.ts): walks the
.cause chain and always returns the underlying driver cause (code/errno/
syscall + causeChain), including for unclassified errors like AbortError.
- Log `cause` + a `retryable` flag (and aborted/timedOut in the cell's main
catch) across the cell + finalization error paths, mirroring the existing
schedule-execution pattern. Logging-only; no behavior change. This lets the
next recurrence reveal the real cause and whether the retry applies.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(tables): address review feedback on cell retry resilience
- retryTransient: re-check the abort signal after the backoff sleep so a
cancellation during sleep stops the next attempt (don't run/return work for
an already-cancelled task).
- isRetryableRedisError: walk the .cause chain (mirroring the infra
classifier) so wrapped Redis timeouts are recognized; drop "Connection is in
subscriber mode" — that's a connection-state programming error, not a
transient drop, and would just fail identically every retry.
- cascade-lock: stop wrapping acquireLock in retryTransient. acquireLock is a
non-idempotent SET NX, so retrying after a timed-out-but-applied first SET
returns false (key already ours) and yields a false `contended` that skips
the cascade. A transient Redis blip here just fails the run before pickup
(no stranded cell); the dispatcher re-drives it.
- Tests: cause-chain Redis match, subscriber-mode exclusion, abort-during-sleep.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(tables): drop out-of-scope abort/timeout fields from cell catch
The main catch logged `aborted`/`timedOut` from `abortSignal`/`timeoutController`,
but those are declared inside the outer try block (the inner try around
executeWorkflow is try/finally, so this catch belongs to the outer try) and are
not in scope in the catch — `next build`'s type-check failed with "Cannot find
name 'abortSignal'". Local incremental `tsc --noEmit` had skipped the file and
falsely passed; the Cursor/Greptile reviewers flagged this correctly.
Removed the two fields. Abort/timeout is still surfaced via `cause:
describeError(err)` (an aborted run shows `name: 'AbortError'` / the timeout
message), so no diagnostic signal is lost.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(tables): drop in-process retry, keep cause diagnostics only
In-process retry is the wrong layer for this path: the cell task is
maxAttempts:1 by design, retrying on a possibly-degraded worker may not help,
and it masks the very transient-failure signal we're trying to capture before
we understand the root cause. Removed retryTransient entirely (file + all
wrapping in cell-write, the cascade reads, and the lock acquire) and kept only
the diagnostic logging.
- Deleted lib/table/retry-transient.ts (+ test); cell-write and the cascade
reads call getTableById/getRowById/updateRow directly again, fail-fast.
- Kept describeError + `cause`/`retryable` fields across the cell + finalization
catch blocks; the cell-path `retryable` flag now sources from
isRetryableInfrastructureError (the canonical classifier) for consistency.
Diagnostics-first: surface the real driver cause on the next recurrence, then
decide the actual fix (e.g. task-level maxAttempts, or addressing the worker-
side cause) from evidence rather than a speculative in-process retry.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(schedules): log error cause on scheduled-execution failure paths
The scheduled-job failure paths logged the raw error (.message/stack only) —
its `.cause` (the real driver error behind a Drizzle "Failed query: ..."
wrapper) was never recorded, and the classified-only
`describeRetryableInfrastructureError` returns undefined for unrecognized
errors. A real failed run (same incident window as the cell failures) failed in
`applyScheduleUpdate` with exactly this unrecorded cause.
Added `cause: describeError(error)` (always-on, walks the cause chain) to the
applyScheduleUpdate catch, the early-failure catch, and the unhandled-error
catch — passed as a second arg so the existing message+stack still emit.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(errors): move describeError to @sim/utils/errors
`describeError` is a general-purpose error/cause-chain helper — it didn't
belong in `lib/core/errors/retryable-infrastructure.ts` (that module is
specifically about classifying retryable infra errors, and the name read wrong
for a generic diagnostic). Moved it to `@sim/utils/errors` alongside `toError`/
`getErrorMessage`/`getPostgresErrorCode`, with its own cycle-safe cause walk.
- Added describeError + DescribedError + tests to packages/utils/src/errors.ts.
- Reverted the describeError addition from retryable-infrastructure.ts (it keeps
only isRetryableInfrastructureError / describeRetryableInfrastructureError,
which are accurately named and still used by the schedule retry path).
- Re-pointed all consumers (cell, logging-session, pause-persistence, schedule)
to import describeError from @sim/utils/errors. The `retryable` classification
flag still sources from isRetryableInfrastructureError where used.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent 5efb47e commit aed4402
6 files changed
Lines changed: 152 additions & 13 deletions
File tree
- apps/sim
- background
- lib
- logs/execution
- workflows/executor
- packages/utils/src
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
10 | | - | |
| 10 | + | |
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
| |||
156 | 156 | | |
157 | 157 | | |
158 | 158 | | |
159 | | - | |
| 159 | + | |
160 | 160 | | |
161 | 161 | | |
162 | 162 | | |
| |||
530 | 530 | | |
531 | 531 | | |
532 | 532 | | |
533 | | - | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
534 | 540 | | |
535 | 541 | | |
536 | 542 | | |
| |||
950 | 956 | | |
951 | 957 | | |
952 | 958 | | |
953 | | - | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
954 | 962 | | |
955 | 963 | | |
956 | 964 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
| 4 | + | |
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
| |||
597 | 598 | | |
598 | 599 | | |
599 | 600 | | |
600 | | - | |
601 | | - | |
| 601 | + | |
| 602 | + | |
602 | 603 | | |
603 | 604 | | |
604 | 605 | | |
| |||
720 | 721 | | |
721 | 722 | | |
722 | 723 | | |
723 | | - | |
| 724 | + | |
| 725 | + | |
| 726 | + | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
724 | 730 | | |
725 | 731 | | |
726 | 732 | | |
| |||
735 | 741 | | |
736 | 742 | | |
737 | 743 | | |
738 | | - | |
| 744 | + | |
| 745 | + | |
| 746 | + | |
| 747 | + | |
| 748 | + | |
739 | 749 | | |
740 | 750 | | |
741 | 751 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
| 4 | + | |
5 | 5 | | |
| 6 | + | |
6 | 7 | | |
7 | 8 | | |
8 | 9 | | |
| |||
177 | 178 | | |
178 | 179 | | |
179 | 180 | | |
| 181 | + | |
| 182 | + | |
180 | 183 | | |
181 | 184 | | |
182 | 185 | | |
| |||
193 | 196 | | |
194 | 197 | | |
195 | 198 | | |
| 199 | + | |
| 200 | + | |
196 | 201 | | |
197 | 202 | | |
198 | 203 | | |
| |||
411 | 416 | | |
412 | 417 | | |
413 | 418 | | |
| 419 | + | |
| 420 | + | |
414 | 421 | | |
415 | 422 | | |
416 | 423 | | |
| |||
1057 | 1064 | | |
1058 | 1065 | | |
1059 | 1066 | | |
1060 | | - | |
| 1067 | + | |
| 1068 | + | |
| 1069 | + | |
| 1070 | + | |
| 1071 | + | |
1061 | 1072 | | |
1062 | 1073 | | |
1063 | 1074 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
| 2 | + | |
| 3 | + | |
3 | 4 | | |
4 | 5 | | |
5 | 6 | | |
| |||
46 | 47 | | |
47 | 48 | | |
48 | 49 | | |
| 50 | + | |
| 51 | + | |
49 | 52 | | |
50 | 53 | | |
51 | 54 | | |
| |||
59 | 62 | | |
60 | 63 | | |
61 | 64 | | |
| 65 | + | |
| 66 | + | |
62 | 67 | | |
63 | 68 | | |
64 | 69 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| |||
76 | 76 | | |
77 | 77 | | |
78 | 78 | | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
42 | 96 | | |
43 | 97 | | |
44 | 98 | | |
| |||
0 commit comments