Skip to content

Commit aed4402

Browse files
fix(tables): surface real error causes on cell-execution failures (diagnostics) (#4868)
* fix(tables): retry transient DB/Redis failures in cell execution and surface error causes Workflow-group-cell runs intermittently failed on trivial DB reads/writes under heavy fan-out, stranding cells in `running`. Investigation showed the PlanetScale and ElastiCache backends were healthy at the time — the failures are transient connection-level faults that the cell (maxAttempts: 1) had no tolerance for, and the real cause was never logged (Drizzle wraps it as "Failed query: ..." and the driver cause lives in error.cause). Resilience: - Add retryTransient (lib/table/retry-transient.ts): retries only transient infra errors (reuses isRetryableInfrastructureError; adds an ioredis command-timeout match) with jittered backoff, then rethrows. Fail-fast for everything else. - Wrap the cell's getTableById/getRowById reads, the terminal write (cell-write updateRow — idempotent via the executionId guard), and the Redis cascade-lock acquire. Diagnostics: - Add describeError (lib/core/errors/retryable-infrastructure.ts): walks the .cause chain and always returns the underlying driver cause (code/errno/ syscall + causeChain), including for unclassified errors like AbortError. - Log `cause` + a `retryable` flag (and aborted/timedOut in the cell's main catch) across the cell + finalization error paths, mirroring the existing schedule-execution pattern. Logging-only; no behavior change. This lets the next recurrence reveal the real cause and whether the retry applies. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(tables): address review feedback on cell retry resilience - retryTransient: re-check the abort signal after the backoff sleep so a cancellation during sleep stops the next attempt (don't run/return work for an already-cancelled task). - isRetryableRedisError: walk the .cause chain (mirroring the infra classifier) so wrapped Redis timeouts are recognized; drop "Connection is in subscriber mode" — that's a connection-state programming error, not a transient drop, and would just fail identically every retry. - cascade-lock: stop wrapping acquireLock in retryTransient. acquireLock is a non-idempotent SET NX, so retrying after a timed-out-but-applied first SET returns false (key already ours) and yields a false `contended` that skips the cascade. A transient Redis blip here just fails the run before pickup (no stranded cell); the dispatcher re-drives it. - Tests: cause-chain Redis match, subscriber-mode exclusion, abort-during-sleep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(tables): drop out-of-scope abort/timeout fields from cell catch The main catch logged `aborted`/`timedOut` from `abortSignal`/`timeoutController`, but those are declared inside the outer try block (the inner try around executeWorkflow is try/finally, so this catch belongs to the outer try) and are not in scope in the catch — `next build`'s type-check failed with "Cannot find name 'abortSignal'". Local incremental `tsc --noEmit` had skipped the file and falsely passed; the Cursor/Greptile reviewers flagged this correctly. Removed the two fields. Abort/timeout is still surfaced via `cause: describeError(err)` (an aborted run shows `name: 'AbortError'` / the timeout message), so no diagnostic signal is lost. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(tables): drop in-process retry, keep cause diagnostics only In-process retry is the wrong layer for this path: the cell task is maxAttempts:1 by design, retrying on a possibly-degraded worker may not help, and it masks the very transient-failure signal we're trying to capture before we understand the root cause. Removed retryTransient entirely (file + all wrapping in cell-write, the cascade reads, and the lock acquire) and kept only the diagnostic logging. - Deleted lib/table/retry-transient.ts (+ test); cell-write and the cascade reads call getTableById/getRowById/updateRow directly again, fail-fast. - Kept describeError + `cause`/`retryable` fields across the cell + finalization catch blocks; the cell-path `retryable` flag now sources from isRetryableInfrastructureError (the canonical classifier) for consistency. Diagnostics-first: surface the real driver cause on the next recurrence, then decide the actual fix (e.g. task-level maxAttempts, or addressing the worker- side cause) from evidence rather than a speculative in-process retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(schedules): log error cause on scheduled-execution failure paths The scheduled-job failure paths logged the raw error (.message/stack only) — its `.cause` (the real driver error behind a Drizzle "Failed query: ..." wrapper) was never recorded, and the classified-only `describeRetryableInfrastructureError` returns undefined for unrecognized errors. A real failed run (same incident window as the cell failures) failed in `applyScheduleUpdate` with exactly this unrecorded cause. Added `cause: describeError(error)` (always-on, walks the cause chain) to the applyScheduleUpdate catch, the early-failure catch, and the unhandled-error catch — passed as a second arg so the existing message+stack still emit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(errors): move describeError to @sim/utils/errors `describeError` is a general-purpose error/cause-chain helper — it didn't belong in `lib/core/errors/retryable-infrastructure.ts` (that module is specifically about classifying retryable infra errors, and the name read wrong for a generic diagnostic). Moved it to `@sim/utils/errors` alongside `toError`/ `getErrorMessage`/`getPostgresErrorCode`, with its own cycle-safe cause walk. - Added describeError + DescribedError + tests to packages/utils/src/errors.ts. - Reverted the describeError addition from retryable-infrastructure.ts (it keeps only isRetryableInfrastructureError / describeRetryableInfrastructureError, which are accurately named and still used by the schedule retry path). - Re-pointed all consumers (cell, logging-session, pause-persistence, schedule) to import describeError from @sim/utils/errors. The `retryable` classification flag still sources from isRetryableInfrastructureError where used. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 5efb47e commit aed4402

6 files changed

Lines changed: 152 additions & 13 deletions

File tree

apps/sim/background/schedule-execution.ts

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ import {
77
workflowSchedule,
88
} from '@sim/db'
99
import { createLogger, runWithRequestContext } from '@sim/logger'
10-
import { toError } from '@sim/utils/errors'
10+
import { describeError, toError } from '@sim/utils/errors'
1111
import { generateId } from '@sim/utils/id'
1212
import { backoffWithJitter } from '@sim/utils/retry'
1313
import { task } from '@trigger.dev/sdk'
@@ -156,7 +156,7 @@ async function applyScheduleUpdate(
156156

157157
return updatedRows.length > 0
158158
} catch (error) {
159-
logger.error(`[${requestId}] ${context}`, error)
159+
logger.error(`[${requestId}] ${context}`, error, { cause: describeError(error) })
160160
throw error
161161
}
162162
}
@@ -530,7 +530,13 @@ async function runWorkflowExecution({
530530
}
531531
}
532532

533-
logger.error(`[${requestId}] Early failure in scheduled workflow ${payload.workflowId}`, error)
533+
logger.error(
534+
`[${requestId}] Early failure in scheduled workflow ${payload.workflowId}`,
535+
error,
536+
{
537+
cause: describeError(error),
538+
}
539+
)
534540

535541
if (wasExecutionFinalizedByCore(error, executionId)) {
536542
throw error
@@ -950,7 +956,9 @@ export async function executeScheduleJob(payload: ScheduleExecutionPayload) {
950956
return
951957
}
952958

953-
logger.error(`[${requestId}] Error processing schedule ${payload.scheduleId}`, error)
959+
logger.error(`[${requestId}] Error processing schedule ${payload.scheduleId}`, error, {
960+
cause: describeError(error),
961+
})
954962
await releaseClaim(
955963
now,
956964
`Failed to release schedule ${payload.scheduleId} after unhandled error`

apps/sim/background/workflow-column-execution.ts

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
import { db } from '@sim/db'
22
import { workflow as workflowTable } from '@sim/db/schema'
33
import { createLogger, runWithRequestContext } from '@sim/logger'
4-
import { toError } from '@sim/utils/errors'
4+
import { describeError, toError } from '@sim/utils/errors'
55
import { sleep } from '@sim/utils/helpers'
66
import { generateId } from '@sim/utils/id'
77
import { backoffWithJitter } from '@sim/utils/retry'
88
import { task } from '@trigger.dev/sdk'
99
import { eq } from 'drizzle-orm'
10+
import { isRetryableInfrastructureError } from '@/lib/core/errors/retryable-infrastructure'
1011
import { createTimeoutAbortController } from '@/lib/core/execution-limits'
1112
import { RateLimiter } from '@/lib/core/rate-limiter/rate-limiter'
1213
import { preprocessExecution } from '@/lib/execution/preprocessing'
@@ -597,8 +598,8 @@ async function runWorkflowAndWriteTerminal(
597598
})
598599
.catch((err) => {
599600
logger.warn(
600-
`Per-block partial write failed (table=${tableId} row=${rowId} group=${groupId}):`,
601-
err
601+
`Per-block partial write failed (table=${tableId} row=${rowId} group=${groupId})`,
602+
{ cause: describeError(err), retryable: isRetryableInfrastructureError(err) }
602603
)
603604
})
604605
}
@@ -720,7 +721,12 @@ async function runWorkflowAndWriteTerminal(
720721
const message = toError(err).message
721722
logger.error(
722723
`Workflow group cell execution failed (table=${tableId} row=${rowId} group=${groupId})`,
723-
{ error: message, executionId }
724+
{
725+
error: message,
726+
executionId,
727+
cause: describeError(err),
728+
retryable: isRetryableInfrastructureError(err),
729+
}
724730
)
725731
terminalWritten = true
726732
await writeChain.catch(() => {})
@@ -735,7 +741,11 @@ async function runWorkflowAndWriteTerminal(
735741
blockErrors,
736742
})
737743
} catch (writeErr) {
738-
logger.error('Also failed to write error state', { error: toError(writeErr).message })
744+
logger.error('Also failed to write error state', {
745+
error: toError(writeErr).message,
746+
cause: describeError(writeErr),
747+
retryable: isRetryableInfrastructureError(writeErr),
748+
})
739749
}
740750
return 'error'
741751
}

apps/sim/lib/logs/execution/logging-session.ts

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
import { db } from '@sim/db'
22
import { workflowExecutionLogs } from '@sim/db/schema'
33
import { createLogger } from '@sim/logger'
4-
import { toError } from '@sim/utils/errors'
4+
import { describeError, toError } from '@sim/utils/errors'
55
import { and, eq, sql } from 'drizzle-orm'
6+
import { isRetryableInfrastructureError } from '@/lib/core/errors/retryable-infrastructure'
67
import { executionLogger } from '@/lib/logs/execution/logger'
78
import {
89
calculateCostSummary,
@@ -177,6 +178,8 @@ export class LoggingSession {
177178
} catch (error) {
178179
logger.error(`Failed to persist last started block for execution ${this.executionId}:`, {
179180
error: toError(error).message,
181+
cause: describeError(error),
182+
retryable: isRetryableInfrastructureError(error),
180183
})
181184
}
182185
}
@@ -193,6 +196,8 @@ export class LoggingSession {
193196
} catch (error) {
194197
logger.error(`Failed to persist last completed block for execution ${this.executionId}:`, {
195198
error: toError(error).message,
199+
cause: describeError(error),
200+
retryable: isRetryableInfrastructureError(error),
196201
})
197202
}
198203
}
@@ -411,6 +416,8 @@ export class LoggingSession {
411416
executionId: this.executionId,
412417
error: toError(error).message,
413418
stack: error instanceof Error ? error.stack : undefined,
419+
cause: describeError(error),
420+
retryable: isRetryableInfrastructureError(error),
414421
})
415422
throw error
416423
}
@@ -1057,7 +1064,11 @@ export class LoggingSession {
10571064
this.completionAttemptFailed = true
10581065
logger.error(
10591066
`[${this.requestId || 'unknown'}] Cost-only fallback also failed for execution ${this.executionId}:`,
1060-
{ error: toError(fallbackError).message }
1067+
{
1068+
error: toError(fallbackError).message,
1069+
cause: describeError(fallbackError),
1070+
retryable: isRetryableInfrastructureError(fallbackError),
1071+
}
10611072
)
10621073
}
10631074
}

apps/sim/lib/workflows/executor/pause-persistence.ts

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import { createLogger } from '@sim/logger'
2-
import { toError } from '@sim/utils/errors'
2+
import { describeError, toError } from '@sim/utils/errors'
3+
import { isRetryableInfrastructureError } from '@/lib/core/errors/retryable-infrastructure'
34
import type { LoggingSession } from '@/lib/logs/execution/logging-session'
45
import { PauseResumeManager } from '@/lib/workflows/executor/human-in-the-loop-manager'
56
import type { ExecutionResult } from '@/executor/types'
@@ -46,6 +47,8 @@ export async function handlePostExecutionPauseState({
4647
logger.error('Failed to persist pause result', {
4748
executionId,
4849
error: toError(pauseError).message,
50+
cause: describeError(pauseError),
51+
retryable: isRetryableInfrastructureError(pauseError),
4952
})
5053
await loggingSession.markAsFailed(
5154
`Failed to persist pause state: ${toError(pauseError).message}`
@@ -59,6 +62,8 @@ export async function handlePostExecutionPauseState({
5962
logger.error('Failed to process queued resumes', {
6063
executionId,
6164
error: toError(resumeError).message,
65+
cause: describeError(resumeError),
66+
retryable: isRetryableInfrastructureError(resumeError),
6267
})
6368
}
6469
}

packages/utils/src/errors.test.ts

Lines changed: 52 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
* @vitest-environment node
33
*/
44
import { describe, expect, it } from 'vitest'
5-
import { getPostgresErrorCode, toError } from './errors.js'
5+
import { describeError, getPostgresErrorCode, toError } from './errors.js'
66

77
describe('toError', () => {
88
it('returns the same Error when given an Error', () => {
@@ -76,3 +76,54 @@ describe('getPostgresErrorCode', () => {
7676
expect(getPostgresErrorCode(err1)).toBeUndefined()
7777
})
7878
})
79+
80+
describe('describeError', () => {
81+
it('reports name and message for a plain error, omitting causeChain', () => {
82+
const described = describeError(new Error('boom'))
83+
expect(described).toEqual({ name: 'Error', message: 'boom' })
84+
expect(described.causeChain).toBeUndefined()
85+
})
86+
87+
it('surfaces the deepest cause for a wrapped driver error', () => {
88+
const driver = Object.assign(new Error('read ECONNRESET'), {
89+
code: 'ECONNRESET',
90+
errno: 'ECONNRESET',
91+
syscall: 'read',
92+
})
93+
const wrapped = new Error('Failed query: select ...', { cause: driver })
94+
const described = describeError(wrapped)
95+
expect(described.message).toBe('read ECONNRESET')
96+
expect(described.code).toBe('ECONNRESET')
97+
expect(described.errno).toBe('ECONNRESET')
98+
expect(described.syscall).toBe('read')
99+
expect(described.causeChain).toEqual([
100+
'Error: Failed query: select ...',
101+
'Error: read ECONNRESET',
102+
])
103+
})
104+
105+
it('always returns the cause for unclassified errors (AbortError)', () => {
106+
const aborted = Object.assign(new Error('The operation was aborted'), { name: 'AbortError' })
107+
expect(describeError(aborted)).toEqual({
108+
name: 'AbortError',
109+
message: 'The operation was aborted',
110+
})
111+
})
112+
113+
it('falls back to a populated description for non-Error input without throwing', () => {
114+
expect(describeError('just a string')).toEqual({ name: 'Error', message: 'just a string' })
115+
expect(() => describeError({ weird: true })).not.toThrow()
116+
})
117+
118+
it('stops at depth 10 and does not loop on a cyclic cause', () => {
119+
const a = new Error('a')
120+
const b = new Error('b')
121+
;(a as { cause?: unknown }).cause = b
122+
;(b as { cause?: unknown }).cause = a
123+
let described: ReturnType<typeof describeError> | undefined
124+
expect(() => {
125+
described = describeError(a)
126+
}).not.toThrow()
127+
expect(described?.causeChain?.length).toBeLessThanOrEqual(10)
128+
})
129+
})

packages/utils/src/errors.ts

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,60 @@ export function getPostgresConstraintName(error: unknown): string | undefined {
3939
return readPgErrorField(error, 'constraint_name') ?? readPgErrorField(error, 'constraint')
4040
}
4141

42+
export interface DescribedError {
43+
name: string
44+
message: string
45+
code?: string
46+
errno?: string
47+
syscall?: string
48+
/** `"Name: message"` per link in the `.cause` chain, outermost first. Present only when the chain has more than one link. */
49+
causeChain?: string[]
50+
}
51+
52+
/**
53+
* Always-on diagnostic view of an error and its `.cause` chain.
54+
*
55+
* Reports the fields of the DEEPEST `.cause` link, because a wrapped driver
56+
* error (e.g. Drizzle's `"Failed query: ..."` wrapping an `ECONNRESET`) carries
57+
* the real reason there, not on the outer wrapper. Always returns a populated
58+
* object — including for non-`Error` throws and unclassified errors like
59+
* `AbortError`. Cycle-safe and depth-bounded.
60+
*
61+
* Loggers do not serialize the non-enumerable `Error.prototype.cause`, so pass
62+
* the result as an explicit structured field rather than the raw error.
63+
*/
64+
export function describeError(error: unknown): DescribedError {
65+
const chain: Error[] = []
66+
const seen = new Set<unknown>()
67+
let current: unknown = error
68+
while (current instanceof Error && !seen.has(current) && chain.length < 10) {
69+
seen.add(current)
70+
chain.push(current)
71+
current = current.cause
72+
}
73+
74+
if (chain.length === 0) {
75+
const normalized = toError(error)
76+
return { name: normalized.name, message: normalized.message }
77+
}
78+
79+
const deepest = chain[chain.length - 1] as Error & Record<string, unknown>
80+
const asString = (value: unknown): string | undefined =>
81+
typeof value === 'string' ? value : undefined
82+
const code = asString(deepest.code)
83+
const errno = asString(deepest.errno)
84+
const syscall = asString(deepest.syscall)
85+
86+
return {
87+
name: deepest.name,
88+
message: deepest.message,
89+
...(code ? { code } : {}),
90+
...(errno ? { errno } : {}),
91+
...(syscall ? { syscall } : {}),
92+
...(chain.length > 1 ? { causeChain: chain.map((e) => `${e.name}: ${e.message}`) } : {}),
93+
}
94+
}
95+
4296
function readPgErrorField(error: unknown, field: string): string | undefined {
4397
const seen = new Set<unknown>()
4498
let current: unknown = error

0 commit comments

Comments
 (0)