You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(tables): retry transient DB/Redis failures in cell execution and surface error causes
Workflow-group-cell runs intermittently failed on trivial DB reads/writes
under heavy fan-out, stranding cells in `running`. Investigation showed the
PlanetScale and ElastiCache backends were healthy at the time — the failures
are transient connection-level faults that the cell (maxAttempts: 1) had no
tolerance for, and the real cause was never logged (Drizzle wraps it as
"Failed query: ..." and the driver cause lives in error.cause).
Resilience:
- Add retryTransient (lib/table/retry-transient.ts): retries only transient
infra errors (reuses isRetryableInfrastructureError; adds an ioredis
command-timeout match) with jittered backoff, then rethrows. Fail-fast for
everything else.
- Wrap the cell's getTableById/getRowById reads, the terminal write
(cell-write updateRow — idempotent via the executionId guard), and the
Redis cascade-lock acquire.
Diagnostics:
- Add describeError (lib/core/errors/retryable-infrastructure.ts): walks the
.cause chain and always returns the underlying driver cause (code/errno/
syscall + causeChain), including for unclassified errors like AbortError.
- Log `cause` + a `retryable` flag (and aborted/timedOut in the cell's main
catch) across the cell + finalization error paths, mirroring the existing
schedule-execution pattern. Logging-only; no behavior change. This lets the
next recurrence reveal the real cause and whether the retry applies.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0 commit comments