feat(think): recover via DO chat recovery instead of full retry on timeout#1782
feat(think): recover via DO chat recovery instead of full retry on timeout#1782thomasgauvin wants to merge 12 commits into
Conversation
step.prompt() now accepts an optional retries option with maxAttempts, baseDelayMs, and maxDelayMs. Retryable errors (e.g. Workers AI 3040 capacity errors) trigger fresh prompt attempts with jittered exponential backoff, while timeouts, validation errors, aborted, and skipped prompts are surfaced terminally. Each attempt uses unique workflow step names and idempotency keys so retries are durable across workflow hibernation and replays.
- Derive retry jitter from two raw SHA-256 digest bytes so the backoff fraction spans the full [0, 1) range instead of collapsing to ~0ms (base64url char codes / 0xffff was near-zero, defeating jitter). - Keep the original :submit/:wait step names on the first attempt so in-flight workflows replay without re-executing completed steps; only retries get suffixed names. - Persist and read back submitMessages maxRetries through the durable submission queue so modelMaxRetries actually reaches the inference loop (also fixes the maxRetries typecheck error). - Strengthen the backoff test to assert jitter is non-zero and add a regression test for stable first-attempt step names.
…ep-prompt-retries # Conflicts: # packages/think/src/think.ts
Addresses the chatRecovery race in step.prompt() retries without disabling chatRecovery (which usefully preserves in-flight turn state across DO restarts/stalls). - Before each retry, cancel the abandoned attempt's submission via a durable `:cancel-N` step so its turn and any chatRecovery continuation cannot keep running and race the fresh attempt on the same session (the cause of duplicate/interleaved output). No-op once terminal. - Add `retries.retryOnTimeout` (default true) to fail fast on wait timeouts instead of retrying a likely-to-repeat timeout. - Log each retry via console.warn with step/attempt/delay/error. - Tests: assert abandoned attempts are cancelled before retry and the successful final attempt is not; assert retryOnTimeout=false fails fast without backoff.
The test file's local _promptStep options type omitted retryOnTimeout, failing the tests tsconfig typecheck in CI (the main package tsconfig does not compile the test directory, so it passed locally).
…meout When step.prompt() times out (e.g. DO died during deploy), check inspectSubmission before cancelling. If the DO's built-in recovery is handling the submission (pending/running/completed), re-wait for the original completion event instead of wasting the in-flight turn. - Single eventType across all retries so recovered submissions' events reach any retry's waitForEvent - cancelOnTimeout: false when retries enabled (retry loop owns cancel) - New _tryRecoverSubmission and _processPromptEvent helpers - Recovery only for ThinkPromptTimeoutError; non-timeout errors unchanged - 4 new tests: recovery succeeds, recovery fails (dead/DO down), non-timeout skip
🦋 Changeset detectedLatest commit: 8698964 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
| const event = (await step.waitForEvent( | ||
| `${stepName}:recovery-wait-${attempt}`, | ||
| { | ||
| type: eventType, | ||
| timeout: options.timeout | ||
| } | ||
| )) as WorkflowStepEvent<ThinkPromptEventPayload>; | ||
|
|
||
| return this._processPromptEvent(event, options); |
There was a problem hiding this comment.
🔴 Unhandled throw from _tryRecoverSubmission breaks the retry loop
_tryRecoverSubmission (line 448) calls step.waitForEvent without a try/catch. If this recovery wait times out (or if _processPromptEvent at line 456 throws for a non-completed event status or validation failure), the error propagates out of the catch block at packages/think/src/workflows.ts:183 and escapes the for loop entirely. This means instead of falling through to the cancel-and-retry path (lines 234-264), the function throws an unhandled error — likely a raw Workflow runtime error rather than a ThinkPromptTimeoutError. The changeset explicitly states "if recovery fails, fall back to full retry," but any failure inside _tryRecoverSubmission will bypass the retry mechanism entirely.
Error propagation trace
The catch(err) block at line 183 calls _tryRecoverSubmission at line 215. Inside that method, step.waitForEvent (line 448) can throw on timeout, and _processPromptEvent (line 456) can throw for non-completed statuses. Either throw propagates out of the catch block since there's no inner try/catch, breaking the for-loop and causing _promptStep to throw the raw error instead of continuing retry attempts.
Prompt for agents
The _tryRecoverSubmission method at packages/think/src/workflows.ts:415-457 calls step.waitForEvent (line 448) and _processPromptEvent (line 456) without wrapping them in a try-catch. When either throws (e.g. the recovery wait times out, or the event has a non-completed status), the error propagates out of the catch block in the retry loop at line 183, bypassing the cancel-and-retry path entirely.
The fix should wrap lines 448-456 in a try-catch that returns undefined on any error, allowing the caller in _promptStep to fall through to the cancel-and-retry logic at lines 234-264. Something like:
try {
const event = await step.waitForEvent(...);
return this._processPromptEvent(event, options);
} catch {
return undefined;
}
This ensures that if recovery fails for ANY reason (timeout, skipped status, validation error, etc.), the retry loop continues with a fresh submission attempt rather than breaking.
Was this helpful? React with 👍 or 👎 to provide feedback.
| // Only cancel when _waitForPromptEvent skipped it (retries | ||
| // enabled). When cancelOnTimeout is true the wait wrapper | ||
| // already cancelled on timeout. | ||
| if (attemptRef.submissionId && cancelOnTimeout === false) { |
There was a problem hiding this comment.
🟡 cancelOnTimeout: false user opt-out is overridden by retry loop cleanup
When a user passes cancelOnTimeout: false with maxAttempts: 1 (no retries, the default), the old behavior was to NOT cancel on timeout. In the new code, the condition at line 193 (cancelOnTimeout === false) was intended to detect the retry-mechanism's forced false (set at line 160 when maxAttempts > 1), but it cannot distinguish from the user's explicit cancelOnTimeout: false. This means a user who opts out of cancellation (e.g., to let the DO's chatRecovery handle it externally) will now see the submission cancelled anyway via the step.do cancel step, which is a behavioral regression from the pre-PR behavior.
| if (attemptRef.submissionId && cancelOnTimeout === false) { | |
| if (attemptRef.submissionId && maxAttempts > 1) { | |
Was this helpful? React with 👍 or 👎 to provide feedback.
| // Single event type shared across all retry attempts. When the DO's | ||
| // built-in chat recovery restarts an interrupted submission after a | ||
| // deploy, it emits the completion event with this same type, so a | ||
| // subsequent waitForEvent in a retry attempt will receive it. | ||
| const eventType = await this._eventTypeForPrompt(stepName, options.key); |
There was a problem hiding this comment.
🚩 Shared eventType across all retry attempts may cause stale event delivery
All retry attempts share the same eventType (computed once at line 149 from stepName and options.key). This is intentional per the comments (line 145-148): the DO's chatRecovery emits completion events with this type, so a recovery wait can receive it. However, this means a stale completion event from attempt N could be delivered to the waitForEvent of attempt N+1 (or a recovery wait). Whether this causes issues depends on the Cloudflare Workflows runtime's event buffering semantics — specifically whether each waitForEvent consumes the event (removing it from the buffer) or if events can be delivered to multiple waiters. The code's correctness depends on the cancel step (line 236) preventing the old submission from ever completing and emitting its event. If cancellation races with completion (cancel is best-effort), a stale event could be consumed by the wrong attempt.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
When
step.prompt()times out (e.g. the Think Durable Object died during a deploy), the retry loop now checks whether the DO's built-in chat recovery has picked up the interrupted submission before cancelling and re-submitting. If the submission is stillpendingorrunning(recovery in progress) or alreadycompleted, the workflow re-waits for the original completion event instead of wasting the in-flight turn.This builds on top of #1777 (retry loop for
step.prompt()).Changes
Single
eventTypeacross all retries — computed once fromstepName+options.key(not per-attempt). The DO stores the eventType in submission metadata and uses it when emitting the completion event. A recovered submission emits the original eventType, so all retrywaitForEventcalls must listen for the same type.cancelOnTimeout: falsewhen retries enabled — the retry loop owns cancellation so it can attempt recovery first. Without retries, the old behavior is preserved (_waitForPromptEventcancels on timeout).New
_tryRecoverSubmission()— callsinspectSubmission()(existing RPC on Think DO). If status isrunning/pending/completed, re-waits for the event. Returnsundefinedif recovery isn't possible (status iserror/aborted/nullor RPC failed).New
_processPromptEvent()— extracted helper for event payload processing, used by both_promptStepAttemptand_tryRecoverSubmission.Fixed double-cancel — the
isLastAttempt/stopOnTimeoutpath only cancels whencancelOnTimeout === false(retries enabled). WhencancelOnTimeoutis true,_waitForPromptEventalready handled it.Design decisions
inspectSubmissionalready exists on ThinkThinkPromptTimeoutError— non-timeout errors (provider errors, validation failures) go straight to cancel + retry as beforeTest plan