feat(think): recover via DO chat recovery instead of full retry on timeout#1783
feat(think): recover via DO chat recovery instead of full retry on timeout#1783thomasgauvin wants to merge 1 commit into
Conversation
…meout When step.prompt() times out (e.g. DO died during deploy), check inspectSubmission before cancelling. If the DO's built-in recovery is handling the submission (pending/running/completed), re-wait for the original completion event instead of wasting the in-flight turn. - Single eventType across all retries so recovered submissions' events reach any retry's waitForEvent - cancelOnTimeout: false when retries enabled (retry loop owns cancel) - New _tryRecoverSubmission and _processPromptEvent helpers - Recovery only for ThinkPromptTimeoutError; non-timeout errors unchanged - 4 new tests: recovery succeeds, recovery fails (dead/DO down), non-timeout skip
🦋 Changeset detectedLatest commit: 8698964 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
| const event = (await step.waitForEvent( | ||
| `${stepName}:recovery-wait-${attempt}`, | ||
| { | ||
| type: eventType, | ||
| timeout: options.timeout | ||
| } | ||
| )) as WorkflowStepEvent<ThinkPromptEventPayload>; | ||
|
|
||
| return this._processPromptEvent(event, options); |
There was a problem hiding this comment.
🔴 Unhandled exceptions in _tryRecoverSubmission bypass the retry loop entirely
_tryRecoverSubmission is called inside the catch block of the retry loop at packages/think/src/workflows.ts:215. If step.waitForEvent at line 448 throws (e.g., recovery wait also times out) or _processPromptEvent at line 456 throws (e.g., the recovered event has a non-completed status like "skipped", or validation fails), the exception propagates out of the catch block and the for loop entirely. This skips all remaining retry attempts, never cancels the in-flight submission (lines 234-242 are never reached), and throws an unexpected error type (raw Workflow runtime error rather than ThinkPromptTimeoutError). The recovery waitForEvent uses the same options.timeout as the original wait, so a genuinely stuck submission will time out again, triggering this bug.
Call path showing the unprotected throw
In _promptStep (line 215):
catch (err) {
...
// This call is inside the catch block — if it throws,
// the exception escapes the for loop
const recovered = await this._tryRecoverSubmission(...);
...
}
In _tryRecoverSubmission (line 448):
// No try-catch — timeout or process errors propagate up
const event = await step.waitForEvent(...);
return this._processPromptEvent(event, options);
Prompt for agents
The _tryRecoverSubmission method (packages/think/src/workflows.ts, lines 415-457) needs error handling around both the step.waitForEvent call (line 448) and the _processPromptEvent call (line 456). Currently, exceptions from either propagate out of the catch block in _promptStep (line 183), bypassing the retry loop.
The fix should wrap lines 448-456 in a try-catch. If the recovery waitForEvent times out or _processPromptEvent throws (validation error, non-completed status), the method should return undefined instead of throwing, so the caller falls through to the normal cancel + retry path in the retry loop.
Something like:
try {
const event = await step.waitForEvent(...);
return this._processPromptEvent(event, options);
} catch {
return undefined;
}
This ensures recovery is purely opportunistic: if it fails for any reason, the retry loop continues normally.
Was this helpful? React with 👍 or 👎 to provide feedback.
| if (attemptRef.submissionId && cancelOnTimeout === false) { | ||
| await step.do(`${stepName}:cancel-${attempt}`, async () => { | ||
| await this.agent.cancelSubmission( | ||
| attemptRef.submissionId!, | ||
| "Workflow prompt wait timed out" | ||
| ); | ||
| }); | ||
| } |
There was a problem hiding this comment.
🔴 cancelOnTimeout: false is no longer respected when maxAttempts is 1
The new cancellation logic at line 193 checks cancelOnTimeout === false and cancels the submission. When maxAttempts === 1, cancelOnTimeout is set to options.cancelOnTimeout (line 160). If a user explicitly passes cancelOnTimeout: false (opting out of cancellation on timeout), this condition is true and the code cancels the submission anyway. Previously (packages/think/src/workflows.ts before this PR), the isLastAttempt || stopOnTimeout block simply threw err with no cancel logic — cancellation was handled exclusively by _waitForPromptEvent, which correctly respected cancelOnTimeout: false. This is a behavioral regression for callers using cancelOnTimeout: false without retries.
| if (attemptRef.submissionId && cancelOnTimeout === false) { | |
| await step.do(`${stepName}:cancel-${attempt}`, async () => { | |
| await this.agent.cancelSubmission( | |
| attemptRef.submissionId!, | |
| "Workflow prompt wait timed out" | |
| ); | |
| }); | |
| } | |
| if (attemptRef.submissionId && maxAttempts > 1) { | |
| await step.do(`${stepName}:cancel-${attempt}`, async () => { | |
| await this.agent.cancelSubmission( | |
| attemptRef.submissionId!, | |
| "Workflow prompt wait timed out" | |
| ); | |
| }); | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
| if ( | ||
| !inspection || | ||
| inspection.status === "error" || | ||
| inspection.status === "aborted" | ||
| ) { | ||
| return undefined; | ||
| } |
There was a problem hiding this comment.
🚩 Recovery does not filter out "skipped" submission status
In _tryRecoverSubmission (lines 440-446), the status filter bails out on "error" and "aborted" but allows "skipped" through to the recovery waitForEvent. The comment at line 423-427 only mentions pending, running, and completed as the expected recoverable states. A "skipped" submission is terminal — recovery shouldn't wait for it. While _processPromptEvent would handle it by throwing ThinkPromptSkippedError, that throw currently escapes the retry loop (BUG-0001). Even after BUG-0001 is fixed (catching the error), it would be cleaner to filter "skipped" early alongside "error" and "aborted" to avoid a wasted waitForEvent step.
Was this helpful? React with 👍 or 👎 to provide feedback.
| // Single event type shared across all retry attempts. When the DO's | ||
| // built-in chat recovery restarts an interrupted submission after a | ||
| // deploy, it emits the completion event with this same type, so a | ||
| // subsequent waitForEvent in a retry attempt will receive it. | ||
| const eventType = await this._eventTypeForPrompt(stepName, options.key); |
There was a problem hiding this comment.
🚩 Shared event type across retry attempts changes Workflow event routing semantics
Previously (_promptStepAttempt computed eventType from attemptKey), each retry attempt used a unique event type. Now all attempts share a single eventType (line 149, computed from options.key). This is intentional for recovery (the DO emits with this type and a later wait can pick it up), but it changes event routing: if attempt 0's cancelled submission still completes and emits an event, and attempt 1 also submits and emits one, both events share the same type. The Workflow runtime would deliver one per waitForEvent call, so the first completion event consumed might be from the wrong attempt. In practice, cancelSubmission should prevent a cancelled submission from emitting, but if cancellation is a no-op (submission already completed), the stale event could be consumed by a later retry's waitForEvent. This is likely benign (the result is still valid), but the semantic change warrants awareness.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
When
step.prompt()times out (e.g. the Think Durable Object died during a deploy), the retry loop now checks whether the DO's built-in chat recovery has picked up the interrupted submission before cancelling and re-submitting. If the submission is stillpendingorrunning(recovery in progress) or alreadycompleted, the workflow re-waits for the original completion event instead of wasting the in-flight turn.Builds on top of #1777.
Changes
Single
eventTypeacross all retries — computed once fromstepName+options.key. A recovered submission emits the original eventType, so all retrywaitForEventcalls listen for the same type.cancelOnTimeout: falsewhen retries enabled — the retry loop owns cancellation so it can attempt recovery first.New
_tryRecoverSubmission()— callsinspectSubmission()(existing RPC). If status isrunning/pending/completed, re-waits for the event. Returnsundefinedif recovery isn't possible.New
_processPromptEvent()— extracted helper, used by both_promptStepAttemptand_tryRecoverSubmission.Fixed double-cancel —
isLastAttempt/stopOnTimeoutonly cancels whencancelOnTimeout === false.Design decisions
inspectSubmissionalready exists on ThinkThinkPromptTimeoutError— non-timeout errors unchangedTest plan