Skip to content

feat(think): recover via DO chat recovery instead of full retry on timeout#1783

Closed
thomasgauvin wants to merge 1 commit into
thomasgauvin-think-step-prompt-retriesfrom
think-prompt-recovery
Closed

feat(think): recover via DO chat recovery instead of full retry on timeout#1783
thomasgauvin wants to merge 1 commit into
thomasgauvin-think-step-prompt-retriesfrom
think-prompt-recovery

Conversation

@thomasgauvin

@thomasgauvin thomasgauvin commented Jun 19, 2026

Copy link
Copy Markdown

Summary

When step.prompt() times out (e.g. the Think Durable Object died during a deploy), the retry loop now checks whether the DO's built-in chat recovery has picked up the interrupted submission before cancelling and re-submitting. If the submission is still pending or running (recovery in progress) or already completed, the workflow re-waits for the original completion event instead of wasting the in-flight turn.

Builds on top of #1777.

Changes

  • Single eventType across all retries — computed once from stepName + options.key. A recovered submission emits the original eventType, so all retry waitForEvent calls listen for the same type.

  • cancelOnTimeout: false when retries enabled — the retry loop owns cancellation so it can attempt recovery first.

  • New _tryRecoverSubmission() — calls inspectSubmission() (existing RPC). If status is running/pending/completed, re-waits for the event. Returns undefined if recovery isn't possible.

  • New _processPromptEvent() — extracted helper, used by both _promptStepAttempt and _tryRecoverSubmission.

  • Fixed double-cancelisLastAttempt/stopOnTimeout only cancels when cancelOnTimeout === false.

Design decisions

  • No new RPC needed — inspectSubmission already exists on Think
  • No changes to think.ts — the DO's recovery mechanism is unchanged
  • One recovery attempt per loop iteration — falls through to cancel + full retry if recovery re-wait also times out
  • Recovery only for ThinkPromptTimeoutError — non-timeout errors unchanged

Test plan

  • 4 new tests: recovery succeeds, recovery fails (dead/DO down), non-timeout skip
  • All 13 tests pass
  • Lint + format + typecheck clean

Open in Devin Review

…meout

When step.prompt() times out (e.g. DO died during deploy), check
inspectSubmission before cancelling. If the DO's built-in recovery
is handling the submission (pending/running/completed), re-wait for
the original completion event instead of wasting the in-flight turn.

- Single eventType across all retries so recovered submissions' events
  reach any retry's waitForEvent
- cancelOnTimeout: false when retries enabled (retry loop owns cancel)
- New _tryRecoverSubmission and _processPromptEvent helpers
- Recovery only for ThinkPromptTimeoutError; non-timeout errors unchanged
- 4 new tests: recovery succeeds, recovery fails (dead/DO down), non-timeout skip
@changeset-bot

changeset-bot Bot commented Jun 19, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 8698964

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/think Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 4 potential issues.

Open in Devin Review

Comment on lines +448 to +456
const event = (await step.waitForEvent(
`${stepName}:recovery-wait-${attempt}`,
{
type: eventType,
timeout: options.timeout
}
)) as WorkflowStepEvent<ThinkPromptEventPayload>;

return this._processPromptEvent(event, options);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Unhandled exceptions in _tryRecoverSubmission bypass the retry loop entirely

_tryRecoverSubmission is called inside the catch block of the retry loop at packages/think/src/workflows.ts:215. If step.waitForEvent at line 448 throws (e.g., recovery wait also times out) or _processPromptEvent at line 456 throws (e.g., the recovered event has a non-completed status like "skipped", or validation fails), the exception propagates out of the catch block and the for loop entirely. This skips all remaining retry attempts, never cancels the in-flight submission (lines 234-242 are never reached), and throws an unexpected error type (raw Workflow runtime error rather than ThinkPromptTimeoutError). The recovery waitForEvent uses the same options.timeout as the original wait, so a genuinely stuck submission will time out again, triggering this bug.

Call path showing the unprotected throw

In _promptStep (line 215):

catch (err) {
  ...
  // This call is inside the catch block — if it throws,
  // the exception escapes the for loop
  const recovered = await this._tryRecoverSubmission(...);
  ...
}

In _tryRecoverSubmission (line 448):

// No try-catch — timeout or process errors propagate up
const event = await step.waitForEvent(...);
return this._processPromptEvent(event, options);
Prompt for agents
The _tryRecoverSubmission method (packages/think/src/workflows.ts, lines 415-457) needs error handling around both the step.waitForEvent call (line 448) and the _processPromptEvent call (line 456). Currently, exceptions from either propagate out of the catch block in _promptStep (line 183), bypassing the retry loop.

The fix should wrap lines 448-456 in a try-catch. If the recovery waitForEvent times out or _processPromptEvent throws (validation error, non-completed status), the method should return undefined instead of throwing, so the caller falls through to the normal cancel + retry path in the retry loop.

Something like:
try {
  const event = await step.waitForEvent(...);
  return this._processPromptEvent(event, options);
} catch {
  return undefined;
}

This ensures recovery is purely opportunistic: if it fails for any reason, the retry loop continues normally.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +193 to +200
if (attemptRef.submissionId && cancelOnTimeout === false) {
await step.do(`${stepName}:cancel-${attempt}`, async () => {
await this.agent.cancelSubmission(
attemptRef.submissionId!,
"Workflow prompt wait timed out"
);
});
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 cancelOnTimeout: false is no longer respected when maxAttempts is 1

The new cancellation logic at line 193 checks cancelOnTimeout === false and cancels the submission. When maxAttempts === 1, cancelOnTimeout is set to options.cancelOnTimeout (line 160). If a user explicitly passes cancelOnTimeout: false (opting out of cancellation on timeout), this condition is true and the code cancels the submission anyway. Previously (packages/think/src/workflows.ts before this PR), the isLastAttempt || stopOnTimeout block simply threw err with no cancel logic — cancellation was handled exclusively by _waitForPromptEvent, which correctly respected cancelOnTimeout: false. This is a behavioral regression for callers using cancelOnTimeout: false without retries.

Suggested change
if (attemptRef.submissionId && cancelOnTimeout === false) {
await step.do(`${stepName}:cancel-${attempt}`, async () => {
await this.agent.cancelSubmission(
attemptRef.submissionId!,
"Workflow prompt wait timed out"
);
});
}
if (attemptRef.submissionId && maxAttempts > 1) {
await step.do(`${stepName}:cancel-${attempt}`, async () => {
await this.agent.cancelSubmission(
attemptRef.submissionId!,
"Workflow prompt wait timed out"
);
});
}
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +440 to +446
if (
!inspection ||
inspection.status === "error" ||
inspection.status === "aborted"
) {
return undefined;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Recovery does not filter out "skipped" submission status

In _tryRecoverSubmission (lines 440-446), the status filter bails out on "error" and "aborted" but allows "skipped" through to the recovery waitForEvent. The comment at line 423-427 only mentions pending, running, and completed as the expected recoverable states. A "skipped" submission is terminal — recovery shouldn't wait for it. While _processPromptEvent would handle it by throwing ThinkPromptSkippedError, that throw currently escapes the retry loop (BUG-0001). Even after BUG-0001 is fixed (catching the error), it would be cleaner to filter "skipped" early alongside "error" and "aborted" to avoid a wasted waitForEvent step.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +145 to +149
// Single event type shared across all retry attempts. When the DO's
// built-in chat recovery restarts an interrupted submission after a
// deploy, it emits the completion event with this same type, so a
// subsequent waitForEvent in a retry attempt will receive it.
const eventType = await this._eventTypeForPrompt(stepName, options.key);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Shared event type across retry attempts changes Workflow event routing semantics

Previously (_promptStepAttempt computed eventType from attemptKey), each retry attempt used a unique event type. Now all attempts share a single eventType (line 149, computed from options.key). This is intentional for recovery (the DO emits with this type and a later wait can pick it up), but it changes event routing: if attempt 0's cancelled submission still completes and emits an event, and attempt 1 also submits and emits one, both events share the same type. The Workflow runtime would deliver one per waitForEvent call, so the first completion event consumed might be from the wrong attempt. In practice, cancelSubmission should prevent a cancelled submission from emitting, but if cancellation is a no-op (submission already completed), the stale event could be consumed by a later retry's waitForEvent. This is likely benign (the result is still valid), but the semantic change warrants awareness.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant