Skip to content

feat(think): recover via DO chat recovery instead of full retry on timeout#1782

Closed
thomasgauvin wants to merge 12 commits into
mainfrom
think-prompt-recovery
Closed

feat(think): recover via DO chat recovery instead of full retry on timeout#1782
thomasgauvin wants to merge 12 commits into
mainfrom
think-prompt-recovery

Conversation

@thomasgauvin

@thomasgauvin thomasgauvin commented Jun 19, 2026

Copy link
Copy Markdown

Summary

When step.prompt() times out (e.g. the Think Durable Object died during a deploy), the retry loop now checks whether the DO's built-in chat recovery has picked up the interrupted submission before cancelling and re-submitting. If the submission is still pending or running (recovery in progress) or already completed, the workflow re-waits for the original completion event instead of wasting the in-flight turn.

This builds on top of #1777 (retry loop for step.prompt()).

Changes

  • Single eventType across all retries — computed once from stepName + options.key (not per-attempt). The DO stores the eventType in submission metadata and uses it when emitting the completion event. A recovered submission emits the original eventType, so all retry waitForEvent calls must listen for the same type.

  • cancelOnTimeout: false when retries enabled — the retry loop owns cancellation so it can attempt recovery first. Without retries, the old behavior is preserved (_waitForPromptEvent cancels on timeout).

  • New _tryRecoverSubmission() — calls inspectSubmission() (existing RPC on Think DO). If status is running/pending/completed, re-waits for the event. Returns undefined if recovery isn't possible (status is error/aborted/null or RPC failed).

  • New _processPromptEvent() — extracted helper for event payload processing, used by both _promptStepAttempt and _tryRecoverSubmission.

  • Fixed double-cancel — the isLastAttempt/stopOnTimeout path only cancels when cancelOnTimeout === false (retries enabled). When cancelOnTimeout is true, _waitForPromptEvent already handled it.

Design decisions

  • No new RPC neededinspectSubmission already exists on Think
  • No changes to think.ts — the DO's recovery mechanism is unchanged
  • One recovery attempt per loop iteration — if recovery re-wait also times out, falls through to cancel + full retry (no infinite recovery loop)
  • Recovery only for ThinkPromptTimeoutError — non-timeout errors (provider errors, validation failures) go straight to cancel + retry as before

Test plan

  • 4 new tests: recovery succeeds, recovery fails (dead submission), recovery fails (DO still down), recovery not attempted for non-timeout errors
  • All 13 existing + new tests pass
  • Lint (oxlint) clean
  • Format (oxfmt) clean
  • Typecheck clean (no new errors)

Open in Devin Review

thomasgauvin and others added 12 commits June 18, 2026 12:36
step.prompt() now accepts an optional retries option with maxAttempts,
baseDelayMs, and maxDelayMs. Retryable errors (e.g. Workers AI 3040
capacity errors) trigger fresh prompt attempts with jittered exponential
backoff, while timeouts, validation errors, aborted, and skipped prompts
are surfaced terminally.

Each attempt uses unique workflow step names and idempotency keys so
retries are durable across workflow hibernation and replays.
- Derive retry jitter from two raw SHA-256 digest bytes so the backoff
  fraction spans the full [0, 1) range instead of collapsing to ~0ms
  (base64url char codes / 0xffff was near-zero, defeating jitter).
- Keep the original :submit/:wait step names on the first attempt so
  in-flight workflows replay without re-executing completed steps; only
  retries get suffixed names.
- Persist and read back submitMessages maxRetries through the durable
  submission queue so modelMaxRetries actually reaches the inference loop
  (also fixes the maxRetries typecheck error).
- Strengthen the backoff test to assert jitter is non-zero and add a
  regression test for stable first-attempt step names.
…ep-prompt-retries

# Conflicts:
#	packages/think/src/think.ts
Addresses the chatRecovery race in step.prompt() retries without
disabling chatRecovery (which usefully preserves in-flight turn state
across DO restarts/stalls).

- Before each retry, cancel the abandoned attempt's submission via a
  durable `:cancel-N` step so its turn and any chatRecovery continuation
  cannot keep running and race the fresh attempt on the same session
  (the cause of duplicate/interleaved output). No-op once terminal.
- Add `retries.retryOnTimeout` (default true) to fail fast on wait
  timeouts instead of retrying a likely-to-repeat timeout.
- Log each retry via console.warn with step/attempt/delay/error.
- Tests: assert abandoned attempts are cancelled before retry and the
  successful final attempt is not; assert retryOnTimeout=false fails
  fast without backoff.
The test file's local _promptStep options type omitted retryOnTimeout,
failing the tests tsconfig typecheck in CI (the main package tsconfig
does not compile the test directory, so it passed locally).
…meout

When step.prompt() times out (e.g. DO died during deploy), check
inspectSubmission before cancelling. If the DO's built-in recovery
is handling the submission (pending/running/completed), re-wait for
the original completion event instead of wasting the in-flight turn.

- Single eventType across all retries so recovered submissions' events
  reach any retry's waitForEvent
- cancelOnTimeout: false when retries enabled (retry loop owns cancel)
- New _tryRecoverSubmission and _processPromptEvent helpers
- Recovery only for ThinkPromptTimeoutError; non-timeout errors unchanged
- 4 new tests: recovery succeeds, recovery fails (dead/DO down), non-timeout skip
@changeset-bot

changeset-bot Bot commented Jun 19, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 8698964

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/think Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

Open in Devin Review

Comment on lines +448 to +456
const event = (await step.waitForEvent(
`${stepName}:recovery-wait-${attempt}`,
{
type: eventType,
timeout: options.timeout
}
)) as WorkflowStepEvent<ThinkPromptEventPayload>;

return this._processPromptEvent(event, options);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Unhandled throw from _tryRecoverSubmission breaks the retry loop

_tryRecoverSubmission (line 448) calls step.waitForEvent without a try/catch. If this recovery wait times out (or if _processPromptEvent at line 456 throws for a non-completed event status or validation failure), the error propagates out of the catch block at packages/think/src/workflows.ts:183 and escapes the for loop entirely. This means instead of falling through to the cancel-and-retry path (lines 234-264), the function throws an unhandled error — likely a raw Workflow runtime error rather than a ThinkPromptTimeoutError. The changeset explicitly states "if recovery fails, fall back to full retry," but any failure inside _tryRecoverSubmission will bypass the retry mechanism entirely.

Error propagation trace

The catch(err) block at line 183 calls _tryRecoverSubmission at line 215. Inside that method, step.waitForEvent (line 448) can throw on timeout, and _processPromptEvent (line 456) can throw for non-completed statuses. Either throw propagates out of the catch block since there's no inner try/catch, breaking the for-loop and causing _promptStep to throw the raw error instead of continuing retry attempts.

Prompt for agents
The _tryRecoverSubmission method at packages/think/src/workflows.ts:415-457 calls step.waitForEvent (line 448) and _processPromptEvent (line 456) without wrapping them in a try-catch. When either throws (e.g. the recovery wait times out, or the event has a non-completed status), the error propagates out of the catch block in the retry loop at line 183, bypassing the cancel-and-retry path entirely.

The fix should wrap lines 448-456 in a try-catch that returns undefined on any error, allowing the caller in _promptStep to fall through to the cancel-and-retry logic at lines 234-264. Something like:

  try {
    const event = await step.waitForEvent(...);
    return this._processPromptEvent(event, options);
  } catch {
    return undefined;
  }

This ensures that if recovery fails for ANY reason (timeout, skipped status, validation error, etc.), the retry loop continues with a fresh submission attempt rather than breaking.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

// Only cancel when _waitForPromptEvent skipped it (retries
// enabled). When cancelOnTimeout is true the wait wrapper
// already cancelled on timeout.
if (attemptRef.submissionId && cancelOnTimeout === false) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 cancelOnTimeout: false user opt-out is overridden by retry loop cleanup

When a user passes cancelOnTimeout: false with maxAttempts: 1 (no retries, the default), the old behavior was to NOT cancel on timeout. In the new code, the condition at line 193 (cancelOnTimeout === false) was intended to detect the retry-mechanism's forced false (set at line 160 when maxAttempts > 1), but it cannot distinguish from the user's explicit cancelOnTimeout: false. This means a user who opts out of cancellation (e.g., to let the DO's chatRecovery handle it externally) will now see the submission cancelled anyway via the step.do cancel step, which is a behavioral regression from the pre-PR behavior.

Suggested change
if (attemptRef.submissionId && cancelOnTimeout === false) {
if (attemptRef.submissionId && maxAttempts > 1) {
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +145 to +149
// Single event type shared across all retry attempts. When the DO's
// built-in chat recovery restarts an interrupted submission after a
// deploy, it emits the completion event with this same type, so a
// subsequent waitForEvent in a retry attempt will receive it.
const eventType = await this._eventTypeForPrompt(stepName, options.key);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Shared eventType across all retry attempts may cause stale event delivery

All retry attempts share the same eventType (computed once at line 149 from stepName and options.key). This is intentional per the comments (line 145-148): the DO's chatRecovery emits completion events with this type, so a recovery wait can receive it. However, this means a stale completion event from attempt N could be delivered to the waitForEvent of attempt N+1 (or a recovery wait). Whether this causes issues depends on the Cloudflare Workflows runtime's event buffering semantics — specifically whether each waitForEvent consumes the event (removing it from the buffer) or if events can be delivered to multiple waiters. The code's correctness depends on the cancel step (line 236) preventing the old submission from ever completing and emitting its event. If cancellation races with completion (cancel is best-effort), a stale event could be consumed by the wrong attempt.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant