feat(think): recover via DO chat recovery instead of full retry on timeout by thomasgauvin · Pull Request #1782 · cloudflare/agents

thomasgauvin · 2026-06-19T17:43:04Z

Summary

When step.prompt() times out (e.g. the Think Durable Object died during a deploy), the retry loop now checks whether the DO's built-in chat recovery has picked up the interrupted submission before cancelling and re-submitting. If the submission is still pending or running (recovery in progress) or already completed, the workflow re-waits for the original completion event instead of wasting the in-flight turn.

This builds on top of #1777 (retry loop for step.prompt()).

Changes

Single eventType across all retries — computed once from stepName + options.key (not per-attempt). The DO stores the eventType in submission metadata and uses it when emitting the completion event. A recovered submission emits the original eventType, so all retry waitForEvent calls must listen for the same type.
cancelOnTimeout: false when retries enabled — the retry loop owns cancellation so it can attempt recovery first. Without retries, the old behavior is preserved (_waitForPromptEvent cancels on timeout).
New _tryRecoverSubmission() — calls inspectSubmission() (existing RPC on Think DO). If status is running/pending/completed, re-waits for the event. Returns undefined if recovery isn't possible (status is error/aborted/null or RPC failed).
New _processPromptEvent() — extracted helper for event payload processing, used by both _promptStepAttempt and _tryRecoverSubmission.
Fixed double-cancel — the isLastAttempt/stopOnTimeout path only cancels when cancelOnTimeout === false (retries enabled). When cancelOnTimeout is true, _waitForPromptEvent already handled it.

Design decisions

No new RPC needed — inspectSubmission already exists on Think
No changes to think.ts — the DO's recovery mechanism is unchanged
One recovery attempt per loop iteration — if recovery re-wait also times out, falls through to cancel + full retry (no infinite recovery loop)
Recovery only for ThinkPromptTimeoutError — non-timeout errors (provider errors, validation failures) go straight to cancel + retry as before

Test plan

4 new tests: recovery succeeds, recovery fails (dead submission), recovery fails (DO still down), recovery not attempted for non-timeout errors
All 13 existing + new tests pass
Lint (oxlint) clean
Format (oxfmt) clean
Typecheck clean (no new errors)

step.prompt() now accepts an optional retries option with maxAttempts, baseDelayMs, and maxDelayMs. Retryable errors (e.g. Workers AI 3040 capacity errors) trigger fresh prompt attempts with jittered exponential backoff, while timeouts, validation errors, aborted, and skipped prompts are surfaced terminally. Each attempt uses unique workflow step names and idempotency keys so retries are durable across workflow hibernation and replays.

…es/streamText

- Derive retry jitter from two raw SHA-256 digest bytes so the backoff fraction spans the full [0, 1) range instead of collapsing to ~0ms (base64url char codes / 0xffff was near-zero, defeating jitter). - Keep the original :submit/:wait step names on the first attempt so in-flight workflows replay without re-executing completed steps; only retries get suffixed names. - Persist and read back submitMessages maxRetries through the durable submission queue so modelMaxRetries actually reaches the inference loop (also fixes the maxRetries typecheck error). - Strengthen the backoff test to assert jitter is non-zero and add a regression test for stable first-attempt step names.

…ep-prompt-retries # Conflicts: # packages/think/src/think.ts

…retries

…antics

Addresses the chatRecovery race in step.prompt() retries without disabling chatRecovery (which usefully preserves in-flight turn state across DO restarts/stalls). - Before each retry, cancel the abandoned attempt's submission via a durable `:cancel-N` step so its turn and any chatRecovery continuation cannot keep running and race the fresh attempt on the same session (the cause of duplicate/interleaved output). No-op once terminal. - Add `retries.retryOnTimeout` (default true) to fail fast on wait timeouts instead of retrying a likely-to-repeat timeout. - Log each retry via console.warn with step/attempt/delay/error. - Tests: assert abandoned attempts are cancelled before retry and the successful final attempt is not; assert retryOnTimeout=false fails fast without backoff.

The test file's local _promptStep options type omitted retryOnTimeout, failing the tests tsconfig typecheck in CI (the main package tsconfig does not compile the test directory, so it passed locally).

…meout When step.prompt() times out (e.g. DO died during deploy), check inspectSubmission before cancelling. If the DO's built-in recovery is handling the submission (pending/running/completed), re-wait for the original completion event instead of wasting the in-flight turn. - Single eventType across all retries so recovered submissions' events reach any retry's waitForEvent - cancelOnTimeout: false when retries enabled (retry loop owns cancel) - New _tryRecoverSubmission and _processPromptEvent helpers - Recovery only for ThinkPromptTimeoutError; non-timeout errors unchanged - 4 new tests: recovery succeeds, recovery fails (dead/DO down), non-timeout skip

changeset-bot · 2026-06-19T17:43:08Z

🦋 Changeset detected

Latest commit: 8698964

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@cloudflare/think	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

devin-ai-integration

Devin Review found 3 potential issues.

devin-ai-integration · 2026-06-19T17:47:24Z

+    const event = (await step.waitForEvent(
+      `${stepName}:recovery-wait-${attempt}`,
+      {
+        type: eventType,
+        timeout: options.timeout
+      }
+    )) as WorkflowStepEvent<ThinkPromptEventPayload>;
+
+    return this._processPromptEvent(event, options);


🔴 Unhandled throw from _tryRecoverSubmission breaks the retry loop

_tryRecoverSubmission (line 448) calls step.waitForEvent without a try/catch. If this recovery wait times out (or if _processPromptEvent at line 456 throws for a non-completed event status or validation failure), the error propagates out of the catch block at packages/think/src/workflows.ts:183 and escapes the for loop entirely. This means instead of falling through to the cancel-and-retry path (lines 234-264), the function throws an unhandled error — likely a raw Workflow runtime error rather than a ThinkPromptTimeoutError. The changeset explicitly states "if recovery fails, fall back to full retry," but any failure inside _tryRecoverSubmission will bypass the retry mechanism entirely.

Error propagation trace

The catch(err) block at line 183 calls _tryRecoverSubmission at line 215. Inside that method, step.waitForEvent (line 448) can throw on timeout, and _processPromptEvent (line 456) can throw for non-completed statuses. Either throw propagates out of the catch block since there's no inner try/catch, breaking the for-loop and causing _promptStep to throw the raw error instead of continuing retry attempts.

Prompt for agents

The _tryRecoverSubmission method at packages/think/src/workflows.ts:415-457 calls step.waitForEvent (line 448) and _processPromptEvent (line 456) without wrapping them in a try-catch. When either throws (e.g. the recovery wait times out, or the event has a non-completed status), the error propagates out of the catch block in the retry loop at line 183, bypassing the cancel-and-retry path entirely. The fix should wrap lines 448-456 in a try-catch that returns undefined on any error, allowing the caller in _promptStep to fall through to the cancel-and-retry logic at lines 234-264. Something like: try { const event = await step.waitForEvent(...); return this._processPromptEvent(event, options); } catch { return undefined; } This ensures that if recovery fails for ANY reason (timeout, skipped status, validation error, etc.), the retry loop continues with a fresh submission attempt rather than breaking.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-06-19T17:47:26Z

+          // Only cancel when _waitForPromptEvent skipped it (retries
+          // enabled).  When cancelOnTimeout is true the wait wrapper
+          // already cancelled on timeout.
+          if (attemptRef.submissionId && cancelOnTimeout === false) {


🟡 cancelOnTimeout: false user opt-out is overridden by retry loop cleanup

When a user passes cancelOnTimeout: false with maxAttempts: 1 (no retries, the default), the old behavior was to NOT cancel on timeout. In the new code, the condition at line 193 (cancelOnTimeout === false) was intended to detect the retry-mechanism's forced false (set at line 160 when maxAttempts > 1), but it cannot distinguish from the user's explicit cancelOnTimeout: false. This means a user who opts out of cancellation (e.g., to let the DO's chatRecovery handle it externally) will now see the submission cancelled anyway via the step.do cancel step, which is a behavioral regression from the pre-PR behavior.

Suggested change

if (attemptRef.submissionId && cancelOnTimeout === false) {

if (attemptRef.submissionId && maxAttempts > 1) {

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-06-19T17:47:27Z

+    // Single event type shared across all retry attempts.  When the DO's
+    // built-in chat recovery restarts an interrupted submission after a
+    // deploy, it emits the completion event with this same type, so a
+    // subsequent waitForEvent in a retry attempt will receive it.
+    const eventType = await this._eventTypeForPrompt(stepName, options.key);


🚩 Shared eventType across all retry attempts may cause stale event delivery

All retry attempts share the same eventType (computed once at line 149 from stepName and options.key). This is intentional per the comments (line 145-148): the DO's chatRecovery emits completion events with this type, so a recovery wait can receive it. However, this means a stale completion event from attempt N could be delivered to the waitForEvent of attempt N+1 (or a recovery wait). Whether this causes issues depends on the Cloudflare Workflows runtime's event buffering semantics — specifically whether each waitForEvent consumes the event (removing it from the buffer) or if events can be delivered to multiple waiters. The code's correctness depends on the cancel step (line 236) preventing the old submission from ever completing and emitting its event. If cancellation races with completion (cancel is best-effort), a stale event could be consumed by the wrong attempt.

Was this helpful? React with 👍 or 👎 to provide feedback.

thomasgauvin and others added 12 commits June 18, 2026 12:36

fix(think): use deterministic jitter for step.prompt retry backoff

ad24313

feat(think): forward modelMaxRetries from step.prompt to submitMessag…

0ed9aa3

…es/streamText

Merge remote-tracking branch 'origin/main' into thomasgauvin-think-st…

ceb4346

…ep-prompt-retries # Conflicts: # packages/think/src/think.ts

Revert modelMaxRetries plumbing to keep PR focused on workflow-level …

cac7aac

…retries

fix(think): clamp maxAttempts and add exhausted-retry test

cae7220

fix(think): validate retries options matching agents RetryOptions sem…

cb53891

…antics

style(think): format workflows.test.ts with oxfmt

dd343a7

fix(think): add retryOnTimeout to test PromptStepRunner type

a622d14

The test file's local _promptStep options type omitted retryOnTimeout, failing the tests tsconfig typecheck in CI (the main package tsconfig does not compile the test directory, so it passed locally).

thomasgauvin closed this Jun 19, 2026

devin-ai-integration Bot reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(think): recover via DO chat recovery instead of full retry on timeout#1782

feat(think): recover via DO chat recovery instead of full retry on timeout#1782
thomasgauvin wants to merge 12 commits into
mainfrom
think-prompt-recovery

thomasgauvin commented Jun 19, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

changeset-bot Bot commented Jun 19, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Jun 19, 2026

Uh oh!

devin-ai-integration Bot Jun 19, 2026

Uh oh!

devin-ai-integration Bot Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if (attemptRef.submissionId && cancelOnTimeout === false) {
	if (attemptRef.submissionId && maxAttempts > 1) {

Conversation

thomasgauvin commented Jun 19, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Design decisions

Test plan

Uh oh!

changeset-bot Bot commented Jun 19, 2026

🦋 Changeset detected

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thomasgauvin commented Jun 19, 2026 •

edited by devin-ai-integration Bot

Loading