feat(think): Add retries to Think step.prompt() by thomasgauvin · Pull Request #1777 · cloudflare/agents

thomasgauvin · 2026-06-18T17:37:43Z

When I run step.prompt() as part of an agentic workflow, I want to have an option for the step.prompt() to be retried (in case the agent fails due to capacity constraints, etc).

I have kept this 'dumb', which will retry the whole step.prompt(). I think the Think harness itself should have retries or 'continues' to ensure the task is pushed to completion before stopping.

step.prompt() now accepts an optional retries option with maxAttempts, baseDelayMs, and maxDelayMs. Retryable errors (e.g. Workers AI 3040 capacity errors) trigger fresh prompt attempts with jittered exponential backoff, while timeouts, validation errors, aborted, and skipped prompts are surfaced terminally. Each attempt uses unique workflow step names and idempotency keys so retries are durable across workflow hibernation and replays.

changeset-bot · 2026-06-18T17:37:47Z

🦋 Changeset detected

Latest commit: c37ba2a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@cloudflare/think	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

devin-ai-integration

Devin Review found 3 potential issues.

devin-ai-integration · 2026-06-18T17:40:54Z

+      expect(sleepCalls[0].duration).toBeGreaterThanOrEqual(0);
+      expect(sleepCalls[0].duration).toBeLessThanOrEqual(100);
+      expect(sleepCalls[1].duration).toBeGreaterThanOrEqual(0);
+      expect(sleepCalls[1].duration).toBeLessThanOrEqual(200);


🚩 Test backoff bounds may be vacuously satisfied due to the jitter bug

The retry test at packages/think/src/tests/workflows.test.ts:297-300 asserts sleepCalls[0].duration <= 100 and sleepCalls[1].duration <= 200. Because of the near-zero jitter fraction bug (BUG-0001), these assertions pass trivially (durations are always ~0). After the jitter bug is fixed, these bounds should still hold mathematically (jitter is in [0, upperBound]), but the test should also verify that durations are not always zero — e.g., asserting that at least one sleep duration is > 0 — to serve as a meaningful regression test for the backoff mechanism.

Was this helpful? React with 👍 or 👎 to provide feedback.

pkg-pr-new · 2026-06-18T17:43:15Z

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1777

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1777

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1777

create-think

npm i https://pkg.pr.new/create-think@1777

hono-agents

npm i https://pkg.pr.new/hono-agents@1777

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1777

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1777

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1777

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1777

commit: c37ba2a

…es/streamText

- Derive retry jitter from two raw SHA-256 digest bytes so the backoff fraction spans the full [0, 1) range instead of collapsing to ~0ms (base64url char codes / 0xffff was near-zero, defeating jitter). - Keep the original :submit/:wait step names on the first attempt so in-flight workflows replay without re-executing completed steps; only retries get suffixed names. - Persist and read back submitMessages maxRetries through the durable submission queue so modelMaxRetries actually reaches the inference loop (also fixes the maxRetries typecheck error). - Strengthen the backoff test to assert jitter is non-zero and add a regression test for stable first-attempt step names.

…ep-prompt-retries # Conflicts: # packages/think/src/think.ts

thomasgauvin · 2026-06-18T19:21:20Z

Remove from maxModelRetries from Think.submitMessages() because you can do it in beforeTurn()/getModel() (this sets the retry count for the whole agent)

…retries

…antics

Addresses the chatRecovery race in step.prompt() retries without disabling chatRecovery (which usefully preserves in-flight turn state across DO restarts/stalls). - Before each retry, cancel the abandoned attempt's submission via a durable `:cancel-N` step so its turn and any chatRecovery continuation cannot keep running and race the fresh attempt on the same session (the cause of duplicate/interleaved output). No-op once terminal. - Add `retries.retryOnTimeout` (default true) to fail fast on wait timeouts instead of retrying a likely-to-repeat timeout. - Log each retry via console.warn with step/attempt/delay/error. - Tests: assert abandoned attempts are cancelled before retry and the successful final attempt is not; assert retryOnTimeout=false fails fast without backoff.

The test file's local _promptStep options type omitted retryOnTimeout, failing the tests tsconfig typecheck in CI (the main package tsconfig does not compile the test directory, so it passed locally).

thomasgauvin · 2026-06-19T16:11:27Z

This cancel's Think built-in recovery, and we should make it so step.prompt() if it needs to recover, should resume Think and let it do its own chat recovery

…meout When step.prompt() times out (e.g. DO died during deploy), check inspectSubmission before cancelling. If the DO's built-in recovery is handling the submission (pending/running/completed), re-wait for the original completion event instead of wasting the in-flight turn. - Single eventType across all retries so recovered submissions' events reach any retry's waitForEvent - cancelOnTimeout: false when retries enabled (retry loop owns cancel) - New _tryRecoverSubmission and _processPromptEvent helpers - Recovery only for ThinkPromptTimeoutError; non-timeout errors unchanged - 4 new tests: recovery succeeds, recovery fails (dead/DO down), non-timeout skip

Reworks the DO chat-recovery path added for timeout handling so it is correct and resilient to slow DO restarts. - Per-attempt event types: revert the single shared event type. Each attempt's submission has a distinct type, and Think emits exactly one terminal event per submission keyed by it, so a delivered event maps 1:1 to its submission. This removes the cross-attempt misattribution risk where a stale (e.g. cancellation) event from a prior attempt could be consumed by a later attempt's wait. - DO-downtime tolerant recovery: _tryRecoverSubmission is now a bounded inspect/re-wait loop. An unreachable DO (inspectSubmission RPC failing because it is still restarting after a deploy) is treated as 'still recovering', so the loop backs off and re-checks instead of discarding the durable in-flight submission and resubmitting. - Never throws out of step.prompt(): the recovery wait + parse is wrapped and returns a tagged result. A recovery-wait timeout, a terminal failure of the recovered turn, or invalid recovered output all fall through to the cancel + fresh-resubmit path (matching the changeset's stated contract, which the previous code violated). - cancelOnTimeout: documented and honored. With retries the loop owns cancellation; a final timed-out attempt respects cancelOnTimeout, and abandoned attempts are always cancelled before resubmit to avoid the racing-turn problem. - Tests: DO-comes-back-after-downtime recovery, recovery-wait-timeout re-inspect loop, recovered-turn-fails fallback, and recovery-budget exhaustion fallback.

thomasgauvin · 2026-06-20T03:36:18Z

@threepointone would love to get your thoughts on this one. I've been working with workflows and step.prompt() would regularly die on deployments due to the underlying DOs dying.

I've been testing this PR in my own setups for the past few days iterating to make it more reliable. It hooks into the Agent's built in chat recovery to avoid replaying previous turns (you can see that in previous iterations, I kept the retries 'dumb' but that was very expensive and my workflows would take forever to complete)

thomasgauvin added 2 commits June 18, 2026 12:36

fix(think): use deterministic jitter for step.prompt retry backoff

ad24313

devin-ai-integration Bot reviewed Jun 18, 2026

View reviewed changes

thomasgauvin and others added 3 commits June 18, 2026 13:44

feat(think): forward modelMaxRetries from step.prompt to submitMessag…

0ed9aa3

…es/streamText

Merge remote-tracking branch 'origin/main' into thomasgauvin-think-st…

ceb4346

…ep-prompt-retries # Conflicts: # packages/think/src/think.ts

thomasgauvin marked this pull request as draft June 18, 2026 19:21

thomasgauvin and others added 6 commits June 18, 2026 15:40

Revert modelMaxRetries plumbing to keep PR focused on workflow-level …

cac7aac

…retries

fix(think): clamp maxAttempts and add exhausted-retry test

cae7220

fix(think): validate retries options matching agents RetryOptions sem…

cb53891

…antics

style(think): format workflows.test.ts with oxfmt

dd343a7

fix(think): add retryOnTimeout to test PromptStepRunner type

a622d14

The test file's local _promptStep options type omitted retryOnTimeout, failing the tests tsconfig typecheck in CI (the main package tsconfig does not compile the test directory, so it passed locally).

This was referenced Jun 19, 2026

feat(think): recover via DO chat recovery instead of full retry on timeout #1782

Closed

feat(think): recover via DO chat recovery instead of full retry on timeout #1783

Closed

thomasgauvin changed the title ~~Add retries to Think step.prompt()~~ feat(think): Add retries to Think step.prompt() Jun 19, 2026

thomasgauvin added 5 commits June 19, 2026 17:33

style(think): trim step.prompt retry comments

39a0bea

fix(think): address ai-review feedback for prompt retries

b0030f0

fix(think): account for scheduled chat recovery retries

6cba3be

fix(think): narrow recovery error handling

c37ba2a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(think): Add retries to Think step.prompt()#1777

feat(think): Add retries to Think step.prompt()#1777
thomasgauvin wants to merge 17 commits into
mainfrom
thomasgauvin-think-step-prompt-retries

thomasgauvin commented Jun 18, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

changeset-bot Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Uh oh!

devin-ai-integration Bot Jun 18, 2026

Uh oh!

pkg-pr-new Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

thomasgauvin commented Jun 18, 2026 •

edited

Loading

Uh oh!

thomasgauvin commented Jun 19, 2026 •

edited

Loading

Uh oh!

thomasgauvin commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thomasgauvin commented Jun 18, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

devin-ai-integration Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

pkg-pr-new Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasgauvin commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasgauvin commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasgauvin commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thomasgauvin commented Jun 18, 2026 •

edited by devin-ai-integration Bot

Loading

changeset-bot Bot commented Jun 18, 2026 •

edited

Loading

pkg-pr-new Bot commented Jun 18, 2026 •

edited

Loading

thomasgauvin commented Jun 18, 2026 •

edited

Loading

thomasgauvin commented Jun 19, 2026 •

edited

Loading