Skip to content

feat(think): Add retries to Think step.prompt()#1777

Draft
thomasgauvin wants to merge 17 commits into
mainfrom
thomasgauvin-think-step-prompt-retries
Draft

feat(think): Add retries to Think step.prompt()#1777
thomasgauvin wants to merge 17 commits into
mainfrom
thomasgauvin-think-step-prompt-retries

Conversation

@thomasgauvin

@thomasgauvin thomasgauvin commented Jun 18, 2026

Copy link
Copy Markdown

When I run step.prompt() as part of an agentic workflow, I want to have an option for the step.prompt() to be retried (in case the agent fails due to capacity constraints, etc).

I have kept this 'dumb', which will retry the whole step.prompt(). I think the Think harness itself should have retries or 'continues' to ensure the task is pushed to completion before stopping.


Open in Devin Review

step.prompt() now accepts an optional retries option with maxAttempts,
baseDelayMs, and maxDelayMs. Retryable errors (e.g. Workers AI 3040
capacity errors) trigger fresh prompt attempts with jittered exponential
backoff, while timeouts, validation errors, aborted, and skipped prompts
are surfaced terminally.

Each attempt uses unique workflow step names and idempotency keys so
retries are durable across workflow hibernation and replays.
@changeset-bot

changeset-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: c37ba2a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/think Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

Open in Devin Review

Comment thread packages/think/src/workflows.ts Outdated
Comment thread packages/think/src/workflows.ts Outdated
Comment on lines +297 to +300
expect(sleepCalls[0].duration).toBeGreaterThanOrEqual(0);
expect(sleepCalls[0].duration).toBeLessThanOrEqual(100);
expect(sleepCalls[1].duration).toBeGreaterThanOrEqual(0);
expect(sleepCalls[1].duration).toBeLessThanOrEqual(200);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Test backoff bounds may be vacuously satisfied due to the jitter bug

The retry test at packages/think/src/tests/workflows.test.ts:297-300 asserts sleepCalls[0].duration <= 100 and sleepCalls[1].duration <= 200. Because of the near-zero jitter fraction bug (BUG-0001), these assertions pass trivially (durations are always ~0). After the jitter bug is fixed, these bounds should still hold mathematically (jitter is in [0, upperBound]), but the test should also verify that durations are not always zero — e.g., asserting that at least one sleep duration is > 0 — to serve as a meaningful regression test for the backoff mechanism.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@pkg-pr-new

pkg-pr-new Bot commented Jun 18, 2026

Copy link
Copy Markdown

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1777

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1777

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1777

create-think

npm i https://pkg.pr.new/create-think@1777

hono-agents

npm i https://pkg.pr.new/hono-agents@1777

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1777

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1777

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1777

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1777

commit: c37ba2a

thomasgauvin and others added 3 commits June 18, 2026 13:44
- Derive retry jitter from two raw SHA-256 digest bytes so the backoff
  fraction spans the full [0, 1) range instead of collapsing to ~0ms
  (base64url char codes / 0xffff was near-zero, defeating jitter).
- Keep the original :submit/:wait step names on the first attempt so
  in-flight workflows replay without re-executing completed steps; only
  retries get suffixed names.
- Persist and read back submitMessages maxRetries through the durable
  submission queue so modelMaxRetries actually reaches the inference loop
  (also fixes the maxRetries typecheck error).
- Strengthen the backoff test to assert jitter is non-zero and add a
  regression test for stable first-attempt step names.
…ep-prompt-retries

# Conflicts:
#	packages/think/src/think.ts
@thomasgauvin thomasgauvin marked this pull request as draft June 18, 2026 19:21
@thomasgauvin

thomasgauvin commented Jun 18, 2026

Copy link
Copy Markdown
Author

Remove from maxModelRetries from Think.submitMessages() because you can do it in beforeTurn()/getModel() (this sets the retry count for the whole agent)

thomasgauvin and others added 6 commits June 18, 2026 15:40
Addresses the chatRecovery race in step.prompt() retries without
disabling chatRecovery (which usefully preserves in-flight turn state
across DO restarts/stalls).

- Before each retry, cancel the abandoned attempt's submission via a
  durable `:cancel-N` step so its turn and any chatRecovery continuation
  cannot keep running and race the fresh attempt on the same session
  (the cause of duplicate/interleaved output). No-op once terminal.
- Add `retries.retryOnTimeout` (default true) to fail fast on wait
  timeouts instead of retrying a likely-to-repeat timeout.
- Log each retry via console.warn with step/attempt/delay/error.
- Tests: assert abandoned attempts are cancelled before retry and the
  successful final attempt is not; assert retryOnTimeout=false fails
  fast without backoff.
The test file's local _promptStep options type omitted retryOnTimeout,
failing the tests tsconfig typecheck in CI (the main package tsconfig
does not compile the test directory, so it passed locally).
@thomasgauvin

thomasgauvin commented Jun 19, 2026

Copy link
Copy Markdown
Author

This cancel's Think built-in recovery, and we should make it so step.prompt() if it needs to recover, should resume Think and let it do its own chat recovery

…meout

When step.prompt() times out (e.g. DO died during deploy), check
inspectSubmission before cancelling. If the DO's built-in recovery
is handling the submission (pending/running/completed), re-wait for
the original completion event instead of wasting the in-flight turn.

- Single eventType across all retries so recovered submissions' events
  reach any retry's waitForEvent
- cancelOnTimeout: false when retries enabled (retry loop owns cancel)
- New _tryRecoverSubmission and _processPromptEvent helpers
- Recovery only for ThinkPromptTimeoutError; non-timeout errors unchanged
- 4 new tests: recovery succeeds, recovery fails (dead/DO down), non-timeout skip
@thomasgauvin thomasgauvin changed the title Add retries to Think step.prompt() feat(think): Add retries to Think step.prompt() Jun 19, 2026
Reworks the DO chat-recovery path added for timeout handling so it is
correct and resilient to slow DO restarts.

- Per-attempt event types: revert the single shared event type. Each
  attempt's submission has a distinct type, and Think emits exactly one
  terminal event per submission keyed by it, so a delivered event maps
  1:1 to its submission. This removes the cross-attempt misattribution
  risk where a stale (e.g. cancellation) event from a prior attempt
  could be consumed by a later attempt's wait.

- DO-downtime tolerant recovery: _tryRecoverSubmission is now a bounded
  inspect/re-wait loop. An unreachable DO (inspectSubmission RPC failing
  because it is still restarting after a deploy) is treated as 'still
  recovering', so the loop backs off and re-checks instead of discarding
  the durable in-flight submission and resubmitting.

- Never throws out of step.prompt(): the recovery wait + parse is wrapped
  and returns a tagged result. A recovery-wait timeout, a terminal
  failure of the recovered turn, or invalid recovered output all fall
  through to the cancel + fresh-resubmit path (matching the changeset's
  stated contract, which the previous code violated).

- cancelOnTimeout: documented and honored. With retries the loop owns
  cancellation; a final timed-out attempt respects cancelOnTimeout, and
  abandoned attempts are always cancelled before resubmit to avoid the
  racing-turn problem.

- Tests: DO-comes-back-after-downtime recovery, recovery-wait-timeout
  re-inspect loop, recovered-turn-fails fallback, and recovery-budget
  exhaustion fallback.
@thomasgauvin

Copy link
Copy Markdown
Author

@threepointone would love to get your thoughts on this one. I've been working with workflows and step.prompt() would regularly die on deployments due to the underlying DOs dying.

I've been testing this PR in my own setups for the past few days iterating to make it more reliable. It hooks into the Agent's built in chat recovery to avoid replaying previous turns (you can see that in previous iterations, I kept the retries 'dumb' but that was very expensive and my workflows would take forever to complete)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant