Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
1969872
feat(think): add retries to step.prompt() for transient provider errors
thomasgauvin Jun 18, 2026
ad24313
fix(think): use deterministic jitter for step.prompt retry backoff
thomasgauvin Jun 18, 2026
0ed9aa3
feat(think): forward modelMaxRetries from step.prompt to submitMessag…
thomasgauvin Jun 18, 2026
a0919a9
fix(think): address review feedback on step.prompt retries
thomasgauvin Jun 18, 2026
ceb4346
Merge remote-tracking branch 'origin/main' into thomasgauvin-think-st…
thomasgauvin Jun 18, 2026
cac7aac
Revert modelMaxRetries plumbing to keep PR focused on workflow-level …
thomasgauvin Jun 18, 2026
cae7220
fix(think): clamp maxAttempts and add exhausted-retry test
thomasgauvin Jun 18, 2026
cb53891
fix(think): validate retries options matching agents RetryOptions sem…
thomasgauvin Jun 18, 2026
dd343a7
style(think): format workflows.test.ts with oxfmt
thomasgauvin Jun 18, 2026
17710a3
fix(think): cancel abandoned prompt attempt before workflow retry
thomasgauvin Jun 18, 2026
a622d14
fix(think): add retryOnTimeout to test PromptStepRunner type
thomasgauvin Jun 18, 2026
8698964
feat(think): recover via DO chat recovery instead of full retry on ti…
thomasgauvin Jun 19, 2026
8e74546
fix(think): harden step.prompt timeout recovery for production
thomasgauvin Jun 19, 2026
39a0bea
style(think): trim step.prompt retry comments
thomasgauvin Jun 19, 2026
b0030f0
fix(think): address ai-review feedback for prompt retries
thomasgauvin Jun 19, 2026
6cba3be
fix(think): account for scheduled chat recovery retries
thomasgauvin Jun 20, 2026
c37ba2a
fix(think): narrow recovery error handling
thomasgauvin Jun 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .changeset/think-prompt-recovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
"@cloudflare/think": minor
---

Add DO chat recovery to the `step.prompt()` retry loop.

When a prompt wait times out (e.g. the Think Durable Object was restarting during a deploy), the retry loop first tries to recover the in-flight submission via the DO's built-in chat recovery before discarding it and resubmitting. It inspects the submission and, if it is still `pending`/`running` (recovery in progress) or already `completed`, re-waits for the original completion event — reusing the in-flight turn instead of wasting it.

Recovery is resilient to the DO being temporarily unreachable: while `inspectSubmission` fails (the DO is still coming back up after a deploy), the submission is treated as "still recovering" rather than dead, so the loop backs off and re-checks rather than abandoning the durable submission. Recovery runs for a bounded number of rounds; if it can't recover within that budget it falls through to the cancel + fresh-resubmit path. It never throws out of `step.prompt()` — a recovery-wait timeout, a terminal failure of the recovered turn, or invalid recovered output all fall through to a fresh retry.

Each retry attempt uses a distinct event type derived from its key, so a delivered workflow event maps 1:1 to the submission that produced it and no event can be misattributed across attempts. The DO re-emits an interrupted submission's completion event with that same type, which the recovery wait listens on.

Recovery is only attempted for `ThinkPromptTimeoutError` with `retryOnTimeout` enabled. Non-timeout errors (provider errors, validation failures) still go through the cancel + full-retry path. This leverages Think's existing submission recovery and fiber mechanisms — no new RPC is needed (`inspectSubmission` already exists).
11 changes: 11 additions & 0 deletions .changeset/think-step-prompt-retries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
"@cloudflare/think": patch
---

Add optional retries to `ThinkWorkflow.step.prompt()`.

`step.prompt()` now accepts a `retries` option with `{ maxAttempts?, baseDelayMs?, maxDelayMs?, retryOnTimeout? }`. When a prompt fails for any reason, the workflow waits with jittered exponential backoff and submits a fresh prompt attempt, mirroring the default behavior of `step.do()` retries. All prompt failures are retried up to `maxAttempts` (including the first attempt). Set `retryOnTimeout: false` to fail fast on a wait timeout instead of retrying (timeouts often repeat).

Retry state is durable: each retry uses unique workflow step names and idempotency keys, so retries survive workflow hibernation and replays. The first attempt keeps the original (`:submit`/`:wait`) step names so in-flight workflows from earlier versions continue to replay without re-executing completed steps.

Before retrying, the workflow cancels the abandoned attempt's submission. Think keeps its own `chatRecovery` running for the submission (which preserves in-flight turn state across DO restarts/stalls), so without this a lingering turn or recovery continuation for the old attempt could keep running and race the fresh attempt on the same session — producing duplicate or interleaved output. Each retry is also logged via `console.warn` with the step name, attempt, backoff delay, and error.
37 changes: 37 additions & 0 deletions docs/think/workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,43 @@ and throws `ThinkPromptTimeoutError`.
Set `cancelOnTimeout: false` when you intentionally want the Think submission to
continue after the Workflow stops waiting.

## Retries

Pass `retries` to retry a failed `step.prompt()` attempt:

```typescript
await step.prompt("summarize-file", {
prompt: "Summarize the file",
output: summarySchema,
timeout: "5 minutes",
retries: {
maxAttempts: 3,
baseDelayMs: 500,
maxDelayMs: 5000,
retryOnTimeout: true
}
});
```

| Option | Default | Description |
| ---------------- | ------- | ------------------------------------------------ |
| `maxAttempts` | `1` | Total attempts, including the first attempt. |
| `baseDelayMs` | `500` | Base delay for deterministic exponential jitter. |
| `maxDelayMs` | `5000` | Maximum retry delay. |
| `retryOnTimeout` | `true` | Whether timeout errors should be retried. |

When retries are enabled and a wait times out, `step.prompt()` first gives the
Think Durable Object a chance to recover the in-flight submission. This is useful
when the Durable Object is restarting after a deploy: the Workflow waits for the
original completion event instead of immediately discarding the turn and
submitting a duplicate prompt. If recovery cannot complete within its bounded
retry window, the Workflow cancels the abandoned submission and submits a fresh
attempt.

Set `retryOnTimeout: false` to fail fast on `ThinkPromptTimeoutError`. With
multiple attempts enabled, `cancelOnTimeout` only applies to the final timed-out
attempt; abandoned attempts are cancelled before a fresh retry starts.

## Boundary With Other Primitives

Use `getScheduledTasks()` for recurring prompt submissions or deterministic
Expand Down
17 changes: 17 additions & 0 deletions packages/think/src/tests/think-session.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3862,6 +3862,23 @@ describe("Think — onChatRecovery", () => {
expect(incident?.status).toBe("scheduled");
});

it("does not mark a running submission errored while a recovered retry is scheduled", async () => {
const agent = await freshRecoveryAgent(
`recover-retry-sweep-${crypto.randomUUID()}`
);
await agent.seedRunningSubmissionForTest("root-RS");
await agent.preScheduleRecoveryRetryForTest({
recoveredRequestId: "root-RS",
targetUserId: "user-RS",
incidentId: "inc-RS",
originalRequestId: "root-RS"
});

await agent.recoverSubmissionsOnStartForTest();

expect(await agent.getSubmissionStatusForTest("root-RS")).toBe("running");
});

it("exhausts via onExhausted once the stable-state continue budget is spent", async () => {
const agent = await freshRecoveryAgent(
`stable-exhaust-${crypto.randomUUID()}`
Expand Down
Loading
Loading