feat(think): Add retries to Think step.prompt()#1777
Conversation
step.prompt() now accepts an optional retries option with maxAttempts, baseDelayMs, and maxDelayMs. Retryable errors (e.g. Workers AI 3040 capacity errors) trigger fresh prompt attempts with jittered exponential backoff, while timeouts, validation errors, aborted, and skipped prompts are surfaced terminally. Each attempt uses unique workflow step names and idempotency keys so retries are durable across workflow hibernation and replays.
🦋 Changeset detectedLatest commit: c37ba2a The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
| expect(sleepCalls[0].duration).toBeGreaterThanOrEqual(0); | ||
| expect(sleepCalls[0].duration).toBeLessThanOrEqual(100); | ||
| expect(sleepCalls[1].duration).toBeGreaterThanOrEqual(0); | ||
| expect(sleepCalls[1].duration).toBeLessThanOrEqual(200); |
There was a problem hiding this comment.
🚩 Test backoff bounds may be vacuously satisfied due to the jitter bug
The retry test at packages/think/src/tests/workflows.test.ts:297-300 asserts sleepCalls[0].duration <= 100 and sleepCalls[1].duration <= 200. Because of the near-zero jitter fraction bug (BUG-0001), these assertions pass trivially (durations are always ~0). After the jitter bug is fixed, these bounds should still hold mathematically (jitter is in [0, upperBound]), but the test should also verify that durations are not always zero — e.g., asserting that at least one sleep duration is > 0 — to serve as a meaningful regression test for the backoff mechanism.
Was this helpful? React with 👍 or 👎 to provide feedback.
agents
@cloudflare/ai-chat
@cloudflare/codemode
create-think
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
- Derive retry jitter from two raw SHA-256 digest bytes so the backoff fraction spans the full [0, 1) range instead of collapsing to ~0ms (base64url char codes / 0xffff was near-zero, defeating jitter). - Keep the original :submit/:wait step names on the first attempt so in-flight workflows replay without re-executing completed steps; only retries get suffixed names. - Persist and read back submitMessages maxRetries through the durable submission queue so modelMaxRetries actually reaches the inference loop (also fixes the maxRetries typecheck error). - Strengthen the backoff test to assert jitter is non-zero and add a regression test for stable first-attempt step names.
…ep-prompt-retries # Conflicts: # packages/think/src/think.ts
|
Remove from maxModelRetries from Think.submitMessages() because you can do it in beforeTurn()/getModel() (this sets the retry count for the whole agent) |
Addresses the chatRecovery race in step.prompt() retries without disabling chatRecovery (which usefully preserves in-flight turn state across DO restarts/stalls). - Before each retry, cancel the abandoned attempt's submission via a durable `:cancel-N` step so its turn and any chatRecovery continuation cannot keep running and race the fresh attempt on the same session (the cause of duplicate/interleaved output). No-op once terminal. - Add `retries.retryOnTimeout` (default true) to fail fast on wait timeouts instead of retrying a likely-to-repeat timeout. - Log each retry via console.warn with step/attempt/delay/error. - Tests: assert abandoned attempts are cancelled before retry and the successful final attempt is not; assert retryOnTimeout=false fails fast without backoff.
The test file's local _promptStep options type omitted retryOnTimeout, failing the tests tsconfig typecheck in CI (the main package tsconfig does not compile the test directory, so it passed locally).
|
This cancel's Think built-in recovery, and we should make it so step.prompt() if it needs to recover, should resume Think and let it do its own chat recovery |
…meout When step.prompt() times out (e.g. DO died during deploy), check inspectSubmission before cancelling. If the DO's built-in recovery is handling the submission (pending/running/completed), re-wait for the original completion event instead of wasting the in-flight turn. - Single eventType across all retries so recovered submissions' events reach any retry's waitForEvent - cancelOnTimeout: false when retries enabled (retry loop owns cancel) - New _tryRecoverSubmission and _processPromptEvent helpers - Recovery only for ThinkPromptTimeoutError; non-timeout errors unchanged - 4 new tests: recovery succeeds, recovery fails (dead/DO down), non-timeout skip
Reworks the DO chat-recovery path added for timeout handling so it is correct and resilient to slow DO restarts. - Per-attempt event types: revert the single shared event type. Each attempt's submission has a distinct type, and Think emits exactly one terminal event per submission keyed by it, so a delivered event maps 1:1 to its submission. This removes the cross-attempt misattribution risk where a stale (e.g. cancellation) event from a prior attempt could be consumed by a later attempt's wait. - DO-downtime tolerant recovery: _tryRecoverSubmission is now a bounded inspect/re-wait loop. An unreachable DO (inspectSubmission RPC failing because it is still restarting after a deploy) is treated as 'still recovering', so the loop backs off and re-checks instead of discarding the durable in-flight submission and resubmitting. - Never throws out of step.prompt(): the recovery wait + parse is wrapped and returns a tagged result. A recovery-wait timeout, a terminal failure of the recovered turn, or invalid recovered output all fall through to the cancel + fresh-resubmit path (matching the changeset's stated contract, which the previous code violated). - cancelOnTimeout: documented and honored. With retries the loop owns cancellation; a final timed-out attempt respects cancelOnTimeout, and abandoned attempts are always cancelled before resubmit to avoid the racing-turn problem. - Tests: DO-comes-back-after-downtime recovery, recovery-wait-timeout re-inspect loop, recovered-turn-fails fallback, and recovery-budget exhaustion fallback.
|
@threepointone would love to get your thoughts on this one. I've been working with workflows and step.prompt() would regularly die on deployments due to the underlying DOs dying. I've been testing this PR in my own setups for the past few days iterating to make it more reliable. It hooks into the Agent's built in chat recovery to avoid replaying previous turns (you can see that in previous iterations, I kept the retries 'dumb' but that was very expensive and my workflows would take forever to complete) |
When I run step.prompt() as part of an agentic workflow, I want to have an option for the step.prompt() to be retried (in case the agent fails due to capacity constraints, etc).
I have kept this 'dumb', which will retry the whole step.prompt(). I think the Think harness itself should have retries or 'continues' to ensure the task is pushed to completion before stopping.