Chat recovery foundation: a shared, host-agnostic recovery engine for agents/chat#1788
Conversation
🦋 Changeset detectedLatest commit: ac8fe1e The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
| // SKIPPED: known-red gate for an unrelated bug — the flat wall-clock re-attach | ||
| // budget (`DEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS`) still abandons a healthy, | ||
| // still-progressing child as `interrupted` after a deploy. A prior fix | ||
| // (#1670, "progress-keyed agent-tool re-attach") only partially closed this; | ||
| // making the re-attach fully progress-aware/durable is its own task. This is | ||
| // NOT part of the chat-recovery extraction. Re-enable (drop `.skip`) once that | ||
| // fix lands — the assertion below should then pass unchanged. |
There was a problem hiding this comment.
🚩 Think e2e reattach-budget test deliberately skipped with a known-red gate note
The test at packages/think/src/e2e-tests/reattach-budget.test.ts:310 was changed from it(...) to it.skip(...) with a detailed comment explaining this is a known-red gate for an unrelated bug — the wall-clock re-attach budget still abandons a healthy child after a deploy. The comment explicitly says this is NOT part of the chat-recovery extraction and should be re-enabled once the separate fix lands. This is the right approach (skip with context rather than delete or leave red), but it means there is a regression hole in the re-attach budget area until that fix arrives.
Was this helpful? React with 👍 or 👎 to provide feedback.
Begin work on rfc-chat-recovery-foundation. The durable recovery state machine is currently duplicated verbatim across `@cloudflare/ai-chat` (`AIChatAgent`) and `@cloudflare/think` (`Think`). This is the first, lowest-risk extraction: the incident-budget decision, which is the pure, byte-identical heart of both copies of `_beginChatRecoveryIncident`. What changed: - Add `packages/agents/src/chat/recovery-incident.ts` (@internal): the `ChatRecoveryIncident` type, the persisted storage-key/budget constants (the cutover contract, now in one place), and pure helpers `resolveChatRecoveryConfig`, `chatRecoveryIncidentId`, `chatRecoveryIncidentKey`, `selectStaleIncidentKeys`, plus `evaluateChatRecoveryIncident` — a storage-free, deterministic extraction of the budget math. The caller still owns storage I/O, the progress counter, the pending-interaction predicate, and event emission; the function takes resolved inputs + an injected clock and returns { incident, exhausted, events }. - Add Layer-1 unit tests (`__tests__/recovery-incident.test.ts`, 30 cases) covering: incident open, identity excludes recovery kind, retry/continue share one budget, attempt cap, attempt reset on progress, deploy-storm debounce, no-progress timeout, finite work budget, maxRecoveryWork:Infinity, shouldKeepRecovering true/false/throws, no-progress and work-budget win before the predicate, and HITL-park-is-budget-free. - Add the golden cutover round-trip gate (`__tests__/recovery-cutover-fixtures.ts` + `recovery-cutover.test.ts`): authentic pre-cutover snapshot envelopes (`__cfAIChatFiberSnapshot` / `__cfThinkChatFiberSnapshot` + legacy raw stash), legacy incident records (missing optional fields, deprecated `max_recovery_window_exceeded` reason), and schedule payloads with Think's extra `recoveredRequestId`. - Update the RFC: add a "Working cadence" loop (do step -> update plan -> deep-review -> commit) that applies to every phase, a live "Progress log", and Phase 0 checklist ticks. Deep review findings: - `evaluateChatRecoveryIncident` reproduces `_beginChatRecoveryIncident`'s budget math verified line-by-line against both packages (the source is byte-identical apart from log prefix, predicate name, and Think's client-tool rehydration guard). - `targetAssistantId` is correctly omitted from the pure identity: it lives only in schedule payloads and is a dead param in the incident math. - Sweep ordering invariant preserved by keeping the sweep in the caller: stale incidents must be swept BEFORE reading `existing` (a >1h incident is both swept and past the no-progress window). Phase 1 wiring must honor this. - Think's `_restoreClientTools()` hibernation guard is adapter-owned rehydration that must run before `awaitingClientInteraction` is computed; modeled here as an input to the pure function, not engine policy. - Cutover note: a pre-cutover incident persisted without `lastProgressAt` is bounded by `firstSeenAt`, so a long-orphaned turn can seal on no_progress_timeout immediately on the cutover wake — existing behavior, now explicit and tested. Zero behavior change: both packages still run their inline copies. Wiring them to call the shared function is Phase 1; deleting the inline copies is Phase 4. No public API change (module is @internal, not exported from the barrel), so no changeset is required. Gates: 280 chat-project tests pass (36 new); oxlint, oxfmt, and full-repo typecheck (111 projects) clean. Co-authored-by: Cursor <cursoragent@cursor.com>
Make `AIChatAgent` delegate its incident-budget logic to the shared engine
extracted in the previous commit, proving the extraction against real Durable
Object storage with zero behavior change.
What changed:
- `AIChatAgent._resolveChatRecoveryConfig` -> `resolveChatRecoveryConfig`,
`_chatRecoveryIncidentId` -> `chatRecoveryIncidentId`, and the budget
computation inside `_beginChatRecoveryIncident` -> `evaluateChatRecoveryIncident`
(agents/chat). The method now owns only the storage I/O it must own: resolve
config, sweep stale incidents, read the existing record, read the progress
marker, evaluate via the engine, persist, and emit the returned events.
- Re-export the engine surface `@internal` from the `agents/chat` barrel
(`chat/index.ts`) because both consumers import shared chat code through that
entry point. Tighten `resolveChatRecoveryConfig`'s param to
`ChatRecoveryConfig | undefined`, dropping the earlier casts.
- Remove six now-unused local default constants from `ai-chat`
(maxAttempts/maxRecoveryWork/stableTimeout/terminalMessage/noProgressTimeout/
alarm-debounce); those defaults live in the engine now. Net -210 lines in
`ai-chat/src/index.ts`.
Deep review (behavior-preservation):
- Persisted incident JSON is byte-identical: the engine builds the same field
set in the same order, including the `...(exhausted ? { reason } : {})`
spread, so the cutover contract is unaffected.
- `existing` is normalized `?? null`; the engine's `existing != null` / `!existing`
guards match the old `undefined` checks exactly.
- Sweep-before-read ordering invariant preserved (a >TTL incident is also past
the no-progress window, so sweeping first lets an abandoned identity start
fresh) and documented inline.
- `hasPendingClientInteraction()` budget-free path, the `shouldKeepRecovering`
ctx (`recoveryRootRequestId` fallback, `ageMs`), and the `[AIChatAgent]`-
prefixed predicate-error log are all preserved (passed via
`onShouldKeepRecoveringError`).
- Event names/payloads/order unchanged; `_emit` accepts the engine event types.
- The locally-computed `key`/`incidentId` match `incident.incidentId`, so the
stored key and record stay consistent.
Deferred to Phase 4: the local `ChatRecoveryIncident` / `ChatRecoveryKind`
types and the remaining recovery constants stay duplicated for now. Think
wiring is the next Phase 1 step. The other recovery methods
(`_updateChatRecoveryIncident`, `_exhaustChatRecovery`,
`_handleInternalFiberRecovery`, scheduling) move under the adapter in Phase 2.
No public API change (engine is @internal, not documented, no hook/signature
or default changes), so no changeset.
Gates: full ai-chat suite (682) + agents chat unit suite (280) pass; full-repo
typecheck (111 projects) and oxlint clean.
Co-authored-by: Cursor <cursoragent@cursor.com>
Mirror the AIChatAgent wiring on Think so both packages now share one incident state machine. `Think._resolveChatRecoveryConfig`, `_chatRecoveryIncidentId`, and the budget computation inside `_beginChatRecoveryIncident` delegate to the shared `resolveChatRecoveryConfig` / `chatRecoveryIncidentId` / `evaluateChatRecoveryIncident` (agents/chat). Removed the same six now-unused local default constants. Deep review (Think-specific seams preserved): - The `_restoreClientTools()` hibernation guard is kept and still runs BEFORE the engine reads `hasPendingInteraction()`: the guard `if` statement executes before the `evaluateChatRecoveryIncident` argument object (which evaluates `awaitingClientInteraction: this.hasPendingInteraction()`) is constructed. On a fresh wake the base Agent runs boot recovery before onStart's restore, so a HITL turn parked on a client-tool orphan would be misread as "stuck" without this; the guard is idempotent with the later onStart restore. - The predicate is Think's `hasPendingInteraction()` (which excludes server-tool orphans), not ai-chat's `hasPendingClientInteraction()`. - The predicate-error log keeps the `[Think]` prefix (passed via `onShouldKeepRecoveringError`). - Persisted incident JSON is byte-identical, `existing` normalized `?? null`, and sweep-before-read ordering preserved (same as the AIChatAgent commit). Deferred to Phase 2/4 as before: Think's other recovery methods (submissions, messenger/workflow ordering, stall watchdog, tool rollback, agent-tool reconcile) move under a ThinkRecoveryAdapter in Phase 3; the duplicated local incident type/constants are deleted in Phase 4. No public API change, so no changeset. Gates: Think workers suite (686) passes; full-repo typecheck (111) and oxlint clean. (React/CLI/e2e suites use their own vitest configs; run the recovery path via `pnpm run test:workers`.) Co-authored-by: Cursor <cursoragent@cursor.com>
… net
Worked the Phase 0 breadth items (schedule idempotency/non-idempotency,
terminal-before-seal, callback-error coverage, reconnect recovering replay) as
an AUDIT rather than reflexively adding tests, then recorded the result in the
RFC.
Finding: the high-risk Phase 2 invariants are already characterized
symmetrically in both `@cloudflare/ai-chat` and `@cloudflare/think`, so adding
more package-level tests would duplicate (the working cadence explicitly warns
against over-testing). The existing suites ARE the Phase 2 safety net.
Evidence (verified by reading the real scheduling/exhaust/terminal code +
existing tests):
- Non-idempotent stable-timeout reschedule: pinned by the 2-row tests in both
packages ("reschedules a continuation that times out…" + retry twin). The
base scheduler dedups idempotent delayed rows on callback+payload+owner, and
the reschedule deliberately passes { idempotent: false } so it does not dedup
onto the executing one-shot row and vanish.
- Initial-schedule storm-dedup: pinned by the fiber-row-deletion "double
recovery" tests (primary mechanism); { idempotent: true } is belt-and-braces.
- Terminal-before-seal: pinned by the #1730 defer-on-transient tests in both
packages, plus seal-write-best-effort and #1645 terminal-replay-on-reconnect.
- Callback-error handling: onChatRecovery/onExhausted throw (ai-chat) +
shouldKeepRecovering throw (shared engine unit test).
Resolved the Phase 0 checklist and added a "Phase 0 breadth audit" invariant→
test map. Two items are deliberately deferred (NOT current-behavior pins):
1. Adapter-contract tests + a direct { idempotent } flag-value assertion per
scheduling reason — these belong at the engine↔adapter seam, which doesn't
exist yet; specified as the first Layer-2 test to write in Phase 2.
2. ai-chat recovering-status on-connect hydration — confirmed asymmetry vs
Think (ai-chat's own `_setChatRecovering` comment says the live signal is not
replayed on connect). Ships WITH the Phase 2 convergence + changeset, tracked
as an intentional behavior change.
Why this over jumping to Phase 2: converging behavior before the safety net is
verified contradicts the cadence; this audit makes the net explicit so Phase 2
lands against a known-good base.
Docs-only; no code or test changes.
Co-authored-by: Cursor <cursoragent@cursor.com>
…e 1)
Introduce the first recovery engine-seam file
`packages/agents/src/chat/recovery-engine.ts` with
`chatRecoverySchedulePolicy(reason)` as the single source of truth for the
`schedule()` idempotency flag, and route both packages through it:
- "initial" -> idempotent (collapses a deploy-storm of re-detections
into one enqueued continuation)
- "stable_timeout_retry"-> non-idempotent (a reschedule issued from inside the
executing one-shot row, which alarm()
deletes only after we return; an
idempotent reschedule would dedup onto
the doomed row and never fire)
All eight recovery schedule sites now source the flag from the policy:
AIChatAgent (3 initial + 1 reschedule) and Think (3 initial + 1 reschedule).
Per-site comments now point at the policy for the rationale. The four remaining
`{ idempotent }` literals are non-recovery subsystems (stream-buffer cleanup,
scheduled tasks, submission drain) and are intentionally left alone.
This is a cutover invariant that no type error guards, so add the deferred
Layer-2 seam test (`__tests__/recovery-engine.test.ts`): it pins both reasons
directly and through a fake scheduler exercised the way the packages call
`schedule()`. Closes the Phase 0 "direct flag assertion" deferral.
Zero behavior change (the policy returns the same literal each site used).
Gates: ai-chat workers (604) + think workers (686) + shared engine unit (34)
pass; repo typecheck (111 projects) and oxlint clean.
Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the six inline `"_chatRecoveryContinue" | "_chatRecoveryRetry"` unions in the recovery helper signatures (AIChatAgent ×2, Think ×4) with the shared `ChatRecoveryScheduleCallback` type already exported from `agents/chat`. This gives the seam type a real consumer (it was exported-but-unused after slice 1) and removes the duplicated literal union. Pure type-alias substitution — identical types, zero runtime change. Gates: repo typecheck (111 projects) and oxlint clean. Co-authored-by: Cursor <cursoragent@cursor.com>
…se 2, slice 2a) Introduce `ChatRecoveryEngine` + `ChatRecoveryAdapter` in `recovery-engine.ts`. The engine owns the begin-incident sequence and its two ordering invariants: resolve config -> derive key -> sweep stale (before read) -> read existing -> rehydrate interaction state (before predicate) -> read progress -> evaluate budget -> persist -> emit events The adapter is a thin seam over the package's host I/O (storage, clock, events, interaction predicate). The budget math stays in the pure `evaluateChatRecoveryIncident`; the engine owns only the sequence. Wire `AIChatAgent._beginChatRecoveryIncident` to a cached engine over an inline adapter, and remove the now-dead `_chatRecoveryIncidentId` and the `evaluateChatRecoveryIncident` import (the engine derives id/key via the pure fns). AIChatAgent omits the optional `ensureInteractionStateLoaded` hook — it has no interaction state to rehydrate; the hook reserves Think's client-tool guard for slice 2b. Orchestration is byte-identical: the pure `chatRecoveryIncidentKey` matches the removed private method character-for-character (incl. `encodeURIComponent`), the sequence/order is unchanged, and the cached engine is safe because the adapter arrows capture `this` (ctx/storage stable per DO instance) while `resolveConfig` still runs per-incident. Adds a Layer-2 fake-adapter test pinning the sequence, both ordering invariants, the injected-clock path, event fan-out, and the optional-hook-absent shape. Gates: ai-chat workers (604) + shared engine unit (10) pass; repo typecheck (111 projects) and oxlint clean. Co-authored-by: Cursor <cursoragent@cursor.com>
… slice 2b) `Think._beginChatRecoveryIncident` now delegates to the shared `ChatRecoveryEngine` via a cached adapter, mirroring AIChatAgent (slice 2a). Its hibernation ordering guard becomes the adapter's `ensureInteractionStateLoaded` hook — the rationale comment moves verbatim onto the hook — and its interaction predicate is `hasPendingInteraction()`. Removed the now-dead `_chatRecoveryIncidentId` plus the `evaluateChatRecoveryIncident` / `chatRecoveryIncidentId` imports (the engine derives id/key via the pure fns). Both packages are now symmetric for incident-begin; the only divergence is the predicate and the presence of the rehydration hook. Byte-identical: the engine order (get -> ensureInteractionStateLoaded -> readProgress -> predicate) matches the old inline order (get -> guard -> progress -> hasPendingInteraction); `_restoreClientTools` and `hasPendingInteraction` keep their other callers; key derivation is unchanged. Gates: Think workers (686) pass; repo typecheck (111 projects) and oxlint clean. Co-authored-by: Cursor <cursoragent@cursor.com>
… 2c) Both packages' `_exhaustChatRecovery` share a byte-identical head — build the `ChatRecoveryExhaustedContext`, emit `chat:recovery:exhausted`, then run `onExhausted` with a throw-swallow so a bad hook can never block terminal UX — but their tails (terminal-record / banner-broadcast / submission writes AND the order of those writes) are an intentional, documented divergence: ai-chat persists-then-broadcasts (#1645 reconnect reliability), Think broadcasts-then-persists (banner resilience) and also marks the submission interrupted. So rather than forcing the whole method behind the engine (which would either flatten that divergence or push a persist-first/broadcast-first knob plus terminal/submission I/O through the seam — a leak), extract only the shared head: - `buildChatRecoveryExhaustedContext` — pure field map with the `reason`/`recoveryRootRequestId` fallbacks. - `notifyChatRecoveryExhausted` — emit -> onExhausted-swallow -> onError report. Both packages call these at the top of `_exhaustChatRecovery` and keep their own divergent terminal I/O in their own order. The "throwing onExhausted never blocks terminal UX" invariant now lives in one tested place. Drops the now-unused `ChatRecoveryExhaustedContext` type-import from both packages (public re-export kept). Zero behavior change. Layer-2 tests pin the field map, both fallbacks, emit-before-hook order, the swallow + onError path, and the no-hook path. Gates: chat unit (296), ai-chat workers (682), Think workers (686), typecheck (111), oxlint — all clean. Co-authored-by: Cursor <cursoragent@cursor.com>
The existing recovery e2e proves the state machine in local workerd via SIGKILL. This adds the missing half: a suite that deploys a real Worker, forces a real Durable Object eviction the way production does — a `wrangler deploy` mid-turn — and asserts recovery fires on Cloudflare's edge, then always deletes the Worker. - `wrangler.deployed.jsonc` — uniquely-named Worker for the suite. - `deployed-recovery.test.ts` — deploy -> start a hanging turn -> redeploy to evict -> poll until a recovery incident opens; resilient deploy() retry and a send+check turn-start poll (a fresh workers.dev route drops the first WS handshakes during cold start). - `vitest.deployed.config.ts` + `test:e2e:deployed` script. - `ChatHangingRecoveryAgent` — its turn hangs forever so it is guaranteed in-flight when the (slow ~18s) redeploy lands; a finite mock turn would complete first and leave nothing to recover. Double-gated so it never runs in normal CI: its own config (not in `test`/`test:e2e`) plus a `RUN_DEPLOYED_E2E=1` body gate. Validated green twice against a real account (~70-76s/run) with no leftover resources; typecheck (111) clean; the new DO class + migration v6 are additive (local config dry-run validates). Co-authored-by: Cursor <cursoragent@cursor.com>
… slice 2d) First behavior-changing slice of the chat-recovery convergence. AIChatAgent now replays the live "recovering…" status on connect, matching @cloudflare/think. Before this, ai-chat only broadcast cf_agent_chat_recovering live, so a client that connected during the gap between a scheduled recovery continuation and its first chunk saw nothing and appeared frozen until the turn resumed or failed. - onConnect now sends the recovering frame on the no-active-stream branch via a new _buildRecoveringConnectFrame() (an actively-streaming continuation still gets STREAM_RESUMING, so the two signals never collide). Stale records past the flag TTL are skipped; terminal outcomes still clear it. Mirrors Think's _buildIdleConnectMessages exactly. - No client change needed: react.tsx isRecovering already handles the frame whenever it arrives (its doc already described on-connect replay for Think). - Deterministic unit coverage (getRecoveringConnectFrameForTest); corrected the chat-recovering-status e2e doc (it documented the OLD no-replay behavior) and added an opportunistic real-socket on-connect observation. Validation: 683 ai-chat unit tests green; full local SIGKILL e2e 10/10 green (no regression in the hot connect path); deployed real-edge e2e green (self-cleaning); pnpm run check green (111 projects). Minor changeset shipped (user-visible state). Co-authored-by: Cursor <cursoragent@cursor.com>
…ing-continue rendering follow-up Enables `chatRecovery = true` on the examples/ai-chat ChatAgent so the showcase demonstrates Durable Object eviction recovery (and the new recovering-on-connect replay) out of the box. AIChatAgent defaults chatRecovery to false, so the example previously couldn't recover an interrupted turn — kill wrangler dev mid-stream and only the last-flushed partial survived. README updated to match. Also records a deferred recovery-UX follow-up in the RFC progress log: on the recovery continue path with a reasoning model, new reasoning emitted after a partial text briefly renders as a second reasoning block under the content (forwarded reasoning-start creates a new client part) before the final persisted message merges it back on top. Pre-existing and an AI SDK v6 protocol limitation; tracked, not fixed, per smoke-test review. Co-authored-by: Cursor <cursoragent@cursor.com>
… shared engine Both AIChatAgent and Think duplicated _updateChatRecoveryIncident near byte-for-byte: the incident state-machine transition that deletes a completed record (else persists), emits the completed/skipped/failed lifecycle event, and drives the #1620 "recovering…" flag. Hoist it into ChatRecoveryEngine.updateIncident — the transition twin of beginIncident — so the state-machine shape lives in one place. Two new adapter hooks carry the package-owned I/O: deleteIncident(key) and setRecovering(active, requestId?) (the latter delegates to each package's existing _setChatRecovering, so its staleness/idempotency/broadcast logic stays package-owned). ChatRecoveryIncidentEvent widens to the five recovery event types with an optional reason; emitRecoveryEvent forwards the cause for skipped/failed. Both _updateChatRecoveryIncident are thin adapter bindings now (~50 lines deleted from each package). Zero behavior change. Key derivation verified byte-identical so the engine can compute the key itself; getIncident normalizes undefined->null under the engine's truthy guard; _emit's payload is an untyped Record so the widened event union carries no per-type risk; _chatRecoveryIncidentKey stays referenced by the resume-handshake paths. 6 new updateIncident fake-adapter tests (agents chat project 22 green); ai-chat 683 + think suites green; typecheck 111; pnpm run check clean. Co-authored-by: Cursor <cursoragent@cursor.com>
…gents/chat Phase 3 start. The incident lifecycle is already shared for Think (slices 2b/2c/2e); Phase 3 is the deeper Think-only recovery surface. The stall watchdog is the #1 convergence-matrix item, so it is the natural shared foundation. Extract Think's _iterateWithStallWatchdog + ChatStreamStalledError verbatim into packages/agents/src/chat/stall-watchdog.ts and re-export them (@internal) from agents/chat. The watchdog generator never referenced `this`, so it lifts to a free function with no seam: Think now imports both, its two stream-loop call sites drop the `this.` prefix, and the two `instanceof ChatStreamStalledError` read-loop catches are unaffected (the thrown error is the same imported class, so instanceof still holds). The onStall closures stay inline at Think's call sites, so package-specific abort/emit stays package-owned — only the generic race/timeout/cancel mechanics moved. Zero behavior change. 4 Layer-2 unit cases added (disabled-passthrough, fast passthrough, stall->throw+onStall-once, consumer-break->source-cancel); Think 686 + e2e suites green; ai-chat untouched (additive export, confirmed by typecheck + check:exports); repo typecheck 111; pnpm run check clean. No changeset. Slice 3b wires AIChatAgent onto this primitive (the user-visible behavior change + changeset). Co-authored-by: Cursor <cursoragent@cursor.com>
Wire AIChatAgent's _streamSSEReply read loop through the shared iterateWithStallWatchdog (from slice 3a) behind a new opt-in `chatStreamStallTimeoutMs` field (default 0 = off, matching @cloudflare/think). When a stall fires and chatRecovery is enabled, _routeStallToBoundedRecovery opens/reuses the incident under the turn's recovery identity and schedules a _chatRecoveryContinue (or delivers terminal UX once the budget is spent); with recovery off the stall stays a terminal stream error (kills the spinner). Correctness: the partial is persisted from the in-memory `message` via the normal post-stream persistence path (not reconstructed from stored chunks), because on a live stall the stored-chunk buffer can lag the in-memory parts and the orphan reconstructor comes back empty — which would lose the user's partial (the exact #1626 complaint). The continuation re-anchors via targetAssistantId. Tests: 2 Layer-3 integration cases in durable-chat-recovery.test.ts (hanging-SSE onChatMessage mode + driveStallingTurnForTest helper) — stall routes into a continue incident with the partial persisted, and timeout 0 disables the watchdog. ai-chat 685 green; repo check + typecheck (111) clean. Changeset: @cloudflare/ai-chat minor. Co-authored-by: Cursor <cursoragent@cursor.com>
Add an optional `tryHandleNonChatFiberRecovery(ctx)` hook to ChatRecoveryAdapter plus `ChatRecoveryEngine.handleNonChatFiber(ctx)`. Think now routes its messenger/workflow reply-fiber dispatch (_messengerRuntime.handleFiberRecovery) through the engine seam at the top of _handleInternalFiberRecovery instead of calling the runtime directly; AIChatAgent calls the same seam as a structural no-op (it omits the hook, so every recovered fiber stays a chat candidate). The engine now owns the ordering invariant (non-chat dispatch runs BEFORE the chat-fiber-name gate, so a messenger fiber is never misread as an orphaned chat turn); the behavior stays adapter-owned. Byte-equivalent: the prior `if (await _messengerRuntime?.handleFiberRecovery(ctx))` becomes `if (await engine.handleNonChatFiber(ctx))` = `(await hook?.(ctx)) ?? false` — same truthiness, same undefined->skip when no messenger runtime (child facet). FiberRecoveryContext is imported `import type` from ../index (same package, erased — no cycle). Tests: 3 Layer-2 fake-adapter cases (consume->true, decline->false, omitted->false). agents chat project 25 green (was 22), Think 686 + e2e green, ai-chat 685 green, repo typecheck 111, check clean. No changeset (zero behavior change; @internal seam). Also reframes Phase 3 in the RFC: the deep surface map showed durable submissions, agent-tool child-run reconcile, and resume-ACK orphan persist are already correctly adapter-owned, so their convergence is Phase 4 dedup (not new seams) — moved there to avoid indirection without payoff on the risky wake path. Co-authored-by: Cursor <cursoragent@cursor.com>
…+ record e2e verification Before starting Phase 4, re-verified slices 3a/3b/3c with a deep review and the real-`wrangler dev` suites. Review found a coverage gap: no test exercised a *healthy* (non-stalling) stream with the stall watchdog armed (`chatStreamStallTimeoutMs > 0`). The guarded `pull()` path must pass healthy streams through unchanged and clear its timer on completion. Added an ai-chat integration test for exactly that. Also confirmed: `AIChatAgent._routeStallToBoundedRecovery` is structurally byte-equivalent to Think's; and the continuation re-anchor id is safe because the tool-approval early-persist writes under `sanitized.id === message.id`, so `earlyPersistedId === message.id` and the leaf-check cannot mis-skip. Verification run (all green): ai-chat real `wrangler dev` + SIGKILL recovery e2e (5 files / 10 tests), shared agents/chat watchdog + recovery-engine units (29), Think messengers (27), full Think workers (686), full ai-chat workers (608, incl. the new test). Environment blocker (not code): the Think real-`wrangler dev` e2e binds Workers AI with `"remote": true` and needs live wrangler auth (expired here); deferred with the deployed e2e to the Phase 6 merge gate. RFC progress log updated. Co-authored-by: Cursor <cursoragent@cursor.com>
…rface map Maps the remaining duplication into four ordered slices (4a shared types + key/sweep helpers; 4b centralize the schedule-recovery triplet; 4c stable-timeout incident mutations; 4d collapse the ~280-line _handleInternalFiberRecovery bodies into an engine dispatch skeleton). Ordered low-risk -> high-risk so each ships behind its own review + e2e gate; 4d (the wake path) does not start until 4a-4c are green. Co-authored-by: Cursor <cursoragent@cursor.com>
…pers First Phase 4 (deduplication) slice — the mechanical, zero-behavior band. Both `@cloudflare/think` and `@cloudflare/ai-chat` re-declared symbols that already exist canonically in `agents/chat` (`recovery-incident.ts`): - the `ChatRecoveryIncident` type and `ChatRecoveryKind` alias, - a local `CHAT_RECOVERY_INCIDENT_KEY_PREFIX` const, - a `_chatRecoveryIncidentKey` method (100% duplicate of `chatRecoveryIncidentKey`), - an inline stale-key loop inside `_sweepStaleChatRecoveryIncidents` (a reimplementation of `selectStaleIncidentKeys`), - a local `CHAT_RECOVERY_INCIDENT_TTL_MS` const. Replace all of these with the shared symbols in both packages: import the canonical type/kind/prefix + `chatRecoveryIncidentKey` + `selectStaleIncidentKeys`, route the two stable-timeout call sites through `chatRecoveryIncidentKey(...)`, and collapse each sweep to `selectStaleIncidentKeys(entries, now)`. Zero behavior change: the canonical type is byte-identical to both local copies; the prefix string is unchanged; and the sweep TTL the shared helper applies (`60*60*1000`) matches the local constant it replaces (verified before deleting the now-unused locals). Net -86 lines of duplication. Verification: repo typecheck (111 projects) + full `pnpm run check` clean; Think workers 686 and ai-chat workers 608 unchanged; ai-chat real-`wrangler dev` SIGKILL recovery e2e (offline-safe) re-run green. No changeset (internal seam, zero behavior). Think real-edge e2e remains gated on the Phase 6 merge gate (its Workers AI binding needs a stable remote session). Co-authored-by: Cursor <cursoragent@cursor.com>
The `updateIncident("scheduled")` + `_emit("chat:recovery:scheduled")` +
`schedule(0, callback, data, chatRecoverySchedulePolicy("initial"))` block was
copied at 7 call sites (AIChat: stall + 3 fiber; Think: stall + 2 fiber).
Collapse it into one engine method `ChatRecoveryEngine.scheduleRecovery({
incident, recoveryKind, callback, data, reason? })` that owns the transition →
emit → enqueue order, behind a new `ChatRecoveryAdapter.scheduleRecovery` hook
(each package: `schedule(0, callback, data, chatRecoverySchedulePolicy(reason))`).
Widen `ChatRecoveryIncidentEvent["type"]` with `"chat:recovery:scheduled"` so the
emit flows through the existing `emitRecoveryEvent` mapping (byte-identical
payload). `recoveryKind` is passed explicitly because AIChat's lost-partial
branch opens a `continue` incident but schedules + reports a `retry`;
`requestId` is read off the incident (the evaluation rewrites it to the current
attempt). Behavior-preserving; the stable-timeout reschedule (non-idempotent
direct put) is untouched — that is slice 4c.
Tests: +4 engine unit tests (order, explicit recoveryKind override, default
reason, verbatim payload); engine unit 29, typecheck 111, check, Think workers
686, ai-chat workers 686, ai-chat real wrangler-dev SIGKILL e2e 10/10. Think
real-edge e2e still gated on Phase 6 (remote Workers AI binding needs re-auth).
Co-authored-by: Cursor <cursoragent@cursor.com>
`_rescheduleRecoveryAfterStableTimeout` was byte-identical in both packages: read the incident, and if under the attempt cap, bump `attempt`, persist `scheduled`/`stable_timeout_retry`, and issue a NON-idempotent delayed schedule (it runs inside the executing one-shot row, so an idempotent reschedule would dedup onto the doomed row and never fire). Lift it into `ChatRecoveryEngine.rescheduleAfterStableTimeout`; each package method is now a one-line delegation. Generalize the 4b `ChatRecoveryAdapter.scheduleRecovery` hook to carry `delaySeconds` (initial triplet passes 0, the reschedule passes CHAT_RECOVERY_STABLE_RETRY_DELAY_SECONDS) so one schedule seam serves both. Also remove the private `CHAT_RECOVERY_STABLE_RETRY_DELAY_SECONDS = 3` each package kept shadowing the canonical agents/chat constant — the engine now uses the shared one. Re-scope: the give-up seal (`_exhaustRecoveryGiveUp` / `_exhaustRecoveryAfterStableTimeout`, ~80% dup) moves from 4c to 4d. Its shared spine interleaves package-specific terminalization + stream/partial reads behind the #1730/#1645 exactly-once invariants and needs ~5 adapter hooks that are exactly 4d's terminalize + stream surface; building them once (in 4d) avoids inconsistent seams. Tests: +5 engine unit tests (attempt bump + delayed non-idempotent enqueue, missing id, no record, budget spent, maxAttempts fallback); engine unit 34, typecheck 111, check, Think workers 686, ai-chat workers 686, ai-chat real wrangler-dev SIGKILL e2e 10/10, Think remote-Workers-AI recovery e2e 6/6. Co-authored-by: Cursor <cursoragent@cursor.com>
Both packages ran a byte-identical ~80% give-up spine to terminalize a recovery turn whose retry budget drained (#1645): resolve config -> read the stored incident (best-effort) -> `exhausted` re-entry guard -> build the exhausted incident (reuse or synthesize) -> resolve streamId/partial -> `_exhaustChatRecovery` (terminalize) BEFORE a best-effort seal write. Lift it into `ChatRecoveryEngine.exhaustRecoveryGiveUp({ callback, data, reason })` behind 5 adapter hooks (exhaustChatRecovery, resolveRecoveryStreamId, getPartialStreamText, activeChatRecoveryRootRequestId, onGiveUpBookkeepingError); each package method is now a one-line delegation. The only divergences are caller parameters: `reason` (Think stable_timeout|recovery_error; AIChat always stable_timeout) and the root-id chain (Think includes recoveredRequestId; AIChat's payload type has none, so the unified chain collapses identically). The terminalize-before-seal ordering (#1730) is preserved and pinned by a test, and the give-up's terminalize + stream/partial hooks are exactly the surface slice 4d-2 reuses, so this de-risks it. Reading both `_handleInternalFiberRecovery` bodies in full first split 4d in the plan: the bodies are ~70% structurally similar but the meaty logic has legitimately diverged, so 4d-2 (the wake-frame collapse for Phase 5 genericity) is gated behind a seam-design review. Tests: engine unit 42 (+9); typecheck 111; check; ai-chat workers 686; Think recovery workers 285; Think remote-Workers-AI give-up e2e 6/6; ai-chat real-wrangler-dev SIGKILL give-up e2e 9/9. Internal @internal seam, zero behavior change -> no changeset. Co-authored-by: Cursor <cursoragent@cursor.com>
Lift the fiber-recovery wake FRAME into a single reusable engine method so
a third (pi) adapter can drive deploy/crash recovery through the SAME
engine. The value here is genericity, not dedup: the bulk of the logic
stays package-owned in the decision hook, while the engine gains one
linear wake lifecycle.
Engine: add ChatRecoveryEngine.handleChatFiberRecovery(ctx, wake) owning
chat-fiber gate -> requestId parse -> snapshot unwrap -> stream/partial
resolution -> recovery-kind classification -> beginIncident -> exhausted
branch -> onChatRecovery -> persist + complete -> decision ->
catch -> updateIncident("failed") -> rethrow. The package-specific decision
lives behind a method-scoped ChatFiberWakeHooks<TClassify> object passed as
the second arg, NOT bolted onto ChatRecoveryAdapter, so the give-up-spine
adapter and its five unit-test fakes stay focused. TClassify is inferred
from the hooks (no class-level generic, no any/unknown casts).
Dedup: the byte-identical _partialHasSettledToolResults lifts to one shared
pure partialHasSettledToolResults(parts) in agents/chat; both packages drop
their private copy (zero behavior change). Think and AIChatAgent each
collapse _handleInternalFiberRecovery to a one-line delegation and
implement the hooks as private methods. Think keeps its submission
lifecycle + session-leaf + _handleRecoveryCallbackError inside
dispatchRecoveredTurn; AIChatAgent is leaf-only and returns
streamStatus: undefined (terminal-stream handling stays absent per the
"substrate capabilities are optional" decision -- reading status would be a
behavior change).
Records the load-bearing "substrate capabilities are optional, not shared
requirements" decision under Genericity, and reconciles the
classifyRecoveredTurn / dispatchRecoveredTurn / resolveRecoveryStream hook
names with the sibling Think Turns/Actions RFCs.
Tests: full check (sherif + exports + oxfmt + oxlint + typecheck 111) green;
agents 1989; ai-chat 686; think full chain green; new engine unit tests for
handleChatFiberRecovery + partialHasSettledToolResults; local wrangler-dev
SIGKILL e2e -- ai-chat 10/10, think 26 passed + 4 skipped. The only e2e red
is reattach-budget.test.ts, the documented expected-RED regression gate for
the unrelated wall-clock re-attach-budget bug (manual think-e2e project, not
the CI gate) -- untouched by this slice. Internal @internal seam, zero
behavior change -> no changeset.
Co-authored-by: Cursor <cursoragent@cursor.com>
…nts/chat A Phase 4 confidence pass (exit-criteria audit + release reviewer checklist) confirmed every recovery ORCHESTRATION engine routes through ChatRecoveryEngine and no public hook signatures changed, but found a small cluster of byte-identical LEAF host-I/O helpers still duplicated across both packages. This slice lifts them into shared agents/chat free functions, leaving each package a thin binding. recovery-incident.ts gains: - sweepStaleChatRecoveryIncidents(storage, now) — owns list-by-prefix + TTL select + the batched KV_DELETE_MAX_KEYS delete loop. - readChatRecoveryProgress(storage) / bumpChatRecoveryProgress(storage) — the durable monotonic no-progress counter. - AgentToolStreamProgressThrottle + AGENT_TOOL_STREAM_PROGRESS_BUMP_THROTTLE_MS — the N9 parent-progress-credit throttle. Storage params are typed Pick<DurableObjectStorage, ...> so this.ctx.storage passes with no cast and the helpers stay unit-testable with a fake. AIChatAgent and Think both: dropped their duplicated _sweepStaleChatRecoveryIncidents (hook points straight at the shared fn), turned _chatRecoveryProgressMarker / _bumpChatRecoveryProgress into one-line bindings, replaced the in-memory _lastAgentToolStreamProgressAt field + inline throttle with new AgentToolStreamProgressThrottle(), and deleted their local duplicate CHAT_RECOVERY_PROGRESS_KEY, AGENT_TOOL_STREAM_PROGRESS_BUMP_THROTTLE_MS, and KV_DELETE_MAX_KEYS constants. _resolveRecoveryStreamId is deliberately LEFT package-local: lifting it would feed ResumableStream into the engine for ~6 lines, the hook-bloat inversion the 4d-2 fallback warned against. Also extended the @internal barrel comment in chat/index.ts to cover the recovery-engine / stall-watchdog blocks. Zero behavior change: the throttle gate is identical (a fresh isolate's first forwarded chunk still credits because production `now` is a large epoch >> the window; a unit test pins exactly that). Tests: 9 new recovery-incident unit tests (sweep prefix-scoping + no-op + 128-batching; progress read/increment; throttle credit/throttle windows); full check (sherif + exports + oxfmt + oxlint + typecheck 111) green; agents / ai-chat / think suites green; local wrangler-dev SIGKILL e2e -- ai-chat 10/10, think chat-recovery + stall-recovery green. Only e2e red remains the documented expected-RED reattach-budget gate (unrelated wall-clock budget; untouched). Internal @internal seam, zero behavior change -> no changeset. Co-authored-by: Cursor <cursoragent@cursor.com>
…ion surface Before starting Phase 5 (pi), reviewed the chat machinery across AIChatAgent and Think for what else should move into agents/chat, and recorded the result in the RFC. Plan/docs only — no code change. Adds: - "Chat-layer extraction map" section: a four-surface review (message persistence; stream lifecycle + broadcast; inbound request/connection; tool/HITL/terminal) sorted into three tiers. Tier 1 = safe leaf dedup (new Slice 4f); Tier 2 = structural seams the pi adapter should DRIVE during Phase 5 (resume/reconnect handshake + streaming-loop codec); Tier 3 = keep-package-specific (storage model, Think submissions/codemode/media/repair, ai-chat persisted-cache/migration, _persistOrphanedStream id-merge, boot ordering, request-context glue). - Slice 4f in the Phase 4 slice plan, split by risk: 4f-i (pure leaf lifts — dup'd CHAT_*/STREAM_CLEANUP_* constants, sendIfOpen, terminal KV trio, recovering flag, _getPartialStreamText, stream-cleanup pair, _hasIncompleteToolBatch, the client-interaction predicates) and 4f-ii (behavior-sensitive convergences — ai-chat's local enforceRowSizeLimit reimpl; ai-chat's inline parse vs the shared parseProtocolMessage). Each item carries a verify-byte-equivalence-first gate. - Better-behavior convergence decision: adopt Think's event-driven, no-timeout, stream-gated parallel-tool barrier (#1650) in AIChatAgent, dropping its in-turn 60s force-continue (#1649). Scoped honestly as a substantial AIChatAgent rearchitecture (barrier out of the turn, new double-fire guard, new SSE-loop finalize hook, reconcile the _continuation machinery), with a deploy-mid-park e2e requirement; semver-minor, changeset required. - Phase 5 reframed to name the Tier-2 extractions as pi-driven. Code-grounded confidence pass folded in (verified against the source, not just the review): _hasIncompleteToolBatch and the client-interaction predicates are byte-identical; the shared constant values match (no migration risk). Corrected the auto-continuation scope, pulled enforceRowSizeLimit + parseProtocolMessage out of the zero-behavior bucket, and baked the verify-first gate into all of 4f. Sequencing: Slice 4f (4f-i then 4f-ii) -> auto-continuation convergence -> Phase 5 (pi drives Tier 2). Co-authored-by: Cursor <cursoragent@cursor.com>
…nto agents/chat
Slice 4f-i from the chat-layer extraction map: lift the cluster of
near-duplicate leaf helpers that were still copy-maintained across
`AIChatAgent` and `Think`, outside the recovery orchestration engine. Pure
leaf lifts only — zero behavior change, no changeset. The behavior-sensitive
4f-ii items (ai-chat's local enforceRowSizeLimit, the parseProtocolMessage
migration) and the auto-continuation convergence are NOT in this slice.
Verify-first gate (re-diffed at execution time; 2026-06 line numbers had
drifted, so matched by method name): all eight items confirmed byte-equivalent
modulo comments before lifting —
- sendIfOpen / isWebSocketClosedSendError (identical in both packages and a
third copy in continuation-state.ts)
- _getPartialStreamText (one-word comment diff)
- _partAwaitsClientInteraction / _toolPartName / _clientResolvableToolNames
(docblock-only diff)
- _hasIncompleteToolBatch (identical incl. inline comment)
- terminal KV trio _recordChatTerminal / _clearChatTerminal /
_pendingChatTerminal (identical)
- stream-cleanup pair _ensureStreamCleanupScheduled / _cleanupStreamBuffers
(one extra comment clause in Think)
- _setChatRecovering + the recovering-frame builder (identical apart from the
recovering wire-type enum and broadcast wrapper, exactly as predicted)
Landed in agents/chat:
- new connection.ts: sendIfOpen / isWebSocketClosedSendError + a ChatConnection
minimal type. Also deduped continuation-state.ts's private third copy.
- message-builder.ts: getPartialStreamText (over the resumable-stream chunk
reader; composes the shared applyChunkToParts).
- tool-state.ts: hasIncompleteToolBatch + partAwaitsClientInteraction /
clientResolvableToolNames / toolPartName. The broad-vs-client-only asymmetry
stays in each package's hasPendingInteraction / hasPendingClientInteraction
wrapper (both call the identical leaf), so the wrappers stay package-local.
- resumable-stream.ts: STREAM_CLEANUP_DELAY_SECONDS + cleanupStreamBuffers.
- recovery-incident.ts: recordChatTerminal / clearChatTerminal /
pendingChatTerminal + buildChatRecoveringFrame + setChatRecovering (storage
glue home, same precedent as 4e's sweep/progress helpers).
Both packages are now thin per-package bindings. The only per-package
divergence — the recovering wire-type enum (CF_AGENT_CHAT_RECOVERING vs
MSG_CHAT_RECOVERING) and the _broadcastChatMessage / _broadcastChat wrapper —
is threaded as params; the broadcast wrappers themselves stay package-local.
The duplicated CHAT_RECOVERING_KEY / CHAT_LAST_TERMINAL_KEY /
CHAT_RECOVERING_FLAG_TTL_MS local constants were deleted outright (no remaining
direct references once the helpers absorbed them).
Deep review (zero-behavior confirmation): storage keys + values unchanged
(cutover-safe; shared constants are the same strings); wake/hibernation
ordering unchanged (bindings issue the same storage ops in the same order);
stream-cleanup re-arm stays non-idempotent (rearm passes { idempotent: false },
invariant documented on the shared fn); recovering set/clear stays
idempotent-on-active-existing; terminal-before-seal and settled tool results
untouched; observability/recovering-frame payload shape identical;
setChatRecovering now uses a single injected `now` for both the staleness check
and the stored `at` (was two Date.now() calls microseconds apart — not
observable, and matches the engine's injected-clock seam). AIChat<->Think
parity: both now call the identical shared leaves.
Tests: pnpm run check (sherif/exports/oxfmt/oxlint/typecheck 111) green; agents
workers 1996, ai-chat workers 686, Think workers 52 + react 2 green; ai-chat
real-wrangler-dev SIGKILL e2e 10/10; Think chat-recovery + stall-recovery
SIGKILL e2e 6/6. The expected-RED reattach-budget e2e (unrelated wall-clock
budget) was left untouched.
Internal @internal seam (re-barrelled through chat/index.ts, not exported from
the agents root), zero behavior change -> no changeset.
Co-authored-by: Cursor <cursoragent@cursor.com>
…uctured compaction + annotations
The verify-first gate showed ai-chat's `_enforceRowSizeLimit` and the shared
`enforceRowSizeLimit` (which Think uses) were NOT a byte-identical lift — they
had drifted in two independent, observable ways, so this is a convergence (not a
4f-i leaf lift), with the correct behavior decided in the RFC and a changeset on
both packages.
1. Tool-output compaction shape. ai-chat replaced an oversized tool output with a
flat english summary string ("…too large to persist… Preview: …"), discarding
the shape; Think used the structured, shape-preserving `truncateToolOutput`.
Structured wins (a model can keep reasoning about a shape-preserving
truncation; the flat string is strictly lossier), so ai-chat now uses
`truncateToolOutput` and its summary string is gone.
2. Compaction annotations + warnings. ai-chat annotated
`metadata.compactedToolOutputs` / `compactedTextParts` and `console.warn`ed;
Think did neither. Annotate + warn on both (additive metadata lets a client
tell a stored row was compacted), so Think now emits them too.
Implemented by extending the shared `enforceRowSizeLimit` to own both the
structured compaction and the annotations, plus an optional `warn` hook
(`EnforceRowSizeLimitOptions`) so each package keeps its own log prefix
(`[AIChatAgent]` / `[Think]`). Both call sites are now thin bindings: ai-chat's
`_enforceRowSizeLimit` (and its `_truncateTextParts` + the now-unused
`chatByteLength` / `ROW_MAX_BYTES` imports are deleted) and a new Think `_rowSafe`
helper (folds in the `sanitizeMessage` it always pairs with, dedups three
identical call sites + the submission serializer).
Deep review: truncation thresholds already matched (both compact tool outputs
>1KB, both truncate text parts oldest→newest until they fit, both use the same
1.8MB ROW_MAX_BYTES byte-length guard incl. multibyte UTF-8); the only
value-level changes are ai-chat's tool-output text (summary → structured marker)
and Think's newly-present annotations; non-assistant messages still fall straight
to text truncation; metadata is merged (spread over existing), never clobbered;
the engine/recovery, hibernation/wake order, terminal-before-seal, and settled
tool results are untouched (this is a pure pre-storage serialization step).
ai-chat's row-size-guard.test.ts assertions that pinned the old summary string
were repointed at the structured "... [truncated N chars]" marker (the
compactedToolOutputs metadata assertion was already correct); Think's row-size
tests were already structure-shaped and unchanged.
Verification: pnpm run check (111 projects); agents workers 1996, ai-chat workers
686, Think workers 686; ai-chat SIGKILL e2e 10/10; Think SIGKILL e2e 11 files /
26 tests with the expected-RED reattach-budget gate (unrelated wall-clock budget)
left untouched. Two changesets (ai-chat minor: structured compaction; think
minor: compaction annotations/warnings).
Co-authored-by: Cursor <cursoragent@cursor.com>
…hared parseProtocolMessage
Migrate AIChatAgent's onMessage wrapper off its inline `JSON.parse` +
`data.type === MessageType.X` switch and onto the shared `parseProtocolMessage`
(which Think already uses), dispatching on the typed `ChatProtocolEvent`
discriminants. This is a classification-only migration: all eight handler bodies
are byte-preserved (`data.` -> `event.`), and in particular the `messages` event
still persists the client snapshot — explicitly NOT converged onto Think's no-op.
Behavior-preservation review:
- The wire strings in ai-chat's `MessageType` and `agents`' `CHAT_MESSAGE_TYPES`
are byte-identical for all eight incoming types (the same client talks to both
packages), so the parser recognizes exactly the set the inline switch did and
routes each to the same body.
- The inline switch gated chat-request on `method === "POST"`, so a non-POST
use-chat request fell through to the consumer's onMessage. Preserved by gating
the delegate on `!(event.type === "chat-request" && event.init.method !==
"POST")` — only the POST branch enters the handler; everything else (including
a parser-null non-JSON/unknown frame) still falls to `_onMessage`.
- Non-JSON and JSON-without-`type` both yield a null parse -> `_onMessage`,
matching the old try/catch + no-`type` fall-through.
- The parser is marginally more robust for malformed frames (defaults a missing
`init` to `{}` rather than throwing, and a missing `toolName` to `""`) — no
change for well-formed traffic.
One type fix: the parser types `clientTools[].parameters` as `unknown` (vs
ai-chat's `ClientToolSchema`/`JSONSchema7`), so the auto-continuation call site
now casts `clientTools as ClientToolSchema[] | undefined`, mirroring the existing
cast on the `_lastClientTools` assignment. Removed the now-unused
`type IncomingMessage` import.
Verification: pnpm run check (111 projects); ai-chat workers 686; ai-chat
real-`wrangler dev` SIGKILL e2e 10/10 (the dispatch-path gate); Think workers 686
for parity. Think and the agents package are byte-unchanged this slice (Think
already routed through the pre-existing parseProtocolMessage), so the Think
SIGKILL e2e cannot regress and was not re-run. No changeset — internal dispatch
refactor, no user-visible behavior change.
Co-authored-by: Cursor <cursoragent@cursor.com>
…helper
Add `runChatRecoveryExhaustion(input, { emit, onExhausted?, onError, terminalize })`
to `agents/chat`, folding the `buildChatRecoveryExhaustedContext` →
`notifyChatRecoveryExhausted` → host-terminalize sequence that every host's
`_exhaustChatRecovery` repeated. The helper owns the invariant (notify before
any terminal write; a throwing `onExhausted` is swallowed and never blocks
terminal UX) while the host expresses the legitimately-divergent terminal /
broadcast / recovering-clear ordering inside `terminalize(ctx)`.
`partialParts` stays an explicit input (not derived from `RecoveryPartial`) so a
foreign-vocabulary host passes `[]` rather than fabricating AI-SDK parts.
Refactor all four hosts onto it, each PRESERVING its current ordering
(`AIChatAgent` persist-first; `Think` broadcast-first + submission write; the pi
and tanstack harnesses record + clear-recovering). The harnesses also gain a
`_setChatRecovering` wrapper so the duplicated `setChatRecovering` option bag is
built once.
Behavior-neutral plumbing in the @internal `agents/chat` layer (additive,
sibling-only export) — no changeset. Adds a `runChatRecoveryExhaustion` unit
test (notify-before-terminalize order, onExhausted-swallow, shared ctx,
terminalize propagation). Validated: typecheck (113/113), chat (410), ai-chat
(687), think suites.
Co-authored-by: Cursor <cursoragent@cursor.com>
…cast-first `AIChatAgent`'s give-up terminalize now broadcasts the terminal banner BEFORE persisting the durable terminal record, matching `Think`. A terminal-record write can reject in the deploy/storage window a give-up runs in (#1730); under the old persist-first ordering the throw propagated before the banner sent, so the live banner was dropped on that pass and only landed on the healthy re-run. Broadcasting first keeps the banner resilient to a failing storage write — the throw still propagates and the give-up re-runs idempotently (re-persisting + re-broadcasting, the documented at-least-once edge). Persisting first gained no durability (the re-run persists either way) while losing banner resilience. Removes the last "legitimately divergent ordering" between the two chat hosts: both now terminalize broadcast-first; only the set of durable writes differs (`Think` also writes a submission row). Updates the engine adapter/helper docs and the stale `Think`/ai-chat cross-reference comments accordingly. Changeset: patch for @cloudflare/ai-chat (behavior change in a failure mode). Validated: pnpm run check (113/113) + ai-chat suite (687, incl. the give-up transient/seal tests). Co-authored-by: Cursor <cursoragent@cursor.com>
… convergence Close API-ergonomics finding #3 in the resume block (move to "Recently landed", renumber the remaining open items, bump the commit ref) and add two newest-first Progress-log entries: the engine-owned `runChatRecoveryExhaustion` helper (behavior-neutral) and the ai-chat broadcast-first convergence (behavior change, with changeset). Annotate the Phase 5 findings list with the one correction to the original sketch (terminalize-closure seam, not raw broadcast/storage). Co-authored-by: Cursor <cursoragent@cursor.com>
…ep verdict + platform north star Design-only RFC update folding in the recent investigation: - Orphan-persist 4-step verdict (a reconstruction, b target-id, c tool dedup, d upsert): 3 of 4 unify; only (b) is genuinely storage-coupled, (c) is a latent dedup gap to fix in the shared StreamAccumulator.mergeInto. - Convergence philosophy (north star) + the 3-bucket litmus test (behavior drift -> converge; bug-asymmetry -> fix in shared primitive; product -> keep per-package); ai-chat as lean subset, Think as product superset. - Foundation revisit across all three tiers: re-split the persist-orphan responsibilities (merge shared / store-write adapter), corrected stale matrix cells (reconciliation + terminal delivery already converged), fixed the double-assigned "reconstruct partial" seam, and re-scoped Tier-2 item 2 (StreamAccumulator adoption shrinks the codec seam to driver + vocabulary). - Platform context (non-scope north star): shape the storage seam toward the existing Session provider interface; treat resumable-stream as substrate. No code changes; reshapes open item #1 into the orphan-persist consolidation. Co-authored-by: Cursor <cursoragent@cursor.com>
…b), not a bug Before implementing the proposed standalone "(c) tool-dedup" fix, reading the actual reconstruction code reversed the finding: - applyChunkToParts is already fully idempotent by toolCallId (#1404 guards), so StreamAccumulator.mergeInto (replace, not append) needs no dedup — the premise that "mergeInto lacks dedup" was wrong. - Think has no early/mid-stream message persist (persists are at finalize, which a crash skips; tool-approval early-persist is ai-chat-only), so its fresh-id orphan path has nothing to duplicate against — not a gap. - ai-chat's hand-rolled dedup is purely a consequence of its reconstruct-fresh-then-append-onto-same-id model (downstream of (b) + the ai-chat-only early-persist). It dissolves for free when ai-chat adopts the shared seed-then-replace model in step (a); mergeInto is left unchanged. Corrected the 4-step table, open item #1, the Tier-3 bullet, the matrix note, and the bucket-2 litmus example (which had cited (c)). Lesson folded into the litmus test: verify the asymmetry is real before propagating a "fix." The recommended next step is now the step-(a) reconstruction migration, not a standalone (c) patch. Co-authored-by: Cursor <cursoragent@cursor.com>
…econstruction onto shared StreamAccumulator AIChatAgent._persistOrphanedStream now rebuilds the partial via the shared StreamAccumulator instead of a hand-rolled applyChunkToParts loop + inline start/finish/message-metadata extraction (the drift-prone duplication that step (a) targeted; the same primitive Think and the client reducer use). Scoped to reconstruction only. A full seed-then-replace was deliberately NOT adopted: it would change ai-chat's tool-result-merge semantics (its merge keeps an existing in-place-applied tool part rather than letting a replayed output-available chunk re-advance it), so on this highest-risk path the id-resolution (b, #1691) and the append-merge-with-toolCallId-dedup (c/d) are kept verbatim. Reconstruction is provably behavior-identical (the accumulator defaults continuation:false, adopting start.messageId unconditionally like the old code, and merges the same metadata chunks). Validation: ai-chat durable-chat-recovery 63/63, full workers suite 609/609, typecheck 113/113, pnpm run check clean. No changeset — internal refactor of an @internal method with no public-API or observable behavior change. RFC progress log + open item #1 updated to the as-built (narrower) scope; also reflows the prior RFC commits to oxfmt. Co-authored-by: Cursor <cursoragent@cursor.com>
Spike gating orphan-persist (b): the store-write half maps cleanly onto the existing SessionProvider subset (getMessage/getLatestLeaf/appendMessage/ updateMessage); ai-chat's flat array is a degenerate linear provider, so no second storage abstraction is needed. resolveOrphanTargetId is recovery policy (reads #1691 stream message_id / getLatestLeaf), not a store method. chat-sdk's state adapter confirmed orthogonal. (b) consolidation unblocked. Co-authored-by: Cursor <cursoragent@cursor.com>
Land the orphan-persist consolidation the chat-recovery RFC scoped: - (b) extract AIChatAgent._resolveOrphanTargetId — the #1691 stored-id / last-assistant policy — as a named per-host seam (kept on the host, not the shared engine adapter: hoisting orchestration into the engine would need strip/broadcast/flush hooks that fight the flat-vs-tree split). - (c) extract the append-merge-with-toolCallId-dedup into a shared pure reconcileOrphanPartial(existing, incoming) in message-reconciler.ts, exported from agents/chat and unit-tested. Not convergeable to Think's whole-message replace (ai-chat's early tool-approval persist applies a client tool result in place that lives only in storage). - (d) the store-write is now recognizably the same SessionProvider-subset shape on both hosts (flat findIndex/append vs Think's tree upsert). Pure internal refactor of @internal methods; no public API or observable behavior change. ai-chat 687/687, agents 2067 passed, check clean. Co-authored-by: Cursor <cursoragent@cursor.com>
…chat The (b)/(c)/(d) orphan-persist consolidation added a new public export on the agents `./chat` subpath; record it as an additive agents patch. Co-authored-by: Cursor <cursoragent@cursor.com>
Coverage map for the (a)/(b)/(c)/(d) orphan-persist seams across the workers runtime tests vs real-SIGKILL e2e; re-ran the two ai-chat e2e files that drive the refactored _persistOrphanedStream through a real crash (outcomes 2/2, chat-recovery 3/3, green). One documented/accepted gap: no real-SIGKILL e2e for the (c) tool-approval dedup path (covered at workers + unit layers; pure, runtime-independent logic). Phase-6 exit criteria for converged orphan-persist met. Co-authored-by: Cursor <cursoragent@cursor.com>
…Phase 7) Add a recovery-engine.ts section to chat-shared-layer.md (ChatRecoveryEngine lifecycle + ordering invariants, the ChatRecoveryAdapter / ChatFiberWakeHooks seams, pi-recovery as the non-AI-SDK forcing function, and the four orphan-persist seams). Correct stale references that the orphan-persist refactor invalidated (the orphan path now rebuilds via StreamAccumulator, not applyChunkToParts directly), add reconcileOrphanPartial to the module map, and add a history note linking the RFC. RFC Phase 6/7 item updated + progress log. Co-authored-by: Cursor <cursoragent@cursor.com>
…sistStore interface Turn the by-convention store-write alignment into a type-enforced contract. Add `OrphanPersistStore<M = UIMessage>` in agents/chat (the SessionProvider write subset, message-type-parameterized so it is not AI-SDK-specific) and route both AIChatAgent and Think's orphan-persist write through a host adapter typed against it. Pure internal refactor — no observable behavior change. - ai-chat: `_orphanStore()` over the flat messages array + persistMessages. - think: `_orphanStore()` over the Session with `_rowSafe` at the write boundary; factor the strip + empty-skip rule into `_strippedForPersist` shared by the live (`_persistAssistantMessage`) and orphan paths so it cannot drift. - tests-d: assert SessionProvider satisfies OrphanPersistStore<SessionMessage>. - RFC: note when to use this seam vs converge onto SessionProvider. agents patch changeset added for the additive `OrphanPersistStore` export. Co-authored-by: Cursor <cursoragent@cursor.com>
Mechanical tightening of the chat-recovery-foundation branch (no behavior change, internal seams only): - resumable-stream: drop the third hand-copied sendIfOpen/isWebSocketClosedSendError; import sendIfOpen from ./connection (the documented single source). Connection is structurally assignable to the shared ChatConnection. - recovery-engine: remove the vestigial `targetAssistantId` from the shared BeginChatRecoveryIncidentInput (engine never reads it; both hosts route it via the schedule data payload). Drop the dead mirror from each host wrapper too. - pi-recovery: remove unused hasFiberRows() (e2e polls getStatus) and the unread FauxPiModel.registration field (+ now-orphaned FauxProviderRegistration import). - comments: retense recovery-incident / resume-handshake headers (extraction is done, not pending); qualify ai-chat's `_replayTerminalOnResume` refs as the shared ResumeHandshake method. connection.ts "single shared implementation" is now accurate post-dedup. - gitignore: ignore **/.smoke-state/ (e2e miniflare/SQLite state). Note: evaluateChatRecoveryIncident stays async — it awaits config.shouldKeepRecovering(ctx); the "async with no await" review note was wrong. Co-authored-by: Cursor <cursoragent@cursor.com>
…at barrel Tighten the agents/chat export surface to match the decision that the chat-recovery engine is an internal seam, not public API. No behavior change. - @internal grouping: move the recovery-codec and resume-handshake re-exports under the single @internal section alongside recovery-incident / recovery-engine / stall-watchdog, and widen the doc to name all five blocks and the tanstack/pi adapter consumers. - prune 12 zero-consumer barrel exports (verified: no external `agents/chat` importer, and the agents __tests__ already import them via relative module paths, so no test repointing): getPartialStreamText, partialHasSettledToolResults, AISDKRecoveryCodec (class; aiSdkRecoveryCodec singleton kept — hosts use it), isWebSocketClosedSendError, toolPartName, assistantContentKey, evaluateChatRecoveryIncident, chatRecoveryIncidentId, chatRecoveryIncidentKey, selectStaleIncidentKeys, buildChatRecoveryExhaustedContext, notifyChatRecoveryExhausted. - changeset: no new agents changeset for the internal+unreleased recovery surface (the two genuinely-public additions, OrphanPersistStore and reconcileOrphanPartial, already have changesets). Clarify the stall-watchdog changeset that iterateWithStallWatchdog is an internal agents/chat seam. Constants (DEFAULT_CHAT_RECOVERY_* / *_THROTTLE_MS) left as-is pending a separate micro-audit. Co-authored-by: Cursor <cursoragent@cursor.com>
…ep progress log inline Documentation-only truth-up of stale/contradictory claims (no code change): - chat-shared-layer.md: complete the module tree (was 12 of 27 files; add connection/parse-protocol/tool-state/agent-tools/orphan-store/lifecycle/recovery* and the @internal recovery-engine group); fix the ai-chat line-count claim (~3700/~1577 -> current ~6.1k/~2.5k); correct reconciler ownership — reconcileMessages/resolveToolMergeId are shared and BOTH hosts call them (Think._handleChatRequest reconciles incoming, Think._persistIncomingMessage resolves assistant tool-merge ids), reframing "Why reconciliation stays in ai-chat" -> "Why reconciliation is shared". - rfc-chat-recovery-foundation.md: resolve contradictions in place — relabel the "Still open" list (item #1 orphan-persist is fully LANDED), refresh the resume point commit hash (-> 1617593) and recently-landed entries (orphan-store seam, Tier-1/2 cleanup), and correct the stall-watchdog decision: it shipped opt-in (chatStreamStallTimeoutMs default 0 in both packages), not the original default-on-when-chatRecovery plan. Add an explicit "archive the progress log to a sibling file at finalization" marker (kept inline while the branch is in flight, since it is an active working artifact). - rfc-think-actions.md / rfc-think-turns.md: bump stale cross-refs (said Phases 0-4 done / Slice 4d-2 in flight; actually Phases 0-5 + engine extraction complete, the ChatFiberWakeHooks classify/dispatch hook pair shipped). Co-authored-by: Cursor <cursoragent@cursor.com>
…endingInteraction Mirror @cloudflare/ai-chat's `_streamingMessage` scan so a parallel-batch client-tool `input-available`/`approval-requested` part that has streamed into `_streamingAssistant` but not yet been persisted to `this.messages` is detected. Of the three consumers only the `isAwaitingClientInteraction` incident-eval callback observes a non-null accumulator (same-isolate stall route); `waitUntilStable` checks the predicate only after `waitForIdle()` (accumulator already drained to null) and `_parkRecoveryForPendingInteraction` runs only on the post-wake recovery paths (fresh isolate). Without the scan Think budgets a mid-stream stall that ai-chat treats as "awaiting client" (budget-free) — a self-correcting drift the stall watchdog would expose. The change only ever flips false->true (more conservative), so it cannot wedge `waitUntilStable`. Adds `setStreamingAssistantForTest` and three tests covering the accumulator scan (client-tool pending, server-tool exclusion, resolved no-op). Co-authored-by: Cursor <cursoragent@cursor.com>
The branch converged AIChatAgent + Think onto the shared agents/chat recovery engine, but most of the real-`wrangler dev` + SIGKILL suites that prove it ran in no workflow: nightly only ran think vitest e2e and ai-chat *Playwright* (client-disconnect resume, not process death). Add three nightly jobs after a full local sweep passed green: - e2e-ai-chat-recovery: ai-chat real-SIGKILL vitest suite (the convergence half), distinct from the existing e2e-ai-chat Playwright job. - e2e-agents: core runFiber SIGKILL recovery primitives. - e2e-engine-genericity: pi-recovery + tanstack-recovery proofs that the engine is not AI-SDK-specific (non-UIMessage agent + foreign AG-UI tool vocabulary). Also refresh the stale "hangs in CI" comment on the agents e2e exclusion — the suite ran clean locally (11/11); it's excluded from the fast unit target only because it spawns wrangler dev, and now runs nightly. Co-authored-by: Cursor <cursoragent@cursor.com>
…dangling promise `waitForMessagesBroadcast` armed a 10s reject timer, but callers arm it before sending and await it after. If the intervening turn threw (or the broadcast never landed), the promise dangled and its timer rejected later as an unhandled rejection — failing an otherwise-green nightly think e2e run with "Messages broadcast timed out". Make it a best-effort barrier: resolve with the broadcast or `null` on timeout, never reject. The authoritative assertion is the subsequent `getMessages` RPC, so a missed broadcast need not fail here. Drop the now-redundant `.catch`es. Co-authored-by: Cursor <cursoragent@cursor.com>
… READMEs t5-experimental-ci remainder: - Add pi-recovery `test` script + plain-node vitest.config.ts and pi-codec.test.ts (32 tests), closing the gap vs tanstack-recovery's 31 codec tests. Covers delta accumulation, text_end/done/message_end authority, torn-write recovery, encode/decode round-trips, and the progress/streaming predicate disjointness. - Add READMEs to pi-recovery and tanstack-recovery explaining each as a shared recovery-engine genericity harness (non-AI-SDK agent / foreign AG-UI tool vocabulary), what they prove, how to run, and their nightly CI wiring. Co-authored-by: Cursor <cursoragent@cursor.com>
…iring Validated on the real Cloudflare edge (account `agents`). ai-chat (deployed-recovery.test.ts): add a second deterministic scenario — a normally-completed turn is NOT spuriously recovered by reconnect/idle churn and the agent keeps serving fresh turns (the false-positive counterpart to the existing mid-turn-redeploy eviction test). Both reuse already-bound fixtures. Think (chat-recovery-probe): add scripts/run-suite.mjs — deploys the probe under a unique throwaway name, runs the fast, deterministic, abort-driven scenarios (a6 HITL, a7 server-orphan, a8 approval, idem) via the proven driver, and always deletes the Worker. A readiness gate polls /probe/debug before driving. Slow real-deploy-churn scenarios (a1/a2/a4/a5/a9/rapid) stay manual. Automation: root `test:recovery:live` runs both suites; gated nightly jobs `e2e-deployed-ai-chat` + `e2e-deployed-think-probe` (off unless the RUN_DEPLOYED_E2E repo var is set or a manual run_deployed dispatch enables them). Docs note pinning CLOUDFLARE_ACCOUNT_ID to avoid a wrong-account auth error. RFC Layer 5 updated from sketch to LANDED, recording the real-edge realities (hang-to-interrupt timing, racy natural seals seeded via prime-seal). Live results: ai-chat 2/2, probe a6/a7/a8/idem all PASS; teardown verified (worker delete confirmed via code 10007). Co-authored-by: Cursor <cursoragent@cursor.com>
Update the resume-point orientation block (last commit -> d0c9585), add recently-landed entries for the Layer-5 deployed suites and the nightly e2e wiring / pi-codec tests / flake fix, mark tracked item #2's Layer-5 work as landed, and refresh the deferred/post-v1 list (drop Layer 5; name the two host-convergence extractions + harness-unify as the remaining tracked follow-ups). Co-authored-by: Cursor <cursoragent@cursor.com>
Harden the in-repo tracking for the deferred AutoContinuationController + adapter-spine extractions so a future session can pick them up cold without relying on the local cleanup-plan artifact: precondition (barrier behavior already converged → pure de-dup), shared method shape to extract, parameterization, and the merge gate (own PR + changeset + behavior-parity tests). Anchors on stable method names rather than drift-prone line numbers. Co-authored-by: Cursor <cursoragent@cursor.com>
…rallel-matrix flake The agents/ai-chat/think unit suites run under @cloudflare/vitest-pool-workers. Under the full parallel `nx run-many -t test` matrix, miniflare isolate teardown can overrun vitest's 10s default and surface as "Worker exited unexpectedly" — an infra teardown race (Nx flags the think task as flaky), not a test failure that the existing `retry: 3` can catch. Mirror the e2e configs' fix with `teardownTimeout: 60_000` so a slow teardown can't red an otherwise-green run. No product change; test infra only. Co-authored-by: Cursor <cursoragent@cursor.com>
131164f to
7174815
Compare
agents
@cloudflare/ai-chat
@cloudflare/codemode
create-think
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
… calls Throttle React state updates from streaming chat to ~10/s across every example, smoothing render churn during fast token streams. Applied to all live useAgentChat() call sites (doc-snippet strings left untouched). Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
This branch extracts the chat recovery machinery — the logic that keeps a
streaming agent turn alive across Durable Object eviction, deploys, isolate
replacement, hibernation, and stream stalls — out of the two host agents
(
@cloudflare/ai-chat'sAIChatAgentand@cloudflare/think) and into asingle, host-agnostic engine in
agents/chat. Both hosts are then convergedonto that shared engine, eliminating two divergent, independently-patched copies
of subtle recovery code.
The result is one tested seam (
agents/chat) that owns the incident-budget statemachine, the stall watchdog, the WebSocket stream-resume handshake, the
recovery codec, and the orphaned-partial persistence primitive — proven to be
not AI-SDK-coupled by two independent genericity harnesses (a hand-rolled
"pi" adapter and a TanStack-AI client over real Workers AI).
design/rfc-chat-recovery-foundation.md(authoritative, with aninline progress log),
design/chat-shared-layer.md,design/rfc-think-turns.md.What ships to users (changesets)
10 changesets on this branch:
agentsagents/chatexports:OrphanPersistStoreinterface,reconcileOrphanPartialhelper; recovery-engine progress-credit convergence; stream-resolution picks the newest row@cloudflare/ai-chat@cloudflare/thinkThe shared recovery engine itself is
@internaland intentionally carries nochangeset (not public API).
The shared seam (
packages/agents/src/chat/)New/!owned modules, each unit-tested under
src/chat/__tests__/:recovery-incident.ts— the incident-budget state machine (Phase 0/1).recovery-engine.ts— incident-begin orchestration, scheduling-idempotencypolicy, the give-up spine, the wake frame, exhaustion-notification core.
stall-watchdog.ts—iterateWithStallWatchdogprimitive for stream-stalldetection (Phase 3).
resume-handshake.ts— host-agnostic driver for the WebSocket stream-resumeprotocol (
ResumeHandshakeHostinterface).recovery-codec.ts—RecoveryPartial-based, AI-SDK-agnostic codec seam.orphan-store.ts—OrphanPersistStoreinterface +reconcileOrphanPartialfor persisting orphaned partial streams after recovery.
message-reconciler.ts/message-builder.ts—StreamAccumulator+applyChunkToParts(idempotent ontoolCallId).Per-host convergence
AIChatAgent(packages/ai-chat): wired to the shared engine; adoptedThink's event-driven, no-timeout, stream-gated auto-continuation barrier
(replacing the in-turn 60s-polling barrier); recovering-on-connect replay;
structured row-size compaction; give-up converged onto broadcast-first.
Think(packages/think): wired to the shared engine; row-sizecompaction annotations; tier-4 fix —
hasPendingInteractionnow scans thein-flight accumulator (mirrors ai-chat's scan-both), closing a dormant
mid-stream park/recovery disagreement.
Genericity proof (the engine is not AI-SDK-coupled)
Two experimental harnesses prove the engine works against foreign client/tool
vocabularies:
experimental/pi-recovery— a hand-rolled "pi" adapter (faux model + customchunk protocol). Unit codec tests (32) + e2e SIGKILL continuation.
experimental/tanstack-recovery— a TanStack-AI client over the shared resumehandshake, validated against real Workers AI. Covers ResumeHandshake,
SIGKILL continuation, settled-tool persist gating, text-only partial retries.
experimental/chat-recovery-probe— opt-in deployed probe orchestrator forThink's abort-driven scenarios.
Testing
All gates green on the branch tip prior to opening this PR:
pnpm run check(sherif/exports/oxfmt/oxlint/typecheck)pnpm run buildagents+ai-chat+thinkunit suitesai-chatlocal SIGKILL recovery e2eagentslocal SIGKILL e2ethinklocal SIGKILL e2epi-recoverygenericity e2etanstack-recoverygenericity e2eFlake hardening (this branch): the workers-pool unit suites (
agents,ai-chat,think) gainedteardownTimeout: 60_000. Under the full parallelnx run-many -t testmatrix, miniflare isolate teardown could overrun vitest's10s default and surface as "Worker exited unexpectedly" — an infra teardown race
(Nx flagged the think task as flaky), not a test failure that the existing
retry: 3could catch. This mirrors the fix already present in the e2e configs.CI changes
/.github/workflows/nightly.yml: promoted the previously manual-only SIGKILLe2e suites to nightly jobs (
e2e-ai-chat-recovery,e2e-agents,e2e-engine-genericity), and added opt-in, billable deployed jobs(
e2e-deployed-ai-chat,e2e-deployed-think-probe) gated behind theRUN_DEPLOYED_E2Erepo var / a manualrun_deployeddispatch. PR CI isunchanged in cost.
Explicitly deferred / tracked follow-ups (post-merge)
Recorded in the RFC (see the "Tracked follow-up" brief):
AutoContinuationControllerextraction — de-dup the ~260-line now-behavior-identical barrier into a shared primitive (own PR + changeset + behavior-parity
tests). Behavior already converged; this is pure code de-dup.
checkpoints; Route 2 (front
AIChatAgentwith a TanStack client);ResumeHandshakeHostApproach B; ai-chat/agents wrangler e2e harnessunification (pure test hygiene).
Risk & rollback
@internal; the only new public surface is two smallagents/chatexports (OrphanPersistStore,reconcileOrphanPartial).all green here.
it.skipinthinke2e is a documented known-red gate for an unrelatedbug (fix(agents): progress-keyed agent-tool re-attach so a deploy can't abandon a still-running child (#1630) #1670 follow-up), explicitly out of scope, with re-enable instructions.
Reviewer guide
design/rfc-chat-recovery-foundation.md(the "Current state & nextsteps" block + progress log).
packages/agents/src/chat/(the seam) and its__tests__/.packages/ai-chatandpackages/think.experimental/*-recoveryharnesses are the genericity evidence.Made with Cursor