Skip to content

feat(sdk,core): preserve chat.agent context after cancel / OOM / crash#8

Open
deepshekhardas wants to merge 4 commits into
mainfrom
pr/3671-chat-recovery
Open

feat(sdk,core): preserve chat.agent context after cancel / OOM / crash#8
deepshekhardas wants to merge 4 commits into
mainfrom
pr/3671-chat-recovery

Conversation

@deepshekhardas

@deepshekhardas deepshekhardas commented May 20, 2026

Copy link
Copy Markdown
Owner

When a \chat.agent\ run dies mid-stream (the user cancels, the worker OOMs, an unhandled exception kills the process), the next continuation run now reconstructs the conversation context automatically.

Changes

  • Session recovery boot for chat.agent
  • New \onRecoveryBoot\ hook for custom recovery policies
  • Schema fix: \data\ is now \z.unknown()\ instead of \z.string()\
  • Added recovery-boot.test.ts and replay tests

Test plan

  • Build: \pnpm run build --filter @trigger.dev/sdk --filter @trigger.dev/core\
  • SDK unit tests: \pnpm --filter @trigger.dev/sdk exec vitest run\

Closes triggerdotdev#3671


Summary by cubic

Preserves chat.agent context across cancel/OOM/crash so continuation runs resume the same conversation. Delivered via Sessions-backed recovery in @trigger.dev/sdk/@trigger.dev/core with a new boot hook; meets the recovery goals from Linear triggerdotdev#3671.

  • New Features

    • Sessions-backed chat.agent runtime reconstructs conversation state on continuation runs.
    • onBoot lifecycle hook to rehydrate locals/state once per worker before handling messages.
    • chat.history helpers (getPendingToolCalls, extractNewToolResults, etc.) for HITL flows.
    • Stamp gen_ai.conversation.id on all spans/metrics inside chat runs for cross-run tracing.
    • mockChatAgent test harness to unit-test agents offline without the runtime.
  • Bug Fixes

    • Cap idempotencyKey at 2048 chars with a structured 400 at the API boundary.
    • Retry SIGSEGV task crashes under the task’s existing retry policy.
    • Runs API: add region filter/field for list/retrieve.
    • Fix LocalsKey<T> type incompatibility across dual-package builds.

Written for commit 98f2903. Summary will update on new commits. Review in cubic

id: release
uses: softprops/action-gh-release@v1
if: github.event_name == 'push'
uses: softprops/action-gh-release@b4309332981a82ec1c5618f44dd2e27cc8bfbfda # v3.0.0

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 1513 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name=".changeset/runs-list-region-filter.md">

<violation number="1" location=".changeset/runs-list-region-filter.md:1">
P1: Changeset content does not match PR changes. The changeset describes 'Add region to the runs list / retrieve API' while the PR is about preserving chat.agent context after cancel/OOM/crash with session recovery. This will generate incorrect changelog entries for @trigger.dev/core and @trigger.dev/sdk.</violation>
</file>

<file name=".github/workflows/e2e-webapp.yml">

<violation number="1" location=".github/workflows/e2e-webapp.yml:67">
P2: Gate DockerHub login on both username and token; checking only username can run the step with missing credentials and fail the workflow.</violation>
</file>

<file name=".github/workflows/publish-worker-v4.yml">

<violation number="1" location=".github/workflows/publish-worker-v4.yml:69">
P2: Semver releases no longer publish the `v4-beta` alias tag, so consumers tracking that channel will stop receiving updates.</violation>
</file>

<file name=".github/workflows/claude-md-audit.yml">

<violation number="1" location=".github/workflows/claude-md-audit.yml:52">
P2: Use the PR's base branch dynamically instead of hardcoding `origin/main`</violation>
</file>

Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.

Re-trigger cubic

@@ -0,0 +1,6 @@
---

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Changeset content does not match PR changes. The changeset describes 'Add region to the runs list / retrieve API' while the PR is about preserving chat.agent context after cancel/OOM/crash with session recovery. This will generate incorrect changelog entries for @trigger.dev/core and @trigger.dev/sdk.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .changeset/runs-list-region-filter.md, line 1:

<comment>Changeset content does not match PR changes. The changeset describes 'Add region to the runs list / retrieve API' while the PR is about preserving chat.agent context after cancel/OOM/crash with session recovery. This will generate incorrect changelog entries for @trigger.dev/core and @trigger.dev/sdk.</comment>

<file context>
@@ -0,0 +1,6 @@
+---
+"@trigger.dev/core": patch
+"@trigger.dev/sdk": patch
</file context>


# ..to avoid rate limits when pulling images
- name: 🐳 Login to DockerHub
if: ${{ env.DOCKERHUB_USERNAME }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Gate DockerHub login on both username and token; checking only username can run the step with missing credentials and fail the workflow.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/e2e-webapp.yml, line 67:

<comment>Gate DockerHub login on both username and token; checking only username can run the step with missing credentials and fail the workflow.</comment>

<file context>
@@ -0,0 +1,97 @@
+
+      # ..to avoid rate limits when pulling images
+      - name: 🐳 Login to DockerHub
+        if: ${{ env.DOCKERHUB_USERNAME }}
+        uses: docker/login-action@4907a6ddec9925e35a0a9e82d7399ccc52663121 # v4.1.0
+        with:
</file context>
Suggested change
if: ${{ env.DOCKERHUB_USERNAME }}
if: ${{ secrets.DOCKERHUB_USERNAME && secrets.DOCKERHUB_TOKEN }}

image_tags=$image_tags,$ref_without_tag:v4-beta
fi
ref_without_tag=ghcr.io/triggerdotdev/${STEPS_GET_REPOSITORY_OUTPUTS_REPO}
image_tags=$ref_without_tag:${STEPS_GET_TAG_OUTPUTS_TAG}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Semver releases no longer publish the v4-beta alias tag, so consumers tracking that channel will stop receiving updates.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/publish-worker-v4.yml, line 69:

<comment>Semver releases no longer publish the `v4-beta` alias tag, so consumers tracking that channel will stop receiving updates.</comment>

<file context>
@@ -62,26 +65,24 @@ jobs:
-            image_tags=$image_tags,$ref_without_tag:v4-beta
-          fi
+          ref_without_tag=ghcr.io/triggerdotdev/${STEPS_GET_REPOSITORY_OUTPUTS_REPO}
+          image_tags=$ref_without_tag:${STEPS_GET_TAG_OUTPUTS_TAG}
 
           echo "image_tags=${image_tags}" >> "$GITHUB_OUTPUT"
</file context>
Suggested change
image_tags=$ref_without_tag:${STEPS_GET_TAG_OUTPUTS_TAG}
image_tags=$ref_without_tag:${STEPS_GET_TAG_OUTPUTS_TAG}
if [[ "$STEPS_GET_TAG_OUTPUTS_IS_SEMVER" == true ]]; then
image_tags=$image_tags,$ref_without_tag:v4-beta
fi


## Your task

1. Run `git diff origin/main...HEAD --name-only` to see which files changed in this PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Use the PR's base branch dynamically instead of hardcoding origin/main

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/claude-md-audit.yml, line 52:

<comment>Use the PR's base branch dynamically instead of hardcoding `origin/main`</comment>

<file context>
@@ -0,0 +1,73 @@
+
+            ## Your task
+
+            1. Run `git diff origin/main...HEAD --name-only` to see which files changed in this PR.
+            2. For each changed directory, check if there's a CLAUDE.md in that directory or a parent directory.
+            3. Determine if any CLAUDE.md or .claude/rules/ file should be updated based on the changes. Consider:
</file context>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants