Skip to content

fix: stale resumeAt recovery + upstream improvements#56

Open
TerrifiedBug wants to merge 6 commits intomainfrom
dev
Open

fix: stale resumeAt recovery + upstream improvements#56
TerrifiedBug wants to merge 6 commits intomainfrom
dev

Conversation

@TerrifiedBug
Copy link
Owner

Summary

  • Fix stale resumeAt UUID in scheduled tasks: When the Claude API can't find a resumeAt message UUID, the agent-runner now throws to trigger fresh-session retry instead of surfacing the error. The task-scheduler also clears stale positions on failure so subsequent tasks don't repeat the same error.
  • Port upstream improvements: SDK bump, sender allowlist, mount security hardening, test reorganization into __tests__/ subdirectories.
  • Housekeeping: Removed gog-reauth.sh from core repo (belongs in plugin), cleaned up gitignore.

Test plan

  • npm run build passes
  • Existing tests pass (group-queue, etc.)
  • Verify scheduled tasks recover after stale session error (rebuild container + restart)

TARS added 6 commits February 26, 2026 22:20
Move 17 test files out of src/ root into organized directories:
- src/__tests__/ for core module tests
- src/ipc/__tests__/ for IPC module tests
- src/db/__tests__/ for database module tests

Update relative imports in all moved files accordingly.
Upstream feature ports:
- Sender allowlist for per-chat access control
- Secret redaction hardening (third-party model env vars)
- Duplicate scheduled task prevention via runningTaskId tracking
- Task scheduler drift prevention with computeNextRun
- CJK font support in container Dockerfile
- WhatsApp message handler error wrapping

New:
- scripts/gog-reauth.sh for headless OAuth re-authentication
  with multi-account discovery and batch reauth support
Plugin helper script belongs in nanoclaw-skills marketplace only.
Plugin-deployed helper scripts (e.g. gog-reauth.sh) belong in the
marketplace repo, not the core repo. Prevent accidental commits.
When the Claude API can't find a resumeAt message UUID, it returns an
error result (not an exception). The agent-runner's retry logic only
handled exceptions, so stale sessions were never recovered. Additionally
the task-scheduler never cleared the stale position, causing every
subsequent scheduled task for the group to fail with the same UUID.

Agent-runner: detect "No message found with message.uuid" in error
results and throw to trigger the existing fresh-session retry path.

Task-scheduler: clear resumeAt from orchestrator state after any
context_mode=group task error, preventing repeated failures.
@greptile-apps
Copy link

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR fixes stale resumeAt recovery for scheduled tasks and ports several upstream improvements. The agent-runner now throws when it detects a "No message found with message.uuid" error (instead of writing it to stdout as the final result), which triggers main()'s existing retry-without-session logic. The task-scheduler clears the resume position on failure so subsequent runs don't repeat the same stale-UUID error.

Key changes:

  • Stale session recovery (container/agent-runner/src/index.ts): Detection of the stale UUID error pattern, ipcPolling teardown, and throw to the existing retry path in main() is correct and well-integrated.
  • clearResumePosition scope (src/task-scheduler.ts): The clear is applied on any container error when context_mode === 'group', not just stale UUID errors. This means a transient failure (rate limit, timeout, OOM) also clears a valid resume cursor, causing the next scheduled run to start a fresh session and silently lose conversation context. The fix should condition the clear on detecting the specific stale-UUID error message.
  • Sender allowlist (src/sender-allowlist.ts, src/orchestrator.ts, src/index.ts): Clean port — fail-open, per-chat overrides, is_from_me bypass for bot-initiated triggers, drop/trigger mode separation, well-tested (14 tests).
  • Atomic task claims (src/group-queue.ts): runningTaskId prevents re-enqueue of an in-flight task. Correct.
  • Interval drift fix (src/task-scheduler.ts): computeNextRun() anchors to next_run and skips past missed intervals. Correct.
  • Plugin mount security hardening (src/mount-security.ts, src/container-mounts.ts): allowAbsoluteContainerPath option correctly relaxes the relative-path-only requirement for admin plugin mounts while still blocking .. traversal; host-path allowlist validation is unchanged.
  • --add-host networking, SDK bump, CJK fonts, ANTHROPIC_BASE_URL/ANTHROPIC_AUTH_TOKEN secrets, and test reorganization are all straightforward ports with no issues.

Confidence Score: 3/5

  • Safe to merge after reviewing the clearResumePosition scope — the fix is correct for stale UUID errors but has a side-effect of clearing valid cursors on unrelated transient failures.
  • The core stale-session recovery path (agent-runner throw → main() retry) is well-implemented and correct. The majority of upstream ports are clean. The one correctness concern is in task-scheduler.ts: clearResumePosition fires for all errors on group-context tasks, not only the targeted stale-UUID case, which can silently discard valid resume positions on transient failures for recurring tasks. Everything else — sender allowlist, atomic task claims, drift fix, mount security — looks solid.
  • src/task-scheduler.ts (lines 202–205: clearResumePosition scope)

Important Files Changed

Filename Overview
container/agent-runner/src/index.ts Adds stale resumeAt detection: throws on "No message found with message.uuid" to trigger a clean retry in main(). main() already has the retry-without-session logic, so the integration is solid. ipcPolling is set to false before throwing, which is correct since the session is being abandoned.
src/task-scheduler.ts Extracts computeNextRun() with drift-anchored interval logic (correct), and clears resumePosition on error. The clear is unconditionally applied to all errors for group-context tasks, not just stale UUID errors — this is the main correctness concern in this PR.
src/orchestrator.ts Loads sender allowlist once in constructor and applies isTriggerAllowed() check alongside the trigger pattern in both processGroupMessages() and startMessageLoop(). is_from_me messages bypass the allowlist (bots can always trigger themselves), which is correct. Integration is clean.
src/sender-allowlist.ts New module implementing per-chat sender allowlist with trigger/drop modes. Fail-open design (missing file defaults to allow-all), valid JSON/schema validation, per-chat overrides, and logDenied flag. Logic is correct and well-tested.
src/mount-security.ts Adds allowAbsoluteContainerPath option for admin plugin mounts. When set, skips the relative-path-only check but still blocks ".." traversal and empty paths. Host-path allowlist validation continues unconditionally. Change is narrowly scoped and correctly gated.
src/container-mounts.ts Plugin mounts now validated per-mount via validateMount() with allowAbsoluteContainerPath:true, replacing the batch validateAdditionalMounts() call. Absolute container paths are used as-is; relative ones get the /workspace/extra/ prefix. Rejected mounts are logged and skipped rather than crashing.
src/group-queue.ts Adds runningTaskId field to GroupState to prevent the scheduler from re-enqueuing a task that's already executing. Set at runTask() start, cleared at completion. Correctly handles the race condition described in upstream f794185.
src/db/messages.ts Adds is_from_me to both getNewMessages and getMessagesSince SELECT queries and maps the SQLite integer result to a boolean. This is required for the orchestrator's sender allowlist to correctly pass-through bot-sent trigger messages.

Sequence Diagram

sequenceDiagram
    participant Scheduler as task-scheduler.ts
    participant Queue as group-queue.ts
    participant Runner as container-runner.ts
    participant Agent as agent-runner (container)
    participant Orch as orchestrator.ts

    Scheduler->>Queue: "enqueueTask(taskId, fn)"
    Note over Queue: "Check runningTaskId — skip if already running"
    Queue->>Queue: "state.runningTaskId = taskId"
    Queue->>Runner: "runContainerAgent(group, prompt, resumeAt)"
    Runner->>Agent: "spawn container with resumeAt UUID"

    alt stale resumeAt UUID
        Agent->>Agent: "detect No message found with message.uuid"
        Agent->>Agent: "ipcPolling=false, throw"
        Agent->>Agent: "main() retries without resumeAt"
        Agent-->>Runner: "status: error"
        Runner-->>Scheduler: "output.status === error"
        Scheduler->>Orch: "clearResumePosition(group_folder)"
    else success
        Agent-->>Runner: "status: success, resumeAt: newUUID"
        Runner-->>Scheduler: "completed"
    end

    Queue->>Queue: "state.runningTaskId = null"
    Scheduler->>Scheduler: "computeNextRun(task) drift-anchored"
Loading

Last reviewed commit: 2432841

Comment on lines 202 to +205

const durationMs = Date.now() - startTime;

// Clear stale resume position on error so subsequent tasks don't hit the same
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resume position cleared on any error, not just stale UUID

The clearResumePosition call is guarded only by error && context_mode === 'group', which means it fires for every container failure — API rate limits, network timeouts, container OOM, etc. — not only the stale resumeAt UUID case this fix is targeting.

For a recurring interval/cron task whose last run was cut short by a transient error, the resume position it had established is still valid. Clearing it here causes the next run to start a brand-new session instead of continuing from the last known cursor, silently discarding conversation context that would otherwise be available.

The agent-runner already writes the stale-UUID error message into output.error. A targeted fix would only clear when that specific message is detected:

const isStaleSession = error != null && /No message found with message\.uuid/i.test(error);
if (isStaleSession && task.context_mode === 'group') {
  deps.clearResumePosition(task.group_folder);
}

This matches the intent described in the comment above the block and avoids discarding valid cursors on unrelated failures.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/task-scheduler.ts
Line: 202-205

Comment:
**Resume position cleared on any error, not just stale UUID**

The `clearResumePosition` call is guarded only by `error && context_mode === 'group'`, which means it fires for every container failure — API rate limits, network timeouts, container OOM, etc. — not only the stale `resumeAt` UUID case this fix is targeting.

For a recurring interval/cron task whose last run was cut short by a transient error, the resume position it had established is still valid. Clearing it here causes the _next_ run to start a brand-new session instead of continuing from the last known cursor, silently discarding conversation context that would otherwise be available.

The agent-runner already writes the stale-UUID error message into `output.error`. A targeted fix would only clear when that specific message is detected:

```typescript
const isStaleSession = error != null && /No message found with message\.uuid/i.test(error);
if (isStaleSession && task.context_mode === 'group') {
  deps.clearResumePosition(task.group_folder);
}
```

This matches the intent described in the comment above the block and avoids discarding valid cursors on unrelated failures.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant