fix: stale resumeAt recovery + upstream improvements by TerrifiedBug · Pull Request #56 · TerrifiedBug/nanotars

TerrifiedBug · 2026-03-11T11:45:29Z

Summary

Fix stale resumeAt UUID in scheduled tasks: When the Claude API can't find a resumeAt message UUID, the agent-runner now throws to trigger fresh-session retry instead of surfacing the error. The task-scheduler also clears stale positions on failure so subsequent tasks don't repeat the same error.
Port upstream improvements: SDK bump, sender allowlist, mount security hardening, test reorganization into __tests__/ subdirectories.
Housekeeping: Removed gog-reauth.sh from core repo (belongs in plugin), cleaned up gitignore.

Test plan

npm run build passes
Existing tests pass (group-queue, etc.)
Verify scheduled tasks recover after stale session error (rebuild container + restart)

Move 17 test files out of src/ root into organized directories: - src/__tests__/ for core module tests - src/ipc/__tests__/ for IPC module tests - src/db/__tests__/ for database module tests Update relative imports in all moved files accordingly.

Upstream feature ports: - Sender allowlist for per-chat access control - Secret redaction hardening (third-party model env vars) - Duplicate scheduled task prevention via runningTaskId tracking - Task scheduler drift prevention with computeNextRun - CJK font support in container Dockerfile - WhatsApp message handler error wrapping New: - scripts/gog-reauth.sh for headless OAuth re-authentication with multi-account discovery and batch reauth support

Plugin helper script belongs in nanoclaw-skills marketplace only.

Plugin-deployed helper scripts (e.g. gog-reauth.sh) belong in the marketplace repo, not the core repo. Prevent accidental commits.

This reverts commit 38a9ed3.

When the Claude API can't find a resumeAt message UUID, it returns an error result (not an exception). The agent-runner's retry logic only handled exceptions, so stale sessions were never recovered. Additionally the task-scheduler never cleared the stale position, causing every subsequent scheduled task for the group to fail with the same UUID. Agent-runner: detect "No message found with message.uuid" in error results and throw to trigger the existing fresh-session retry path. Task-scheduler: clear resumeAt from orchestrator state after any context_mode=group task error, preventing repeated failures.

greptile-apps · 2026-03-11T11:49:43Z

Greptile Summary

This PR fixes stale resumeAt recovery for scheduled tasks and ports several upstream improvements. The agent-runner now throws when it detects a "No message found with message.uuid" error (instead of writing it to stdout as the final result), which triggers main()'s existing retry-without-session logic. The task-scheduler clears the resume position on failure so subsequent runs don't repeat the same stale-UUID error.

Key changes:

Stale session recovery (container/agent-runner/src/index.ts): Detection of the stale UUID error pattern, ipcPolling teardown, and throw to the existing retry path in main() is correct and well-integrated.
clearResumePosition scope (src/task-scheduler.ts): The clear is applied on any container error when context_mode === 'group', not just stale UUID errors. This means a transient failure (rate limit, timeout, OOM) also clears a valid resume cursor, causing the next scheduled run to start a fresh session and silently lose conversation context. The fix should condition the clear on detecting the specific stale-UUID error message.
Sender allowlist (src/sender-allowlist.ts, src/orchestrator.ts, src/index.ts): Clean port — fail-open, per-chat overrides, is_from_me bypass for bot-initiated triggers, drop/trigger mode separation, well-tested (14 tests).
Atomic task claims (src/group-queue.ts): runningTaskId prevents re-enqueue of an in-flight task. Correct.
Interval drift fix (src/task-scheduler.ts): computeNextRun() anchors to next_run and skips past missed intervals. Correct.
Plugin mount security hardening (src/mount-security.ts, src/container-mounts.ts): allowAbsoluteContainerPath option correctly relaxes the relative-path-only requirement for admin plugin mounts while still blocking .. traversal; host-path allowlist validation is unchanged.
--add-host networking, SDK bump, CJK fonts, ANTHROPIC_BASE_URL/ANTHROPIC_AUTH_TOKEN secrets, and test reorganization are all straightforward ports with no issues.

Confidence Score: 3/5

Safe to merge after reviewing the clearResumePosition scope — the fix is correct for stale UUID errors but has a side-effect of clearing valid cursors on unrelated transient failures.
The core stale-session recovery path (agent-runner throw → main() retry) is well-implemented and correct. The majority of upstream ports are clean. The one correctness concern is in task-scheduler.ts: clearResumePosition fires for all errors on group-context tasks, not only the targeted stale-UUID case, which can silently discard valid resume positions on transient failures for recurring tasks. Everything else — sender allowlist, atomic task claims, drift fix, mount security — looks solid.
src/task-scheduler.ts (lines 202–205: clearResumePosition scope)

Important Files Changed

Filename	Overview
container/agent-runner/src/index.ts	Adds stale resumeAt detection: throws on "No message found with message.uuid" to trigger a clean retry in main(). main() already has the retry-without-session logic, so the integration is solid. ipcPolling is set to false before throwing, which is correct since the session is being abandoned.
src/task-scheduler.ts	Extracts computeNextRun() with drift-anchored interval logic (correct), and clears resumePosition on error. The clear is unconditionally applied to all errors for group-context tasks, not just stale UUID errors — this is the main correctness concern in this PR.
src/orchestrator.ts	Loads sender allowlist once in constructor and applies isTriggerAllowed() check alongside the trigger pattern in both processGroupMessages() and startMessageLoop(). is_from_me messages bypass the allowlist (bots can always trigger themselves), which is correct. Integration is clean.
src/sender-allowlist.ts	New module implementing per-chat sender allowlist with trigger/drop modes. Fail-open design (missing file defaults to allow-all), valid JSON/schema validation, per-chat overrides, and logDenied flag. Logic is correct and well-tested.
src/mount-security.ts	Adds allowAbsoluteContainerPath option for admin plugin mounts. When set, skips the relative-path-only check but still blocks ".." traversal and empty paths. Host-path allowlist validation continues unconditionally. Change is narrowly scoped and correctly gated.
src/container-mounts.ts	Plugin mounts now validated per-mount via validateMount() with allowAbsoluteContainerPath:true, replacing the batch validateAdditionalMounts() call. Absolute container paths are used as-is; relative ones get the /workspace/extra/ prefix. Rejected mounts are logged and skipped rather than crashing.
src/group-queue.ts	Adds runningTaskId field to GroupState to prevent the scheduler from re-enqueuing a task that's already executing. Set at runTask() start, cleared at completion. Correctly handles the race condition described in upstream `f794185`.
src/db/messages.ts	Adds is_from_me to both getNewMessages and getMessagesSince SELECT queries and maps the SQLite integer result to a boolean. This is required for the orchestrator's sender allowlist to correctly pass-through bot-sent trigger messages.

Sequence Diagram

sequenceDiagram
    participant Scheduler as task-scheduler.ts
    participant Queue as group-queue.ts
    participant Runner as container-runner.ts
    participant Agent as agent-runner (container)
    participant Orch as orchestrator.ts

    Scheduler->>Queue: "enqueueTask(taskId, fn)"
    Note over Queue: "Check runningTaskId — skip if already running"
    Queue->>Queue: "state.runningTaskId = taskId"
    Queue->>Runner: "runContainerAgent(group, prompt, resumeAt)"
    Runner->>Agent: "spawn container with resumeAt UUID"

    alt stale resumeAt UUID
        Agent->>Agent: "detect No message found with message.uuid"
        Agent->>Agent: "ipcPolling=false, throw"
        Agent->>Agent: "main() retries without resumeAt"
        Agent-->>Runner: "status: error"
        Runner-->>Scheduler: "output.status === error"
        Scheduler->>Orch: "clearResumePosition(group_folder)"
    else success
        Agent-->>Runner: "status: success, resumeAt: newUUID"
        Runner-->>Scheduler: "completed"
    end

    Queue->>Queue: "state.runningTaskId = null"
    Scheduler->>Scheduler: "computeNextRun(task) drift-anchored"

_{Last reviewed commit: 2432841}

greptile-apps · 2026-03-11T11:49:47Z

src/task-scheduler.ts


  const durationMs = Date.now() - startTime;

+  // Clear stale resume position on error so subsequent tasks don't hit the same


Resume position cleared on any error, not just stale UUID

The clearResumePosition call is guarded only by error && context_mode === 'group', which means it fires for every container failure — API rate limits, network timeouts, container OOM, etc. — not only the stale resumeAt UUID case this fix is targeting.

For a recurring interval/cron task whose last run was cut short by a transient error, the resume position it had established is still valid. Clearing it here causes the next run to start a brand-new session instead of continuing from the last known cursor, silently discarding conversation context that would otherwise be available.

The agent-runner already writes the stale-UUID error message into output.error. A targeted fix would only clear when that specific message is detected:

const isStaleSession = error != null && /No message found with message\.uuid/i.test(error); if (isStaleSession && task.context_mode === 'group') { deps.clearResumePosition(task.group_folder); }

This matches the intent described in the comment above the block and avoids discarding valid cursors on unrelated failures.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/task-scheduler.ts Line: 202-205 Comment: **Resume position cleared on any error, not just stale UUID** The `clearResumePosition` call is guarded only by `error && context_mode === 'group'`, which means it fires for every container failure — API rate limits, network timeouts, container OOM, etc. — not only the stale `resumeAt` UUID case this fix is targeting. For a recurring interval/cron task whose last run was cut short by a transient error, the resume position it had established is still valid. Clearing it here causes the _next_ run to start a brand-new session instead of continuing from the last known cursor, silently discarding conversation context that would otherwise be available. The agent-runner already writes the stale-UUID error message into `output.error`. A targeted fix would only clear when that specific message is detected: ```typescript const isStaleSession = error != null && /No message found with message\.uuid/i.test(error); if (isStaleSession && task.context_mode === 'group') { deps.clearResumePosition(task.group_folder); } ``` This matches the intent described in the comment above the block and avoids discarding valid cursors on unrelated failures. How can I resolve this? If you propose a fix, please make it concise.

TARS added 6 commits February 26, 2026 22:20

remove gog-reauth.sh from core repo

bcb8dd9

Plugin helper script belongs in nanoclaw-skills marketplace only.

chore: gitignore scripts/ directory

38a9ed3

Plugin-deployed helper scripts (e.g. gog-reauth.sh) belong in the marketplace repo, not the core repo. Prevent accidental commits.

Revert "chore: gitignore scripts/ directory"

e3b76a9

This reverts commit 38a9ed3.

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stale resumeAt recovery + upstream improvements#56

fix: stale resumeAt recovery + upstream improvements#56
TerrifiedBug wants to merge 6 commits intomainfrom
dev

TerrifiedBug commented Mar 11, 2026

Uh oh!

greptile-apps bot commented Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		const durationMs = Date.now() - startTime;

		// Clear stale resume position on error so subsequent tasks don't hit the same

Conversation

TerrifiedBug commented Mar 11, 2026

Summary

Test plan

Uh oh!

greptile-apps bot commented Mar 11, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant