fix: stale resumeAt recovery + upstream improvements#56
fix: stale resumeAt recovery + upstream improvements#56TerrifiedBug wants to merge 6 commits intomainfrom
Conversation
Move 17 test files out of src/ root into organized directories: - src/__tests__/ for core module tests - src/ipc/__tests__/ for IPC module tests - src/db/__tests__/ for database module tests Update relative imports in all moved files accordingly.
Upstream feature ports: - Sender allowlist for per-chat access control - Secret redaction hardening (third-party model env vars) - Duplicate scheduled task prevention via runningTaskId tracking - Task scheduler drift prevention with computeNextRun - CJK font support in container Dockerfile - WhatsApp message handler error wrapping New: - scripts/gog-reauth.sh for headless OAuth re-authentication with multi-account discovery and batch reauth support
Plugin helper script belongs in nanoclaw-skills marketplace only.
Plugin-deployed helper scripts (e.g. gog-reauth.sh) belong in the marketplace repo, not the core repo. Prevent accidental commits.
This reverts commit 38a9ed3.
When the Claude API can't find a resumeAt message UUID, it returns an error result (not an exception). The agent-runner's retry logic only handled exceptions, so stale sessions were never recovered. Additionally the task-scheduler never cleared the stale position, causing every subsequent scheduled task for the group to fail with the same UUID. Agent-runner: detect "No message found with message.uuid" in error results and throw to trigger the existing fresh-session retry path. Task-scheduler: clear resumeAt from orchestrator state after any context_mode=group task error, preventing repeated failures.
Greptile SummaryThis PR fixes stale Key changes:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Scheduler as task-scheduler.ts
participant Queue as group-queue.ts
participant Runner as container-runner.ts
participant Agent as agent-runner (container)
participant Orch as orchestrator.ts
Scheduler->>Queue: "enqueueTask(taskId, fn)"
Note over Queue: "Check runningTaskId — skip if already running"
Queue->>Queue: "state.runningTaskId = taskId"
Queue->>Runner: "runContainerAgent(group, prompt, resumeAt)"
Runner->>Agent: "spawn container with resumeAt UUID"
alt stale resumeAt UUID
Agent->>Agent: "detect No message found with message.uuid"
Agent->>Agent: "ipcPolling=false, throw"
Agent->>Agent: "main() retries without resumeAt"
Agent-->>Runner: "status: error"
Runner-->>Scheduler: "output.status === error"
Scheduler->>Orch: "clearResumePosition(group_folder)"
else success
Agent-->>Runner: "status: success, resumeAt: newUUID"
Runner-->>Scheduler: "completed"
end
Queue->>Queue: "state.runningTaskId = null"
Scheduler->>Scheduler: "computeNextRun(task) drift-anchored"
Last reviewed commit: 2432841 |
|
|
||
| const durationMs = Date.now() - startTime; | ||
|
|
||
| // Clear stale resume position on error so subsequent tasks don't hit the same |
There was a problem hiding this comment.
Resume position cleared on any error, not just stale UUID
The clearResumePosition call is guarded only by error && context_mode === 'group', which means it fires for every container failure — API rate limits, network timeouts, container OOM, etc. — not only the stale resumeAt UUID case this fix is targeting.
For a recurring interval/cron task whose last run was cut short by a transient error, the resume position it had established is still valid. Clearing it here causes the next run to start a brand-new session instead of continuing from the last known cursor, silently discarding conversation context that would otherwise be available.
The agent-runner already writes the stale-UUID error message into output.error. A targeted fix would only clear when that specific message is detected:
const isStaleSession = error != null && /No message found with message\.uuid/i.test(error);
if (isStaleSession && task.context_mode === 'group') {
deps.clearResumePosition(task.group_folder);
}This matches the intent described in the comment above the block and avoids discarding valid cursors on unrelated failures.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/task-scheduler.ts
Line: 202-205
Comment:
**Resume position cleared on any error, not just stale UUID**
The `clearResumePosition` call is guarded only by `error && context_mode === 'group'`, which means it fires for every container failure — API rate limits, network timeouts, container OOM, etc. — not only the stale `resumeAt` UUID case this fix is targeting.
For a recurring interval/cron task whose last run was cut short by a transient error, the resume position it had established is still valid. Clearing it here causes the _next_ run to start a brand-new session instead of continuing from the last known cursor, silently discarding conversation context that would otherwise be available.
The agent-runner already writes the stale-UUID error message into `output.error`. A targeted fix would only clear when that specific message is detected:
```typescript
const isStaleSession = error != null && /No message found with message\.uuid/i.test(error);
if (isStaleSession && task.context_mode === 'group') {
deps.clearResumePosition(task.group_folder);
}
```
This matches the intent described in the comment above the block and avoids discarding valid cursors on unrelated failures.
How can I resolve this? If you propose a fix, please make it concise.
Summary
resumeAtmessage UUID, the agent-runner now throws to trigger fresh-session retry instead of surfacing the error. The task-scheduler also clears stale positions on failure so subsequent tasks don't repeat the same error.__tests__/subdirectories.gog-reauth.shfrom core repo (belongs in plugin), cleaned up gitignore.Test plan
npm run buildpassesgroup-queue, etc.)