Skip to content

Sub-agent runtime sub-issue: checkpoint and continue child work across turns #2029

@Hmbown

Description

@Hmbown

Parent

Sub-issue of #1806.

Problem

#1806 correctly identifies the hard 120s sub-agent API-response ceiling. The deeper product problem is that useful child work is treated as one final response instead of a resumable, checkpointed activity.

When a child needs more runtime or more than one thinking/execution pass, the parent has no clean way to:

  • See partial progress.
  • Continue the child for another turn.
  • Ask the child to checkpoint intermediate findings.
  • Recover useful work when final synthesis times out.
  • Route large results through files/handles instead of the final response body.

Evidence:

  • Public report Subject: Sub-agent 120s API timeout renders agent_open nearly unusable (v0.8.39) #1806 describes five parallel sub-agents that ran for about 12 minutes, then failed with a 120s API timeout and no result.
  • A redacted scan of maintainer-private CodeWhale logs found 19 spawn_agent calls, 13 wait_agent calls, and 11 close_agent calls.
  • 5 of the observed wait_agent calls used a timeout argument above 120s, but the API-response ceiling still matters because the child result path is final-response shaped.
  • 4 observed wait outputs returned an empty status before later completion, which is consistent with parent polling needing an explicit progress/continuation contract.

No prompts, raw child summaries, secrets, local paths, or transcript text are copied here.

Desired Behavior

Sub-agents should become resumable child work items rather than one-shot response generators.

Required shape:

  • Each child has a durable task id, role, budget, model/effort route, and status.
  • Child progress can be polled without requiring completion.
  • Child can emit checkpoint receipts: summary, files touched/read, evidence handles, blockers, next suggested action.
  • Parent can ask a child to continue for another turn with bounded context and explicit budget.
  • Child can write or hand off large output through approved handles/artifacts instead of the final response body.
  • Timeout should preserve last checkpoint and mark the child needs_continuation or timed_out_with_checkpoint, not simply failed with null result.

Acceptance Criteria

  • Sub-agent timeout, max runtime, and max turns are configurable under [subagents].
  • A child that exceeds one response window can resume from its last checkpoint.
  • Parent UI shows running, checkpointed, needs_continuation, failed, and complete distinctly.
  • agent_eval or equivalent continuation can send a follow-up assignment to the same child/session.
  • If final response times out, prior checkpoint data remains visible and inspectable.
  • Tests cover a child that checkpoints, times out, resumes, and completes without losing the intermediate summary.
  • Approval/sandbox/provider policies are enforced on every continuation turn.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcontextContext management / contextenhancementNew feature or request

    Projects

    Status

    Backlog

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions