Skip to content

Feature request: first-class detached ("background") sub-agent runs with a durable completion hook #1752

@rwdaigle

Description

@rwdaigle

Feature request: first-class detached ("background") sub-agent runs + completion notification

Context

We build on the Agents SDK's sub-agent primitive (runAgentTool / a sub-agent facet). It's great: each sub-agent run gets its own SQLite-backed transcript, a run row in cf_agent_tool_runs, live agent-tool-event broadcast to clients, child recovery, and the parent's onAgentToolStart / onAgentToolFinish lifecycle hooks (we use the latter to merge child token cost into the parent).

The one thing it doesn't model is a detached run: dispatch a sub-agent, let the parent's turn continue, and get notified when the child finishes. runAgentTool is await-shaped — the parent turn is parked for the child's entire lifecycle.

Motivation (why we needed this)

Long-running delegated work blocks the agent. Our worst case is data import: our load_data task ingests a user-uploaded spreadsheet (spin-up → schema design → multi-minute ingest → verification). On an 80k-row file that's 4–6 minutes of serialized wall-clock during which the top-level agent can't start building the app, even though building doesn't depend on the load finishing.

Letting the import run detached while the parent builds schema/UI/queries, then folding the result back in, was a large, measurable win:

  • request fully done 257s → 171s (−33%)
  • app live 235s → 131s (−44%)
  • wall-clock spread 163s → 18s (child hiccups — including ingest retries — disappear into the overlap instead of landing on the user's wait)

This generalizes well beyond imports to any "kick off slow work, keep going, reconcile on completion" pattern (builds, batch jobs, long external calls).

What we built on top of the SDK (our workaround)

We got it working entirely with existing SDK primitives, which is the encouraging part — but it took ~200 lines of glue we'd rather not own:

  1. Detached dispatch. Mint the runId ourselves (RunAgentToolOptions already accepts a caller-supplied id), call runAgentTool(SubAgentClass, …) without awaiting, and return { runId, status: "running" } to the model immediately. The framework handles the run identically either way — run row, UI broadcast, recovery, and onAgentToolStart/Finish all fire whether or not the dispatch promise is held. We deliberately thread no AbortSignal (the child must outlive the spawning turn).

  2. Completion notification via two redundant paths.

    • Durable backbone: a scheduled callback (this.schedule, 20s cadence, ~2h budget) that calls inspectAgentToolRun(runId) on the child facet until it reports a terminal status. This survives parent DO eviction — we rely on the framework's reconciler lazily materializing a post-eviction child terminal.
    • Fast path: a waitUntil off the floating dispatch promise to cut latency while the parent isolate stays alive.
  3. Idempotent delivery. Both paths submit the completion message through the SDK's durable submitMessages, keyed bg-task:<sessionId>:<runId>, so overlap and eviction can't double-notify. The synthetic message is injected as the parent's next-turn input (queues FIFO behind any running turn) and hidden from the rendered transcript via a metadata.source filter.

The sharp edges we had to discover the hard way (and would love the framework to own):

  • A single null from inspectAgentToolRun is not proof the run is gone — the poll can race the child facet's first write. Insta-failing on the first null turned every race into a spurious "outcome unconfirmed" notification (a real prod incident for us). We now tolerate N consecutive nulls.
  • The "gave up polling" notification needs a separate idempotency key from the success notification — when they shared one, a premature give-up consumed the key and the child's real summary (delivered later by the fast path) was silently deduped away.

What we'd like from the framework

First-class support for detached sub-agent runs, so consumers don't each re-implement the poll/fast-path/idempotency/race-handling glue:

  1. A supported "detached" mode on runAgentTool — e.g. runAgentTool(Cls, { input, detached: true }) returning { runId } without awaiting, with a documented guarantee that the full run lifecycle (run row, broadcast, recovery, onAgentToolFinish, cost) fires regardless.

  2. A durable completion hook for detached runs that is guaranteed to fire exactly once even across parent DO eviction — today onAgentToolFinish only reliably fires on the awaited path, which is why we built the poll. Ideally the framework owns the eviction-survival + dedupe and just calls us back.

  3. Documented contract for inspectAgentToolRun after eviction — the lazy reconcile behavior we depend on, plus the null-means-maybe-racing semantics, should be specified rather than reverse-engineered.

We're happy to share our implementation or open a PR if there's interest in a particular shape. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions