Feature request: first-class detached ("background") sub-agent runs + completion notification
Context
We build on the Agents SDK's sub-agent primitive (runAgentTool / a sub-agent facet). It's great: each sub-agent run gets its own SQLite-backed transcript, a run row in cf_agent_tool_runs, live agent-tool-event broadcast to clients, child recovery, and the parent's onAgentToolStart / onAgentToolFinish lifecycle hooks (we use the latter to merge child token cost into the parent).
The one thing it doesn't model is a detached run: dispatch a sub-agent, let the parent's turn continue, and get notified when the child finishes. runAgentTool is await-shaped — the parent turn is parked for the child's entire lifecycle.
Motivation (why we needed this)
Long-running delegated work blocks the agent. Our worst case is data import: our load_data task ingests a user-uploaded spreadsheet (spin-up → schema design → multi-minute ingest → verification). On an 80k-row file that's 4–6 minutes of serialized wall-clock during which the top-level agent can't start building the app, even though building doesn't depend on the load finishing.
Letting the import run detached while the parent builds schema/UI/queries, then folding the result back in, was a large, measurable win:
- request fully done 257s → 171s (−33%)
- app live 235s → 131s (−44%)
- wall-clock spread 163s → 18s (child hiccups — including ingest retries — disappear into the overlap instead of landing on the user's wait)
This generalizes well beyond imports to any "kick off slow work, keep going, reconcile on completion" pattern (builds, batch jobs, long external calls).
What we built on top of the SDK (our workaround)
We got it working entirely with existing SDK primitives, which is the encouraging part — but it took ~200 lines of glue we'd rather not own:
-
Detached dispatch. Mint the runId ourselves (RunAgentToolOptions already accepts a caller-supplied id), call runAgentTool(SubAgentClass, …) without awaiting, and return { runId, status: "running" } to the model immediately. The framework handles the run identically either way — run row, UI broadcast, recovery, and onAgentToolStart/Finish all fire whether or not the dispatch promise is held. We deliberately thread no AbortSignal (the child must outlive the spawning turn).
-
Completion notification via two redundant paths.
- Durable backbone: a scheduled callback (
this.schedule, 20s cadence, ~2h budget) that calls inspectAgentToolRun(runId) on the child facet until it reports a terminal status. This survives parent DO eviction — we rely on the framework's reconciler lazily materializing a post-eviction child terminal.
- Fast path: a
waitUntil off the floating dispatch promise to cut latency while the parent isolate stays alive.
-
Idempotent delivery. Both paths submit the completion message through the SDK's durable submitMessages, keyed bg-task:<sessionId>:<runId>, so overlap and eviction can't double-notify. The synthetic message is injected as the parent's next-turn input (queues FIFO behind any running turn) and hidden from the rendered transcript via a metadata.source filter.
The sharp edges we had to discover the hard way (and would love the framework to own):
- A single
null from inspectAgentToolRun is not proof the run is gone — the poll can race the child facet's first write. Insta-failing on the first null turned every race into a spurious "outcome unconfirmed" notification (a real prod incident for us). We now tolerate N consecutive nulls.
- The "gave up polling" notification needs a separate idempotency key from the success notification — when they shared one, a premature give-up consumed the key and the child's real summary (delivered later by the fast path) was silently deduped away.
What we'd like from the framework
First-class support for detached sub-agent runs, so consumers don't each re-implement the poll/fast-path/idempotency/race-handling glue:
-
A supported "detached" mode on runAgentTool — e.g. runAgentTool(Cls, { input, detached: true }) returning { runId } without awaiting, with a documented guarantee that the full run lifecycle (run row, broadcast, recovery, onAgentToolFinish, cost) fires regardless.
-
A durable completion hook for detached runs that is guaranteed to fire exactly once even across parent DO eviction — today onAgentToolFinish only reliably fires on the awaited path, which is why we built the poll. Ideally the framework owns the eviction-survival + dedupe and just calls us back.
-
Documented contract for inspectAgentToolRun after eviction — the lazy reconcile behavior we depend on, plus the null-means-maybe-racing semantics, should be specified rather than reverse-engineered.
We're happy to share our implementation or open a PR if there's interest in a particular shape. Thanks!
Feature request: first-class detached ("background") sub-agent runs + completion notification
Context
We build on the Agents SDK's sub-agent primitive (
runAgentTool/ a sub-agent facet). It's great: each sub-agent run gets its own SQLite-backed transcript, a run row incf_agent_tool_runs, liveagent-tool-eventbroadcast to clients, child recovery, and the parent'sonAgentToolStart/onAgentToolFinishlifecycle hooks (we use the latter to merge child token cost into the parent).The one thing it doesn't model is a detached run: dispatch a sub-agent, let the parent's turn continue, and get notified when the child finishes.
runAgentToolis await-shaped — the parent turn is parked for the child's entire lifecycle.Motivation (why we needed this)
Long-running delegated work blocks the agent. Our worst case is data import: our
load_datatask ingests a user-uploaded spreadsheet (spin-up → schema design → multi-minute ingest → verification). On an 80k-row file that's 4–6 minutes of serialized wall-clock during which the top-level agent can't start building the app, even though building doesn't depend on the load finishing.Letting the import run detached while the parent builds schema/UI/queries, then folding the result back in, was a large, measurable win:
This generalizes well beyond imports to any "kick off slow work, keep going, reconcile on completion" pattern (builds, batch jobs, long external calls).
What we built on top of the SDK (our workaround)
We got it working entirely with existing SDK primitives, which is the encouraging part — but it took ~200 lines of glue we'd rather not own:
Detached dispatch. Mint the
runIdourselves (RunAgentToolOptionsalready accepts a caller-supplied id), callrunAgentTool(SubAgentClass, …)without awaiting, and return{ runId, status: "running" }to the model immediately. The framework handles the run identically either way — run row, UI broadcast, recovery, andonAgentToolStart/Finishall fire whether or not the dispatch promise is held. We deliberately thread noAbortSignal(the child must outlive the spawning turn).Completion notification via two redundant paths.
this.schedule, 20s cadence, ~2h budget) that callsinspectAgentToolRun(runId)on the child facet until it reports a terminal status. This survives parent DO eviction — we rely on the framework's reconciler lazily materializing a post-eviction child terminal.waitUntiloff the floating dispatch promise to cut latency while the parent isolate stays alive.Idempotent delivery. Both paths submit the completion message through the SDK's durable
submitMessages, keyedbg-task:<sessionId>:<runId>, so overlap and eviction can't double-notify. The synthetic message is injected as the parent's next-turn input (queues FIFO behind any running turn) and hidden from the rendered transcript via ametadata.sourcefilter.The sharp edges we had to discover the hard way (and would love the framework to own):
nullfrominspectAgentToolRunis not proof the run is gone — the poll can race the child facet's first write. Insta-failing on the first null turned every race into a spurious "outcome unconfirmed" notification (a real prod incident for us). We now tolerate N consecutive nulls.What we'd like from the framework
First-class support for detached sub-agent runs, so consumers don't each re-implement the poll/fast-path/idempotency/race-handling glue:
A supported "detached" mode on
runAgentTool— e.g.runAgentTool(Cls, { input, detached: true })returning{ runId }without awaiting, with a documented guarantee that the full run lifecycle (run row, broadcast, recovery,onAgentToolFinish, cost) fires regardless.A durable completion hook for detached runs that is guaranteed to fire exactly once even across parent DO eviction — today
onAgentToolFinishonly reliably fires on the awaited path, which is why we built the poll. Ideally the framework owns the eviction-survival + dedupe and just calls us back.Documented contract for
inspectAgentToolRunafter eviction — the lazy reconcile behavior we depend on, plus thenull-means-maybe-racing semantics, should be specified rather than reverse-engineered.We're happy to share our implementation or open a PR if there's interest in a particular shape. Thanks!