Feature request: first-class detached ("background") sub-agent runs with a durable completion hook

## Feature request: first-class detached ("background") sub-agent runs + completion notification

### Context

We build on the Agents SDK's sub-agent primitive (`runAgentTool` / a sub-agent facet). It's great: each sub-agent run gets its own SQLite-backed transcript, a run row in `cf_agent_tool_runs`, live `agent-tool-event` broadcast to clients, child recovery, and the parent's `onAgentToolStart` / `onAgentToolFinish` lifecycle hooks (we use the latter to merge child token cost into the parent).

The one thing it doesn't model is a **detached run**: dispatch a sub-agent, let the parent's turn continue, and get notified when the child finishes. `runAgentTool` is await-shaped — the parent turn is parked for the child's entire lifecycle.

### Motivation (why we needed this)

Long-running delegated work blocks the agent. Our worst case is **data import**: our `load_data` task ingests a user-uploaded spreadsheet (spin-up → schema design → multi-minute ingest → verification). On an 80k-row file that's **4–6 minutes of serialized wall-clock** during which the top-level agent can't start building the app, even though building doesn't depend on the load finishing.

Letting the import run detached while the parent builds schema/UI/queries, then folding the result back in, was a large, measurable win:

- request fully done **257s → 171s (−33%)**
- app live **235s → 131s (−44%)**
- wall-clock spread **163s → 18s** (child hiccups — including ingest retries — disappear into the overlap instead of landing on the user's wait)

This generalizes well beyond imports to any "kick off slow work, keep going, reconcile on completion" pattern (builds, batch jobs, long external calls).

### What we built on top of the SDK (our workaround)

We got it working *entirely* with existing SDK primitives, which is the encouraging part — but it took ~200 lines of glue we'd rather not own:

1. **Detached dispatch.** Mint the `runId` ourselves (`RunAgentToolOptions` already accepts a caller-supplied id), call `runAgentTool(SubAgentClass, …)` **without awaiting**, and return `{ runId, status: "running" }` to the model immediately. The framework handles the run identically either way — run row, UI broadcast, recovery, and `onAgentToolStart/Finish` all fire whether or not the dispatch promise is held. We deliberately thread **no** `AbortSignal` (the child must outlive the spawning turn).

2. **Completion notification via two redundant paths.**
   - **Durable backbone:** a scheduled callback (`this.schedule`, 20s cadence, ~2h budget) that calls `inspectAgentToolRun(runId)` on the child facet until it reports a terminal status. This survives **parent DO eviction** — we rely on the framework's reconciler lazily materializing a post-eviction child terminal.
   - **Fast path:** a `waitUntil` off the floating dispatch promise to cut latency while the parent isolate stays alive.

3. **Idempotent delivery.** Both paths submit the completion message through the SDK's durable `submitMessages`, keyed `bg-task:<sessionId>:<runId>`, so overlap and eviction can't double-notify. The synthetic message is injected as the parent's next-turn input (queues FIFO behind any running turn) and hidden from the rendered transcript via a `metadata.source` filter.

The sharp edges we had to discover the hard way (and would love the framework to own):

- A single `null` from `inspectAgentToolRun` is **not** proof the run is gone — the poll can race the child facet's first write. Insta-failing on the first null turned every race into a spurious "outcome unconfirmed" notification (a real prod incident for us). We now tolerate N consecutive nulls.
- The "gave up polling" notification needs a **separate** idempotency key from the success notification — when they shared one, a premature give-up consumed the key and the child's *real* summary (delivered later by the fast path) was silently deduped away.

### What we'd like from the framework

First-class support for detached sub-agent runs, so consumers don't each re-implement the poll/fast-path/idempotency/race-handling glue:

1. **A supported "detached" mode on `runAgentTool`** — e.g. `runAgentTool(Cls, { input, detached: true })` returning `{ runId }` without awaiting, with a documented guarantee that the full run lifecycle (run row, broadcast, recovery, `onAgentToolFinish`, cost) fires regardless.

2. **A durable completion hook for detached runs** that is guaranteed to fire exactly once even across parent DO eviction — today `onAgentToolFinish` only reliably fires on the awaited path, which is why we built the poll. Ideally the framework owns the eviction-survival + dedupe and just calls us back.

3. **Documented contract for `inspectAgentToolRun` after eviction** — the lazy reconcile behavior we depend on, plus the `null`-means-maybe-racing semantics, should be specified rather than reverse-engineered.

We're happy to share our implementation or open a PR if there's interest in a particular shape. Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: first-class detached ("background") sub-agent runs with a durable completion hook #1752

Feature request: first-class detached ("background") sub-agent runs + completion notification

Context

Motivation (why we needed this)

What we built on top of the SDK (our workaround)

What we'd like from the framework

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature request: first-class detached ("background") sub-agent runs with a durable completion hook #1752

Description

Feature request: first-class detached ("background") sub-agent runs + completion notification

Context

Motivation (why we needed this)

What we built on top of the SDK (our workaround)

What we'd like from the framework

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions