From c90689cf8d90f459cb9e6b0eeaa92019371c55a4 Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Thu, 18 Jun 2026 21:29:24 +0000
Subject: [PATCH 01/20] Add spec: atomic df.start() to eliminate df/_duroxide
 divergence

Proposes making the duroxide start-enqueue part of the caller's backend
transaction (Option 3) so df.start() is atomic: rollback leaves neither
df.* rows nor a _duroxide orchestrator-queue orphan, and the worker only
picks up an instance after its df.* rows commit.

Grounded in the current start path (single orchestrator_queue INSERT via
_duroxide.enqueue_orchestrator_work), the SECURITY DEFINER privilege model
needed for the enqueue, the resulting removal of the load-function-graph
5s race, alternatives considered, and an Upgrade & Migration section.
---
 docs/spec-atomic-start.md | 370 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 370 insertions(+)
 create mode 100644 docs/spec-atomic-start.md

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
new file mode 100644
index 0000000..06c9903
--- /dev/null
+++ b/docs/spec-atomic-start.md
@@ -0,0 +1,370 @@
+# Spec: Atomic `df.start()` — Eliminating `df.*` ↔ `_duroxide` Divergence
+
+**Status:** Proposal
+**Author:** pg_durable team
+**Date:** June 2026
+
+## Overview
+
+`df.start()` writes to two independent stores that are **not** committed together:
+
+- the **pg_durable control plane** (`df.nodes`, `df.instances`) — written on the
+  caller's backend transaction via SPI; and
+- the **duroxide runtime** (`_duroxide.orchestrator_queue`) — enqueued
+  out-of-band on a *separate* sqlx connection pool, committed independently.
+
+Because the two halves are not atomic, the stores can diverge. The most common
+failure is the "rolled-back start" leak: a `df.start()` whose surrounding
+transaction is rolled back (or whose batch subtransaction aborts on a PK
+collision) leaves an **orphaned orchestration** in `_duroxide` with no
+corresponding `df.*` rows. The background worker then repeatedly tries to load a
+graph that does not exist, fails after a 5 s timeout, and the orphan persists in
+`_duroxide.instances`.
+
+This spec proposes making the duroxide start-enqueue part of the caller's
+backend transaction (design **Option 3** from the design exploration), so that
+`df.start()` is atomic: either everything commits, or nothing does.
+
+### Evidence (observed during investigation)
+
+- A `df.start()` executed inside a transaction that was then `ROLLBACK`-ed left a
+  row in `_duroxide.instances` (e.g. `48819f12`) with **zero** rows in
+  `df.instances`. The orphan survived the rollback.
+- Under load, orphaned start items self-sustain via retry and saturate the
+  2-thread worker with 5 s "Instance not found … transaction may have been
+  rolled back" failures.
+- The short-id PK-collision path (8 hex chars ≈ 2³²) reliably aborts `df.start()`
+  well under 1M instances (observed first-collision draws: 9248, 34845, 40445,
+  60776). That abort is *safe* today (it happens before the duroxide enqueue),
+  but it exercises the same dual-write seam.
+
+## Background: how a start is written today
+
+`crate::dsl::start()` (src/dsl.rs) executes, in order, on the **caller's**
+transaction unless noted:
+
+1. `instance_id = short_id()` (src/dsl.rs:645).
+2. Read-only validation (database exists, caller has `LOGIN`, superuser policy).
+3. `insert_nodes()` → INSERT rows into `df.nodes` (src/dsl.rs:842).
+4. INSERT row into `df.instances` (src/dsl.rs:888).
+5. Capture `df.vars` snapshot.
+6. `start_durable_function()` → `Client::start_orchestration()` →
+   `store.enqueue_for_orchestrator(StartOrchestration)` (src/dsl.rs:923) — runs on
+   the cached `DUROXIDE_CLIENT` sqlx pool (src/client.rs), **separate
+   connection, independently committed.** Failures here are logged, not raised
+   (src/dsl.rs:928).
+
+Key facts that make the fix small:
+
+- **The start is a single enqueue.** `Client::start_orchestration` builds one
+  `WorkItem::StartOrchestration` and calls `enqueue_for_orchestrator`
+  (duroxide `client/mod.rs`). The duroxide-pg provider implements that as one
+  call to the SQL function `_duroxide.enqueue_orchestrator_work(...)`, whose body
+  is a single `INSERT INTO _duroxide.orchestrator_queue`. The
+  `_duroxide.instances`/`history` rows are materialized **later** by the worker.
+- **The atomic unit is therefore just** `{df.nodes INSERTs, df.instances INSERT,
+  one orchestrator_queue INSERT}`.
+- **duroxide-pg is SQL-function based.** Every provider operation is a function
+  in the `_duroxide` schema, so the enqueue can be invoked directly via SPI on
+  the caller's transaction — no async pool required for the start path.
+
+### What each store owns (so we know what must *not* change)
+
+| Data | Source of truth | Notes |
+|------|-----------------|-------|
+| Graph (`df.nodes`) | `df.*` | duroxide does not store it; the worker loads it via the `load-function-graph` activity. |
+| Execution history | `_duroxide.history` | duroxide-internal; transactional per turn via `ack_orchestration_item`. |
+| Control plane (`root_node`, `submitted_by`, `database`, `label`) | `df.instances` | needed by the worker to load + authenticate. |
+| `df.instances.status` | best-effort mirror | written by the `update-instance-status` activity; not atomic with duroxide, self-heals. |
+
+`submitted_by` is a **security boundary**: it is captured as `current_user` at
+`df.start()` time and pinned by a composite FK; the worker re-checks the
+superuser policy (src/activities/load_function_graph.rs). Any change must keep it
+unforgeable.
+
+## Goals
+
+- `df.start()` is **atomic with the caller's transaction**: on rollback, no
+  `df.*` rows *and* no `_duroxide` start item remain; on commit, both are present
+  and the worker picks the instance up only after the `df.*` rows are visible.
+- Eliminate the rolled-back-start orphan leak at the source.
+- Eliminate the `load-function-graph` "wait up to 5 s for the row to appear"
+  race (it becomes structurally impossible).
+- No change to the `df.*` schema's read surface (status/result/monitoring/explain
+  keep working unchanged).
+
+## Non-goals
+
+- Collapsing `df.*` into `_duroxide` (Option 1) — out of scope; tracked
+  separately as a possible long-term direction.
+- Fixing the best-effort `df.instances.status` mirror (it is eventually
+  consistent by design).
+- Changing short-id generation / the id space. (Independent; can be addressed
+  separately. This spec makes a collision a clean, fully-atomic abort.)
+- Migrating `df.signal()` / `df.cancel()` to the in-transaction path — same
+  mechanism, scoped as a follow-up (see Phasing).
+
+## Considered alternatives (summary)
+
+1. **Single source of truth in `_duroxide`** — remove `df.*`, serialize the graph
+   into the orchestration input, re-expose reads as views over `_duroxide`.
+   Strongest invariant but large; security/RLS/read-API rework and a destructive
+   migration. Rejected for this change.
+2. **duroxide-pg runs pg_durable's SQL inside *its* sqlx transaction** — makes
+   `df.*` ↔ `_duroxide` atomic but on the worker-pool connection, so `df.start`
+   stops participating in the caller's transaction, and the `df.*` writes run as
+   the pool role (breaks `current_user` capture). Strictly weaker than Option 3.
+3. **Run the enqueue inside the caller's backend transaction via SPI (this
+   spec)** — smallest change, strongest guarantee, keeps `df.*` and identity
+   capture intact.
+4. **Tolerate divergence + async GC/reconciler** — additive safety net; does not
+   prevent transient inconsistency. Recommended as a *complementary* backstop,
+   not a replacement.
+
+## Design
+
+### Mechanism
+
+In `crate::dsl::start()`, replace step 6 (the out-of-band
+`start_durable_function` pool call) with an **SPI call** that enqueues the start
+work item on the caller's transaction, immediately after the `df.instances`
+INSERT:
+
+```text
+-- still on the caller's transaction (SPI):
+INSERT INTO df.nodes ...                          -- step 3 (unchanged)
+INSERT INTO df.instances ...                      -- step 4 (unchanged)
+SELECT df.__enqueue_start($instance_id, $work_item_json, $visible_at);  -- step 6 (new)
+```
+
+Because the enqueue is now an SPI statement, it shares the caller's transaction:
+rollback removes the queue row along with the `df.*` rows; commit makes them
+visible together.
+
+### The work item
+
+pg_durable already depends on the `duroxide` crate, so it should **construct and
+serialize the real `WorkItem` type** rather than hand-rolling JSON, guaranteeing
+wire compatibility:
+
+```rust
+let item = duroxide::providers::WorkItem::StartOrchestration {
+    instance: instance_id.clone(),
+    orchestration: orchestrations::execute_function_graph::NAME.into(),
+    input: input_json,                       // existing FunctionInput JSON
+    version: None,                           // runtime resolves from registry
+    parent_instance: None,
+    parent_id: None,
+    execution_id: duroxide::INITIAL_EXECUTION_ID,   // = 1
+};
+let work_item_json = serde_json::to_string(&item)?;
+```
+
+`WorkItem` is externally tagged (default serde), i.e.
+`{"StartOrchestration":{...}}`. These are exactly the values the current
+`Client::start_orchestration` path uses, so worker behavior is unchanged.
+
+The enqueue target is the existing function (body is a single INSERT; the
+`p_orchestration_name/version/execution_id` params are accepted but unused — the
+work item is self-contained):
+
+```sql
+_duroxide.enqueue_orchestrator_work(
+    p_instance_id   text,
+    p_work_item     text,                 -- work_item_json
+    p_visible_at    timestamptz,          -- now() for immediate start
+    p_orchestration_name text DEFAULT NULL,
+    p_orchestration_version text DEFAULT NULL,
+    p_execution_id  bigint DEFAULT NULL)
+```
+
+### Privilege model (the one real wrinkle)
+
+`_duroxide.orchestrator_queue` grants INSERT to its **owner only** (the
+`pg_durable.worker_role`; `relacl` is empty → no PUBLIC), and
+`_duroxide.enqueue_orchestrator_work` is `SECURITY INVOKER`. `df.start()` is also
+`SECURITY INVOKER` and runs SPI as the calling user. Therefore a normal caller
+cannot insert into the queue directly — confirmed: for a fresh `LOGIN` role,
+`has_table_privilege(... 'INSERT') = false`.
+
+The enqueue must run through a **`SECURITY DEFINER`** entrypoint owned by a role
+that owns the `_duroxide` queue tables. Two placements:
+
+- **Recommended — provide it in duroxide-pg** (proper layering; pg_durable never
+  touches `_duroxide` internals). Add a stable, documented `SECURITY DEFINER`
+  client-enqueue function in the `_duroxide` schema, owned by the schema owner
+  (`worker_role`), e.g. `_duroxide.client_enqueue_start(p_instance_id text,
+  p_work_item text, p_visible_at timestamptz)`. It performs the single queue
+  INSERT and is `GRANT EXECUTE`-ed to the roles allowed to start instances.
+  Applied by the BGW migration runner (`ApplyAll`); no pg_durable extension
+  upgrade script needed (consistent with docs/bgw-applies-migrations.md).
+- **Fallback — a `df` wrapper in pg_durable.** A `df.__enqueue_start(...)`
+  `SECURITY DEFINER` function owned by the extension owner that calls
+  `_duroxide.enqueue_orchestrator_work(...)`. Works in the common configuration
+  where the extension owner is a superuser (or otherwise owns/【has INSERT on】the
+  queue). In hardened multi-role deployments where `_duroxide` is owned by a
+  non-superuser `worker_role` distinct from the extension owner, this requires
+  the `worker_role` to `GRANT INSERT` on the queue tables (and `EXECUTE` on the
+  enqueue fn) to the wrapper's owner. The duroxide-pg placement avoids this
+  coordination.
+
+Either wrapper **must** pin `search_path` and schema-qualify every reference
+(CVE-2018-1058; enforced by the pgspot gate, scripts/pgspot-gate.sh):
+
+```sql
+CREATE FUNCTION _duroxide.client_enqueue_start(...)
+RETURNS void
+LANGUAGE plpgsql
+SECURITY DEFINER
+SET search_path = pg_catalog
+AS $$ BEGIN
+  INSERT INTO _duroxide.orchestrator_queue (instance_id, work_item, visible_at, created_at)
+  VALUES (p_instance_id, p_work_item, p_visible_at, pg_catalog.now());
+END $$;
+```
+
+The wrapper takes **only** an opaque `instance_id`, a `work_item` string, and a
+timestamp — it executes no caller-supplied SQL, so the `SECURITY DEFINER`
+surface is a single fixed INSERT.
+
+### Identity / security
+
+`submitted_by` capture is unchanged: it is still read as `current_user` inside
+`df.start()` (`SECURITY INVOKER`) **before** the enqueue and stored in
+`df.instances`. The `SECURITY DEFINER` wrapper does not touch identity; the work
+item carries only the orchestration name/input. The worker continues to
+authenticate using `df.instances.submitted_by` and re-check the superuser policy
+(unchanged). No new forgery surface is introduced.
+
+### Worker readiness and ordering
+
+- Keep the existing worker-readiness gate: if `_worker_ready` is absent/stale,
+  `df.start()` errors as it does today (the `_duroxide` schema may not yet
+  exist). This check moves from "before acquiring the client" to "before the SPI
+  enqueue."
+- The enqueue is the **last** write in `df.start()`; a PK collision on
+  `df.nodes`/`df.instances` still aborts before it (no behavior change for the
+  collision case, now fully inside one transaction).
+
+### Consequence: the 5 s load race disappears
+
+`load-function-graph` currently polls for up to `MAX_WAIT_SECS = 5` because the
+worker could dequeue a start item before the caller's `df.*` rows committed
+(src/activities/load_function_graph.rs:21,60). With an atomic start, a queue row
+becomes visible to the worker **only** after the `df.*` rows commit in the same
+transaction, so the "not yet visible / rolled back" branch cannot occur. The
+retry loop can be simplified to a single read (optionally retain a short grace
+for replication scenarios). This also removes the worker-saturation failure mode.
+
+## Transaction-semantics decision (must be made explicit)
+
+This change makes `df.start()` **atomic with the caller's transaction**:
+
+```sql
+BEGIN;
+SELECT df.start('SELECT 1');   -- enqueued on THIS transaction
+ROLLBACK;                      -- start is fully undone; nothing runs
+```
+
+Today the behavior is inconsistent (the `df.*` rows roll back but the duroxide
+enqueue does not). The new behavior is the least-surprising contract and is the
+explicit, documented decision of this spec. Note for users: a `df.start()` in a
+transaction that later rolls back will **not** run — which is the desired fix,
+but a behavior change from "fire-and-forget at statement time." This must be
+called out in `USER_GUIDE.md`.
+
+## Signals & cancellation (follow-up)
+
+`df.signal()` (ExternalRaised) and `df.cancel()` (CancelInstance) are also single
+orchestrator enqueues and can use the same in-transaction SPI path for the same
+consistency benefit. They are **out of scope** here (signal/cancel target
+already-committed instances, so the rolled-back-start leak does not apply), but
+the `SECURITY DEFINER` enqueue entrypoint should be designed to serve them too
+(generalize to `client_enqueue(instance, work_item, visible_at)`).
+
+## Complementary backstop (optional, recommended)
+
+Independently of this change, a lightweight **reconciler** — ideally a built-in
+durable function (a `@>` loop + `df.wait_for_schedule`) — should sweep for
+residual divergence that an atomic start cannot prevent (crashes mid-execution,
+not-yet-migrated signal/cancel paths, `status` mirror drift), using existing
+`_duroxide` read functions (`list_instances`, `get_instance_info`,
+`get_queue_depths`) and `delete_instances_atomic`. Tracked separately as
+"Option 4"; not required for this spec but cheap insurance.
+
+## Upgrade & Migration
+
+**B1 — binary backward compatibility (new `.so` vs all prior schemas).** The new
+`.so` calls the enqueue entrypoint via SPI.
+- If the entrypoint is the **duroxide-pg-provided** `SECURITY DEFINER` function,
+  the BGW applies the duroxide migration at startup (`ApplyAll`,
+  src/worker.rs), and the function exists for every running worker. Backends
+  detect readiness via `_worker_ready` (existing gate) before enqueuing.
+  Versioning: bump the pinned `duroxide`/`duroxide-pg` pair together (see
+  docs/upgrade-testing.md "Updating duroxide-pg") and run `cargo update`. **No
+  `pg_durable` extension upgrade script is required** for duroxide-schema changes
+  (docs/bgw-applies-migrations.md).
+- If the entrypoint is the **`df`-schema fallback wrapper**, it is a `df` schema
+  object and **does** require an extension upgrade script
+  `sql/pg_durable--<prev>--<current>.sql` (CREATE FUNCTION + GRANT), plus the
+  fresh-install SQL, and a "Version-Specific Changes" entry in
+  docs/upgrade-testing.md. The new `.so` must still operate against older `df`
+  schemas that predate the wrapper: gate the SPI path on the wrapper's existence
+  (catalog probe) and **fall back to the existing out-of-band client enqueue**
+  when absent. This preserves B1 while the upgrade rolls out.
+
+**Runtime schema detection.** Reuse the existing `_worker_ready` /
+`backend_duroxide_schema()` resolution. Add a one-time probe (cached per session)
+for the enqueue entrypoint; if missing (old worker), fall back to the legacy
+path. This makes the change safe under mixed binary/worker versions.
+
+**Data migration.** None. No `df.*` or `_duroxide` table shapes change. Existing
+in-flight instances are unaffected (their queue items were already enqueued).
+
+**Rollback.** Reverting the `.so` restores the out-of-band enqueue; the optional
+`df` wrapper, if added, is inert when unused.
+
+## Testing
+
+E2E (tests/e2e/sql/) and unit coverage:
+
+- **Atomic rollback (the leak):** `BEGIN; df.start(...); ROLLBACK;` then assert
+  **both** `df.instances` and `_duroxide.instances`/`orchestrator_queue` have no
+  row for the instance. (Today this fails: `_duroxide` keeps an orphan.)
+- **Atomic commit:** `BEGIN; df.start(...); COMMIT;` runs to completion exactly
+  as before.
+- **Subtransaction abort:** a `df.start()` inside a `BEGIN/EXCEPTION` block that
+  raises leaves no `_duroxide` orphan.
+- **PK collision (existing repro):** the short-id collision aborts cleanly with
+  no `_duroxide` residue (extends the manual `short_id_collision` repro).
+- **Worker-not-ready:** `df.start()` before `_worker_ready` errors as today.
+- **Privilege/hardened config:** a non-superuser caller (and, for the fallback
+  wrapper, a distinct `worker_role`) can start successfully and cannot insert
+  into `_duroxide.orchestrator_queue` directly.
+- **No 5 s waits:** worker logs contain no "transaction may have been rolled
+  back" entries during normal start/rollback flows.
+
+## Phasing
+
+1. **Phase 1 (this spec):** atomic `df.start()` via the `SECURITY DEFINER`
+   enqueue entrypoint; readiness gate + legacy fallback; simplify the
+   `load-function-graph` retry; docs + tests.
+2. **Phase 2:** apply the same in-transaction enqueue to `df.signal()` /
+   `df.cancel()`.
+3. **Phase 3 (optional):** ship the reconciler backstop (Option 4).
+
+## Open questions / risks
+
+- **Entrypoint placement:** duroxide-pg (clean layering, coordinated release) vs
+  a `df` fallback wrapper (no duroxide-pg release, but cross-role grants in
+  hardened setups). Recommendation: duroxide-pg, with the `df` fallback retained
+  for B1 during rollout.
+- **Contract stability:** pg_durable now depends on the enqueue entrypoint
+  signature and the `WorkItem` JSON. Mitigated by reusing the `duroxide`
+  `WorkItem` type and the version-locked duroxide/duroxide-pg pair.
+- **`visible_at` clock:** use `now()` (server time) for immediate starts to match
+  the duroxide-pg provider's behavior; confirm no reliance on the Rust-side
+  `now_ms` for ordering at start.
+- **Idempotency:** `enqueue_orchestrator_work` is a bare INSERT (no dedup); the
+  `df.instances`/`df.nodes` PKs already prevent duplicate instance ids within the
+  transaction, so no additional dedup is required.

From 00de1101843bde44bea3d3542d6a3cbfcdcfe3c8 Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Thu, 18 Jun 2026 21:39:49 +0000
Subject: [PATCH 02/20] spec: add dual-write inventory; clarify df.cancel vs
 df.signal

Add a Dual-write inventory section that classifies every df.* writer vs
duroxide-client call into two seams: backend session-side dual-writes
(df.start, df.cancel) that Option 3 fixes, and worker-side best-effort
mirror activities (update-instance-status, update-node-status) that it does
not. Make explicit that df.cancel is a genuine dual-write (Phase 2) while
df.signal writes only _duroxide and is not a cross-store divergence.
---
 docs/spec-atomic-start.md | 65 +++++++++++++++++++++++++++++++++++----
 1 file changed, 59 insertions(+), 6 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index 06c9903..b921bc2 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -76,12 +76,56 @@ Key facts that make the fix small:
 | Execution history | `_duroxide.history` | duroxide-internal; transactional per turn via `ack_orchestration_item`. |
 | Control plane (`root_node`, `submitted_by`, `database`, `label`) | `df.instances` | needed by the worker to load + authenticate. |
 | `df.instances.status` | best-effort mirror | written by the `update-instance-status` activity; not atomic with duroxide, self-heals. |
+| `df.nodes.status/result/error` | best-effort mirror | written by the `update-node-status` activity; same properties as above. |
 
 `submitted_by` is a **security boundary**: it is captured as `current_user` at
 `df.start()` time and pinned by a composite FK; the worker re-checks the
 superuser policy (src/activities/load_function_graph.rs). Any change must keep it
 unforgeable.
 
+### Dual-write inventory
+
+Auditing every `df.*` writer against every duroxide-client call, the cross-store
+seams are bounded and fall into two distinct classes.
+
+**Backend (session) side — `df.*` on the caller's transaction + an out-of-band
+duroxide-client enqueue.** This is the class Option 3 fixes.
+
+| Site | `df.*` write (user tx, SPI) | `_duroxide` write (out-of-band pool) | Dual-write? |
+|------|------|------|------|
+| `df.start()` | INSERT `df.nodes` + `df.instances` (src/dsl.rs:842,888) | enqueue `StartOrchestration` (src/dsl.rs:923) | **Yes** — Phase 1 |
+| `df.cancel()` | UPDATE `df.instances.status='cancelled'` (src/dsl.rs:966) | `cancel_instance` enqueue (src/dsl.rs:955) | **Yes** — Phase 2 |
+| `df.signal()` | **none** (only an RLS read of `df.instances`) | `raise_event` + descendant fan-out (src/dsl.rs:616) | **No** — single store |
+
+Notes:
+
+- **`df.cancel()` is a genuine dual-write** with two twists vs `df.start()`: it
+  enqueues to duroxide **first** and then does the `df.*` UPDATE (opposite
+  order), and the UPDATE is only an *optimistic* mirror — the authoritative
+  `cancelled` status is re-applied when the worker processes the cancel via
+  `update-instance-status`, so divergence here is usually transient and
+  self-healing. It still belongs in the same in-transaction treatment for strict
+  atomicity.
+- **`df.signal()` is not a dual-write**: it writes only `_duroxide`. There is no
+  cross-store inconsistency to create; moving it in-transaction only changes
+  "don't deliver the signal if my surrounding tx rolls back" semantics — a
+  nice-to-have, not a consistency fix.
+- Not seams: `df.run()` is a no-op stub; `df.result/status/explain/monitoring`
+  are reads; `df.vars` and `df._worker_epoch` touch only `df.*`.
+
+**Worker side — `df.*` mirror maintained by activities, on a connection separate
+from duroxide's history ack.** Option 3 does **not** address this class.
+
+| Activity | `df.*` write | Property |
+|----------|--------------|----------|
+| `update-instance-status` | UPDATE `df.instances.status` | eventually consistent, self-healing |
+| `update-node-status` | UPDATE `df.nodes.status/result/error` | eventually consistent, self-healing |
+
+These are at-least-once activities re-applied on replay, so a crash between the
+duroxide ack and the mirror update is transient lag, not permanent divergence.
+This is the "best-effort mirror"; any persistent drift is the reconciler's job
+(see Complementary backstop), not Option 3's.
+
 ## Goals
 
 - `df.start()` is **atomic with the caller's transaction**: on rollback, no
@@ -275,12 +319,21 @@ called out in `USER_GUIDE.md`.
 
 ## Signals & cancellation (follow-up)
 
-`df.signal()` (ExternalRaised) and `df.cancel()` (CancelInstance) are also single
-orchestrator enqueues and can use the same in-transaction SPI path for the same
-consistency benefit. They are **out of scope** here (signal/cancel target
-already-committed instances, so the rolled-back-start leak does not apply), but
-the `SECURITY DEFINER` enqueue entrypoint should be designed to serve them too
-(generalize to `client_enqueue(instance, work_item, visible_at)`).
+See the Dual-write inventory above for the precise classification. In short:
+
+- **`df.cancel()` is a genuine dual-write** (`df.instances.status` + a
+  `CancelInstance` enqueue) and should get the same in-transaction treatment as
+  `df.start()`. It is scoped as **Phase 2** rather than Phase 1 only because it
+  targets an already-committed instance (so the rolled-back-*start* leak does not
+  apply) and because the worker's `update-instance-status` activity already
+  re-applies the authoritative `cancelled` status, making its divergence
+  transient. The `SECURITY DEFINER` enqueue entrypoint should be generalized to
+  carry an arbitrary work item (e.g. `client_enqueue(instance, work_item,
+  visible_at)`) so it serves cancel directly.
+- **`df.signal()` is not a dual-write** (it writes only `_duroxide`), so it
+  creates no cross-store inconsistency. It can ride on the same entrypoint for
+  "don't deliver if my tx rolls back" semantics, but that is a nice-to-have, not
+  a consistency fix — lowest priority.
 
 ## Complementary backstop (optional, recommended)
 

From 7aeff23adf92394d26236e3b2df24338067a7487 Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Thu, 18 Jun 2026 22:10:25 +0000
Subject: [PATCH 03/20] spec: surface the four design options up front; state
 Option 3 + Option 4 decision

Add a 'Design options considered' section near the top enumerating the four
originally-explored options (single source of truth, our-SQL-in-their-tx,
their-enqueue-in-our-tx, and tolerate-and-reconcile) with a Decision
subsection, so the Option N references throughout resolve. Update the Overview
to state we implement Option 3 (primary) + Option 4 (backstop), remove the now
-redundant mid-document summary, and reframe the reconciler as in-scope Phase 3.
---
 docs/spec-atomic-start.md | 94 ++++++++++++++++++++++++++++-----------
 1 file changed, 67 insertions(+), 27 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index b921bc2..9999134 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -21,9 +21,12 @@ corresponding `df.*` rows. The background worker then repeatedly tries to load a
 graph that does not exist, fails after a 5 s timeout, and the orphan persists in
 `_duroxide.instances`.
 
-This spec proposes making the duroxide start-enqueue part of the caller's
-backend transaction (design **Option 3** from the design exploration), so that
-`df.start()` is atomic: either everything commits, or nothing does.
+Four designs were explored to keep the two stores consistent (see **Design
+options considered** below). This spec **implements Option 3 (primary) plus
+Option 4 (complementary backstop)**: make the duroxide start-enqueue part of the
+caller's backend transaction so `df.start()` is atomic — either everything
+commits or nothing does — and add an asynchronous reconciler to repair any
+residual drift that start-time atomicity cannot prevent.
 
 ### Evidence (observed during investigation)
 
@@ -38,6 +41,60 @@ backend transaction (design **Option 3** from the design exploration), so that
   60776). That abort is *safe* today (it happens before the duroxide enqueue),
   but it exercises the same dual-write seam.
 
+## Design options considered
+
+Four approaches were explored for keeping the `df.*` control plane and the
+`_duroxide` runtime consistent. The mechanics of the chosen approach are detailed
+in **Design** below; this section records all four and the decision.
+
+**Option 1 — Single source of truth in `_duroxide`.** Drop the `df.instances` /
+`df.nodes` control tables entirely, keep all instance state in `_duroxide`,
+serialize the graph into the orchestration input, and have duroxide expose
+whatever state pg_durable needs (via views / read APIs).
+*Guarantee:* total — there is only one store, so nothing can diverge.
+*Cost:* large — moves the `submitted_by` security boundary, requires rebuilding
+per-user RLS over duroxide-owned tables, re-exposing every `df.*` read surface,
+and a destructive migration with B1 backward-compat implications. A possible
+long-term direction, not this change.
+
+**Option 2 — Hand duroxide-pg our SQL to run inside *its* transaction.** Keep
+both schemas; extend duroxide-pg so the start-enqueue also executes the
+pg_durable `df.*` writes in the same (sqlx, worker-pool) transaction.
+*Guarantee:* `df.*` ↔ `_duroxide` atomic — but on the worker-pool connection, so
+`df.start()` stops participating in the *caller's* transaction and the `df.*`
+writes run as the pool role, breaking `current_user` / `submitted_by` capture and
+RLS. Strictly weaker than Option 3.
+
+**Option 3 — Hand duroxide-pg *our* (the caller's) transaction.** Keep both
+schemas; run duroxide's start-enqueue inside the caller's backend transaction via
+SPI, alongside the `df.*` writes.
+*Guarantee:* strongest — `df.nodes` + `df.instances` + the queue row + the
+caller's surrounding statements all commit or roll back together; identity
+capture and the `df.*` read surface are untouched; the 5 s load race disappears.
+*Cost:* small — the start is a single SQL-function call — with one wrinkle: a
+`SECURITY DEFINER` enqueue entrypoint is needed because the queue INSERT is
+owner-only. **Chosen as the primary fix.**
+
+**Option 4 — Tolerate divergence; repair asynchronously.** Accept a transient
+window and add a reconciler — ideally a built-in durable function — that detects
+and repairs mismatches (orphaned `_duroxide` instances, stuck `df.instances`
+rows, `status`-mirror drift).
+*Guarantee:* none added; eventual consistency with bounded repair latency.
+*Value:* the only option that also catches divergence from crashes
+*mid-execution* and from paths Option 3 does not cover (the worker-side mirror
+activities, and signal/cancel until they are migrated). **Chosen as a
+complementary backstop.**
+
+### Decision
+
+Implement **Option 3 + Option 4**. Option 3 removes the dual-write at the source
+for backend-initiated starts (and, in a later phase, `df.cancel`); Option 4 is a
+lightweight safety net for the residual divergence that no start-time atomicity
+can prevent (crashes mid-execution, the best-effort `df.*` mirror, and
+not-yet-migrated signal/cancel). Options 1 and 2 are rejected — Option 1 is
+disproportionate to the problem, and Option 2 is a strictly weaker variant of
+Option 3.
+
 ## Background: how a start is written today
 
 `crate::dsl::start()` (src/dsl.rs) executes, in order, on the **caller's**
@@ -148,23 +205,6 @@ This is the "best-effort mirror"; any persistent drift is the reconciler's job
 - Migrating `df.signal()` / `df.cancel()` to the in-transaction path — same
   mechanism, scoped as a follow-up (see Phasing).
 
-## Considered alternatives (summary)
-
-1. **Single source of truth in `_duroxide`** — remove `df.*`, serialize the graph
-   into the orchestration input, re-expose reads as views over `_duroxide`.
-   Strongest invariant but large; security/RLS/read-API rework and a destructive
-   migration. Rejected for this change.
-2. **duroxide-pg runs pg_durable's SQL inside *its* sqlx transaction** — makes
-   `df.*` ↔ `_duroxide` atomic but on the worker-pool connection, so `df.start`
-   stops participating in the caller's transaction, and the `df.*` writes run as
-   the pool role (breaks `current_user` capture). Strictly weaker than Option 3.
-3. **Run the enqueue inside the caller's backend transaction via SPI (this
-   spec)** — smallest change, strongest guarantee, keeps `df.*` and identity
-   capture intact.
-4. **Tolerate divergence + async GC/reconciler** — additive safety net; does not
-   prevent transient inconsistency. Recommended as a *complementary* backstop,
-   not a replacement.
-
 ## Design
 
 ### Mechanism
@@ -335,15 +375,15 @@ See the Dual-write inventory above for the precise classification. In short:
   "don't deliver if my tx rolls back" semantics, but that is a nice-to-have, not
   a consistency fix — lowest priority.
 
-## Complementary backstop (optional, recommended)
+## Complementary backstop — Option 4 (in scope, Phase 3)
 
-Independently of this change, a lightweight **reconciler** — ideally a built-in
-durable function (a `@>` loop + `df.wait_for_schedule`) — should sweep for
-residual divergence that an atomic start cannot prevent (crashes mid-execution,
+As part of the chosen approach, a lightweight **reconciler** — ideally a built-in
+durable function (a `@>` loop + `df.wait_for_schedule`) — sweeps for residual
+divergence that an atomic start cannot prevent (crashes mid-execution,
 not-yet-migrated signal/cancel paths, `status` mirror drift), using existing
 `_duroxide` read functions (`list_instances`, `get_instance_info`,
-`get_queue_depths`) and `delete_instances_atomic`. Tracked separately as
-"Option 4"; not required for this spec but cheap insurance.
+`get_queue_depths`) and `delete_instances_atomic`. It complements Option 3 rather
+than replacing it, and is sequenced as Phase 3 (see Phasing).
 
 ## Upgrade & Migration
 
@@ -404,7 +444,7 @@ E2E (tests/e2e/sql/) and unit coverage:
    `load-function-graph` retry; docs + tests.
 2. **Phase 2:** apply the same in-transaction enqueue to `df.signal()` /
    `df.cancel()`.
-3. **Phase 3 (optional):** ship the reconciler backstop (Option 4).
+3. **Phase 3:** ship the reconciler backstop (Option 4).
 
 ## Open questions / risks
 

From 3f64e60322c37c75adc86bed135e6bf86b1446e0 Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Thu, 18 Jun 2026 22:19:55 +0000
Subject: [PATCH 04/20] spec: add options comparison table

Add the trade-off matrix (atomic with df.*, atomic with caller tx, kills the
5s load race, fixes crash-time drift, effort) to the Design options considered
section, with footnotes explaining Option 1's convergence with Option 3 and why
Option 4 is kept as a crash-drift backstop.
---
 docs/spec-atomic-start.md | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index 9999134..6f11d9d 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -85,6 +85,21 @@ rows, `status`-mirror drift).
 activities, and signal/cancel until they are migrated). **Chosen as a
 complementary backstop.**
 
+### Comparison
+
+| Option | Atomic with `df.*` | Atomic with caller tx | Kills 5 s load race | Fixes crash-time drift | Effort |
+|--------|:--:|:--:|:--:|:--:|:--:|
+| 1 — single source of truth in `_duroxide` | n/a (one store) | no¹ | yes | yes | Large |
+| 2 — our SQL in duroxide's tx | yes | **no** | yes | no | Medium |
+| 3 — enqueue in caller's tx (SPI) | yes | **yes** | **yes** | no | Small–Med |
+| 4 — async GC / reconciler | repairs | — | no | **yes** | Small–Med |
+
+¹ Option 1 also decouples from the caller's transaction unless the enqueue is
+itself done via SPI — at which point its start path converges with Option 3.
+"Fixes crash-time drift" = repairs divergence from a crash *mid-execution* (not
+just at start); only Options 1 and 4 do this, which is why Option 4 is kept as a
+backstop alongside Option 3.
+
 ### Decision
 
 Implement **Option 3 + Option 4**. Option 3 removes the dual-write at the source

From efb818c14d39c2f1fbebaffa4772a03bddd2610a Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Thu, 18 Jun 2026 22:50:10 +0000
Subject: [PATCH 05/20] POC: atomic df.start() via in-transaction SPI enqueue
 (Option 3)

Enqueue the StartOrchestration work item inside the caller's backend
transaction instead of out-of-band on the duroxide client pool, so the
df.nodes/df.instances INSERTs and the orchestrator-queue INSERT commit or
roll back together.

- src/lib.rs: add df._enqueue_orchestrator_start(text,text), a SECURITY
  DEFINER wrapper (SET search_path=pg_catalog) that performs the privileged
  INSERT via _duroxide.enqueue_orchestrator_work(); EXECUTE revoked from
  PUBLIC and granted through df.grant_usage().
- src/dsl.rs: df.start() builds the duroxide WorkItem::StartOrchestration,
  gates on worker readiness, and enqueues via SPI through the wrapper,
  raising (aborting the tx) on failure.
- src/client.rs: drop the now-unused out-of-band start_durable_function;
  expose is_worker_ready() to the start path.

Verified: committed start runs to completion; rolled-back start leaves no
df.* rows AND no _duroxide orphan (the leak this fixes); E2E subset incl.
join/race/signals/cancel/loops passes; non-superuser df_e2e_user path works.

POC scope: fresh-install only (no upgrade script); _duroxide schema name
hard-coded; wrapper hardening deferred per docs/spec-atomic-start.md.
---
 src/client.rs | 25 +------------------------
 src/dsl.rs    | 46 +++++++++++++++++++++++++++++++++++-----------
 src/lib.rs    | 44 +++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 79 insertions(+), 36 deletions(-)

diff --git a/src/client.rs b/src/client.rs
index 625ea73..ab70491 100644
--- a/src/client.rs
+++ b/src/client.rs
@@ -27,7 +27,7 @@ static DUROXIDE_CLIENT: OnceLock<Client> = OnceLock::new();
 /// row, or has a `schema_version` below `WORKER_SCHEMA_VERSION`. This is a fast
 /// SPI read called once per session on the first call to any `df.*` function
 /// that needs the duroxide client.
-fn is_worker_ready() -> bool {
+pub(crate) fn is_worker_ready() -> bool {
     let schema = backend_duroxide_schema();
 
     // First check if the readiness table exists via the catalogue.  Querying
@@ -135,29 +135,6 @@ async fn list_running_descendants(client: &Client, root_instance_id: &str) -> Ve
     descendants
 }
 
-/// Start a durable function via the shared PostgreSQL store.
-pub fn start_durable_function(
-    function_name: &str,
-    instance_id: &str,
-    input: &str,
-) -> Result<(), String> {
-    log!(
-        "pg_durable: start_durable_function for instance {}",
-        instance_id
-    );
-
-    let rt = get_client_runtime();
-    let client = get_duroxide_client()?;
-
-    rt.block_on(async {
-        client
-            .start_orchestration(instance_id, function_name, input)
-            .await
-            .map_err(|e| format!("Failed to start durable function: {e:?}"))?;
-        Ok(())
-    })
-}
-
 /// Cancel a durable function.
 pub fn cancel_durable_function(instance_id: &str, reason: &str) -> Result<(), String> {
     let rt = get_client_runtime();
diff --git a/src/dsl.rs b/src/dsl.rs
index 9e2c95e..e813bcf 100644
--- a/src/dsl.rs
+++ b/src/dsl.rs
@@ -12,7 +12,6 @@ use std::str::FromStr;
 use std::cell::RefCell;
 use std::time::Instant;
 
-use crate::client::start_durable_function;
 use crate::types::{
     mark_non_future_helper_call, short_id, validate_result_name, Durofut, FunctionInput,
 };
@@ -912,23 +911,48 @@ pub fn start(
         vars
     });
 
-    // Start the orchestration via duroxide
+    // Start the orchestration durably.
+    //
+    // Option 3 (atomic start): the StartOrchestration work item is enqueued via
+    // SPI inside the CALLER'S transaction, so the df.nodes/df.instances INSERTs
+    // above and this orchestrator-queue INSERT commit or roll back together. A
+    // rolled-back df.start() therefore leaves no _duroxide orphan, and the worker
+    // only observes the queue row after the df.* rows are visible (which also
+    // removes the load-function-graph "wait for commit" race).
     let input = FunctionInput {
         instance_id: instance_id.clone(),
         label: label.map(|s| s.to_string()),
         vars,
     };
-    let input_json = serde_json::to_string(&input).unwrap_or(instance_id.clone());
+    let input_json = serde_json::to_string(&input).unwrap_or_else(|_| instance_id.clone());
 
-    if let Err(e) = start_durable_function(
-        crate::orchestrations::execute_function_graph::NAME,
-        &instance_id,
-        &input_json,
+    // Fail clearly (and atomically — the df.* INSERTs roll back) if the worker
+    // has not yet initialised the duroxide schema.
+    if !crate::client::is_worker_ready() {
+        pgrx::error!("pg_durable background worker not yet initialized — try again in a moment");
+    }
+
+    // Build the exact same work item the duroxide client would enqueue, by
+    // serializing the real duroxide type (guarantees wire compatibility).
+    let work_item = duroxide::providers::WorkItem::StartOrchestration {
+        instance: instance_id.clone(),
+        orchestration: crate::orchestrations::execute_function_graph::NAME.to_string(),
+        input: input_json,
+        version: None,
+        parent_instance: None,
+        parent_id: None,
+        execution_id: duroxide::INITIAL_EXECUTION_ID,
+    };
+    let work_item_json = serde_json::to_string(&work_item)
+        .unwrap_or_else(|e| pgrx::error!("Failed to serialize start work item: {}", e));
+
+    // SECURITY DEFINER wrapper performs the privileged orchestrator-queue INSERT
+    // on the caller's transaction. Raising on failure aborts the whole df.start.
+    if let Err(e) = Spi::run_with_args(
+        "SELECT df._enqueue_orchestrator_start($1, $2)",
+        &[instance_id.as_str().into(), work_item_json.as_str().into()],
     ) {
-        pgrx::log!(
-            "pg_durable: Warning - failed to start durable function: {}",
-            e
-        );
+        pgrx::error!("Failed to enqueue durable function start: {:?}", e);
     }
 
     instance_id
diff --git a/src/lib.rs b/src/lib.rs
index 681b88d..7f8c723 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -413,7 +413,9 @@ DECLARE
         'df.version()',
         'df.debug_connection()',
         'df.explain(text)',
-        'df.target_database()'
+        'df.target_database()',
+        -- Atomic-start enqueue wrapper (called by df.start() via SPI)
+        'df._enqueue_orchestrator_start(text, text)'
     ];
 BEGIN
     -- Validate the role exists
@@ -561,6 +563,46 @@ REVOKE EXECUTE ON FUNCTION df.revoke_usage(text) FROM PUBLIC;
     requires = ["create_tables", dsl::http]
 );
 
+// ============================================================================
+// Atomic start enqueue (Option 3) — SECURITY DEFINER wrapper
+// ============================================================================
+//
+// df.start() enqueues the StartOrchestration work item by calling this wrapper
+// via SPI, inside the caller's transaction. The duroxide orchestrator queue
+// grants INSERT to its owner only, so the privileged INSERT must run as a
+// definer that owns the _duroxide tables; the call still commits/rolls back as
+// part of the caller's transaction, giving an atomic df.start().
+//
+// The body references _duroxide.enqueue_orchestrator_work, which is created by
+// the background worker's migrations (not this extension). plpgsql resolves it
+// at call time, by which point df.start()'s worker-readiness gate guarantees the
+// _duroxide schema exists.
+//
+// SECURITY (POC): this wrapper can enqueue an arbitrary orchestrator work item.
+// EXECUTE is revoked from PUBLIC and granted only via df.grant_usage(). A
+// production version should move this entrypoint into duroxide-pg (owned by the
+// _duroxide schema owner) and/or validate the work item — see
+// docs/spec-atomic-start.md. The _duroxide schema name is also hard-coded here
+// for the POC (it is resolved dynamically elsewhere).
+extension_sql!(
+    r#"
+CREATE FUNCTION df._enqueue_orchestrator_start(p_instance_id text, p_work_item text)
+RETURNS void
+LANGUAGE plpgsql
+SECURITY DEFINER
+SET search_path = pg_catalog
+AS $fn$
+BEGIN
+    PERFORM _duroxide.enqueue_orchestrator_work(p_instance_id, p_work_item, pg_catalog.now());
+END;
+$fn$;
+
+REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text) FROM PUBLIC;
+"#,
+    name = "atomic_start_enqueue",
+    requires = ["create_tables"]
+);
+
 // ============================================================================
 // Extension Validation (must run before duroxide schema creation)
 // ============================================================================

From 6495ee12cb70a053cbcc4e0acb0ab84103dbb3a7 Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 00:20:13 +0000
Subject: [PATCH 06/20] spec: record validated Option 3 POC results

Add a 'Proof of concept (validated)' section documenting what was built
(the SECURITY DEFINER df._enqueue_orchestrator_start wrapper + in-transaction
SPI enqueue), the confirmed findings (privilege model is the only wrinkle;
visible_at=now() suffices; reusing the duroxide WorkItem type), and the
validation matrix (atomicity, 8/8 E2E subset, 5/5 JOIN, non-superuser path).
Update Status, Phase 1, and the visible_at open question to reflect the POC;
note the DROP EXTENSION CASCADE / _duroxide recovery gotcha.
---
 docs/spec-atomic-start.md | 69 +++++++++++++++++++++++++++++++++++----
 1 file changed, 62 insertions(+), 7 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index 6f11d9d..4b3d89e 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -1,6 +1,6 @@
 # Spec: Atomic `df.start()` — Eliminating `df.*` ↔ `_duroxide` Divergence
 
-**Status:** Proposal
+**Status:** Proposal — Option 3 prototyped & validated (POC)
 **Author:** pg_durable team
 **Date:** June 2026
 
@@ -355,6 +355,59 @@ transaction, so the "not yet visible / rolled back" branch cannot occur. The
 retry loop can be simplified to a single read (optionally retain a short grace
 for replication scenarios). This also removes the worker-saturation failure mode.
 
+## Proof of concept (validated)
+
+Option 3 was prototyped end-to-end and validated on PostgreSQL 17 (branch
+`tjgreen42/atomic-start`, ~80 lines across 3 files).
+
+**What was built:**
+
+- `df._enqueue_orchestrator_start(text, text)` — a `SECURITY DEFINER` wrapper
+  (`SET search_path = pg_catalog`) that performs the privileged
+  `_duroxide.enqueue_orchestrator_work(instance, work_item, now())` INSERT.
+  EXECUTE is revoked from PUBLIC and granted through `df.grant_usage()`.
+- `df.start()` builds the real `duroxide::providers::WorkItem::StartOrchestration`
+  (guaranteeing wire compatibility), gates on `is_worker_ready()`, and enqueues
+  via SPI through the wrapper — raising (aborting the transaction) on failure.
+- The out-of-band `start_durable_function` client path is removed from start
+  (the duroxide client pool is still used by `df.signal` / `df.cancel`).
+
+**Confirmed findings:**
+
+- **The privilege model is the only real wrinkle.** `_duroxide.orchestrator_queue`
+  grants INSERT to its owner only and the enqueue function is `SECURITY INVOKER`,
+  so a plain caller gets *permission denied*. The `SECURITY DEFINER` wrapper
+  resolves this while still committing in the caller's transaction. A
+  non-superuser role (`df_e2e_user`) granted via `df.grant_usage()` starts
+  instances successfully and cannot INSERT into the queue directly.
+- **`visible_at = now()`** (SQL transaction time) is sufficient for immediate
+  starts; no reliance on the Rust-side `now_ms`.
+- **Reusing the `duroxide` `WorkItem` type** makes the enqueued row
+  byte-identical to the client path, so the worker behaves identically.
+
+**Validated behavior:**
+
+| Check | Result |
+|-------|:------:|
+| Committed `df.start()` runs to completion | pass |
+| Rolled-back `df.start()` leaves no `df.*` **and** no `_duroxide` orphan (queue + instances) | pass |
+| E2E subset: core, conditionals, loops, variables, signals, join, race, cancel | 8/8 |
+| JOIN determinism under the new (faster) timing | 5/5 |
+| Non-superuser (`df_e2e_user`) start path | pass |
+| `cargo fmt --check`; no new clippy warnings | pass |
+
+**POC scope / deferred to productionization:** fresh install only (no upgrade
+script); the `_duroxide` schema name is hard-coded (resolved dynamically in the
+real change — see Upgrade & Migration); wrapper hardening (move to duroxide-pg
+and/or validate the work item).
+
+> **Local-dev note:** `DROP EXTENSION pg_durable CASCADE` can leave the
+> BGW-owned `_duroxide` schema half-broken (migrations row present, functions
+> dropped), which surfaces as *spurious* JOIN nondeterminism and a missing
+> `_worker_ready`. Recover with `DROP SCHEMA _duroxide CASCADE` + restart so the
+> BGW re-applies its migrations. (This cost real debugging time during the POC;
+> the failures were environmental, not a regression.)
+
 ## Transaction-semantics decision (must be made explicit)
 
 This change makes `df.start()` **atomic with the caller's transaction**:
@@ -454,9 +507,11 @@ E2E (tests/e2e/sql/) and unit coverage:
 
 ## Phasing
 
-1. **Phase 1 (this spec):** atomic `df.start()` via the `SECURITY DEFINER`
-   enqueue entrypoint; readiness gate + legacy fallback; simplify the
-   `load-function-graph` retry; docs + tests.
+1. **Phase 1 (this spec) — prototyped & validated:** atomic `df.start()` via the
+   `SECURITY DEFINER` enqueue entrypoint; readiness gate + legacy fallback;
+   simplify the `load-function-graph` retry; docs + tests. (POC complete; see
+   Proof of concept. Remaining for production: upgrade script, dynamic
+   `_duroxide` schema resolution, entrypoint hardening.)
 2. **Phase 2:** apply the same in-transaction enqueue to `df.signal()` /
    `df.cancel()`.
 3. **Phase 3:** ship the reconciler backstop (Option 4).
@@ -470,9 +525,9 @@ E2E (tests/e2e/sql/) and unit coverage:
 - **Contract stability:** pg_durable now depends on the enqueue entrypoint
   signature and the `WorkItem` JSON. Mitigated by reusing the `duroxide`
   `WorkItem` type and the version-locked duroxide/duroxide-pg pair.
-- **`visible_at` clock:** use `now()` (server time) for immediate starts to match
-  the duroxide-pg provider's behavior; confirm no reliance on the Rust-side
-  `now_ms` for ordering at start.
+- **`visible_at` clock:** the POC uses `now()` (server transaction time) for
+  immediate starts and it works; no reliance on the Rust-side `now_ms` for
+  ordering at start.
 - **Idempotency:** `enqueue_orchestrator_work` is a bare INSERT (no dedup); the
   `df.instances`/`df.nodes` PKs already prevent duplicate instance ids within the
   transaction, so no additional dedup is required.

From 66dd6526b05028dcea9ab3d9deaf20dc1862badc Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 00:46:13 +0000
Subject: [PATCH 07/20] POC round 2: dynamic schema, provider probe + fallback,
 Option 4 reconciler

Address the abstraction-breaking direct _duroxide call and add the GC backstop.

- df._enqueue_orchestrator_start now resolves the duroxide schema dynamically
  via df.duroxide_schema() and EXECUTE format(%I.enqueue_orchestrator_work),
  instead of hard-coding _duroxide.
- df.start() probes for the duroxide-pg SQL surface (enqueue_orchestrator_work
  in the resolved schema): present -> atomic in-transaction SPI enqueue; absent
  (non-pg provider / pre-wrapper schema) -> fall back to the out-of-band client
  path. start_durable_function restored for the fallback.
- Add df.reconcile(grace) (Option 4): SECURITY DEFINER, admin-only. GCs orphaned
  ROOT duroxide instances (parent_instance_id IS NULL, no df row) via
  delete_instances_atomic and fails stuck df.instances; excludes
  sub-orchestrations. Schedulable as a durable cron loop.

Validated: atomicity holds via dynamic schema; reconcile deletes a planted root
orphan and a stuck instance while leaving sub-orchestrations + live instances
intact; E2E subset 8/8; fmt clean, no new clippy warnings. Spec updated.
---
 docs/spec-atomic-start.md |  83 ++++++++++++++++++++---------
 src/client.rs             |  26 +++++++++
 src/dsl.rs                | 107 ++++++++++++++++++++++++++------------
 src/lib.rs                | 101 +++++++++++++++++++++++++++++++----
 4 files changed, 251 insertions(+), 66 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index 4b3d89e..3679439 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -364,26 +364,43 @@ Option 3 was prototyped end-to-end and validated on PostgreSQL 17 (branch
 
 - `df._enqueue_orchestrator_start(text, text)` — a `SECURITY DEFINER` wrapper
   (`SET search_path = pg_catalog`) that performs the privileged
-  `_duroxide.enqueue_orchestrator_work(instance, work_item, now())` INSERT.
-  EXECUTE is revoked from PUBLIC and granted through `df.grant_usage()`.
-- `df.start()` builds the real `duroxide::providers::WorkItem::StartOrchestration`
-  (guaranteeing wire compatibility), gates on `is_worker_ready()`, and enqueues
-  via SPI through the wrapper — raising (aborting the transaction) on failure.
-- The out-of-band `start_durable_function` client path is removed from start
-  (the duroxide client pool is still used by `df.signal` / `df.cancel`).
+  `<schema>.enqueue_orchestrator_work(instance, work_item, now())` INSERT. The
+  duroxide schema is **resolved dynamically** via `df.duroxide_schema()`
+  (`_duroxide` fresh / legacy `duroxide`) and quoted with `%I`. EXECUTE is
+  revoked from PUBLIC and granted through `df.grant_usage()`.
+- `df.start()` **probes** for the duroxide-pg SQL surface
+  (`enqueue_orchestrator_work` in the resolved schema). When present it builds
+  the real `duroxide::providers::WorkItem::StartOrchestration` (guaranteeing wire
+  compatibility), gates on `is_worker_ready()`, and enqueues via SPI through the
+  wrapper — raising (aborting the transaction) on failure. When **absent** (a
+  non-pg provider, or a schema predating the wrapper) it **falls back** to the
+  out-of-band `start_durable_function` client path (legacy, non-atomic).
+- `df.reconcile(grace_seconds int default 60)` (Option 4) — a `SECURITY DEFINER`,
+  admin-only (revoked from PUBLIC) reconciler that deletes orphaned **root**
+  duroxide instances (`parent_instance_id IS NULL`, no `df.instances` row) via
+  `delete_instances_atomic`, and fails `df.instances` stuck non-terminal with no
+  live duroxide instance and no queued start. Schedulable as a durable cron loop.
+
+**Provider requirement:** the atomic path only works with the **duroxide-pg
+(SQL-backed) provider**, because it calls the provider's SQL function directly.
+The probe + fallback make this non-breaking for any other provider.
 
 **Confirmed findings:**
 
-- **The privilege model is the only real wrinkle.** `_duroxide.orchestrator_queue`
-  grants INSERT to its owner only and the enqueue function is `SECURITY INVOKER`,
-  so a plain caller gets *permission denied*. The `SECURITY DEFINER` wrapper
-  resolves this while still committing in the caller's transaction. A
-  non-superuser role (`df_e2e_user`) granted via `df.grant_usage()` starts
-  instances successfully and cannot INSERT into the queue directly.
+- **The privilege model is the only real wrinkle.** The duroxide orchestrator
+  queue grants INSERT to its owner only and the enqueue function is
+  `SECURITY INVOKER`, so a plain caller gets *permission denied*. The
+  `SECURITY DEFINER` wrapper resolves this while still committing in the caller's
+  transaction. A non-superuser role (`df_e2e_user`) granted via
+  `df.grant_usage()` starts instances successfully and cannot INSERT into the
+  queue directly.
 - **`visible_at = now()`** (SQL transaction time) is sufficient for immediate
   starts; no reliance on the Rust-side `now_ms`.
 - **Reusing the `duroxide` `WorkItem` type** makes the enqueued row
   byte-identical to the client path, so the worker behaves identically.
+- **The reconciler must exclude sub-orchestrations.** JOIN/RACE branches and loop
+  generations are legitimate duroxide instances with no `df.instances` row;
+  filtering on `parent_instance_id IS NULL` (root only) prevents collecting them.
 
 **Validated behavior:**
 
@@ -391,15 +408,19 @@ Option 3 was prototyped end-to-end and validated on PostgreSQL 17 (branch
 |-------|:------:|
 | Committed `df.start()` runs to completion | pass |
 | Rolled-back `df.start()` leaves no `df.*` **and** no `_duroxide` orphan (queue + instances) | pass |
+| Dynamic schema resolution drives the SPI path (atomicity holds) | pass |
 | E2E subset: core, conditionals, loops, variables, signals, join, race, cancel | 8/8 |
 | JOIN determinism under the new (faster) timing | 5/5 |
 | Non-superuser (`df_e2e_user`) start path | pass |
+| `df.reconcile()` deletes a planted root orphan; leaves sub-orchestrations + live instances untouched | pass |
+| `df.reconcile()` fails a stuck (lost-enqueue) `df.instances` | pass |
 | `cargo fmt --check`; no new clippy warnings | pass |
 
 **POC scope / deferred to productionization:** fresh install only (no upgrade
-script); the `_duroxide` schema name is hard-coded (resolved dynamically in the
-real change — see Upgrade & Migration); wrapper hardening (move to duroxide-pg
-and/or validate the work item).
+script); the atomic-start wrapper hardening (move the entrypoint into duroxide-pg
+owned by the schema owner, and/or validate the work item); the reconciler is a
+plain SQL function (Phase 3 wires it into a built-in durable cron loop and tunes
+grace/policy).
 
 > **Local-dev note:** `DROP EXTENSION pg_durable CASCADE` can leave the
 > BGW-owned `_duroxide` schema half-broken (migrations row present, functions
@@ -443,15 +464,27 @@ See the Dual-write inventory above for the precise classification. In short:
   "don't deliver if my tx rolls back" semantics, but that is a nice-to-have, not
   a consistency fix — lowest priority.
 
-## Complementary backstop — Option 4 (in scope, Phase 3)
+## Complementary backstop — Option 4 (prototyped, Phase 3)
 
-As part of the chosen approach, a lightweight **reconciler** — ideally a built-in
-durable function (a `@>` loop + `df.wait_for_schedule`) — sweeps for residual
+As part of the chosen approach, a lightweight **reconciler** sweeps for residual
 divergence that an atomic start cannot prevent (crashes mid-execution,
-not-yet-migrated signal/cancel paths, `status` mirror drift), using existing
-`_duroxide` read functions (`list_instances`, `get_instance_info`,
-`get_queue_depths`) and `delete_instances_atomic`. It complements Option 3 rather
-than replacing it, and is sequenced as Phase 3 (see Phasing).
+not-yet-migrated signal/cancel paths, `status` mirror drift, and legacy/fallback
+orphans). The POC ships it as **`df.reconcile(grace_seconds int default 60)`**
+(`SECURITY DEFINER`, admin-only): it deletes orphaned **root** duroxide instances
+(`parent_instance_id IS NULL`, no `df.instances` row, older than the grace
+window) via `delete_instances_atomic`, and marks `df.instances` rows stuck
+non-terminal with no live duroxide instance and no queued start as `failed`.
+Sub-orchestrations (JOIN/RACE branches, loop generations) are excluded by the
+root filter so they are never collected. Phase 3 wires this into a built-in
+durable cron loop (dogfooding pg_durable):
+
+```sql
+SELECT df.start(
+  @> ( 'SELECT df.reconcile()' ~> df.wait_for_schedule('*/5 * * * *') ),
+  'reconciler');
+```
+
+It complements Option 3 rather than replacing it.
 
 ## Upgrade & Migration
 
@@ -514,7 +547,9 @@ E2E (tests/e2e/sql/) and unit coverage:
    `_duroxide` schema resolution, entrypoint hardening.)
 2. **Phase 2:** apply the same in-transaction enqueue to `df.signal()` /
    `df.cancel()`.
-3. **Phase 3:** ship the reconciler backstop (Option 4).
+3. **Phase 3 — reconciler prototyped:** `df.reconcile()` exists and is validated
+   (orphan GC + stuck-instance failover); remaining work is wiring it into a
+   built-in durable cron loop and tuning the grace/policy.
 
 ## Open questions / risks
 
diff --git a/src/client.rs b/src/client.rs
index ab70491..14b1091 100644
--- a/src/client.rs
+++ b/src/client.rs
@@ -135,6 +135,32 @@ async fn list_running_descendants(client: &Client, root_instance_id: &str) -> Ve
     descendants
 }
 
+/// Start a durable function via the duroxide client (out-of-band, on the cached
+/// pool). This is the provider-agnostic fallback used by df.start() when the
+/// duroxide-pg SQL enqueue surface is not available; it is NOT atomic with the
+/// caller's transaction.
+pub fn start_durable_function(
+    function_name: &str,
+    instance_id: &str,
+    input: &str,
+) -> Result<(), String> {
+    log!(
+        "pg_durable: start_durable_function for instance {}",
+        instance_id
+    );
+
+    let rt = get_client_runtime();
+    let client = get_duroxide_client()?;
+
+    rt.block_on(async {
+        client
+            .start_orchestration(instance_id, function_name, input)
+            .await
+            .map_err(|e| format!("Failed to start durable function: {e:?}"))?;
+        Ok(())
+    })
+}
+
 /// Cancel a durable function.
 pub fn cancel_durable_function(instance_id: &str, reason: &str) -> Result<(), String> {
     let rt = get_client_runtime();
diff --git a/src/dsl.rs b/src/dsl.rs
index e813bcf..532f667 100644
--- a/src/dsl.rs
+++ b/src/dsl.rs
@@ -622,6 +622,24 @@ pub fn signal(instance_id: &str, signal_name: &str, signal_data: default!(&str,
 // Orchestration Control Functions
 // ============================================================================
 
+/// True when the duroxide store exposes the SQL enqueue surface — i.e. the
+/// duroxide-pg (PostgreSQL) provider — in the given schema. Detected by the
+/// presence of `enqueue_orchestrator_work`. When false (a non-pg provider, or a
+/// schema predating the wrapper), df.start() falls back to the out-of-band
+/// client path. The catalog probe never raises, so it is safe in a backend
+/// session even when the schema is absent.
+fn in_tx_enqueue_supported(schema: &str) -> bool {
+    Spi::get_one_with_args::<bool>(
+        "SELECT EXISTS(SELECT 1 FROM pg_catalog.pg_proc p \
+         JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace \
+         WHERE n.nspname = $1 AND p.proname = 'enqueue_orchestrator_work')",
+        &[schema.into()],
+    )
+    .ok()
+    .flatten()
+    .unwrap_or(false)
+}
+
 /// Starts a durable SQL function.
 /// The fut argument can be either Durofut JSON or plain SQL string (auto-wrapped).
 /// Variables from df.vars are captured and passed to the orchestration.
@@ -912,13 +930,6 @@ pub fn start(
     });
 
     // Start the orchestration durably.
-    //
-    // Option 3 (atomic start): the StartOrchestration work item is enqueued via
-    // SPI inside the CALLER'S transaction, so the df.nodes/df.instances INSERTs
-    // above and this orchestrator-queue INSERT commit or roll back together. A
-    // rolled-back df.start() therefore leaves no _duroxide orphan, and the worker
-    // only observes the queue row after the df.* rows are visible (which also
-    // removes the load-function-graph "wait for commit" race).
     let input = FunctionInput {
         instance_id: instance_id.clone(),
         label: label.map(|s| s.to_string()),
@@ -926,33 +937,63 @@ pub fn start(
     };
     let input_json = serde_json::to_string(&input).unwrap_or_else(|_| instance_id.clone());
 
-    // Fail clearly (and atomically — the df.* INSERTs roll back) if the worker
-    // has not yet initialised the duroxide schema.
-    if !crate::client::is_worker_ready() {
-        pgrx::error!("pg_durable background worker not yet initialized — try again in a moment");
-    }
+    // Prefer the atomic in-transaction enqueue (Option 3) when the duroxide-pg
+    // SQL provider is present; otherwise fall back to the provider-agnostic
+    // out-of-band client path (legacy behavior).
+    let schema = crate::types::backend_duroxide_schema();
+    if in_tx_enqueue_supported(schema) {
+        // Option 3 (atomic start): enqueue the StartOrchestration work item via
+        // SPI inside the CALLER'S transaction, so the df.nodes/df.instances
+        // INSERTs above and this orchestrator-queue INSERT commit or roll back
+        // together. A rolled-back df.start() leaves no duroxide orphan, and the
+        // worker only observes the queue row after the df.* rows are visible
+        // (which also removes the load-function-graph "wait for commit" race).
+        //
+        // Fail clearly (and atomically — the df.* INSERTs roll back) if the
+        // worker has not yet initialised the duroxide schema.
+        if !crate::client::is_worker_ready() {
+            pgrx::error!(
+                "pg_durable background worker not yet initialized — try again in a moment"
+            );
+        }
 
-    // Build the exact same work item the duroxide client would enqueue, by
-    // serializing the real duroxide type (guarantees wire compatibility).
-    let work_item = duroxide::providers::WorkItem::StartOrchestration {
-        instance: instance_id.clone(),
-        orchestration: crate::orchestrations::execute_function_graph::NAME.to_string(),
-        input: input_json,
-        version: None,
-        parent_instance: None,
-        parent_id: None,
-        execution_id: duroxide::INITIAL_EXECUTION_ID,
-    };
-    let work_item_json = serde_json::to_string(&work_item)
-        .unwrap_or_else(|e| pgrx::error!("Failed to serialize start work item: {}", e));
-
-    // SECURITY DEFINER wrapper performs the privileged orchestrator-queue INSERT
-    // on the caller's transaction. Raising on failure aborts the whole df.start.
-    if let Err(e) = Spi::run_with_args(
-        "SELECT df._enqueue_orchestrator_start($1, $2)",
-        &[instance_id.as_str().into(), work_item_json.as_str().into()],
-    ) {
-        pgrx::error!("Failed to enqueue durable function start: {:?}", e);
+        // Build the exact same work item the duroxide client would enqueue, by
+        // serializing the real duroxide type (guarantees wire compatibility).
+        let work_item = duroxide::providers::WorkItem::StartOrchestration {
+            instance: instance_id.clone(),
+            orchestration: crate::orchestrations::execute_function_graph::NAME.to_string(),
+            input: input_json,
+            version: None,
+            parent_instance: None,
+            parent_id: None,
+            execution_id: duroxide::INITIAL_EXECUTION_ID,
+        };
+        let work_item_json = serde_json::to_string(&work_item)
+            .unwrap_or_else(|e| pgrx::error!("Failed to serialize start work item: {}", e));
+
+        // SECURITY DEFINER wrapper performs the privileged orchestrator-queue
+        // INSERT on the caller's transaction. Raising on failure aborts df.start.
+        if let Err(e) = Spi::run_with_args(
+            "SELECT df._enqueue_orchestrator_start($1, $2)",
+            &[instance_id.as_str().into(), work_item_json.as_str().into()],
+        ) {
+            pgrx::error!("Failed to enqueue durable function start: {:?}", e);
+        }
+    } else {
+        // Fallback: no duroxide-pg SQL enqueue surface detected (a non-pg
+        // provider, or a schema predating the wrapper). Enqueue out-of-band via
+        // the duroxide client. NOTE: this is the legacy path and is NOT atomic
+        // with the caller's transaction.
+        if let Err(e) = crate::client::start_durable_function(
+            crate::orchestrations::execute_function_graph::NAME,
+            &instance_id,
+            &input_json,
+        ) {
+            pgrx::log!(
+                "pg_durable: Warning - failed to start durable function: {}",
+                e
+            );
+        }
     }
 
     instance_id
diff --git a/src/lib.rs b/src/lib.rs
index 7f8c723..7da9639 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -570,20 +570,24 @@ REVOKE EXECUTE ON FUNCTION df.revoke_usage(text) FROM PUBLIC;
 // df.start() enqueues the StartOrchestration work item by calling this wrapper
 // via SPI, inside the caller's transaction. The duroxide orchestrator queue
 // grants INSERT to its owner only, so the privileged INSERT must run as a
-// definer that owns the _duroxide tables; the call still commits/rolls back as
+// definer that owns the duroxide tables; the call still commits/rolls back as
 // part of the caller's transaction, giving an atomic df.start().
 //
-// The body references _duroxide.enqueue_orchestrator_work, which is created by
-// the background worker's migrations (not this extension). plpgsql resolves it
-// at call time, by which point df.start()'s worker-readiness gate guarantees the
-// _duroxide schema exists.
+// PROVIDER REQUIREMENT: this path only works with the duroxide-pg (SQL-backed)
+// provider, because it calls the provider's `enqueue_orchestrator_work` SQL
+// function directly. df.start() probes for that function and falls back to the
+// provider-agnostic out-of-band client path when it is absent (see dsl::start).
+//
+// The schema is resolved dynamically via df.duroxide_schema() ('_duroxide' on
+// fresh installs, legacy 'duroxide' on upgraded ones); %I quotes the identifier.
+// The function is created by the worker's migrations (not this extension);
+// plpgsql resolves it at call time.
 //
 // SECURITY (POC): this wrapper can enqueue an arbitrary orchestrator work item.
 // EXECUTE is revoked from PUBLIC and granted only via df.grant_usage(). A
 // production version should move this entrypoint into duroxide-pg (owned by the
-// _duroxide schema owner) and/or validate the work item — see
-// docs/spec-atomic-start.md. The _duroxide schema name is also hard-coded here
-// for the POC (it is resolved dynamically elsewhere).
+// duroxide schema owner) and/or validate the work item — see
+// docs/spec-atomic-start.md.
 extension_sql!(
     r#"
 CREATE FUNCTION df._enqueue_orchestrator_start(p_instance_id text, p_work_item text)
@@ -592,8 +596,11 @@ LANGUAGE plpgsql
 SECURITY DEFINER
 SET search_path = pg_catalog
 AS $fn$
+DECLARE
+    sch text := df.duroxide_schema();
 BEGIN
-    PERFORM _duroxide.enqueue_orchestrator_work(p_instance_id, p_work_item, pg_catalog.now());
+    EXECUTE pg_catalog.format('SELECT %I.enqueue_orchestrator_work($1, $2, $3)', sch)
+        USING p_instance_id, p_work_item, pg_catalog.now();
 END;
 $fn$;
 
@@ -603,6 +610,82 @@ REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text) FROM PUBLI
     requires = ["create_tables"]
 );
 
+// ============================================================================
+// Reconciler (Option 4) — repair residual df.* / duroxide divergence
+// ============================================================================
+//
+// Atomic df.start() prevents the rolled-back-start orphan leak at the source,
+// but some divergence can still arise: legacy orphans, the non-atomic fallback
+// path, df.signal()/df.cancel() (not yet migrated), and crashes mid-execution.
+// df.reconcile() is a best-effort garbage collector for that residue. It is
+// admin-only (EXECUTE revoked from PUBLIC) and SECURITY DEFINER so it can touch
+// the duroxide-owned tables and all users' df.instances.
+//
+// Schedule it durably (dogfooding pg_durable), e.g. every 5 minutes:
+//   SELECT df.start(
+//     @> ( 'SELECT df.reconcile()' ~> df.wait_for_schedule('*/5 * * * *') ),
+//     'reconciler');
+//
+// Only ROOT duroxide instances (parent_instance_id IS NULL) are considered —
+// sub-orchestrations (JOIN/RACE branches, loop generations) intentionally have
+// no df.instances row and must not be collected. A grace window avoids racing
+// legitimately in-flight operations.
+extension_sql!(
+    r#"
+CREATE FUNCTION df.reconcile(p_grace_seconds integer DEFAULT 60)
+RETURNS TABLE(duroxide_orphans_deleted bigint, stuck_instances_failed bigint)
+LANGUAGE plpgsql
+SECURITY DEFINER
+SET search_path = pg_catalog
+AS $fn$
+DECLARE
+    sch        text := df.duroxide_schema();
+    orphan_ids text[];
+    stuck      bigint := 0;
+BEGIN
+    -- 1) Root duroxide instances with no df.instances row, older than the grace
+    --    window → orphans (e.g. rolled-back starts on the fallback path). Delete.
+    EXECUTE pg_catalog.format(
+        'SELECT pg_catalog.array_agg(d.instance_id) '
+        'FROM %I.instances d '
+        'LEFT JOIN df.instances i ON i.id = d.instance_id '
+        'WHERE i.id IS NULL '
+        '  AND d.parent_instance_id IS NULL '
+        '  AND d.created_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1)',
+        sch)
+    INTO orphan_ids
+    USING p_grace_seconds;
+
+    IF orphan_ids IS NOT NULL AND pg_catalog.array_length(orphan_ids, 1) > 0 THEN
+        EXECUTE pg_catalog.format('SELECT %I.delete_instances_atomic($1, $2)', sch)
+            USING orphan_ids, true;
+    END IF;
+
+    -- 2) df.instances stuck non-terminal with no live duroxide instance and no
+    --    queued start (lost enqueue) → mark failed.
+    EXECUTE pg_catalog.format(
+        'UPDATE df.instances i '
+        'SET status = ''failed'', updated_at = pg_catalog.now() '
+        'WHERE i.status IN (''pending'', ''running'') '
+        '  AND i.updated_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1) '
+        '  AND NOT EXISTS (SELECT 1 FROM %I.instances d WHERE d.instance_id = i.id) '
+        '  AND NOT EXISTS (SELECT 1 FROM %I.orchestrator_queue q WHERE q.instance_id = i.id)',
+        sch, sch)
+    USING p_grace_seconds;
+    GET DIAGNOSTICS stuck = ROW_COUNT;
+
+    duroxide_orphans_deleted := COALESCE(pg_catalog.array_length(orphan_ids, 1), 0);
+    stuck_instances_failed := stuck;
+    RETURN NEXT;
+END;
+$fn$;
+
+REVOKE EXECUTE ON FUNCTION df.reconcile(integer) FROM PUBLIC;
+"#,
+    name = "reconcile",
+    requires = ["create_tables"]
+);
+
 // ============================================================================
 // Extension Validation (must run before duroxide schema creation)
 // ============================================================================

From f72c6a07cf179a21ec0bacb0c0f3d6914a416c06 Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 01:48:36 +0000
Subject: [PATCH 08/20] Option 4: built-in durable reconciler cron loop

The background worker now keeps one reconciler instance running per cluster,
dogfooding pg_durable as a durable cron loop:
  df.start(df.loop(df.seq('SELECT * FROM df.reconcile()',
                          df.wait_for_schedule(<cron>))), 'df_reconciler')

- worker::ensure_reconciler runs each epoch: idempotent (skips if a reconciler
  is pending/running; self-heals if it died), submits as a dedicated
  NON-superuser role df_reconciler (worker-created; granted df.grant_usage() +
  EXECUTE on df.reconcile()). Non-superuser identity avoids the
  enable_superuser_instances guard and bounds blast radius to 'trigger GC'.
- New GUC pg_durable.reconciler_cron (default '*/5 * * * *'; empty disables).

Validated: auto-starts as df_reconciler; exactly one instance (idempotent);
scheduled loop GCs a planted orphan on its next tick without self-GC'ing; E2E
subset 6/6 unaffected with it running. Two gotchas handled: pg_-prefixed role
names are reserved (-> df_reconciler), and a set-returning reconciler must be
called as SELECT * FROM df.reconcile() (bare call yields an unsupported RECORD
column in execute-sql). Spec updated.
---
 docs/spec-atomic-start.md |  57 ++++++++++++++------
 src/lib.rs                |  26 +++++++--
 src/worker.rs             | 107 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 171 insertions(+), 19 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index 3679439..d0b86cb 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -379,7 +379,18 @@ Option 3 was prototyped end-to-end and validated on PostgreSQL 17 (branch
   admin-only (revoked from PUBLIC) reconciler that deletes orphaned **root**
   duroxide instances (`parent_instance_id IS NULL`, no `df.instances` row) via
   `delete_instances_atomic`, and fails `df.instances` stuck non-terminal with no
-  live duroxide instance and no queued start. Schedulable as a durable cron loop.
+  live duroxide instance and no queued start.
+- **Built-in durable cron loop** — the background worker
+  (`worker::ensure_reconciler`) keeps exactly one reconciler instance running per
+  cluster, dogfooding pg_durable:
+  `df.start(df.loop(df.seq('SELECT * FROM df.reconcile()', df.wait_for_schedule(<cron>))), 'df_reconciler')`.
+  It is idempotent (skips if one is pending/running; self-heals if it died),
+  driven by the `pg_durable.reconciler_cron` GUC (default `*/5 * * * *`; empty
+  disables), and submitted by a dedicated **non-superuser** role `df_reconciler`
+  (created by the worker, granted only `df.grant_usage()` + EXECUTE on
+  `df.reconcile()`). A non-superuser identity sidesteps the
+  `enable_superuser_instances` guard and bounds the blast radius to "trigger GC"
+  even if the instance were forged.
 
 **Provider requirement:** the atomic path only works with the **duroxide-pg
 (SQL-backed) provider**, because it calls the provider's SQL function directly.
@@ -401,6 +412,11 @@ The probe + fallback make this non-breaking for any other provider.
 - **The reconciler must exclude sub-orchestrations.** JOIN/RACE branches and loop
   generations are legitimate duroxide instances with no `df.instances` row;
   filtering on `parent_instance_id IS NULL` (root only) prevents collecting them.
+- **Two gotchas in the cron loop:** the reconciler role cannot be named with a
+  `pg_` prefix (PostgreSQL reserves it) — hence `df_reconciler`; and the loop
+  node must call the set-returning reconciler as `SELECT * FROM df.reconcile()`
+  (a bare `SELECT df.reconcile()` yields an unsupported `RECORD` column in
+  `execute-sql`).
 
 **Validated behavior:**
 
@@ -414,13 +430,16 @@ The probe + fallback make this non-breaking for any other provider.
 | Non-superuser (`df_e2e_user`) start path | pass |
 | `df.reconcile()` deletes a planted root orphan; leaves sub-orchestrations + live instances untouched | pass |
 | `df.reconcile()` fails a stuck (lost-enqueue) `df.instances` | pass |
+| Built-in reconciler auto-starts as non-superuser `df_reconciler`, idempotent (exactly one) | pass |
+| Scheduled loop GCs a planted orphan on its next tick, without self-GC'ing | pass |
+| E2E subset unaffected with the reconciler running (incl. heartbeat + join) | 6/6 |
 | `cargo fmt --check`; no new clippy warnings | pass |
 
 **POC scope / deferred to productionization:** fresh install only (no upgrade
 script); the atomic-start wrapper hardening (move the entrypoint into duroxide-pg
-owned by the schema owner, and/or validate the work item); the reconciler is a
-plain SQL function (Phase 3 wires it into a built-in durable cron loop and tunes
-grace/policy).
+owned by the schema owner, and/or validate the work item); the reconciler's
+grace/cadence policy still needs tuning, and the `df_reconciler` role is created
+by the worker rather than via managed provisioning.
 
 > **Local-dev note:** `DROP EXTENSION pg_durable CASCADE` can leave the
 > BGW-owned `_duroxide` schema half-broken (migrations row present, functions
@@ -464,27 +483,33 @@ See the Dual-write inventory above for the precise classification. In short:
   "don't deliver if my tx rolls back" semantics, but that is a nice-to-have, not
   a consistency fix — lowest priority.
 
-## Complementary backstop — Option 4 (prototyped, Phase 3)
+## Complementary backstop — Option 4 (implemented)
 
 As part of the chosen approach, a lightweight **reconciler** sweeps for residual
 divergence that an atomic start cannot prevent (crashes mid-execution,
 not-yet-migrated signal/cancel paths, `status` mirror drift, and legacy/fallback
-orphans). The POC ships it as **`df.reconcile(grace_seconds int default 60)`**
+orphans). It ships as **`df.reconcile(grace_seconds int default 60)`**
 (`SECURITY DEFINER`, admin-only): it deletes orphaned **root** duroxide instances
 (`parent_instance_id IS NULL`, no `df.instances` row, older than the grace
 window) via `delete_instances_atomic`, and marks `df.instances` rows stuck
 non-terminal with no live duroxide instance and no queued start as `failed`.
 Sub-orchestrations (JOIN/RACE branches, loop generations) are excluded by the
-root filter so they are never collected. Phase 3 wires this into a built-in
-durable cron loop (dogfooding pg_durable):
+root filter so they are never collected.
+
+The background worker runs it as a **built-in durable cron loop** (dogfooding
+pg_durable), ensuring exactly one instance per cluster:
 
 ```sql
-SELECT df.start(
-  @> ( 'SELECT df.reconcile()' ~> df.wait_for_schedule('*/5 * * * *') ),
-  'reconciler');
+df.start(
+  df.loop(df.seq('SELECT * FROM df.reconcile()',
+                 df.wait_for_schedule(current_setting('pg_durable.reconciler_cron')))),
+  'df_reconciler');
 ```
 
-It complements Option 3 rather than replacing it.
+started by `worker::ensure_reconciler` each epoch (idempotent; self-heals if it
+died), submitted by the dedicated non-superuser `df_reconciler` role, and driven
+by `pg_durable.reconciler_cron` (default `*/5 * * * *`; empty disables). It
+complements Option 3 rather than replacing it.
 
 ## Upgrade & Migration
 
@@ -547,9 +572,11 @@ E2E (tests/e2e/sql/) and unit coverage:
    `_duroxide` schema resolution, entrypoint hardening.)
 2. **Phase 2:** apply the same in-transaction enqueue to `df.signal()` /
    `df.cancel()`.
-3. **Phase 3 — reconciler prototyped:** `df.reconcile()` exists and is validated
-   (orphan GC + stuck-instance failover); remaining work is wiring it into a
-   built-in durable cron loop and tuning the grace/policy.
+3. **Phase 3 — reconciler implemented:** `df.reconcile()` plus a built-in durable
+   cron loop (worker-managed `df_reconciler` instance on
+   `pg_durable.reconciler_cron`) are validated (auto-start, idempotency,
+   scheduled orphan GC, no self-GC). Remaining: tune the grace/cadence policy and
+   role provisioning.
 
 ## Open questions / risks
 
diff --git a/src/lib.rs b/src/lib.rs
index 7da9639..4a29fd7 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -29,6 +29,12 @@ pub static EXECUTION_ACQUIRE_TIMEOUT: GucSetting<i32> = GucSetting::<i32>::new(3
 /// functions are explicitly desired. See docs/superuser_guc.md.
 pub static ENABLE_SUPERUSER_INSTANCES: GucSetting<bool> = GucSetting::<bool>::new(false);
 
+/// Cron schedule for the built-in durable reconciler (Option 4 / df.reconcile()).
+/// The background worker ensures one reconciler instance per cluster, running on
+/// this schedule. Set to an empty string to disable the built-in reconciler.
+pub static RECONCILER_CRON: GucSetting<Option<CString>> =
+    GucSetting::<Option<CString>>::new(Some(c"*/5 * * * *"));
+
 // Module declarations
 pub mod activities;
 pub mod client;
@@ -135,6 +141,15 @@ pub extern "C-unwind" fn _PG_init() {
         GucFlags::SUPERUSER_ONLY,
     );
 
+    GucRegistry::define_string_guc(
+        c"pg_durable.reconciler_cron",
+        c"Cron schedule for the built-in durable reconciler (df.reconcile()); empty disables it",
+        c"The background worker keeps one reconciler instance running on this schedule to repair residual df.* / duroxide divergence. Empty string disables the built-in reconciler. Requires server restart to change.",
+        &RECONCILER_CRON,
+        GucContext::Postmaster,
+        GucFlags::default(),
+    );
+
     worker::register_background_worker();
 }
 
@@ -621,10 +636,13 @@ REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text) FROM PUBLI
 // admin-only (EXECUTE revoked from PUBLIC) and SECURITY DEFINER so it can touch
 // the duroxide-owned tables and all users' df.instances.
 //
-// Schedule it durably (dogfooding pg_durable), e.g. every 5 minutes:
-//   SELECT df.start(
-//     @> ( 'SELECT df.reconcile()' ~> df.wait_for_schedule('*/5 * * * *') ),
-//     'reconciler');
+// The background worker keeps one reconciler instance running automatically
+// (worker::ensure_reconciler), dogfooding pg_durable as a durable cron loop:
+//   df.start(df.loop(df.seq('SELECT * FROM df.reconcile()',
+//                           df.wait_for_schedule(<pg_durable.reconciler_cron>))),
+//            'df_reconciler')
+// submitted by the dedicated non-superuser role df_reconciler. Set
+// pg_durable.reconciler_cron = '' to disable it.
 //
 // Only ROOT duroxide instances (parent_instance_id IS NULL) are considered —
 // sub-orchestrations (JOIN/RACE branches, loop generations) intentionally have
diff --git a/src/worker.rs b/src/worker.rs
index 4f6d66a..c9c99d1 100644
--- a/src/worker.rs
+++ b/src/worker.rs
@@ -202,6 +202,10 @@ async fn run_duroxide_runtime() {
             }
         };
 
+        // Ensure the built-in durable reconciler is running for this epoch
+        // (Option 4). Idempotent: starts one only if none is pending/running.
+        ensure_reconciler(&mgmt_pool).await;
+
         run_until_extension_dropped_or_shutdown(
             &mgmt_pool,
             duroxide_runtime,
@@ -215,6 +219,109 @@ async fn run_duroxide_runtime() {
     mgmt_pool.close().await;
 }
 
+/// Built-in durable reconciler (Option 4). Ensures exactly one reconciler
+/// instance is running per cluster: an infinite durable loop that calls
+/// `df.reconcile()` on the `pg_durable.reconciler_cron` schedule.
+///
+/// Idempotent — does nothing if a reconciler is already pending/running, and
+/// (re)starts one if it has died. The instance is submitted by a dedicated
+/// **non-superuser** role (`pg_durable_reconciler`) granted only df usage plus
+/// EXECUTE on the SECURITY DEFINER `df.reconcile()`. Using a non-superuser
+/// identity avoids the enable_superuser_instances guard, and bounds the blast
+/// radius to "trigger GC" even if the instance were forged.
+async fn ensure_reconciler(pool: &sqlx::PgPool) {
+    const ROLE: &str = "df_reconciler";
+    const LABEL: &str = "df_reconciler";
+
+    let cron = crate::RECONCILER_CRON
+        .get()
+        .map(|c| c.to_string_lossy().into_owned())
+        .unwrap_or_default();
+    let cron = cron.trim().to_string();
+    if cron.is_empty() {
+        return; // built-in reconciler disabled
+    }
+
+    // Skip if a reconciler is already pending/running (idempotent across epochs;
+    // self-heals if a previous one failed/was cancelled). Runs as the worker
+    // role, which bypasses RLS, so it sees the instance regardless of owner.
+    let already_running: i64 = match sqlx::query_scalar(
+        "SELECT count(*) FROM df.instances \
+         WHERE label = $1 AND status IN ('pending', 'running')",
+    )
+    .bind(LABEL)
+    .fetch_one(pool)
+    .await
+    {
+        Ok(n) => n,
+        Err(e) => {
+            log!("pg_durable: reconciler liveness check failed: {}", e);
+            return;
+        }
+    };
+    if already_running > 0 {
+        return;
+    }
+
+    // Ensure the dedicated role exists with the minimal grants.
+    if let Err(e) = ensure_reconciler_role(pool, ROLE).await {
+        log!("pg_durable: failed to provision reconciler role: {}", e);
+        return;
+    }
+
+    // Start the durable loop as the dedicated role on a single connection so the
+    // SET ROLE applies to df.start() (which captures submitted_by = current_user).
+    let started = async {
+        let mut conn = pool.acquire().await?;
+        sqlx::query(&format!("SET ROLE {ROLE}"))
+            .execute(&mut *conn)
+            .await?;
+        let res = sqlx::query(
+            "SELECT df.start(\
+               df.loop(df.seq('SELECT * FROM df.reconcile()', df.wait_for_schedule($1))), \
+               $2)",
+        )
+        .bind(&cron)
+        .bind(LABEL)
+        .execute(&mut *conn)
+        .await;
+        // Always reset the role before the connection returns to the pool.
+        let _ = sqlx::query("RESET ROLE").execute(&mut *conn).await;
+        res.map(|_| ())
+    }
+    .await;
+
+    match started {
+        Ok(()) => log!("pg_durable: started built-in reconciler (cron='{}')", cron),
+        Err(e) => log!("pg_durable: failed to start built-in reconciler: {}", e),
+    }
+}
+
+/// Create the dedicated non-superuser reconciler role (if absent) and grant it
+/// df usage plus EXECUTE on df.reconcile(). Idempotent.
+async fn ensure_reconciler_role(pool: &sqlx::PgPool, role: &str) -> Result<(), sqlx::Error> {
+    // role is a fixed compile-time constant, so the format! is injection-safe.
+    sqlx::query(&format!(
+        "DO $$ BEGIN \
+           IF NOT EXISTS (SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '{role}') THEN \
+             CREATE ROLE {role} LOGIN; \
+           END IF; \
+         END $$;"
+    ))
+    .execute(pool)
+    .await?;
+    sqlx::query("SELECT df.grant_usage($1)")
+        .bind(role)
+        .execute(pool)
+        .await?;
+    sqlx::query(&format!(
+        "GRANT EXECUTE ON FUNCTION df.reconcile(integer) TO {role}"
+    ))
+    .execute(pool)
+    .await?;
+    Ok(())
+}
+
 async fn wait_for_extension_creation(poll_pool: &sqlx::PgPool, poll_interval: Duration) -> bool {
     log!("pg_durable: waiting for CREATE EXTENSION pg_durable...");
 

From 33e9648c55e9310410a1236c277e579aadb7c99a Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 02:43:53 +0000
Subject: [PATCH 09/20] Address code-review findings (security + correctness)

Fixes from an SoT review of the atomic-start + reconciler work:

- HIGH (security): df._enqueue_orchestrator_start no longer accepts an opaque
  work item. It builds the StartOrchestration server-side and authorizes only a
  brand-new, not-yet-started instance (pending df.instances row, no queue entry,
  no duroxide instance), so a df user can no longer forge Cancel/Signal/Activity
  work items against another user's instance. Wrapper is now (text,text,text).
- HIGH (correctness): df.reconcile() gathers the FULL orphan subtree (recursive
  on parent_instance_id) before delete_instances_atomic, which refuses to delete
  a parent without its children; each pass is wrapped in EXCEPTION so a failure
  never aborts reconcile or kills the built-in loop. Count now reflects rows
  actually deleted.
- HIGH (security): reconciler singleton liveness keys on the unforgeable
  submitted_by = df_reconciler role, not the user-writable label, so a user
  cannot suppress GC with a look-alike instance.
- MED/HIGH: WAIT_SCHEDULE recomputes its delay from the deterministic clock
  (ctx.utc_now) each generation instead of replaying a fixed build-time offset,
  fixing a df.loop(df.wait_for_schedule) busy-loop (calculate_cron_wait_from).
- MED: ensure_reconciler is re-asserted from the steady-state poll loop, so a
  reconciler that dies mid-epoch restarts within the poll interval.
- MED: df.start() warns when it takes the non-atomic fallback path.
- Docs: keep df_reconciler LOGIN (connect_as_user needs it); correct stale role
  name in comments; qualify the 5s-race claim to the atomic path; spec Review
  hardening section.

Verified: forged enqueues denied; root+child orphan GC'd with no error; label
spoof does not suppress GC; reconciler fires on cron (no spin) and self-heals in
~4s after cancel; E2E subset 6/6; fmt clean, no new clippy warnings.
---
 docs/spec-atomic-start.md                    |  70 +++++++--
 src/dsl.rs                                   |  38 +++--
 src/lib.rs                                   | 153 ++++++++++++++-----
 src/orchestrations/execute_function_graph.rs |  29 +++-
 src/types.rs                                 |  20 ++-
 src/worker.rs                                |  46 ++++--
 6 files changed, 259 insertions(+), 97 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index d0b86cb..5022f92 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -70,7 +70,8 @@ schemas; run duroxide's start-enqueue inside the caller's backend transaction vi
 SPI, alongside the `df.*` writes.
 *Guarantee:* strongest — `df.nodes` + `df.instances` + the queue row + the
 caller's surrounding statements all commit or roll back together; identity
-capture and the `df.*` read surface are untouched; the 5 s load race disappears.
+capture and the `df.*` read surface are untouched; the 5 s load race disappears
+on the atomic path.
 *Cost:* small — the start is a single SQL-function call — with one wrinkle: a
 `SECURITY DEFINER` enqueue entrypoint is needed because the queue INSERT is
 owner-only. **Chosen as the primary fix.**
@@ -205,7 +206,7 @@ This is the "best-effort mirror"; any persistent drift is the reconciler's job
   and the worker picks the instance up only after the `df.*` rows are visible.
 - Eliminate the rolled-back-start orphan leak at the source.
 - Eliminate the `load-function-graph` "wait up to 5 s for the row to appear"
-  race (it becomes structurally impossible).
+  race on the atomic path (the retry is retained for the non-atomic fallback).
 - No change to the `df.*` schema's read surface (status/result/monitoring/explain
   keep working unchanged).
 
@@ -322,9 +323,17 @@ AS $$ BEGIN
 END $$;
 ```
 
-The wrapper takes **only** an opaque `instance_id`, a `work_item` string, and a
-timestamp — it executes no caller-supplied SQL, so the `SECURITY DEFINER`
-surface is a single fixed INSERT.
+The wrapper takes the caller's `instance_id`, orchestration name, and input
+string; it **constructs the `StartOrchestration` work item server-side** (so the
+caller cannot choose the work-item variant or target a foreign instance) and
+**authorizes** the enqueue only for a brand-new, not-yet-started instance
+(`pending` `df.instances` row, no orchestrator-queue entry, no duroxide
+instance — under atomic-start semantics reachable only for the row `df.start`
+just inserted in the current transaction). It executes no caller-supplied SQL.
+This is what prevents the wrapper from being a generic, cross-tenant
+arbitrary-work-item enqueue primitive (which would otherwise let any df user
+forge `CancelInstance`/`ExternalRaised`/`ActivityCompleted` against another
+user's instance, bypassing the RLS gate `df.signal`/`df.cancel` enforce).
 
 ### Identity / security
 
@@ -345,15 +354,17 @@ authenticate using `df.instances.submitted_by` and re-check the superuser policy
   `df.nodes`/`df.instances` still aborts before it (no behavior change for the
   collision case, now fully inside one transaction).
 
-### Consequence: the 5 s load race disappears
+### Consequence: the 5 s load race disappears (on the atomic path)
 
-`load-function-graph` currently polls for up to `MAX_WAIT_SECS = 5` because the
-worker could dequeue a start item before the caller's `df.*` rows committed
-(src/activities/load_function_graph.rs:21,60). With an atomic start, a queue row
+`load-function-graph` polls for up to `MAX_WAIT_SECS = 5` because the worker
+could dequeue a start item before the caller's `df.*` rows committed
+(src/activities/load_function_graph.rs:21,60). On the **atomic path** a queue row
 becomes visible to the worker **only** after the `df.*` rows commit in the same
 transaction, so the "not yet visible / rolled back" branch cannot occur. The
-retry loop can be simplified to a single read (optionally retain a short grace
-for replication scenarios). This also removes the worker-saturation failure mode.
+retry is **retained** rather than removed, because the non-atomic **fallback**
+path can still enqueue before the `df.*` rows commit; it is a no-op on the atomic
+path. So this removes the worker-saturation failure mode for atomic starts while
+staying correct for the fallback.
 
 ## Proof of concept (validated)
 
@@ -435,6 +446,43 @@ The probe + fallback make this non-breaking for any other provider.
 | E2E subset unaffected with the reconciler running (incl. heartbeat + join) | 6/6 |
 | `cargo fmt --check`; no new clippy warnings | pass |
 
+### Review hardening (applied)
+
+A source-of-truth code review surfaced and fixed several issues:
+
+- **Cross-tenant enqueue bypass (HIGH).** The wrapper originally accepted an
+  opaque `work_item` and was granted to every df user, letting a caller forge
+  `CancelInstance`/`ExternalRaised`/`ActivityCompleted` against another user's
+  instance. Fixed: the wrapper now **builds the `StartOrchestration` server-side**
+  and **authorizes** only a brand-new, not-yet-started instance — verified that a
+  non-superuser can no longer forge an enqueue for a foreign or arbitrary id.
+- **Reconciler self-destruct on parallel-workflow orphans (HIGH).**
+  `delete_instances_atomic` refuses to delete a parent without its children (even
+  with `force`), so deleting only orphan **roots** raised on any JOIN/RACE/loop
+  orphan and aborted GC. Fixed: gather the **full subtree** (recursive on
+  `parent_instance_id`) and wrap each pass in an exception block so one failure
+  never aborts reconcile or kills the loop — verified root+child GC with no error.
+- **GC-suppression via user label (HIGH).** The singleton key was the
+  user-writable `label`; any user could park a `df_reconciler`-labelled instance
+  to stop the GC from starting. Fixed: key liveness on the unforgeable
+  `submitted_by = df_reconciler` role.
+- **Cron busy-loop (MED/HIGH).** `df.wait_for_schedule` baked a fixed offset at
+  build time that a `df.loop` replayed every generation (worst case ~1 s spin).
+  Fixed: the WAIT_SCHEDULE node now recomputes the delay from the orchestration's
+  deterministic clock (`ctx.utc_now()`) each generation — verified the reconciler
+  fires on the cron tick and does not spin. (Note: adding the clock read changes
+  the orchestration history shape, so a wait-in-loop instance in flight across the
+  upgrade may need to restart.)
+- **No in-epoch self-healing (MED).** The reconciler was (re)started only per
+  epoch. Fixed: `ensure_reconciler` is also re-asserted from the steady-state poll
+  loop — verified a cancelled reconciler restarts within the poll interval.
+- **Silent non-atomic fallback (MED).** `df.start()` now emits a `WARNING` when it
+  takes the non-atomic fallback path.
+- **Debated/rejected:** the `df_reconciler` role keeps `LOGIN` — the worker
+  executes the reconcile node by connecting *as* the role via `connect_as_user`,
+  exactly like every other durable-function role; the proposed `NOLOGIN` would
+  break node execution.
+
 **POC scope / deferred to productionization:** fresh install only (no upgrade
 script); the atomic-start wrapper hardening (move the entrypoint into duroxide-pg
 owned by the schema owner, and/or validate the work item); the reconciler's
diff --git a/src/dsl.rs b/src/dsl.rs
index 532f667..de786cb 100644
--- a/src/dsl.rs
+++ b/src/dsl.rs
@@ -957,33 +957,31 @@ pub fn start(
             );
         }
 
-        // Build the exact same work item the duroxide client would enqueue, by
-        // serializing the real duroxide type (guarantees wire compatibility).
-        let work_item = duroxide::providers::WorkItem::StartOrchestration {
-            instance: instance_id.clone(),
-            orchestration: crate::orchestrations::execute_function_graph::NAME.to_string(),
-            input: input_json,
-            version: None,
-            parent_instance: None,
-            parent_id: None,
-            execution_id: duroxide::INITIAL_EXECUTION_ID,
-        };
-        let work_item_json = serde_json::to_string(&work_item)
-            .unwrap_or_else(|e| pgrx::error!("Failed to serialize start work item: {}", e));
-
-        // SECURITY DEFINER wrapper performs the privileged orchestrator-queue
-        // INSERT on the caller's transaction. Raising on failure aborts df.start.
+        // The SECURITY DEFINER wrapper builds the StartOrchestration work item
+        // server-side from these trusted arguments (the caller cannot choose the
+        // work-item variant or target a foreign instance) and performs the
+        // privileged orchestrator-queue INSERT on the caller's transaction.
+        // Raising on failure aborts the whole df.start atomically.
         if let Err(e) = Spi::run_with_args(
-            "SELECT df._enqueue_orchestrator_start($1, $2)",
-            &[instance_id.as_str().into(), work_item_json.as_str().into()],
+            "SELECT df._enqueue_orchestrator_start($1, $2, $3)",
+            &[
+                instance_id.as_str().into(),
+                crate::orchestrations::execute_function_graph::NAME.into(),
+                input_json.as_str().into(),
+            ],
         ) {
             pgrx::error!("Failed to enqueue durable function start: {:?}", e);
         }
     } else {
         // Fallback: no duroxide-pg SQL enqueue surface detected (a non-pg
         // provider, or a schema predating the wrapper). Enqueue out-of-band via
-        // the duroxide client. NOTE: this is the legacy path and is NOT atomic
-        // with the caller's transaction.
+        // the duroxide client. NOTE: this path is NOT atomic with the caller's
+        // transaction — a rollback will NOT undo the start. Warn so the
+        // non-atomic semantics are observable.
+        pgrx::warning!(
+            "pg_durable: df.start() is using the non-atomic fallback enqueue \
+             (duroxide-pg SQL surface not detected); a rollback will not undo this start"
+        );
         if let Err(e) = crate::client::start_durable_function(
             crate::orchestrations::execute_function_graph::NAME,
             &instance_id,
diff --git a/src/lib.rs b/src/lib.rs
index 4a29fd7..d7a6b15 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -430,7 +430,7 @@ DECLARE
         'df.explain(text)',
         'df.target_database()',
         -- Atomic-start enqueue wrapper (called by df.start() via SPI)
-        'df._enqueue_orchestrator_start(text, text)'
+        'df._enqueue_orchestrator_start(text, text, text)'
     ];
 BEGIN
     -- Validate the role exists
@@ -598,28 +598,74 @@ REVOKE EXECUTE ON FUNCTION df.revoke_usage(text) FROM PUBLIC;
 // The function is created by the worker's migrations (not this extension);
 // plpgsql resolves it at call time.
 //
-// SECURITY (POC): this wrapper can enqueue an arbitrary orchestrator work item.
-// EXECUTE is revoked from PUBLIC and granted only via df.grant_usage(). A
-// production version should move this entrypoint into duroxide-pg (owned by the
-// duroxide schema owner) and/or validate the work item — see
-// docs/spec-atomic-start.md.
+// SECURITY: this wrapper is SECURITY DEFINER and granted to every df user via
+// df.grant_usage() (df.start() runs SECURITY INVOKER and calls it via SPI). To
+// avoid handing users an arbitrary-work-item enqueue primitive (which would let
+// a caller forge CancelInstance/ExternalRaised/etc. against another user's
+// instance), it (1) constructs the StartOrchestration work item server-side from
+// the caller's arguments — the caller cannot choose the variant or a foreign
+// target — and (2) authorizes the enqueue only for a brand-new, not-yet-started
+// instance (pending df.instances row, no queue entry, no duroxide instance),
+// which under atomic-start semantics is reachable only for the row df.start just
+// inserted in the current transaction. Hardening direction (move the entrypoint
+// into duroxide-pg, owned by the schema owner) is in docs/spec-atomic-start.md.
 extension_sql!(
     r#"
-CREATE FUNCTION df._enqueue_orchestrator_start(p_instance_id text, p_work_item text)
+CREATE FUNCTION df._enqueue_orchestrator_start(
+    p_instance_id   text,
+    p_orchestration text,
+    p_input         text)
 RETURNS void
 LANGUAGE plpgsql
 SECURITY DEFINER
 SET search_path = pg_catalog
 AS $fn$
 DECLARE
-    sch text := df.duroxide_schema();
+    sch       text := df.duroxide_schema();
+    work_item text;
+    v_blocked boolean;
 BEGIN
+    -- Authorization. This runs as the (privileged) definer, so it must not
+    -- trust the caller to only target their own instance. Permit the enqueue
+    -- only for a brand-new, not-yet-started instance: a 'pending' df.instances
+    -- row with no orchestrator-queue entry and no duroxide instance. Under the
+    -- atomic-start path a committed df.instances row always has its queue row in
+    -- the same transaction, so this state is reachable only for the instance the
+    -- caller (df.start) just inserted in the current transaction — never another
+    -- user's committed instance. This is what stops a caller from forging work
+    -- against a foreign instance.
+    EXECUTE pg_catalog.format(
+        'SELECT NOT EXISTS (SELECT 1 FROM df.instances i WHERE i.id = $1 AND i.status = ''pending'') '
+        '    OR EXISTS (SELECT 1 FROM %I.orchestrator_queue q WHERE q.instance_id = $1) '
+        '    OR EXISTS (SELECT 1 FROM %I.instances d WHERE d.instance_id = $1)',
+        sch, sch)
+    INTO v_blocked
+    USING p_instance_id;
+
+    IF v_blocked THEN
+        RAISE EXCEPTION 'pg_durable: not authorized to enqueue a start for instance %', p_instance_id
+            USING ERRCODE = 'insufficient_privilege';
+    END IF;
+
+    -- Build the StartOrchestration work item server-side so the caller cannot
+    -- choose the work-item variant (no CancelInstance/ExternalRaised/etc.) or
+    -- target a different instance. Mirrors duroxide's WorkItem::StartOrchestration.
+    work_item := pg_catalog.json_build_object(
+        'StartOrchestration', pg_catalog.json_build_object(
+            'instance',        p_instance_id,
+            'orchestration',   p_orchestration,
+            'input',           p_input,
+            'version',         NULL,
+            'parent_instance', NULL,
+            'parent_id',       NULL,
+            'execution_id',    1))::text;
+
     EXECUTE pg_catalog.format('SELECT %I.enqueue_orchestrator_work($1, $2, $3)', sch)
-        USING p_instance_id, p_work_item, pg_catalog.now();
+        USING p_instance_id, work_item, pg_catalog.now();
 END;
 $fn$;
 
-REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text) FROM PUBLIC;
+REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text, text) FROM PUBLIC;
 "#,
     name = "atomic_start_enqueue",
     requires = ["create_tables"]
@@ -659,40 +705,71 @@ AS $fn$
 DECLARE
     sch        text := df.duroxide_schema();
     orphan_ids text[];
+    deleted    bigint := 0;
     stuck      bigint := 0;
 BEGIN
-    -- 1) Root duroxide instances with no df.instances row, older than the grace
-    --    window → orphans (e.g. rolled-back starts on the fallback path). Delete.
-    EXECUTE pg_catalog.format(
-        'SELECT pg_catalog.array_agg(d.instance_id) '
-        'FROM %I.instances d '
-        'LEFT JOIN df.instances i ON i.id = d.instance_id '
-        'WHERE i.id IS NULL '
-        '  AND d.parent_instance_id IS NULL '
-        '  AND d.created_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1)',
-        sch)
-    INTO orphan_ids
-    USING p_grace_seconds;
-
-    IF orphan_ids IS NOT NULL AND pg_catalog.array_length(orphan_ids, 1) > 0 THEN
-        EXECUTE pg_catalog.format('SELECT %I.delete_instances_atomic($1, $2)', sch)
+    -- 1) Delete orphaned duroxide subtrees: every duroxide instance whose ROOT
+    --    ancestor has no df.instances row and is older than the grace window.
+    --    We must gather the FULL subtree (root + all descendants) because
+    --    delete_instances_atomic refuses (even with force) to delete a parent
+    --    whose children are not also in the list. Sub-orchestrations (JOIN/RACE
+    --    branches, loop generations) have no df.instances row and would be
+    --    mis-detected as roots if we keyed on "no df row" alone, so the orphan
+    --    seed is restricted to parent_instance_id IS NULL.
+    --    Wrapped so a GC failure never aborts reconcile or kills the built-in
+    --    reconciler loop.
+    BEGIN
+        EXECUTE pg_catalog.format(
+            'WITH RECURSIVE orphan_root AS ( '
+            '    SELECT d.instance_id '
+            '    FROM %1$I.instances d '
+            '    LEFT JOIN df.instances i ON i.id = d.instance_id '
+            '    WHERE i.id IS NULL '
+            '      AND d.parent_instance_id IS NULL '
+            '      AND d.created_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1) '
+            '), subtree AS ( '
+            '    SELECT instance_id FROM orphan_root '
+            '    UNION '
+            '    SELECT c.instance_id FROM %1$I.instances c '
+            '    JOIN subtree s ON c.parent_instance_id = s.instance_id '
+            ') SELECT pg_catalog.array_agg(instance_id) FROM subtree',
+            sch)
+        INTO orphan_ids
+        USING p_grace_seconds;
+
+        IF orphan_ids IS NOT NULL AND pg_catalog.array_length(orphan_ids, 1) > 0 THEN
+            EXECUTE pg_catalog.format(
+                'SELECT instances_deleted FROM %I.delete_instances_atomic($1, $2)', sch)
+            INTO deleted
             USING orphan_ids, true;
-    END IF;
+        END IF;
+    EXCEPTION WHEN OTHERS THEN
+        deleted := 0;
+        RAISE WARNING 'pg_durable: reconcile orphan-GC pass failed: %', SQLERRM;
+    END;
 
     -- 2) df.instances stuck non-terminal with no live duroxide instance and no
-    --    queued start (lost enqueue) → mark failed.
-    EXECUTE pg_catalog.format(
-        'UPDATE df.instances i '
-        'SET status = ''failed'', updated_at = pg_catalog.now() '
-        'WHERE i.status IN (''pending'', ''running'') '
-        '  AND i.updated_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1) '
-        '  AND NOT EXISTS (SELECT 1 FROM %I.instances d WHERE d.instance_id = i.id) '
-        '  AND NOT EXISTS (SELECT 1 FROM %I.orchestrator_queue q WHERE q.instance_id = i.id)',
-        sch, sch)
-    USING p_grace_seconds;
-    GET DIAGNOSTICS stuck = ROW_COUNT;
+    --    queued start (lost enqueue) → mark failed. The duroxide queue row
+    --    persists (locked) until ack, and the instance row is created at ack, so
+    --    a healthy in-flight start always matches one of the NOT EXISTS guards
+    --    and is never failed here. Best-effort; wrapped like step 1.
+    BEGIN
+        EXECUTE pg_catalog.format(
+            'UPDATE df.instances i '
+            'SET status = ''failed'', updated_at = pg_catalog.now() '
+            'WHERE i.status IN (''pending'', ''running'') '
+            '  AND i.updated_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1) '
+            '  AND NOT EXISTS (SELECT 1 FROM %1$I.instances d WHERE d.instance_id = i.id) '
+            '  AND NOT EXISTS (SELECT 1 FROM %1$I.orchestrator_queue q WHERE q.instance_id = i.id)',
+            sch)
+        USING p_grace_seconds;
+        GET DIAGNOSTICS stuck = ROW_COUNT;
+    EXCEPTION WHEN OTHERS THEN
+        stuck := 0;
+        RAISE WARNING 'pg_durable: reconcile stuck-failover pass failed: %', SQLERRM;
+    END;
 
-    duroxide_orphans_deleted := COALESCE(pg_catalog.array_length(orphan_ids, 1), 0);
+    duroxide_orphans_deleted := deleted;
     stuck_instances_failed := stuck;
     RETURN NEXT;
 END;
diff --git a/src/orchestrations/execute_function_graph.rs b/src/orchestrations/execute_function_graph.rs
index fd17722..a652231 100644
--- a/src/orchestrations/execute_function_graph.rs
+++ b/src/orchestrations/execute_function_graph.rs
@@ -506,16 +506,31 @@ async fn execute_wait_schedule_node(
     let config: serde_json::Value = serde_json::from_str(config_str)
         .map_err(|e| format!("Invalid WAIT_SCHEDULE config: {e}"))?;
 
-    let wait_seconds = config["wait_seconds"]
-        .as_u64()
-        .ok_or_else(|| "WAIT_SCHEDULE missing wait_seconds".to_string())?;
-
-    let cron_expr = config["cron_expr"].as_str().unwrap_or("?");
+    let baked_wait_seconds = config["wait_seconds"].as_u64().unwrap_or(0);
+    let cron_expr = config["cron_expr"].as_str();
+
+    // Recompute the delay from the orchestration's deterministic clock on each
+    // execution, so a df.loop around df.wait_for_schedule waits until the NEXT
+    // cron tick rather than replaying the fixed offset baked at DSL-build time
+    // (which would make every loop generation wait the same — possibly ~0s —
+    // interval, busy-looping). ctx.utc_now() is recorded in history, so this
+    // stays deterministic on replay. Falls back to the baked seconds if the
+    // clock read or cron parse is unavailable.
+    let wait = match (cron_expr, ctx.utc_now().await) {
+        (Some(cron), Ok(now_st)) => {
+            let now_dt: chrono::DateTime<chrono::Utc> = now_st.into();
+            crate::types::calculate_cron_wait_from(cron, now_dt)
+                .unwrap_or_else(|_| Duration::from_secs(baked_wait_seconds))
+        }
+        _ => Duration::from_secs(baked_wait_seconds),
+    };
 
     ctx.trace_info(format!(
-        "Waiting {wait_seconds} seconds until schedule: {cron_expr}"
+        "Waiting {}s until schedule: {}",
+        wait.as_secs(),
+        cron_expr.unwrap_or("?")
     ));
-    ctx.schedule_timer(Duration::from_secs(wait_seconds)).await;
+    ctx.schedule_timer(wait).await;
 
     Ok(r#"{"scheduled": true}"#.to_string())
 }
diff --git a/src/types.rs b/src/types.rs
index 28aacbe..961e1fc 100644
--- a/src/types.rs
+++ b/src/types.rs
@@ -353,25 +353,29 @@ pub fn worker_provider_config(
     config
 }
 
-/// Calculate the duration until the next cron schedule match
-pub fn calculate_cron_wait(cron_expr: &str) -> Result<Duration, String> {
+/// Calculate the duration until the next cron schedule match, relative to a
+/// caller-supplied `now`. Used by the WAIT_SCHEDULE node executor with the
+/// orchestration's deterministic clock so a looped `df.wait_for_schedule` waits
+/// until the *next* tick each generation instead of replaying a fixed offset.
+pub fn calculate_cron_wait_from(cron_expr: &str, now: DateTime<Utc>) -> Result<Duration, String> {
     let cron_with_seconds = format!("0 {cron_expr}");
 
     let schedule = CronSchedule::from_str(&cron_with_seconds)
         .map_err(|e| format!("Invalid cron expression '{cron_expr}': {e}"))?;
 
-    let now: DateTime<Utc> = Utc::now();
-
     let next = schedule
-        .upcoming(Utc)
+        .after(&now)
         .next()
         .ok_or_else(|| "No upcoming schedule found".to_string())?;
 
-    let duration = (next - now)
+    (next - now)
         .to_std()
-        .map_err(|_| "Failed to calculate wait duration".to_string())?;
+        .map_err(|_| "Failed to calculate wait duration".to_string())
+}
 
-    Ok(duration)
+/// Calculate the duration until the next cron schedule match
+pub fn calculate_cron_wait(cron_expr: &str) -> Result<Duration, String> {
+    calculate_cron_wait_from(cron_expr, Utc::now())
 }
 
 /// Evaluate a condition result to determine if it's truthy.
diff --git a/src/worker.rs b/src/worker.rs
index c9c99d1..2b6d285 100644
--- a/src/worker.rs
+++ b/src/worker.rs
@@ -219,16 +219,19 @@ async fn run_duroxide_runtime() {
     mgmt_pool.close().await;
 }
 
-/// Built-in durable reconciler (Option 4). Ensures exactly one reconciler
-/// instance is running per cluster: an infinite durable loop that calls
-/// `df.reconcile()` on the `pg_durable.reconciler_cron` schedule.
+/// Built-in durable reconciler (Option 4). Ensures one reconciler instance is
+/// running per cluster: an infinite durable loop that calls `df.reconcile()` on
+/// the `pg_durable.reconciler_cron` schedule.
 ///
 /// Idempotent — does nothing if a reconciler is already pending/running, and
-/// (re)starts one if it has died. The instance is submitted by a dedicated
-/// **non-superuser** role (`pg_durable_reconciler`) granted only df usage plus
-/// EXECUTE on the SECURITY DEFINER `df.reconcile()`. Using a non-superuser
-/// identity avoids the enable_superuser_instances guard, and bounds the blast
-/// radius to "trigger GC" even if the instance were forged.
+/// (re)starts one if it has died. Called once per epoch and then periodically
+/// from the steady-state poll loop, so a reconciler that fails mid-epoch is
+/// restarted within the poll interval rather than only at the next epoch. The
+/// instance is submitted by a dedicated **non-superuser** role
+/// (`df_reconciler`) granted only df usage plus EXECUTE on the SECURITY DEFINER
+/// `df.reconcile()`. Using a non-superuser identity avoids the
+/// enable_superuser_instances guard, and bounds the blast radius to "trigger GC"
+/// even if the instance were forged.
 async fn ensure_reconciler(pool: &sqlx::PgPool) {
     const ROLE: &str = "df_reconciler";
     const LABEL: &str = "df_reconciler";
@@ -242,14 +245,20 @@ async fn ensure_reconciler(pool: &sqlx::PgPool) {
         return; // built-in reconciler disabled
     }
 
-    // Skip if a reconciler is already pending/running (idempotent across epochs;
-    // self-heals if a previous one failed/was cancelled). Runs as the worker
-    // role, which bypasses RLS, so it sees the instance regardless of owner.
+    // Skip if a reconciler is already pending/running (idempotent; self-heals if
+    // a previous one failed/was cancelled). Runs as the worker role (bypasses
+    // RLS). The liveness key is the unforgeable submitted_by = df_reconciler
+    // role, NOT the user-supplied `label`: any df user can call df.start() with
+    // label 'df_reconciler', so keying on label would let an unprivileged user
+    // suppress the GC by parking a look-alike instance. A regular user cannot
+    // submit as df_reconciler (submitted_by = current_user at start time), so
+    // submitted_by is a sound singleton key. `::text` avoids a regrole cast
+    // error before the role exists (it simply matches no rows).
     let already_running: i64 = match sqlx::query_scalar(
         "SELECT count(*) FROM df.instances \
-         WHERE label = $1 AND status IN ('pending', 'running')",
+         WHERE submitted_by::text = $1 AND status IN ('pending', 'running')",
     )
-    .bind(LABEL)
+    .bind(ROLE)
     .fetch_one(pool)
     .await
     {
@@ -299,6 +308,13 @@ async fn ensure_reconciler(pool: &sqlx::PgPool) {
 
 /// Create the dedicated non-superuser reconciler role (if absent) and grant it
 /// df usage plus EXECUTE on df.reconcile(). Idempotent.
+///
+/// The role is `LOGIN` on purpose: the worker executes the reconcile SQL node by
+/// opening a real connection AS submitted_by via `connect_as_user` (see
+/// types.rs), exactly as for every other durable-function role, so it must be
+/// connectable. It is non-superuser and holds only df usage + EXECUTE on
+/// df.reconcile(); password/auth exposure is governed by the deployment's
+/// pg_hba.conf, the same as any other df submitted_by role.
 async fn ensure_reconciler_role(pool: &sqlx::PgPool, role: &str) -> Result<(), sqlx::Error> {
     // role is a fixed compile-time constant, so the format! is injection-safe.
     sqlx::query(&format!(
@@ -734,6 +750,10 @@ async fn run_until_extension_dropped_or_shutdown(
                     log!("pg_durable: epoch sentinel gone — extension dropped or recreated");
                     break;
                 }
+                // Self-heal: re-assert the built-in reconciler so a loop that
+                // died mid-epoch is restarted within the poll interval rather
+                // than only at the next epoch. Idempotent (no-op while alive).
+                ensure_reconciler(poll_pool).await;
             }
         }
     }

From bbb4fed121b1690edc2559183438bb60f2d50095 Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 04:36:25 +0000
Subject: [PATCH 10/20] Enqueue df.signal/df.cancel on the caller's transaction

Route df.signal()/df.cancel() through SECURITY DEFINER wrappers
(df._enqueue_orchestrator_signal, df._enqueue_orchestrator_cancel)
called via SPI, so their _duroxide enqueue commits or rolls back with
the caller's transaction, matching the atomic df.start() path.

- Work items (ExternalRaised / CancelInstance) are built server-side,
  so a caller cannot choose the variant or target a foreign instance.
- Ownership is authorized via pg_has_role(session_user,
  submitted_by::oid, 'MEMBER'): session_user is unforgeable under
  SECURITY DEFINER and membership honors SET ROLE.
- Signal preserves the root + running-descendant fan-out via a
  recursive CTE over parent_instance_id.
- Both retain the out-of-band fallback (with a WARNING) when the
  duroxide-pg SQL surface is absent.

Update docs/spec-atomic-start.md to reflect signal/cancel as migrated.
---
 docs/spec-atomic-start.md | 123 ++++++++++++++++++++++++--------------
 src/dsl.rs                |  51 ++++++++++++++--
 src/lib.rs                | 107 +++++++++++++++++++++++++++++++--
 3 files changed, 226 insertions(+), 55 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index 5022f92..527583d 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -1,6 +1,6 @@
 # Spec: Atomic `df.start()` — Eliminating `df.*` ↔ `_duroxide` Divergence
 
-**Status:** Proposal — Option 3 prototyped & validated (POC)
+**Status:** Implemented & validated — Option 3 (atomic `df.start`/`df.cancel`/`df.signal`) + Option 4 (reconciler)
 **Author:** pg_durable team
 **Date:** June 2026
 
@@ -104,12 +104,12 @@ backstop alongside Option 3.
 ### Decision
 
 Implement **Option 3 + Option 4**. Option 3 removes the dual-write at the source
-for backend-initiated starts (and, in a later phase, `df.cancel`); Option 4 is a
-lightweight safety net for the residual divergence that no start-time atomicity
-can prevent (crashes mid-execution, the best-effort `df.*` mirror, and
-not-yet-migrated signal/cancel). Options 1 and 2 are rejected — Option 1 is
-disproportionate to the problem, and Option 2 is a strictly weaker variant of
-Option 3.
+for every backend-initiated cross-store write — `df.start()`, `df.cancel()`, and
+`df.signal()` all enqueue on the caller's transaction; Option 4 is a lightweight
+safety net for the residual divergence that no in-transaction enqueue can prevent
+(crashes mid-execution and the best-effort `df.*` mirror). Options 1 and 2 are
+rejected — Option 1 is disproportionate to the problem, and Option 2 is a strictly
+weaker variant of Option 3.
 
 ## Background: how a start is written today
 
@@ -164,25 +164,27 @@ seams are bounded and fall into two distinct classes.
 **Backend (session) side — `df.*` on the caller's transaction + an out-of-band
 duroxide-client enqueue.** This is the class Option 3 fixes.
 
-| Site | `df.*` write (user tx, SPI) | `_duroxide` write (out-of-band pool) | Dual-write? |
+| Site | `df.*` write (user tx, SPI) | `_duroxide` write | Now in caller tx? |
 |------|------|------|------|
-| `df.start()` | INSERT `df.nodes` + `df.instances` (src/dsl.rs:842,888) | enqueue `StartOrchestration` (src/dsl.rs:923) | **Yes** — Phase 1 |
-| `df.cancel()` | UPDATE `df.instances.status='cancelled'` (src/dsl.rs:966) | `cancel_instance` enqueue (src/dsl.rs:955) | **Yes** — Phase 2 |
-| `df.signal()` | **none** (only an RLS read of `df.instances`) | `raise_event` + descendant fan-out (src/dsl.rs:616) | **No** — single store |
+| `df.start()` | INSERT `df.nodes` + `df.instances` (src/dsl.rs:842,888) | enqueue `StartOrchestration` (src/dsl.rs:923) | **Yes** |
+| `df.cancel()` | UPDATE `df.instances.status='cancelled'` (src/dsl.rs:966) | `CancelInstance` enqueue | **Yes** |
+| `df.signal()` | **none** (only an RLS read of `df.instances`) | `ExternalRaised` + descendant fan-out (src/dsl.rs:616) | **Yes** |
 
 Notes:
 
 - **`df.cancel()` is a genuine dual-write** with two twists vs `df.start()`: it
-  enqueues to duroxide **first** and then does the `df.*` UPDATE (opposite
-  order), and the UPDATE is only an *optimistic* mirror — the authoritative
-  `cancelled` status is re-applied when the worker processes the cancel via
+  enqueued to duroxide **first** and then did the `df.*` UPDATE (opposite order),
+  and the UPDATE is only an *optimistic* mirror — the authoritative `cancelled`
+  status is re-applied when the worker processes the cancel via
   `update-instance-status`, so divergence here is usually transient and
-  self-healing. It still belongs in the same in-transaction treatment for strict
-  atomicity.
+  self-healing. It now uses the same in-transaction enqueue as `df.start()`, so
+  the optimistic UPDATE and the `CancelInstance` enqueue commit or roll back
+  together.
 - **`df.signal()` is not a dual-write**: it writes only `_duroxide`. There is no
-  cross-store inconsistency to create; moving it in-transaction only changes
-  "don't deliver the signal if my surrounding tx rolls back" semantics — a
-  nice-to-have, not a consistency fix.
+  cross-store inconsistency to create; moving it in-transaction changes
+  "don't deliver the signal if my surrounding tx rolls back" semantics, which is
+  the least-surprising contract and is applied here for consistency with the
+  other paths.
 - Not seams: `df.run()` is a no-op stub; `df.result/status/explain/monitoring`
   are reads; `df.vars` and `df._worker_epoch` touch only `df.*`.
 
@@ -204,6 +206,9 @@ This is the "best-effort mirror"; any persistent drift is the reconciler's job
 - `df.start()` is **atomic with the caller's transaction**: on rollback, no
   `df.*` rows *and* no `_duroxide` start item remain; on commit, both are present
   and the worker picks the instance up only after the `df.*` rows are visible.
+- `df.cancel()` and `df.signal()` enqueue on the **caller's transaction** too: a
+  rolled-back cancel/signal leaves nothing in `_duroxide`, and a committed one is
+  durable with the caller's other work.
 - Eliminate the rolled-back-start orphan leak at the source.
 - Eliminate the `load-function-graph` "wait up to 5 s for the row to appear"
   race on the atomic path (the retry is retained for the non-atomic fallback).
@@ -218,8 +223,6 @@ This is the "best-effort mirror"; any persistent drift is the reconciler's job
   consistent by design).
 - Changing short-id generation / the id space. (Independent; can be addressed
   separately. This spec makes a collision a clean, fully-atomic abort.)
-- Migrating `df.signal()` / `df.cancel()` to the in-transaction path — same
-  mechanism, scoped as a follow-up (see Phasing).
 
 ## Design
 
@@ -344,6 +347,34 @@ item carries only the orchestration name/input. The worker continues to
 authenticate using `df.instances.submitted_by` and re-check the superuser policy
 (unchanged). No new forgery surface is introduced.
 
+### Cancel and signal wrappers
+
+`df.cancel()` and `df.signal()` enqueue on the caller's transaction through their
+own `SECURITY DEFINER` wrappers (`df._enqueue_orchestrator_cancel`,
+`df._enqueue_orchestrator_signal`). Both build the work item server-side
+(`CancelInstance` / `ExternalRaised`) via `json_build_object`, so a caller cannot
+choose the variant or smuggle a foreign target.
+
+The start wrapper can authorize structurally (a brand-new `pending` instance is
+only reachable for the row the caller just inserted). Cancel and signal target an
+**already-committed** instance, so they authorize explicitly: the wrapper checks
+`pg_has_role(session_user, df.instances.submitted_by::oid, 'MEMBER')`
+(`submitted_by` is a `regrole`, so `::oid` yields the owning role's OID).
+`session_user` is unforgeable inside `SECURITY DEFINER` (unlike `current_user`,
+which is the definer), and membership — rather than an exact-name match — lets a
+role that owns the instance via `SET ROLE` still cancel/signal it. A caller that
+is not a member of the owning role is rejected before any enqueue, mirroring the
+RLS gate `df.cancel`/`df.signal` already enforce. The wrappers run no
+caller-supplied SQL.
+
+**Signal fan-out.** A signal must reach the root instance **and** every running
+sub-orchestration (a descendant may be the one blocked on
+`wait_for_signal`). The wrapper recurses `_duroxide.instances.parent_instance_id`
+from the root, joins each instance to its current execution, keeps only
+executions whose `status` is `Running` (lower-cased), and enqueues one
+`ExternalRaised` per surviving instance — all in the caller's transaction, so the
+whole fan-out commits or rolls back atomically.
+
 ### Worker readiness and ordering
 
 - Keep the existing worker-readiness gate: if `_worker_ready` is absent/stale,
@@ -439,6 +470,10 @@ The probe + fallback make this non-breaking for any other provider.
 | E2E subset: core, conditionals, loops, variables, signals, join, race, cancel | 8/8 |
 | JOIN determinism under the new (faster) timing | 5/5 |
 | Non-superuser (`df_e2e_user`) start path | pass |
+| Committed `df.cancel()` cancels; rolled-back `df.cancel()` leaves the instance running and rolls back the queue enqueue | pass |
+| Committed `df.signal()` delivers (incl. fan-out to a sub-orchestration in a race); rolled-back `df.signal()` rolls back the enqueue | pass |
+| Cancel/signal authz: a non-member caller is denied on the wrapper and via the RLS gate; the owning role (incl. via `SET ROLE`) succeeds | pass |
+| E2E: `07_signals`, `23_signal_in_race` (fan-out), `22_cancel_status_consistency` | 3/3 |
 | `df.reconcile()` deletes a planted root orphan; leaves sub-orchestrations + live instances untouched | pass |
 | `df.reconcile()` fails a stuck (lost-enqueue) `df.instances` | pass |
 | Built-in reconciler auto-starts as non-superuser `df_reconciler`, idempotent (exactly one) | pass |
@@ -513,30 +548,26 @@ transaction that later rolls back will **not** run — which is the desired fix,
 but a behavior change from "fire-and-forget at statement time." This must be
 called out in `USER_GUIDE.md`.
 
-## Signals & cancellation (follow-up)
+## Signals & cancellation (migrated)
 
-See the Dual-write inventory above for the precise classification. In short:
+`df.cancel()` and `df.signal()` use the same in-transaction enqueue as
+`df.start()` (see Cancel and signal wrappers above). In short:
 
-- **`df.cancel()` is a genuine dual-write** (`df.instances.status` + a
-  `CancelInstance` enqueue) and should get the same in-transaction treatment as
-  `df.start()`. It is scoped as **Phase 2** rather than Phase 1 only because it
-  targets an already-committed instance (so the rolled-back-*start* leak does not
-  apply) and because the worker's `update-instance-status` activity already
-  re-applies the authoritative `cancelled` status, making its divergence
-  transient. The `SECURITY DEFINER` enqueue entrypoint should be generalized to
-  carry an arbitrary work item (e.g. `client_enqueue(instance, work_item,
-  visible_at)`) so it serves cancel directly.
-- **`df.signal()` is not a dual-write** (it writes only `_duroxide`), so it
-  creates no cross-store inconsistency. It can ride on the same entrypoint for
-  "don't deliver if my tx rolls back" semantics, but that is a nice-to-have, not
-  a consistency fix — lowest priority.
+- **`df.cancel()`** — the optimistic `df.instances.status='cancelled'` UPDATE and
+  the `CancelInstance` enqueue now commit or roll back together on the caller's
+  transaction. The worker's `update-instance-status` activity still re-applies the
+  authoritative status, so a worker-side crash remains transient/self-healing.
+- **`df.signal()`** — the `ExternalRaised` fan-out (root + running
+  sub-orchestrations) is enqueued on the caller's transaction, giving "don't
+  deliver if my tx rolls back" semantics. It writes only `_duroxide`, so this is a
+  consistency-of-contract change rather than a cross-store fix.
 
 ## Complementary backstop — Option 4 (implemented)
 
 As part of the chosen approach, a lightweight **reconciler** sweeps for residual
-divergence that an atomic start cannot prevent (crashes mid-execution,
-not-yet-migrated signal/cancel paths, `status` mirror drift, and legacy/fallback
-orphans). It ships as **`df.reconcile(grace_seconds int default 60)`**
+divergence that an in-transaction enqueue cannot prevent (crashes mid-execution,
+the best-effort `df.*` `status` mirror, and legacy/fallback orphans). It ships as
+**`df.reconcile(grace_seconds int default 60)`**
 (`SECURITY DEFINER`, admin-only): it deletes orphaned **root** duroxide instances
 (`parent_instance_id IS NULL`, no `df.instances` row, older than the grace
 window) via `delete_instances_atomic`, and marks `df.instances` rows stuck
@@ -613,13 +644,15 @@ E2E (tests/e2e/sql/) and unit coverage:
 
 ## Phasing
 
-1. **Phase 1 (this spec) — prototyped & validated:** atomic `df.start()` via the
+1. **Phase 1 — implemented & validated:** atomic `df.start()` via the
    `SECURITY DEFINER` enqueue entrypoint; readiness gate + legacy fallback;
-   simplify the `load-function-graph` retry; docs + tests. (POC complete; see
-   Proof of concept. Remaining for production: upgrade script, dynamic
-   `_duroxide` schema resolution, entrypoint hardening.)
-2. **Phase 2:** apply the same in-transaction enqueue to `df.signal()` /
-   `df.cancel()`.
+   simplify the `load-function-graph` retry; docs + tests. Remaining for
+   production: upgrade script and entrypoint hardening.
+2. **Phase 2 — implemented & validated:** the same in-transaction enqueue for
+   `df.signal()` (root + running-descendant `ExternalRaised` fan-out) and
+   `df.cancel()` (`CancelInstance`), each via a `SECURITY DEFINER` wrapper that
+   builds the work item server-side and authorizes the caller via
+   `pg_has_role(session_user, ...)`.
 3. **Phase 3 — reconciler implemented:** `df.reconcile()` plus a built-in durable
    cron loop (worker-managed `df_reconciler` instance on
    `pg_durable.reconciler_cron`) are validated (auto-start, idempotency,
diff --git a/src/dsl.rs b/src/dsl.rs
index de786cb..75c3b42 100644
--- a/src/dsl.rs
+++ b/src/dsl.rs
@@ -612,9 +612,32 @@ pub fn signal(instance_id: &str, signal_name: &str, signal_data: default!(&str,
         pgrx::error!("Instance not found or access denied: {}", instance_id);
     }
 
-    match raise_external_event(instance_id, signal_name, &signal_data) {
-        Ok(_) => "OK".to_string(),
-        Err(e) => pgrx::error!("Failed to send signal: {}", e),
+    // Enqueue the ExternalRaised work item. Prefer the in-transaction SPI path
+    // (atomic with the caller's transaction; fans out to running descendants in
+    // the wrapper) when the duroxide-pg SQL surface is present; otherwise fall
+    // back to the out-of-band client path.
+    let schema = crate::types::backend_duroxide_schema();
+    if in_tx_enqueue_supported(schema) {
+        if let Err(e) = Spi::run_with_args(
+            "SELECT df._enqueue_orchestrator_signal($1, $2, $3)",
+            &[
+                instance_id.into(),
+                signal_name.into(),
+                signal_data.as_str().into(),
+            ],
+        ) {
+            pgrx::error!("Failed to send signal: {:?}", e);
+        }
+        "OK".to_string()
+    } else {
+        pgrx::warning!(
+            "pg_durable: df.signal() is using the non-atomic fallback enqueue \
+             (duroxide-pg SQL surface not detected)"
+        );
+        match raise_external_event(instance_id, signal_name, &signal_data) {
+            Ok(_) => "OK".to_string(),
+            Err(e) => pgrx::error!("Failed to send signal: {}", e),
+        }
     }
 }
 
@@ -1015,8 +1038,26 @@ pub fn cancel(instance_id: &str, reason: default!(&str, "'Cancelled by user'"))
         pgrx::error!("Instance not found or access denied: {}", instance_id);
     }
 
-    if let Err(e) = cancel_durable_function(instance_id, reason) {
-        return format!("Failed to cancel: {e}");
+    // Enqueue the CancelInstance work item and flip the status mirror together.
+    // On the in-transaction path the enqueue and the status UPDATE commit
+    // atomically (no df.* / duroxide divergence on cancel); otherwise fall back
+    // to the out-of-band client path.
+    let schema = crate::types::backend_duroxide_schema();
+    if in_tx_enqueue_supported(schema) {
+        if let Err(e) = Spi::run_with_args(
+            "SELECT df._enqueue_orchestrator_cancel($1, $2)",
+            &[instance_id.into(), reason.into()],
+        ) {
+            pgrx::error!("Failed to cancel: {:?}", e);
+        }
+    } else {
+        pgrx::warning!(
+            "pg_durable: df.cancel() is using the non-atomic fallback enqueue \
+             (duroxide-pg SQL surface not detected)"
+        );
+        if let Err(e) = cancel_durable_function(instance_id, reason) {
+            return format!("Failed to cancel: {e}");
+        }
     }
 
     // Update the instance status to 'cancelled' via SPI only when the instance is not
diff --git a/src/lib.rs b/src/lib.rs
index d7a6b15..a7b2af2 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -430,7 +430,9 @@ DECLARE
         'df.explain(text)',
         'df.target_database()',
         -- Atomic-start enqueue wrapper (called by df.start() via SPI)
-        'df._enqueue_orchestrator_start(text, text, text)'
+        'df._enqueue_orchestrator_start(text, text, text)',
+        'df._enqueue_orchestrator_cancel(text, text)',
+        'df._enqueue_orchestrator_signal(text, text, text)'
     ];
 BEGIN
     -- Validate the role exists
@@ -671,14 +673,109 @@ REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text, text) FROM
     requires = ["create_tables"]
 );
 
+// ============================================================================
+// In-transaction signal / cancel enqueue (Part 1, extended) — SECURITY DEFINER
+// ============================================================================
+//
+// df.signal()/df.cancel() enqueue their work items (ExternalRaised /
+// CancelInstance) via SPI through these wrappers, inside the caller's
+// transaction, instead of out-of-band on the duroxide client pool. Both target
+// an already-committed instance, so the start wrapper's "brand-new instance"
+// guard does not apply; instead they authorize on ownership.
+//
+// AUTHORIZATION: these are SECURITY DEFINER (the orchestrator queue is
+// owner-only) and granted to every df user via df.grant_usage(), so they must
+// not let a caller signal/cancel a foreign instance. current_user is the definer
+// here, so ownership is checked against session_user (the unforgeable
+// authenticated role) plus membership in the instance's submitted_by — i.e. the
+// caller's session must be able to act as the instance owner. The work items are
+// built server-side from the arguments (no opaque caller-supplied work item).
+extension_sql!(
+    r#"
+CREATE FUNCTION df._enqueue_orchestrator_cancel(p_instance_id text, p_reason text)
+RETURNS void
+LANGUAGE plpgsql
+SECURITY DEFINER
+SET search_path = pg_catalog
+AS $fn$
+DECLARE
+    sch       text := df.duroxide_schema();
+    owner_oid oid;
+BEGIN
+    SELECT i.submitted_by::oid INTO owner_oid FROM df.instances i WHERE i.id = p_instance_id;
+    IF owner_oid IS NULL OR NOT pg_catalog.pg_has_role(session_user, owner_oid, 'MEMBER') THEN
+        RAISE EXCEPTION 'pg_durable: not authorized to cancel instance %', p_instance_id
+            USING ERRCODE = 'insufficient_privilege';
+    END IF;
+
+    EXECUTE pg_catalog.format('SELECT %I.enqueue_orchestrator_work($1, $2, $3)', sch)
+        USING p_instance_id,
+              pg_catalog.json_build_object('CancelInstance',
+                  pg_catalog.json_build_object('instance', p_instance_id, 'reason', p_reason))::text,
+              pg_catalog.now();
+END;
+$fn$;
+
+REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_cancel(text, text) FROM PUBLIC;
+
+CREATE FUNCTION df._enqueue_orchestrator_signal(p_instance_id text, p_name text, p_data text)
+RETURNS void
+LANGUAGE plpgsql
+SECURITY DEFINER
+SET search_path = pg_catalog
+AS $fn$
+DECLARE
+    sch       text := df.duroxide_schema();
+    owner_oid oid;
+BEGIN
+    SELECT i.submitted_by::oid INTO owner_oid FROM df.instances i WHERE i.id = p_instance_id;
+    IF owner_oid IS NULL OR NOT pg_catalog.pg_has_role(session_user, owner_oid, 'MEMBER') THEN
+        RAISE EXCEPTION 'pg_durable: not authorized to signal instance %', p_instance_id
+            USING ERRCODE = 'insufficient_privilege';
+    END IF;
+
+    -- Raise the event for the target instance and every RUNNING descendant
+    -- (a sub-orchestration — JOIN/RACE branch or loop generation — may be the one
+    -- waiting on the signal), mirroring the out-of-band fan-out. %1$I = schema.
+    EXECUTE pg_catalog.format(
+        'INSERT INTO %1$I.orchestrator_queue (instance_id, work_item, visible_at, created_at) '
+        'SELECT t.instance_id, '
+        '       pg_catalog.json_build_object(''ExternalRaised'', '
+        '           pg_catalog.json_build_object(''instance'', t.instance_id, ''name'', $2, ''data'', $3))::text, '
+        '       pg_catalog.now(), pg_catalog.now() '
+        'FROM ( '
+        '    WITH RECURSIVE tree AS ( '
+        '        SELECT i.instance_id, i.current_execution_id, true AS is_root '
+        '        FROM %1$I.instances i WHERE i.instance_id = $1 '
+        '        UNION '
+        '        SELECT c.instance_id, c.current_execution_id, false '
+        '        FROM %1$I.instances c JOIN tree p ON c.parent_instance_id = p.instance_id '
+        '    ) '
+        '    SELECT tr.instance_id '
+        '    FROM tree tr '
+        '    LEFT JOIN %1$I.executions e '
+        '      ON e.instance_id = tr.instance_id AND e.execution_id = tr.current_execution_id '
+        '    WHERE tr.is_root OR pg_catalog.lower(COALESCE(e.status, '''')) = ''running'' '
+        ') t',
+        sch)
+    USING p_instance_id, p_name, p_data;
+END;
+$fn$;
+
+REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_signal(text, text, text) FROM PUBLIC;
+"#,
+    name = "atomic_signal_cancel_enqueue",
+    requires = ["create_tables"]
+);
+
 // ============================================================================
 // Reconciler (Option 4) — repair residual df.* / duroxide divergence
 // ============================================================================
 //
-// Atomic df.start() prevents the rolled-back-start orphan leak at the source,
-// but some divergence can still arise: legacy orphans, the non-atomic fallback
-// path, df.signal()/df.cancel() (not yet migrated), and crashes mid-execution.
-// df.reconcile() is a best-effort garbage collector for that residue. It is
+// Atomic df.start()/df.signal()/df.cancel() commit their duroxide enqueue with
+// the caller's transaction, but some divergence can still arise: legacy orphans,
+// the non-atomic fallback path, and crashes mid-execution. df.reconcile() is a
+// best-effort garbage collector for that residue. It is
 // admin-only (EXECUTE revoked from PUBLIC) and SECURITY DEFINER so it can touch
 // the duroxide-owned tables and all users' df.instances.
 //

From bf10f6f5dcf3dc52e4315a1e325111e05d99ff96 Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 14:52:23 +0000
Subject: [PATCH 11/20] Rewrite atomic-start spec: align with implementation,
 tighten, de-jargon

The spec had drifted from the shipped code (Rust-side work-item
construction, a duroxide-pg wrapper placement that never landed, stale
df.__enqueue_start naming, POC/phasing framing) and triplicated the
reconciler and signal/cancel descriptions.

Rewrite end-to-end against reality: server-side work-item build in the
df-schema SECURITY DEFINER wrappers, start authorized by instance state
and cancel/signal by pg_has_role(session_user, ...), the actual
worker reconciler DSL, and a single prevent/repair structure. Drop POC
language and dead jargon. 676 -> 368 lines; no code changes.
---
 docs/spec-atomic-start.md | 984 +++++++++++++-------------------------
 1 file changed, 338 insertions(+), 646 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index 527583d..17f64e7 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -1,676 +1,368 @@
-# Spec: Atomic `df.start()` — Eliminating `df.*` ↔ `_duroxide` Divergence
+# Keeping the control plane and the runtime consistent
 
-**Status:** Implemented & validated — Option 3 (atomic `df.start`/`df.cancel`/`df.signal`) + Option 4 (reconciler)
+**Status:** Implemented and validated. Atomic `df.start`/`df.cancel`/`df.signal`,
+plus a reconciler that repairs leftover drift.
 **Author:** pg_durable team
 **Date:** June 2026
 
-## Overview
-
-`df.start()` writes to two independent stores that are **not** committed together:
-
-- the **pg_durable control plane** (`df.nodes`, `df.instances`) — written on the
-  caller's backend transaction via SPI; and
-- the **duroxide runtime** (`_duroxide.orchestrator_queue`) — enqueued
-  out-of-band on a *separate* sqlx connection pool, committed independently.
-
-Because the two halves are not atomic, the stores can diverge. The most common
-failure is the "rolled-back start" leak: a `df.start()` whose surrounding
-transaction is rolled back (or whose batch subtransaction aborts on a PK
-collision) leaves an **orphaned orchestration** in `_duroxide` with no
-corresponding `df.*` rows. The background worker then repeatedly tries to load a
-graph that does not exist, fails after a 5 s timeout, and the orphan persists in
-`_duroxide.instances`.
-
-Four designs were explored to keep the two stores consistent (see **Design
-options considered** below). This spec **implements Option 3 (primary) plus
-Option 4 (complementary backstop)**: make the duroxide start-enqueue part of the
-caller's backend transaction so `df.start()` is atomic — either everything
-commits or nothing does — and add an asynchronous reconciler to repair any
-residual drift that start-time atomicity cannot prevent.
-
-### Evidence (observed during investigation)
-
-- A `df.start()` executed inside a transaction that was then `ROLLBACK`-ed left a
-  row in `_duroxide.instances` (e.g. `48819f12`) with **zero** rows in
-  `df.instances`. The orphan survived the rollback.
-- Under load, orphaned start items self-sustain via retry and saturate the
-  2-thread worker with 5 s "Instance not found … transaction may have been
-  rolled back" failures.
-- The short-id PK-collision path (8 hex chars ≈ 2³²) reliably aborts `df.start()`
-  well under 1M instances (observed first-collision draws: 9248, 34845, 40445,
-  60776). That abort is *safe* today (it happens before the duroxide enqueue),
-  but it exercises the same dual-write seam.
-
-## Design options considered
-
-Four approaches were explored for keeping the `df.*` control plane and the
-`_duroxide` runtime consistent. The mechanics of the chosen approach are detailed
-in **Design** below; this section records all four and the decision.
-
-**Option 1 — Single source of truth in `_duroxide`.** Drop the `df.instances` /
-`df.nodes` control tables entirely, keep all instance state in `_duroxide`,
-serialize the graph into the orchestration input, and have duroxide expose
-whatever state pg_durable needs (via views / read APIs).
-*Guarantee:* total — there is only one store, so nothing can diverge.
-*Cost:* large — moves the `submitted_by` security boundary, requires rebuilding
-per-user RLS over duroxide-owned tables, re-exposing every `df.*` read surface,
-and a destructive migration with B1 backward-compat implications. A possible
-long-term direction, not this change.
-
-**Option 2 — Hand duroxide-pg our SQL to run inside *its* transaction.** Keep
-both schemas; extend duroxide-pg so the start-enqueue also executes the
-pg_durable `df.*` writes in the same (sqlx, worker-pool) transaction.
-*Guarantee:* `df.*` ↔ `_duroxide` atomic — but on the worker-pool connection, so
-`df.start()` stops participating in the *caller's* transaction and the `df.*`
-writes run as the pool role, breaking `current_user` / `submitted_by` capture and
-RLS. Strictly weaker than Option 3.
-
-**Option 3 — Hand duroxide-pg *our* (the caller's) transaction.** Keep both
-schemas; run duroxide's start-enqueue inside the caller's backend transaction via
-SPI, alongside the `df.*` writes.
-*Guarantee:* strongest — `df.nodes` + `df.instances` + the queue row + the
-caller's surrounding statements all commit or roll back together; identity
-capture and the `df.*` read surface are untouched; the 5 s load race disappears
-on the atomic path.
-*Cost:* small — the start is a single SQL-function call — with one wrinkle: a
-`SECURITY DEFINER` enqueue entrypoint is needed because the queue INSERT is
-owner-only. **Chosen as the primary fix.**
-
-**Option 4 — Tolerate divergence; repair asynchronously.** Accept a transient
-window and add a reconciler — ideally a built-in durable function — that detects
-and repairs mismatches (orphaned `_duroxide` instances, stuck `df.instances`
-rows, `status`-mirror drift).
-*Guarantee:* none added; eventual consistency with bounded repair latency.
-*Value:* the only option that also catches divergence from crashes
-*mid-execution* and from paths Option 3 does not cover (the worker-side mirror
-activities, and signal/cancel until they are migrated). **Chosen as a
-complementary backstop.**
-
-### Comparison
-
-| Option | Atomic with `df.*` | Atomic with caller tx | Kills 5 s load race | Fixes crash-time drift | Effort |
+## The problem
+
+A durable function lives in two stores that are written separately:
+
+- the **pg_durable control plane** — `df.nodes` and `df.instances`, written on the
+  caller's transaction; and
+- the **duroxide runtime** — `_duroxide.orchestrator_queue`, where the orchestration
+  is enqueued.
+
+Originally `df.start()` wrote the first on the caller's transaction but enqueued the
+second out-of-band, on a separate connection pool that committed on its own. The two
+were never atomic, so they could disagree.
+
+The common failure was an **orphaned orchestration**: a `df.start()` whose
+surrounding transaction rolled back (or whose batch aborted on a primary-key
+collision) left a row in `_duroxide` with no matching `df.*` rows. The background
+worker then kept trying to load a graph that did not exist, failed after a 5-second
+timeout, and the orphan stayed behind. Under load these orphans accumulated and
+saturated the worker's two threads.
+
+We observed this directly: a `df.start()` inside a rolled-back transaction left
+`_duroxide.instances` row `48819f12` with zero rows in `df.instances`, surviving the
+rollback.
+
+## What we did
+
+Two complementary changes:
+
+1. **Prevent** the drift at the source. `df.start()`, `df.cancel()`, and
+   `df.signal()` now enqueue their runtime work on the **caller's transaction**, so
+   the control-plane writes and the runtime enqueue commit or roll back together.
+2. **Repair** what prevention cannot cover. A reconciler periodically deletes
+   leftover orphans and marks stuck instances failed — catching drift from crashes
+   mid-execution and from the legacy fallback path.
+
+Both are pg_durable-only. No `duroxide` / `duroxide-pg` or `Cargo.*` changes.
+
+## Designs we considered
+
+Four approaches were weighed for keeping the two stores in agreement.
+
+**Option 1 — One store.** Drop `df.nodes` / `df.instances` entirely and keep all
+state in `_duroxide`, exposing what pg_durable needs through views.
+*Strongest guarantee* (only one store can't diverge) but the most disruptive: it
+moves the `submitted_by` security boundary, rebuilds per-user row security over
+runtime-owned tables, re-exposes every read path, and needs a destructive migration.
+A possible long-term direction, not this change.
+
+**Option 2 — Run our writes inside duroxide's transaction.** Extend duroxide-pg so
+the enqueue also performs the `df.*` writes, on duroxide's worker-pool connection.
+This makes the two stores atomic — but on the *worker's* connection, so `df.start()`
+no longer joins the *caller's* transaction and the `df.*` writes run as the pool
+role, breaking identity capture and row security. Strictly worse than Option 3.
+
+**Option 3 — Run duroxide's enqueue inside the caller's transaction.** Keep both
+stores; call the runtime's enqueue over SPI on the caller's transaction, next to the
+`df.*` writes. Everything commits or rolls back together, identity capture and the
+read paths are untouched, and the 5-second load race disappears. The only cost is a
+`SECURITY DEFINER` wrapper, because the runtime queue is writable by its owner only.
+**Chosen as the primary fix.**
+
+**Option 4 — Tolerate drift, repair it later.** Accept a brief inconsistency window
+and add a reconciler that finds and fixes mismatches. Adds no guarantee on its own,
+but it is the only option that also catches drift from crashes *mid-execution* —
+something no start-time atomicity can prevent. **Chosen as a backstop.**
+
+| Option | Atomic with `df.*` | Atomic with caller's tx | Removes 5 s race | Fixes crash-time drift | Effort |
 |--------|:--:|:--:|:--:|:--:|:--:|
-| 1 — single source of truth in `_duroxide` | n/a (one store) | no¹ | yes | yes | Large |
-| 2 — our SQL in duroxide's tx | yes | **no** | yes | no | Medium |
-| 3 — enqueue in caller's tx (SPI) | yes | **yes** | **yes** | no | Small–Med |
-| 4 — async GC / reconciler | repairs | — | no | **yes** | Small–Med |
-
-¹ Option 1 also decouples from the caller's transaction unless the enqueue is
-itself done via SPI — at which point its start path converges with Option 3.
-"Fixes crash-time drift" = repairs divergence from a crash *mid-execution* (not
-just at start); only Options 1 and 4 do this, which is why Option 4 is kept as a
-backstop alongside Option 3.
-
-### Decision
-
-Implement **Option 3 + Option 4**. Option 3 removes the dual-write at the source
-for every backend-initiated cross-store write — `df.start()`, `df.cancel()`, and
-`df.signal()` all enqueue on the caller's transaction; Option 4 is a lightweight
-safety net for the residual divergence that no in-transaction enqueue can prevent
-(crashes mid-execution and the best-effort `df.*` mirror). Options 1 and 2 are
-rejected — Option 1 is disproportionate to the problem, and Option 2 is a strictly
-weaker variant of Option 3.
-
-## Background: how a start is written today
-
-`crate::dsl::start()` (src/dsl.rs) executes, in order, on the **caller's**
-transaction unless noted:
-
-1. `instance_id = short_id()` (src/dsl.rs:645).
-2. Read-only validation (database exists, caller has `LOGIN`, superuser policy).
-3. `insert_nodes()` → INSERT rows into `df.nodes` (src/dsl.rs:842).
-4. INSERT row into `df.instances` (src/dsl.rs:888).
-5. Capture `df.vars` snapshot.
-6. `start_durable_function()` → `Client::start_orchestration()` →
-   `store.enqueue_for_orchestrator(StartOrchestration)` (src/dsl.rs:923) — runs on
-   the cached `DUROXIDE_CLIENT` sqlx pool (src/client.rs), **separate
-   connection, independently committed.** Failures here are logged, not raised
-   (src/dsl.rs:928).
-
-Key facts that make the fix small:
-
-- **The start is a single enqueue.** `Client::start_orchestration` builds one
-  `WorkItem::StartOrchestration` and calls `enqueue_for_orchestrator`
-  (duroxide `client/mod.rs`). The duroxide-pg provider implements that as one
-  call to the SQL function `_duroxide.enqueue_orchestrator_work(...)`, whose body
-  is a single `INSERT INTO _duroxide.orchestrator_queue`. The
-  `_duroxide.instances`/`history` rows are materialized **later** by the worker.
-- **The atomic unit is therefore just** `{df.nodes INSERTs, df.instances INSERT,
-  one orchestrator_queue INSERT}`.
-- **duroxide-pg is SQL-function based.** Every provider operation is a function
-  in the `_duroxide` schema, so the enqueue can be invoked directly via SPI on
-  the caller's transaction — no async pool required for the start path.
-
-### What each store owns (so we know what must *not* change)
+| 1 — one store | n/a | no | yes | yes | Large |
+| 2 — our writes in duroxide's tx | yes | **no** | yes | no | Medium |
+| 3 — enqueue in caller's tx | yes | **yes** | **yes** | no | Small |
+| 4 — async reconciler | repairs | — | no | **yes** | Small |
+
+We implemented **Option 3 (prevent) plus Option 4 (repair)**. Options 1 and 2 were
+rejected: Option 1 is out of proportion to the problem, and Option 2 is a weaker
+version of Option 3.
+
+## How a start is written
+
+`df.start()` (`src/dsl.rs`) runs these steps on the caller's transaction:
+
+1. Pick an instance id (`short_id()` — eight hex characters).
+2. Validate (database exists, caller can log in, superuser policy).
+3. Insert the graph rows into `df.nodes`.
+4. Insert the instance row into `df.instances`.
+5. Snapshot the caller's `df.vars`.
+6. Enqueue the orchestration.
+
+The fix is small because the start is **a single enqueue**: one
+`StartOrchestration` row in `_duroxide.orchestrator_queue`. The worker builds the
+`_duroxide.instances` and history rows later. And because duroxide-pg exposes every
+operation as a SQL function in the `_duroxide` schema, that enqueue can run over SPI
+on the caller's transaction — no separate pool needed.
+
+### What each store owns
+
+The fix must not move any of these boundaries.
 
 | Data | Source of truth | Notes |
 |------|-----------------|-------|
-| Graph (`df.nodes`) | `df.*` | duroxide does not store it; the worker loads it via the `load-function-graph` activity. |
-| Execution history | `_duroxide.history` | duroxide-internal; transactional per turn via `ack_orchestration_item`. |
-| Control plane (`root_node`, `submitted_by`, `database`, `label`) | `df.instances` | needed by the worker to load + authenticate. |
-| `df.instances.status` | best-effort mirror | written by the `update-instance-status` activity; not atomic with duroxide, self-heals. |
-| `df.nodes.status/result/error` | best-effort mirror | written by the `update-node-status` activity; same properties as above. |
-
-`submitted_by` is a **security boundary**: it is captured as `current_user` at
-`df.start()` time and pinned by a composite FK; the worker re-checks the
-superuser policy (src/activities/load_function_graph.rs). Any change must keep it
-unforgeable.
-
-### Dual-write inventory
-
-Auditing every `df.*` writer against every duroxide-client call, the cross-store
-seams are bounded and fall into two distinct classes.
-
-**Backend (session) side — `df.*` on the caller's transaction + an out-of-band
-duroxide-client enqueue.** This is the class Option 3 fixes.
-
-| Site | `df.*` write (user tx, SPI) | `_duroxide` write | Now in caller tx? |
-|------|------|------|------|
-| `df.start()` | INSERT `df.nodes` + `df.instances` (src/dsl.rs:842,888) | enqueue `StartOrchestration` (src/dsl.rs:923) | **Yes** |
-| `df.cancel()` | UPDATE `df.instances.status='cancelled'` (src/dsl.rs:966) | `CancelInstance` enqueue | **Yes** |
-| `df.signal()` | **none** (only an RLS read of `df.instances`) | `ExternalRaised` + descendant fan-out (src/dsl.rs:616) | **Yes** |
-
-Notes:
-
-- **`df.cancel()` is a genuine dual-write** with two twists vs `df.start()`: it
-  enqueued to duroxide **first** and then did the `df.*` UPDATE (opposite order),
-  and the UPDATE is only an *optimistic* mirror — the authoritative `cancelled`
-  status is re-applied when the worker processes the cancel via
-  `update-instance-status`, so divergence here is usually transient and
-  self-healing. It now uses the same in-transaction enqueue as `df.start()`, so
-  the optimistic UPDATE and the `CancelInstance` enqueue commit or roll back
-  together.
-- **`df.signal()` is not a dual-write**: it writes only `_duroxide`. There is no
-  cross-store inconsistency to create; moving it in-transaction changes
-  "don't deliver the signal if my surrounding tx rolls back" semantics, which is
-  the least-surprising contract and is applied here for consistency with the
-  other paths.
-- Not seams: `df.run()` is a no-op stub; `df.result/status/explain/monitoring`
-  are reads; `df.vars` and `df._worker_epoch` touch only `df.*`.
-
-**Worker side — `df.*` mirror maintained by activities, on a connection separate
-from duroxide's history ack.** Option 3 does **not** address this class.
-
-| Activity | `df.*` write | Property |
-|----------|--------------|----------|
-| `update-instance-status` | UPDATE `df.instances.status` | eventually consistent, self-healing |
-| `update-node-status` | UPDATE `df.nodes.status/result/error` | eventually consistent, self-healing |
-
-These are at-least-once activities re-applied on replay, so a crash between the
-duroxide ack and the mirror update is transient lag, not permanent divergence.
-This is the "best-effort mirror"; any persistent drift is the reconciler's job
-(see Complementary backstop), not Option 3's.
-
-## Goals
-
-- `df.start()` is **atomic with the caller's transaction**: on rollback, no
-  `df.*` rows *and* no `_duroxide` start item remain; on commit, both are present
-  and the worker picks the instance up only after the `df.*` rows are visible.
-- `df.cancel()` and `df.signal()` enqueue on the **caller's transaction** too: a
-  rolled-back cancel/signal leaves nothing in `_duroxide`, and a committed one is
-  durable with the caller's other work.
-- Eliminate the rolled-back-start orphan leak at the source.
-- Eliminate the `load-function-graph` "wait up to 5 s for the row to appear"
-  race on the atomic path (the retry is retained for the non-atomic fallback).
-- No change to the `df.*` schema's read surface (status/result/monitoring/explain
-  keep working unchanged).
-
-## Non-goals
-
-- Collapsing `df.*` into `_duroxide` (Option 1) — out of scope; tracked
-  separately as a possible long-term direction.
-- Fixing the best-effort `df.instances.status` mirror (it is eventually
-  consistent by design).
-- Changing short-id generation / the id space. (Independent; can be addressed
-  separately. This spec makes a collision a clean, fully-atomic abort.)
-
-## Design
-
-### Mechanism
-
-In `crate::dsl::start()`, replace step 6 (the out-of-band
-`start_durable_function` pool call) with an **SPI call** that enqueues the start
-work item on the caller's transaction, immediately after the `df.instances`
-INSERT:
+| Graph (`df.nodes`) | `df.*` | The worker loads it; duroxide does not store it. |
+| Execution history | `_duroxide.history` | Runtime-internal; transactional per turn. |
+| Control plane (`root_node`, `submitted_by`, `database`, `label`) | `df.instances` | The worker needs it to load and authenticate. |
+| `df.instances.status` | best-effort mirror | Written by the worker; self-heals, not atomic with the runtime. |
+| `df.nodes.status/result/error` | best-effort mirror | Same as above. |
+
+`submitted_by` is a **security boundary**: it is captured as the calling role at
+start time and must stay unforgeable. The worker re-checks the superuser policy when
+it loads the graph.
+
+### Where the two stores were written together
+
+Auditing every writer, the cross-store paths are bounded. All three now enqueue on
+the caller's transaction:
+
+| Caller | `df.*` write | `_duroxide` write | On caller's tx now? |
+|--------|------|------|:--:|
+| `df.start()` | insert `df.nodes` + `df.instances` | enqueue `StartOrchestration` | yes |
+| `df.cancel()` | update `df.instances.status` | enqueue `CancelInstance` | yes |
+| `df.signal()` | none (only a row-security read) | enqueue `ExternalRaised` (+ fan-out) | yes |
+
+- **`df.cancel()`** was a genuine dual-write: it enqueued the cancel, then updated
+  the status mirror. The status update is only an optimistic hint — the worker
+  re-applies the authoritative `cancelled` status when it processes the cancel — so
+  any disagreement was already transient. Both now commit together.
+- **`df.signal()`** writes only `_duroxide`, so it never created cross-store drift.
+  Moving it onto the caller's transaction simply gives it the same rule as the
+  others: *if my transaction rolls back, the signal is not delivered.*
+
+The worker also keeps the `df.*` status columns in sync as activities run. Those
+writes are eventually consistent and self-healing by design; the reconciler cleans
+up any lasting drift.
+
+## The mechanism
+
+Step 6 of `df.start()` is now an SPI call on the caller's transaction, right after
+the `df.instances` insert:
 
 ```text
--- still on the caller's transaction (SPI):
-INSERT INTO df.nodes ...                          -- step 3 (unchanged)
-INSERT INTO df.instances ...                      -- step 4 (unchanged)
-SELECT df.__enqueue_start($instance_id, $work_item_json, $visible_at);  -- step 6 (new)
+INSERT INTO df.nodes ...                              -- step 3
+INSERT INTO df.instances ...                          -- step 4
+SELECT df._enqueue_orchestrator_start($id, $name, $input);  -- step 6 (SPI)
 ```
 
-Because the enqueue is now an SPI statement, it shares the caller's transaction:
-rollback removes the queue row along with the `df.*` rows; commit makes them
-visible together.
-
-### The work item
-
-pg_durable already depends on the `duroxide` crate, so it should **construct and
-serialize the real `WorkItem` type** rather than hand-rolling JSON, guaranteeing
-wire compatibility:
-
-```rust
-let item = duroxide::providers::WorkItem::StartOrchestration {
-    instance: instance_id.clone(),
-    orchestration: orchestrations::execute_function_graph::NAME.into(),
-    input: input_json,                       // existing FunctionInput JSON
-    version: None,                           // runtime resolves from registry
-    parent_instance: None,
-    parent_id: None,
-    execution_id: duroxide::INITIAL_EXECUTION_ID,   // = 1
-};
-let work_item_json = serde_json::to_string(&item)?;
-```
+Because the enqueue is an ordinary SPI statement, it shares the caller's
+transaction: a rollback drops the queue row along with the `df.*` rows, and a commit
+makes them visible together. `df.cancel()` and `df.signal()` work the same way
+through their own wrappers.
+
+### Why a privileged wrapper is needed
+
+`_duroxide.orchestrator_queue` is writable by its owner only (the worker role), so a
+normal caller cannot insert into it. The enqueue therefore runs through a
+`SECURITY DEFINER` wrapper in the `df` schema, owned by the extension owner. Each
+wrapper pins `search_path` and schema-qualifies every reference (per CVE-2018-1058,
+enforced by `scripts/pgspot-gate.sh`).
+
+The wrappers resolve the runtime schema at call time via `df.duroxide_schema()`
+(`_duroxide` on current installs, legacy `duroxide` on older ones) and quote it
+with `%I`. They are revoked from `PUBLIC` and granted through `df.grant_usage()`.
+
+Because a `SECURITY DEFINER` function runs as its owner, the wrappers cannot trust
+the caller. Two safeguards make them safe to grant to every user:
+
+- **The work item is built inside the wrapper**, from the caller's arguments — never
+  passed in. A caller cannot choose the work-item type or smuggle in a foreign
+  target. (`df.start` builds `StartOrchestration`, `df.cancel` builds
+  `CancelInstance`, `df.signal` builds `ExternalRaised`, each with
+  `json_build_object`, matching duroxide's wire format.)
+- **The caller is authorized before any enqueue**, by two different rules:
+  - *Start* targets a brand-new instance, so it authorizes by *state*: it permits
+    the enqueue only for a `pending` `df.instances` row that has no queue entry and
+    no runtime instance yet. Under the atomic path a committed instance always has
+    its queue row in the same transaction, so this state is reachable only for the
+    row the caller just inserted — never someone else's instance.
+  - *Cancel* and *signal* target an already-committed instance, so they authorize by
+    *ownership*: `pg_has_role(session_user, <instance owner>, 'MEMBER')`.
+    `session_user` is the real authenticated role (it cannot be spoofed inside a
+    `SECURITY DEFINER` function), and checking membership rather than an exact name
+    means a role that owns the instance through `SET ROLE` still qualifies. A
+    non-member is rejected before anything is enqueued — the same gate `df.cancel` /
+    `df.signal` already enforce through row security.
+
+### Signal fan-out
+
+A signal must reach the root instance **and** every running sub-orchestration,
+because a child (a `JOIN`/`RACE` branch or a loop generation) may be the one waiting
+on it. The `df.signal` wrapper walks the instance tree from the root
+(`_duroxide.instances.parent_instance_id`), keeps the instances whose current
+execution is still running, and enqueues one `ExternalRaised` per surviving
+instance — all on the caller's transaction, so the whole fan-out is atomic.
 
-`WorkItem` is externally tagged (default serde), i.e.
-`{"StartOrchestration":{...}}`. These are exactly the values the current
-`Client::start_orchestration` path uses, so worker behavior is unchanged.
+### Worker readiness and ordering
 
-The enqueue target is the existing function (body is a single INSERT; the
-`p_orchestration_name/version/execution_id` params are accepted but unused — the
-work item is self-contained):
+- The existing readiness gate still applies: if the worker has not initialized the
+  runtime schema yet, `df.start()` fails clearly (and atomically — the `df.*`
+  inserts roll back).
+- The enqueue is the last write in `df.start()`. A primary-key collision on
+  `df.nodes` / `df.instances` still aborts before it, now cleanly inside one
+  transaction.
 
-```sql
-_duroxide.enqueue_orchestrator_work(
-    p_instance_id   text,
-    p_work_item     text,                 -- work_item_json
-    p_visible_at    timestamptz,          -- now() for immediate start
-    p_orchestration_name text DEFAULT NULL,
-    p_orchestration_version text DEFAULT NULL,
-    p_execution_id  bigint DEFAULT NULL)
-```
+### The 5-second load race is gone (on the atomic path)
+
+The worker's `load-function-graph` activity used to wait up to five seconds for the
+`df.*` rows to appear, because it could dequeue a start before they committed. On
+the atomic path the queue row becomes visible only *after* those rows commit in the
+same transaction, so that wait can't trigger. The retry is kept for the
+non-atomic fallback path, where it remains a no-op on the atomic path.
+
+## The reconciler
+
+`df.reconcile(p_grace_seconds integer DEFAULT 60)` is a `SECURITY DEFINER`,
+admin-only function that repairs leftover drift:
 
-### Privilege model (the one real wrinkle)
-
-`_duroxide.orchestrator_queue` grants INSERT to its **owner only** (the
-`pg_durable.worker_role`; `relacl` is empty → no PUBLIC), and
-`_duroxide.enqueue_orchestrator_work` is `SECURITY INVOKER`. `df.start()` is also
-`SECURITY INVOKER` and runs SPI as the calling user. Therefore a normal caller
-cannot insert into the queue directly — confirmed: for a fresh `LOGIN` role,
-`has_table_privilege(... 'INSERT') = false`.
-
-The enqueue must run through a **`SECURITY DEFINER`** entrypoint owned by a role
-that owns the `_duroxide` queue tables. Two placements:
-
-- **Recommended — provide it in duroxide-pg** (proper layering; pg_durable never
-  touches `_duroxide` internals). Add a stable, documented `SECURITY DEFINER`
-  client-enqueue function in the `_duroxide` schema, owned by the schema owner
-  (`worker_role`), e.g. `_duroxide.client_enqueue_start(p_instance_id text,
-  p_work_item text, p_visible_at timestamptz)`. It performs the single queue
-  INSERT and is `GRANT EXECUTE`-ed to the roles allowed to start instances.
-  Applied by the BGW migration runner (`ApplyAll`); no pg_durable extension
-  upgrade script needed (consistent with docs/bgw-applies-migrations.md).
-- **Fallback — a `df` wrapper in pg_durable.** A `df.__enqueue_start(...)`
-  `SECURITY DEFINER` function owned by the extension owner that calls
-  `_duroxide.enqueue_orchestrator_work(...)`. Works in the common configuration
-  where the extension owner is a superuser (or otherwise owns/【has INSERT on】the
-  queue). In hardened multi-role deployments where `_duroxide` is owned by a
-  non-superuser `worker_role` distinct from the extension owner, this requires
-  the `worker_role` to `GRANT INSERT` on the queue tables (and `EXECUTE` on the
-  enqueue fn) to the wrapper's owner. The duroxide-pg placement avoids this
-  coordination.
-
-Either wrapper **must** pin `search_path` and schema-qualify every reference
-(CVE-2018-1058; enforced by the pgspot gate, scripts/pgspot-gate.sh):
+- It deletes orphaned **root** runtime instances — those with no parent, no matching
+  `df.instances` row, and older than the grace window — gathering each orphan's full
+  subtree so the delete is accepted. Sub-orchestrations (branches and loop
+  generations) have no `df.instances` row of their own and are excluded by the root
+  filter, so they are never collected on their own.
+- It marks `df.instances` rows that are stuck non-terminal — no live runtime
+  instance and no queued start — as `failed`.
+
+Each pass is wrapped so a single failure never aborts the reconcile or stops the
+loop.
+
+The background worker runs the reconciler as a **built-in durable cron loop**,
+keeping exactly one instance per cluster. It reads the schedule, connects as the
+`df_reconciler` role (`SET ROLE`), and starts:
 
 ```sql
-CREATE FUNCTION _duroxide.client_enqueue_start(...)
-RETURNS void
-LANGUAGE plpgsql
-SECURITY DEFINER
-SET search_path = pg_catalog
-AS $$ BEGIN
-  INSERT INTO _duroxide.orchestrator_queue (instance_id, work_item, visible_at, created_at)
-  VALUES (p_instance_id, p_work_item, p_visible_at, pg_catalog.now());
-END $$;
+-- $1 = the pg_durable.reconciler_cron value, $2 = 'df_reconciler'
+SELECT df.start(
+  df.loop(df.seq('SELECT * FROM df.reconcile()', df.wait_for_schedule($1))),
+  $2);
 ```
 
-The wrapper takes the caller's `instance_id`, orchestration name, and input
-string; it **constructs the `StartOrchestration` work item server-side** (so the
-caller cannot choose the work-item variant or target a foreign instance) and
-**authorizes** the enqueue only for a brand-new, not-yet-started instance
-(`pending` `df.instances` row, no orchestrator-queue entry, no duroxide
-instance — under atomic-start semantics reachable only for the row `df.start`
-just inserted in the current transaction). It executes no caller-supplied SQL.
-This is what prevents the wrapper from being a generic, cross-tenant
-arbitrary-work-item enqueue primitive (which would otherwise let any df user
-forge `CancelInstance`/`ExternalRaised`/`ActivityCompleted` against another
-user's instance, bypassing the RLS gate `df.signal`/`df.cancel` enforce).
-
-### Identity / security
-
-`submitted_by` capture is unchanged: it is still read as `current_user` inside
-`df.start()` (`SECURITY INVOKER`) **before** the enqueue and stored in
-`df.instances`. The `SECURITY DEFINER` wrapper does not touch identity; the work
-item carries only the orchestration name/input. The worker continues to
-authenticate using `df.instances.submitted_by` and re-check the superuser policy
-(unchanged). No new forgery surface is introduced.
-
-### Cancel and signal wrappers
-
-`df.cancel()` and `df.signal()` enqueue on the caller's transaction through their
-own `SECURITY DEFINER` wrappers (`df._enqueue_orchestrator_cancel`,
-`df._enqueue_orchestrator_signal`). Both build the work item server-side
-(`CancelInstance` / `ExternalRaised`) via `json_build_object`, so a caller cannot
-choose the variant or smuggle a foreign target.
-
-The start wrapper can authorize structurally (a brand-new `pending` instance is
-only reachable for the row the caller just inserted). Cancel and signal target an
-**already-committed** instance, so they authorize explicitly: the wrapper checks
-`pg_has_role(session_user, df.instances.submitted_by::oid, 'MEMBER')`
-(`submitted_by` is a `regrole`, so `::oid` yields the owning role's OID).
-`session_user` is unforgeable inside `SECURITY DEFINER` (unlike `current_user`,
-which is the definer), and membership — rather than an exact-name match — lets a
-role that owns the instance via `SET ROLE` still cancel/signal it. A caller that
-is not a member of the owning role is rejected before any enqueue, mirroring the
-RLS gate `df.cancel`/`df.signal` already enforce. The wrappers run no
-caller-supplied SQL.
-
-**Signal fan-out.** A signal must reach the root instance **and** every running
-sub-orchestration (a descendant may be the one blocked on
-`wait_for_signal`). The wrapper recurses `_duroxide.instances.parent_instance_id`
-from the root, joins each instance to its current execution, keeps only
-executions whose `status` is `Running` (lower-cased), and enqueues one
-`ExternalRaised` per surviving instance — all in the caller's transaction, so the
-whole fan-out commits or rolls back atomically.
-
-### Worker readiness and ordering
+`worker::ensure_reconciler` starts it, skips if one is already pending or running,
+and restarts it if it died. It is driven by the `pg_durable.reconciler_cron` setting
+(default `*/5 * * * *`; empty disables it) and submitted by a dedicated
+**non-superuser** role, `df_reconciler` (so `submitted_by` is `df_reconciler`). A
+non-superuser identity keeps it clear of the superuser-instance guard and limits what
+the instance can do to "run the reconciler" even if it were somehow forged.
 
-- Keep the existing worker-readiness gate: if `_worker_ready` is absent/stale,
-  `df.start()` errors as it does today (the `_duroxide` schema may not yet
-  exist). This check moves from "before acquiring the client" to "before the SPI
-  enqueue."
-- The enqueue is the **last** write in `df.start()`; a PK collision on
-  `df.nodes`/`df.instances` still aborts before it (no behavior change for the
-  collision case, now fully inside one transaction).
-
-### Consequence: the 5 s load race disappears (on the atomic path)
-
-`load-function-graph` polls for up to `MAX_WAIT_SECS = 5` because the worker
-could dequeue a start item before the caller's `df.*` rows committed
-(src/activities/load_function_graph.rs:21,60). On the **atomic path** a queue row
-becomes visible to the worker **only** after the `df.*` rows commit in the same
-transaction, so the "not yet visible / rolled back" branch cannot occur. The
-retry is **retained** rather than removed, because the non-atomic **fallback**
-path can still enqueue before the `df.*` rows commit; it is a no-op on the atomic
-path. So this removes the worker-saturation failure mode for atomic starts while
-staying correct for the fallback.
-
-## Proof of concept (validated)
-
-Option 3 was prototyped end-to-end and validated on PostgreSQL 17 (branch
-`tjgreen42/atomic-start`, ~80 lines across 3 files).
-
-**What was built:**
-
-- `df._enqueue_orchestrator_start(text, text)` — a `SECURITY DEFINER` wrapper
-  (`SET search_path = pg_catalog`) that performs the privileged
-  `<schema>.enqueue_orchestrator_work(instance, work_item, now())` INSERT. The
-  duroxide schema is **resolved dynamically** via `df.duroxide_schema()`
-  (`_duroxide` fresh / legacy `duroxide`) and quoted with `%I`. EXECUTE is
-  revoked from PUBLIC and granted through `df.grant_usage()`.
-- `df.start()` **probes** for the duroxide-pg SQL surface
-  (`enqueue_orchestrator_work` in the resolved schema). When present it builds
-  the real `duroxide::providers::WorkItem::StartOrchestration` (guaranteeing wire
-  compatibility), gates on `is_worker_ready()`, and enqueues via SPI through the
-  wrapper — raising (aborting the transaction) on failure. When **absent** (a
-  non-pg provider, or a schema predating the wrapper) it **falls back** to the
-  out-of-band `start_durable_function` client path (legacy, non-atomic).
-- `df.reconcile(grace_seconds int default 60)` (Option 4) — a `SECURITY DEFINER`,
-  admin-only (revoked from PUBLIC) reconciler that deletes orphaned **root**
-  duroxide instances (`parent_instance_id IS NULL`, no `df.instances` row) via
-  `delete_instances_atomic`, and fails `df.instances` stuck non-terminal with no
-  live duroxide instance and no queued start.
-- **Built-in durable cron loop** — the background worker
-  (`worker::ensure_reconciler`) keeps exactly one reconciler instance running per
-  cluster, dogfooding pg_durable:
-  `df.start(df.loop(df.seq('SELECT * FROM df.reconcile()', df.wait_for_schedule(<cron>))), 'df_reconciler')`.
-  It is idempotent (skips if one is pending/running; self-heals if it died),
-  driven by the `pg_durable.reconciler_cron` GUC (default `*/5 * * * *`; empty
-  disables), and submitted by a dedicated **non-superuser** role `df_reconciler`
-  (created by the worker, granted only `df.grant_usage()` + EXECUTE on
-  `df.reconcile()`). A non-superuser identity sidesteps the
-  `enable_superuser_instances` guard and bounds the blast radius to "trigger GC"
-  even if the instance were forged.
-
-**Provider requirement:** the atomic path only works with the **duroxide-pg
-(SQL-backed) provider**, because it calls the provider's SQL function directly.
-The probe + fallback make this non-breaking for any other provider.
-
-**Confirmed findings:**
-
-- **The privilege model is the only real wrinkle.** The duroxide orchestrator
-  queue grants INSERT to its owner only and the enqueue function is
-  `SECURITY INVOKER`, so a plain caller gets *permission denied*. The
-  `SECURITY DEFINER` wrapper resolves this while still committing in the caller's
-  transaction. A non-superuser role (`df_e2e_user`) granted via
-  `df.grant_usage()` starts instances successfully and cannot INSERT into the
-  queue directly.
-- **`visible_at = now()`** (SQL transaction time) is sufficient for immediate
-  starts; no reliance on the Rust-side `now_ms`.
-- **Reusing the `duroxide` `WorkItem` type** makes the enqueued row
-  byte-identical to the client path, so the worker behaves identically.
-- **The reconciler must exclude sub-orchestrations.** JOIN/RACE branches and loop
-  generations are legitimate duroxide instances with no `df.instances` row;
-  filtering on `parent_instance_id IS NULL` (root only) prevents collecting them.
-- **Two gotchas in the cron loop:** the reconciler role cannot be named with a
-  `pg_` prefix (PostgreSQL reserves it) — hence `df_reconciler`; and the loop
-  node must call the set-returning reconciler as `SELECT * FROM df.reconcile()`
-  (a bare `SELECT df.reconcile()` yields an unsupported `RECORD` column in
-  `execute-sql`).
-
-**Validated behavior:**
+## Behavior change for users
 
-| Check | Result |
-|-------|:------:|
-| Committed `df.start()` runs to completion | pass |
-| Rolled-back `df.start()` leaves no `df.*` **and** no `_duroxide` orphan (queue + instances) | pass |
-| Dynamic schema resolution drives the SPI path (atomicity holds) | pass |
-| E2E subset: core, conditionals, loops, variables, signals, join, race, cancel | 8/8 |
-| JOIN determinism under the new (faster) timing | 5/5 |
-| Non-superuser (`df_e2e_user`) start path | pass |
-| Committed `df.cancel()` cancels; rolled-back `df.cancel()` leaves the instance running and rolls back the queue enqueue | pass |
-| Committed `df.signal()` delivers (incl. fan-out to a sub-orchestration in a race); rolled-back `df.signal()` rolls back the enqueue | pass |
-| Cancel/signal authz: a non-member caller is denied on the wrapper and via the RLS gate; the owning role (incl. via `SET ROLE`) succeeds | pass |
-| E2E: `07_signals`, `23_signal_in_race` (fan-out), `22_cancel_status_consistency` | 3/3 |
-| `df.reconcile()` deletes a planted root orphan; leaves sub-orchestrations + live instances untouched | pass |
-| `df.reconcile()` fails a stuck (lost-enqueue) `df.instances` | pass |
-| Built-in reconciler auto-starts as non-superuser `df_reconciler`, idempotent (exactly one) | pass |
-| Scheduled loop GCs a planted orphan on its next tick, without self-GC'ing | pass |
-| E2E subset unaffected with the reconciler running (incl. heartbeat + join) | 6/6 |
-| `cargo fmt --check`; no new clippy warnings | pass |
-
-### Review hardening (applied)
-
-A source-of-truth code review surfaced and fixed several issues:
-
-- **Cross-tenant enqueue bypass (HIGH).** The wrapper originally accepted an
-  opaque `work_item` and was granted to every df user, letting a caller forge
-  `CancelInstance`/`ExternalRaised`/`ActivityCompleted` against another user's
-  instance. Fixed: the wrapper now **builds the `StartOrchestration` server-side**
-  and **authorizes** only a brand-new, not-yet-started instance — verified that a
-  non-superuser can no longer forge an enqueue for a foreign or arbitrary id.
-- **Reconciler self-destruct on parallel-workflow orphans (HIGH).**
-  `delete_instances_atomic` refuses to delete a parent without its children (even
-  with `force`), so deleting only orphan **roots** raised on any JOIN/RACE/loop
-  orphan and aborted GC. Fixed: gather the **full subtree** (recursive on
-  `parent_instance_id`) and wrap each pass in an exception block so one failure
-  never aborts reconcile or kills the loop — verified root+child GC with no error.
-- **GC-suppression via user label (HIGH).** The singleton key was the
-  user-writable `label`; any user could park a `df_reconciler`-labelled instance
-  to stop the GC from starting. Fixed: key liveness on the unforgeable
-  `submitted_by = df_reconciler` role.
-- **Cron busy-loop (MED/HIGH).** `df.wait_for_schedule` baked a fixed offset at
-  build time that a `df.loop` replayed every generation (worst case ~1 s spin).
-  Fixed: the WAIT_SCHEDULE node now recomputes the delay from the orchestration's
-  deterministic clock (`ctx.utc_now()`) each generation — verified the reconciler
-  fires on the cron tick and does not spin. (Note: adding the clock read changes
-  the orchestration history shape, so a wait-in-loop instance in flight across the
-  upgrade may need to restart.)
-- **No in-epoch self-healing (MED).** The reconciler was (re)started only per
-  epoch. Fixed: `ensure_reconciler` is also re-asserted from the steady-state poll
-  loop — verified a cancelled reconciler restarts within the poll interval.
-- **Silent non-atomic fallback (MED).** `df.start()` now emits a `WARNING` when it
-  takes the non-atomic fallback path.
-- **Debated/rejected:** the `df_reconciler` role keeps `LOGIN` — the worker
-  executes the reconcile node by connecting *as* the role via `connect_as_user`,
-  exactly like every other durable-function role; the proposed `NOLOGIN` would
-  break node execution.
-
-**POC scope / deferred to productionization:** fresh install only (no upgrade
-script); the atomic-start wrapper hardening (move the entrypoint into duroxide-pg
-owned by the schema owner, and/or validate the work item); the reconciler's
-grace/cadence policy still needs tuning, and the `df_reconciler` role is created
-by the worker rather than via managed provisioning.
-
-> **Local-dev note:** `DROP EXTENSION pg_durable CASCADE` can leave the
-> BGW-owned `_duroxide` schema half-broken (migrations row present, functions
-> dropped), which surfaces as *spurious* JOIN nondeterminism and a missing
-> `_worker_ready`. Recover with `DROP SCHEMA _duroxide CASCADE` + restart so the
-> BGW re-applies its migrations. (This cost real debugging time during the POC;
-> the failures were environmental, not a regression.)
-
-## Transaction-semantics decision (must be made explicit)
-
-This change makes `df.start()` **atomic with the caller's transaction**:
+`df.start()` now takes part in the caller's transaction:
 
 ```sql
 BEGIN;
 SELECT df.start('SELECT 1');   -- enqueued on THIS transaction
-ROLLBACK;                      -- start is fully undone; nothing runs
+ROLLBACK;                      -- fully undone; nothing runs
 ```
 
-Today the behavior is inconsistent (the `df.*` rows roll back but the duroxide
-enqueue does not). The new behavior is the least-surprising contract and is the
-explicit, documented decision of this spec. Note for users: a `df.start()` in a
-transaction that later rolls back will **not** run — which is the desired fix,
-but a behavior change from "fire-and-forget at statement time." This must be
-called out in `USER_GUIDE.md`.
-
-## Signals & cancellation (migrated)
-
-`df.cancel()` and `df.signal()` use the same in-transaction enqueue as
-`df.start()` (see Cancel and signal wrappers above). In short:
-
-- **`df.cancel()`** — the optimistic `df.instances.status='cancelled'` UPDATE and
-  the `CancelInstance` enqueue now commit or roll back together on the caller's
-  transaction. The worker's `update-instance-status` activity still re-applies the
-  authoritative status, so a worker-side crash remains transient/self-healing.
-- **`df.signal()`** — the `ExternalRaised` fan-out (root + running
-  sub-orchestrations) is enqueued on the caller's transaction, giving "don't
-  deliver if my tx rolls back" semantics. It writes only `_duroxide`, so this is a
-  consistency-of-contract change rather than a cross-store fix.
-
-## Complementary backstop — Option 4 (implemented)
-
-As part of the chosen approach, a lightweight **reconciler** sweeps for residual
-divergence that an in-transaction enqueue cannot prevent (crashes mid-execution,
-the best-effort `df.*` `status` mirror, and legacy/fallback orphans). It ships as
-**`df.reconcile(grace_seconds int default 60)`**
-(`SECURITY DEFINER`, admin-only): it deletes orphaned **root** duroxide instances
-(`parent_instance_id IS NULL`, no `df.instances` row, older than the grace
-window) via `delete_instances_atomic`, and marks `df.instances` rows stuck
-non-terminal with no live duroxide instance and no queued start as `failed`.
-Sub-orchestrations (JOIN/RACE branches, loop generations) are excluded by the
-root filter so they are never collected.
-
-The background worker runs it as a **built-in durable cron loop** (dogfooding
-pg_durable), ensuring exactly one instance per cluster:
+Previously the `df.*` rows rolled back but the orchestration still ran. The new
+behavior is the least surprising one and is the deliberate decision of this spec,
+but it *is* a change from "fires the moment the statement runs." `df.cancel()` and
+`df.signal()` gain the same property. This should be called out in `USER_GUIDE.md`.
+
+## Implementation notes
+
+- The atomic path only works with the **duroxide-pg (SQL-backed) provider**, because
+  it calls the provider's SQL functions directly. `df.start` / `df.cancel` /
+  `df.signal` each probe for that surface (`enqueue_orchestrator_work` in the
+  resolved schema); when it is absent — a different provider, or a schema predating
+  the wrappers — they fall back to the old out-of-band path and emit a warning, so
+  the change never breaks another provider.
+- `visible_at = now()` (transaction time) is enough for immediate starts.
+- The work items are byte-compatible with duroxide's `WorkItem` JSON, so the worker
+  behaves exactly as before.
+
+### Hardening applied during review
+
+- **Cross-tenant enqueue (high).** The start wrapper first accepted an opaque work
+  item and was granted to everyone, which would let a caller forge a cancel or
+  signal against another instance. Fixed by building the work item inside the
+  wrapper and authorizing only a brand-new instance.
+- **Reconciler deleting live work (high).** The runtime refuses to delete a parent
+  without its children, so deleting only orphan roots failed on any parallel-workflow
+  orphan. Fixed by gathering the full subtree and isolating each pass so one failure
+  can't stop the loop.
+- **Suppressing the reconciler (high).** Its single-instance check first keyed on the
+  user-writable label, so a user could park a same-labelled instance to block it.
+  Fixed by keying on the unforgeable `df_reconciler` submitter.
+- **Cron busy-loop (medium).** `df.wait_for_schedule` baked a fixed delay at build
+  time that the loop replayed every generation. Fixed so the wait node recomputes the
+  delay each generation from the orchestration's recorded clock (deterministic on
+  replay). Note: this changes the recorded history shape, so a wait-in-loop instance
+  in flight across the upgrade may need a restart.
+- **Self-healing (medium).** The reconciler is re-checked from the steady-state poll
+  loop, not only once per worker epoch, so a cancelled one comes back within the poll
+  interval.
+- **Silent fallback (medium).** `df.start` now warns when it uses the non-atomic
+  fallback.
+- **Kept `LOGIN` on `df_reconciler`** (debated): the worker runs the reconcile node
+  by connecting *as* the role, like every other durable-function role, so `NOLOGIN`
+  would break it.
+
+> **Local-dev note:** `DROP EXTENSION pg_durable CASCADE` can leave the worker-owned
+> `_duroxide` schema half-dropped (its migration row remains but functions are gone),
+> which shows up as flaky `JOIN` behavior and a missing `_worker_ready`. Recover with
+> `DROP SCHEMA _duroxide CASCADE` and a restart so the worker re-applies its
+> migrations.
+
+## Upgrade and migration
+
+**New binary against an older schema.** The wrappers live in the `df` schema, so
+they ship in the extension's install SQL and an upgrade script
+(`sql/pg_durable--<prev>--<current>.sql`, `CREATE FUNCTION` + `GRANT`), with a
+"Version-Specific Changes" entry in `docs/upgrade-testing.md`. A new binary running
+against an older `df` schema that predates the wrappers must still work: each caller
+probes for the enqueue surface and falls back to the old out-of-band path when it is
+missing. This keeps the binary compatible with every prior schema while the upgrade
+rolls out.
+
+**Data migration.** None. No table shapes change, and in-flight instances are
+unaffected (their queue items were already enqueued).
+
+**Rollback.** Reverting the binary restores the out-of-band enqueue; the wrappers, if
+left in place, are simply unused.
+
+**Future direction.** The privileged enqueue could move into duroxide-pg, owned by
+the runtime schema owner — cleaner layering, at the cost of a coordinated
+duroxide-pg release. The `df`-schema wrapper avoids that coordination and is what
+ships today.
+
+## Validation (PostgreSQL 17)
+
+Prevent:
 
-```sql
-df.start(
-  df.loop(df.seq('SELECT * FROM df.reconcile()',
-                 df.wait_for_schedule(current_setting('pg_durable.reconciler_cron')))),
-  'df_reconciler');
-```
+| Check | Result |
+|-------|:--:|
+| Committed `df.start()` runs to completion | pass |
+| Rolled-back `df.start()` leaves no `df.*` and no `_duroxide` orphan | pass |
+| Rolled-back `df.cancel()` leaves the instance running and rolls back the enqueue; committed cancel takes effect | pass |
+| Rolled-back `df.signal()` rolls back the fan-out enqueue; committed signal delivers (incl. fan-out into a sub-orchestration) | pass |
+| A caller cannot enqueue a start, cancel, or signal against a foreign or arbitrary instance; cancel/signal succeed for the owning role (incl. via `SET ROLE`) | pass |
+| Schema resolved dynamically; atomicity holds | pass |
+| E2E `07_signals`, `23_signal_in_race` (fan-out), `22_cancel_status_consistency` | 3/3 |
+| Broader E2E subset (core, conditionals, loops, variables, join, race) | pass |
+
+Repair:
 
-started by `worker::ensure_reconciler` each epoch (idempotent; self-heals if it
-died), submitted by the dedicated non-superuser `df_reconciler` role, and driven
-by `pg_durable.reconciler_cron` (default `*/5 * * * *`; empty disables). It
-complements Option 3 rather than replacing it.
-
-## Upgrade & Migration
-
-**B1 — binary backward compatibility (new `.so` vs all prior schemas).** The new
-`.so` calls the enqueue entrypoint via SPI.
-- If the entrypoint is the **duroxide-pg-provided** `SECURITY DEFINER` function,
-  the BGW applies the duroxide migration at startup (`ApplyAll`,
-  src/worker.rs), and the function exists for every running worker. Backends
-  detect readiness via `_worker_ready` (existing gate) before enqueuing.
-  Versioning: bump the pinned `duroxide`/`duroxide-pg` pair together (see
-  docs/upgrade-testing.md "Updating duroxide-pg") and run `cargo update`. **No
-  `pg_durable` extension upgrade script is required** for duroxide-schema changes
-  (docs/bgw-applies-migrations.md).
-- If the entrypoint is the **`df`-schema fallback wrapper**, it is a `df` schema
-  object and **does** require an extension upgrade script
-  `sql/pg_durable--<prev>--<current>.sql` (CREATE FUNCTION + GRANT), plus the
-  fresh-install SQL, and a "Version-Specific Changes" entry in
-  docs/upgrade-testing.md. The new `.so` must still operate against older `df`
-  schemas that predate the wrapper: gate the SPI path on the wrapper's existence
-  (catalog probe) and **fall back to the existing out-of-band client enqueue**
-  when absent. This preserves B1 while the upgrade rolls out.
-
-**Runtime schema detection.** Reuse the existing `_worker_ready` /
-`backend_duroxide_schema()` resolution. Add a one-time probe (cached per session)
-for the enqueue entrypoint; if missing (old worker), fall back to the legacy
-path. This makes the change safe under mixed binary/worker versions.
-
-**Data migration.** None. No `df.*` or `_duroxide` table shapes change. Existing
-in-flight instances are unaffected (their queue items were already enqueued).
-
-**Rollback.** Reverting the `.so` restores the out-of-band enqueue; the optional
-`df` wrapper, if added, is inert when unused.
-
-## Testing
-
-E2E (tests/e2e/sql/) and unit coverage:
-
-- **Atomic rollback (the leak):** `BEGIN; df.start(...); ROLLBACK;` then assert
-  **both** `df.instances` and `_duroxide.instances`/`orchestrator_queue` have no
-  row for the instance. (Today this fails: `_duroxide` keeps an orphan.)
-- **Atomic commit:** `BEGIN; df.start(...); COMMIT;` runs to completion exactly
-  as before.
-- **Subtransaction abort:** a `df.start()` inside a `BEGIN/EXCEPTION` block that
-  raises leaves no `_duroxide` orphan.
-- **PK collision (existing repro):** the short-id collision aborts cleanly with
-  no `_duroxide` residue (extends the manual `short_id_collision` repro).
-- **Worker-not-ready:** `df.start()` before `_worker_ready` errors as today.
-- **Privilege/hardened config:** a non-superuser caller (and, for the fallback
-  wrapper, a distinct `worker_role`) can start successfully and cannot insert
-  into `_duroxide.orchestrator_queue` directly.
-- **No 5 s waits:** worker logs contain no "transaction may have been rolled
-  back" entries during normal start/rollback flows.
-
-## Phasing
-
-1. **Phase 1 — implemented & validated:** atomic `df.start()` via the
-   `SECURITY DEFINER` enqueue entrypoint; readiness gate + legacy fallback;
-   simplify the `load-function-graph` retry; docs + tests. Remaining for
-   production: upgrade script and entrypoint hardening.
-2. **Phase 2 — implemented & validated:** the same in-transaction enqueue for
-   `df.signal()` (root + running-descendant `ExternalRaised` fan-out) and
-   `df.cancel()` (`CancelInstance`), each via a `SECURITY DEFINER` wrapper that
-   builds the work item server-side and authorizes the caller via
-   `pg_has_role(session_user, ...)`.
-3. **Phase 3 — reconciler implemented:** `df.reconcile()` plus a built-in durable
-   cron loop (worker-managed `df_reconciler` instance on
-   `pg_durable.reconciler_cron`) are validated (auto-start, idempotency,
-   scheduled orphan GC, no self-GC). Remaining: tune the grace/cadence policy and
-   role provisioning.
-
-## Open questions / risks
-
-- **Entrypoint placement:** duroxide-pg (clean layering, coordinated release) vs
-  a `df` fallback wrapper (no duroxide-pg release, but cross-role grants in
-  hardened setups). Recommendation: duroxide-pg, with the `df` fallback retained
-  for B1 during rollout.
-- **Contract stability:** pg_durable now depends on the enqueue entrypoint
-  signature and the `WorkItem` JSON. Mitigated by reusing the `duroxide`
-  `WorkItem` type and the version-locked duroxide/duroxide-pg pair.
-- **`visible_at` clock:** the POC uses `now()` (server transaction time) for
-  immediate starts and it works; no reliance on the Rust-side `now_ms` for
-  ordering at start.
-- **Idempotency:** `enqueue_orchestrator_work` is a bare INSERT (no dedup); the
-  `df.instances`/`df.nodes` PKs already prevent duplicate instance ids within the
-  transaction, so no additional dedup is required.
+| Check | Result |
+|-------|:--:|
+| `df.reconcile()` deletes an orphan root with children, leaves sub-orchestrations and live instances intact | pass |
+| `df.reconcile()` marks a stuck instance failed | pass |
+| Reconciler auto-starts as non-superuser `df_reconciler`, stays a single instance | pass |
+| Reconciler fires on the cron schedule without spinning, and restarts after being cancelled | pass |
+| E2E subset unaffected with the reconciler running | pass |
+| `cargo fmt --check` clean; no new clippy warnings | pass |
+
+## Open questions
+
+- **Enqueue placement** — keep the `df`-schema wrapper, or move it into duroxide-pg
+  (cleaner layering, but a coordinated release). The wrapper ships today; the move is
+  a future option.
+- **Contract stability** — pg_durable now depends on the enqueue function signature
+  and the `WorkItem` JSON shape. Mitigated by the version-locked
+  duroxide / duroxide-pg pair.
+- **Reconciler policy** — the grace window and cron cadence still want tuning, and
+  `df_reconciler` is created by the worker rather than provisioned externally.

From f03298211b2d6a8102151a60316ab0f004bbaae3 Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 16:19:55 +0000
Subject: [PATCH 12/20] Add E2E regression tests for atomic start/cancel/signal
 and reconcile

24_atomic_rollback: a rolled-back df.start/df.cancel/df.signal leaves no
  _duroxide effect (no orphan; the instance is not cancelled; the signal is
  not delivered). Fails against the pre-change out-of-band enqueue.
25_enqueue_wrapper_authz: the SECURITY DEFINER enqueue wrappers deny a
  non-owner forging a cancel/signal against a foreign instance, while the
  owner is allowed. Fails if the pg_has_role authorization check is removed.
26_reconcile_orphan_gc: df.reconcile() collects a planted root orphan that
  has running child sub-orchestrations (parallel join), gathering the full
  subtree, and leaves healthy instances untouched. Fails if reconcile
  collects only orphan roots (delete_instances_atomic refuses a parent
  without its children).

Each test fails against a behavioral defect, verified by mutation testing,
not merely the absence of the new functions.
---
 tests/e2e/sql/24_atomic_rollback.sql       | 174 +++++++++++++++++++++
 tests/e2e/sql/25_enqueue_wrapper_authz.sql | 136 ++++++++++++++++
 tests/e2e/sql/26_reconcile_orphan_gc.sql   | 149 ++++++++++++++++++
 3 files changed, 459 insertions(+)
 create mode 100644 tests/e2e/sql/24_atomic_rollback.sql
 create mode 100644 tests/e2e/sql/25_enqueue_wrapper_authz.sql
 create mode 100644 tests/e2e/sql/26_reconcile_orphan_gc.sql

diff --git a/tests/e2e/sql/24_atomic_rollback.sql b/tests/e2e/sql/24_atomic_rollback.sql
new file mode 100644
index 0000000..b2d8cc5
--- /dev/null
+++ b/tests/e2e/sql/24_atomic_rollback.sql
@@ -0,0 +1,174 @@
+-- Copyright (c) Microsoft Corporation.
+-- Licensed under the PostgreSQL License.
+
+-- Tests: df.start / df.cancel / df.signal enqueue on the CALLER'S transaction.
+--
+-- Regression test for df/_duroxide divergence. Originally the duroxide enqueue
+-- happened out-of-band on a separate connection that committed independently, so
+-- rolling back the caller's transaction left the runtime store ahead of the
+-- control plane:
+--   * a rolled-back df.start()  left an orphaned orchestration in _duroxide;
+--   * a rolled-back df.cancel() still cancelled the (committed) instance;
+--   * a rolled-back df.signal() still delivered the signal.
+--
+-- With the in-transaction enqueue, a ROLLBACK undoes the runtime work too. Each
+-- scenario below FAILS without that change.
+--
+-- _duroxide is owned by the worker role, so its tables are read as the superuser
+-- (postgres); the durable functions themselves run as the non-superuser
+-- df_e2e_user.
+
+-- Helper (superuser): assert no _duroxide residue remains for an instance id.
+-- Resolves the runtime schema dynamically (df.duroxide_schema()).
+CREATE OR REPLACE FUNCTION pg_temp.assert_no_duroxide_residue(p_id text)
+RETURNS void LANGUAGE plpgsql AS $$
+DECLARE
+    sch text := df.duroxide_schema();
+    n   int;
+BEGIN
+    EXECUTE format(
+        'SELECT (SELECT count(*) FROM %I.orchestrator_queue WHERE instance_id = $1) '
+        '     + (SELECT count(*) FROM %I.instances        WHERE instance_id = $1)',
+        sch, sch)
+    INTO n USING p_id;
+
+    IF n > 0 THEN
+        RAISE EXCEPTION 'TEST FAILED: rolled-back df.start left % _duroxide row(s) for instance % (non-atomic enqueue leaked)', n, p_id;
+    END IF;
+    RAISE NOTICE 'PASSED [start_rollback]: no _duroxide residue for %', p_id;
+END $$;
+
+-- ===========================================================================
+-- Scenario 1: a rolled-back df.start() leaves no _duroxide orphan
+-- ===========================================================================
+
+SET SESSION AUTHORIZATION df_e2e_user;
+BEGIN;
+SELECT df.start('SELECT 42', 'atomic-rb-start') AS rb_id \gset
+ROLLBACK;
+RESET SESSION AUTHORIZATION;
+
+-- Give any (buggy) out-of-band enqueue a moment to surface before asserting.
+SELECT pg_sleep(1);
+SELECT pg_temp.assert_no_duroxide_residue(:'rb_id');
+
+-- ===========================================================================
+-- Scenario 2: a rolled-back df.cancel() does NOT cancel the instance
+-- ===========================================================================
+
+SET SESSION AUTHORIZATION df_e2e_user;
+
+CREATE TEMP TABLE _t_cancel (instance_id TEXT);
+INSERT INTO _t_cancel
+SELECT df.start(df.loop(df.seq('SELECT 1', df.sleep(1))), 'atomic-rb-cancel');
+
+-- Wait until the instance is genuinely running.
+DO $$
+DECLARE inst_id TEXT; status TEXT; attempts INT := 0;
+BEGIN
+    SELECT instance_id INTO inst_id FROM _t_cancel;
+    LOOP
+        SELECT s INTO status FROM df.status(inst_id) s;
+        EXIT WHEN lower(status) = 'running' OR attempts > 300;
+        PERFORM pg_sleep(0.1);
+        attempts := attempts + 1;
+    END LOOP;
+    IF lower(status) <> 'running' THEN
+        RAISE EXCEPTION 'Scenario 2 setup: cancel victim never reached running (status=%)', status;
+    END IF;
+END $$;
+
+-- Issue a cancel, then roll it back.
+BEGIN;
+SELECT df.cancel((SELECT instance_id FROM _t_cancel), 'rolled-back-cancel');
+ROLLBACK;
+
+-- The instance must NOT become cancelled: the cancel enqueue was rolled back.
+-- (Without the change the out-of-band CancelInstance committed, so the worker
+-- cancels the instance within ~1s.)
+DO $$
+DECLARE inst_id TEXT; status TEXT; attempts INT := 0;
+BEGIN
+    SELECT instance_id INTO inst_id FROM _t_cancel;
+    LOOP
+        SELECT s INTO status FROM df.status(inst_id) s;
+        IF lower(status) = 'cancelled' THEN
+            RAISE EXCEPTION 'TEST FAILED: rolled-back df.cancel still cancelled instance % (non-atomic enqueue leaked)', inst_id;
+        END IF;
+        EXIT WHEN attempts > 50;   -- watch for ~5s
+        PERFORM pg_sleep(0.1);
+        attempts := attempts + 1;
+    END LOOP;
+    RAISE NOTICE 'PASSED [cancel_rollback]: instance % still % after a rolled-back cancel', inst_id, status;
+END $$;
+
+-- Cleanup: cancel the (infinite) loop for real.
+SELECT df.cancel((SELECT instance_id FROM _t_cancel), 'cleanup');
+DROP TABLE _t_cancel;
+
+RESET SESSION AUTHORIZATION;
+
+-- ===========================================================================
+-- Scenario 3: a rolled-back df.signal() does NOT deliver the signal
+-- ===========================================================================
+
+SET SESSION AUTHORIZATION df_e2e_user;
+
+CREATE TEMP TABLE _t_signal (instance_id TEXT);
+INSERT INTO _t_signal
+SELECT df.start('SELECT 1' ~> (df.wait_for_signal('go') |=> 'sig') ~> 'SELECT 1',
+                'atomic-rb-signal');
+
+-- Wait until the instance is running (blocked on the signal).
+DO $$
+DECLARE inst_id TEXT; status TEXT; attempts INT := 0;
+BEGIN
+    SELECT instance_id INTO inst_id FROM _t_signal;
+    LOOP
+        SELECT s INTO status FROM df.status(inst_id) s;
+        EXIT WHEN lower(status) = 'running' OR attempts > 300;
+        PERFORM pg_sleep(0.1);
+        attempts := attempts + 1;
+    END LOOP;
+    IF lower(status) <> 'running' THEN
+        RAISE EXCEPTION 'Scenario 3 setup: signal victim never reached running (status=%)', status;
+    END IF;
+END $$;
+
+-- Send a signal, then roll it back.
+BEGIN;
+SELECT df.signal((SELECT instance_id FROM _t_signal), 'go', '{}');
+ROLLBACK;
+
+-- The instance must NOT complete: the signal enqueue was rolled back.
+-- (Without the change the out-of-band ExternalRaised committed, so the instance
+-- receives the signal and completes within ~1s.)
+DO $$
+DECLARE inst_id TEXT; status TEXT; attempts INT := 0;
+BEGIN
+    SELECT instance_id INTO inst_id FROM _t_signal;
+    LOOP
+        SELECT s INTO status FROM df.status(inst_id) s;
+        IF lower(status) = 'completed' THEN
+            RAISE EXCEPTION 'TEST FAILED: rolled-back df.signal still delivered to instance % (non-atomic enqueue leaked)', inst_id;
+        END IF;
+        EXIT WHEN attempts > 50;   -- watch for ~5s
+        PERFORM pg_sleep(0.1);
+        attempts := attempts + 1;
+    END LOOP;
+    RAISE NOTICE 'PASSED [signal_rollback]: instance % still % after a rolled-back signal', inst_id, status;
+END $$;
+
+-- Cleanup: deliver the signal for real and let it finish.
+SELECT df.signal((SELECT instance_id FROM _t_signal), 'go', '{}');
+DO $$
+DECLARE inst_id TEXT; status TEXT;
+BEGIN
+    SELECT instance_id INTO inst_id FROM _t_signal;
+    SELECT df.wait_for_completion(inst_id, 30) INTO status;
+END $$;
+DROP TABLE _t_signal;
+
+RESET SESSION AUTHORIZATION;
+
+SELECT 'TEST PASSED: atomic rollback (start/cancel/signal)' AS result;
diff --git a/tests/e2e/sql/25_enqueue_wrapper_authz.sql b/tests/e2e/sql/25_enqueue_wrapper_authz.sql
new file mode 100644
index 0000000..68bb61d
--- /dev/null
+++ b/tests/e2e/sql/25_enqueue_wrapper_authz.sql
@@ -0,0 +1,136 @@
+-- Copyright (c) Microsoft Corporation.
+-- Licensed under the PostgreSQL License.
+
+-- Tests: authorization on the in-transaction enqueue wrappers used by
+-- df.cancel() / df.signal().
+--
+-- df._enqueue_orchestrator_cancel/_signal are SECURITY DEFINER (the runtime
+-- queue is writable by its owner only) and granted to every df user via
+-- df.grant_usage(). They must therefore refuse to enqueue work against an
+-- instance the caller does not own. Ownership is checked with
+-- pg_has_role(session_user, <instance owner>, 'MEMBER'): session_user cannot be
+-- spoofed inside a SECURITY DEFINER function, and membership lets a role that
+-- owns the instance through SET ROLE still qualify.
+--
+-- Without the change these wrappers do not exist, so the forge attempts below
+-- raise undefined_function and the test fails.
+
+-- Fresh, non-superuser roles. (Superusers bypass pg_has_role, so the denial can
+-- only be exercised by a non-superuser caller.)
+DO $setup$
+BEGIN
+    BEGIN DROP OWNED BY authz_owner; EXCEPTION WHEN undefined_object THEN NULL; END;
+    BEGIN DROP OWNED BY authz_other; EXCEPTION WHEN undefined_object THEN NULL; END;
+    BEGIN DROP ROLE authz_owner; EXCEPTION WHEN undefined_object THEN NULL; END;
+    BEGIN DROP ROLE authz_other; EXCEPTION WHEN undefined_object THEN NULL; END;
+END $setup$;
+
+CREATE ROLE authz_owner LOGIN;
+CREATE ROLE authz_other LOGIN;
+SELECT df.grant_usage('authz_owner');
+SELECT df.grant_usage('authz_other');
+
+-- The owner starts a long-running instance.
+SET SESSION AUTHORIZATION authz_owner;
+CREATE TEMP TABLE _t_authz (instance_id TEXT);
+INSERT INTO _t_authz
+SELECT df.start(df.loop(df.seq('SELECT 1', df.sleep(1))), 'authz-owner-inst');
+RESET SESSION AUTHORIZATION;
+
+-- Let the non-owner read the instance id (so the forge attempts can target it).
+GRANT SELECT ON _t_authz TO authz_other;
+
+-- Wait until it is running.
+DO $$
+DECLARE inst_id TEXT; status TEXT; attempts INT := 0;
+BEGIN
+    SELECT instance_id INTO inst_id FROM _t_authz;
+    LOOP
+        SELECT s INTO status FROM df.status(inst_id) s;
+        EXIT WHEN lower(status) = 'running' OR attempts > 300;
+        PERFORM pg_sleep(0.1);
+        attempts := attempts + 1;
+    END LOOP;
+    IF lower(status) <> 'running' THEN
+        RAISE EXCEPTION 'Setup: owner instance never reached running (status=%)', status;
+    END IF;
+END $$;
+
+-- ===========================================================================
+-- A non-owner cannot forge a cancel/signal against the owner's instance.
+-- ===========================================================================
+
+SET SESSION AUTHORIZATION authz_other;
+DO $$
+DECLARE inst_id TEXT;
+BEGIN
+    SELECT instance_id INTO inst_id FROM _t_authz;
+
+    -- Forge a cancel.
+    BEGIN
+        PERFORM df._enqueue_orchestrator_cancel(inst_id, 'forged');
+        RAISE EXCEPTION 'TEST FAILED: authz_other was allowed to enqueue a cancel for owner instance %', inst_id;
+    EXCEPTION
+        WHEN insufficient_privilege THEN
+            IF SQLERRM LIKE '%not authorized%' THEN
+                RAISE NOTICE 'PASSED [cancel_forge_denied]: %', SQLERRM;
+            ELSE
+                RAISE EXCEPTION 'TEST FAILED: cancel denied, but not by the wrapper authorization check: %', SQLERRM;
+            END IF;
+        WHEN undefined_function THEN
+            RAISE EXCEPTION 'TEST FAILED: df._enqueue_orchestrator_cancel is missing (change not present): %', SQLERRM;
+    END;
+
+    -- Forge a signal.
+    BEGIN
+        PERFORM df._enqueue_orchestrator_signal(inst_id, 'go', '{}');
+        RAISE EXCEPTION 'TEST FAILED: authz_other was allowed to enqueue a signal for owner instance %', inst_id;
+    EXCEPTION
+        WHEN insufficient_privilege THEN
+            IF SQLERRM LIKE '%not authorized%' THEN
+                RAISE NOTICE 'PASSED [signal_forge_denied]: %', SQLERRM;
+            ELSE
+                RAISE EXCEPTION 'TEST FAILED: signal denied, but not by the wrapper authorization check: %', SQLERRM;
+            END IF;
+        WHEN undefined_function THEN
+            RAISE EXCEPTION 'TEST FAILED: df._enqueue_orchestrator_signal is missing (change not present): %', SQLERRM;
+    END;
+END $$;
+RESET SESSION AUTHORIZATION;
+
+-- ===========================================================================
+-- The owner can still cancel its own instance (the authorized path works).
+-- ===========================================================================
+
+SET SESSION AUTHORIZATION authz_owner;
+SELECT df.cancel((SELECT instance_id FROM _t_authz), 'owner-cancel');
+RESET SESSION AUTHORIZATION;
+
+DO $$
+DECLARE inst_id TEXT; status TEXT; attempts INT := 0;
+BEGIN
+    SELECT instance_id INTO inst_id FROM _t_authz;
+    LOOP
+        SELECT s INTO status FROM df.status(inst_id) s;
+        EXIT WHEN lower(status) = 'cancelled' OR attempts > 100;
+        PERFORM pg_sleep(0.1);
+        attempts := attempts + 1;
+    END LOOP;
+    IF lower(status) <> 'cancelled' THEN
+        RAISE EXCEPTION 'TEST FAILED: owner could not cancel its own instance % (status=%)', inst_id, status;
+    END IF;
+    RAISE NOTICE 'PASSED [owner_cancel_allowed]: instance % cancelled by its owner', inst_id;
+END $$;
+
+DROP TABLE _t_authz;
+
+-- Cleanup roles.
+DO $cleanup$
+BEGIN
+    BEGIN DROP OWNED BY authz_owner; EXCEPTION WHEN undefined_object THEN NULL; END;
+    BEGIN DROP OWNED BY authz_other; EXCEPTION WHEN undefined_object THEN NULL; END;
+    BEGIN DROP ROLE authz_owner;     EXCEPTION WHEN undefined_object THEN NULL; END;
+    BEGIN DROP ROLE authz_other;     EXCEPTION WHEN undefined_object THEN NULL; END;
+END $cleanup$;
+
+SELECT 'TEST PASSED: enqueue wrapper authorization' AS result;
diff --git a/tests/e2e/sql/26_reconcile_orphan_gc.sql b/tests/e2e/sql/26_reconcile_orphan_gc.sql
new file mode 100644
index 0000000..c59c37b
--- /dev/null
+++ b/tests/e2e/sql/26_reconcile_orphan_gc.sql
@@ -0,0 +1,149 @@
+-- Copyright (c) Microsoft Corporation.
+-- Licensed under the PostgreSQL License.
+
+-- Tests: df.reconcile() repairs leftover df/_duroxide drift.
+--
+-- df.reconcile() deletes orphaned ROOT runtime instances (no df.instances row,
+-- older than the grace window), gathering each orphan's full subtree so the
+-- delete is accepted, and leaves healthy instances untouched.
+--
+-- Without the change df.reconcile() does not exist, so the calls below raise
+-- undefined_function and the test fails.
+--
+-- _duroxide is read as the superuser (postgres); durable functions run as
+-- df_e2e_user.
+
+-- Helper (superuser): does a _duroxide instance row exist for this id?
+CREATE OR REPLACE FUNCTION pg_temp.duroxide_instance_exists(p_id text)
+RETURNS boolean LANGUAGE plpgsql AS $$
+DECLARE sch text := df.duroxide_schema(); ex boolean;
+BEGIN
+    EXECUTE format('SELECT EXISTS(SELECT 1 FROM %I.instances WHERE instance_id = $1)', sch)
+        INTO ex USING p_id;
+    RETURN ex;
+END $$;
+
+-- Helper (superuser): number of direct children (sub-orchestrations) of a root.
+CREATE OR REPLACE FUNCTION pg_temp.duroxide_child_count(p_root text)
+RETURNS bigint LANGUAGE plpgsql AS $$
+DECLARE sch text := df.duroxide_schema(); n bigint;
+BEGIN
+    EXECUTE format('SELECT count(*) FROM %I.instances WHERE parent_instance_id = $1', sch)
+        INTO n USING p_root;
+    RETURN n;
+END $$;
+
+-- Helper (superuser): size of the full instance subtree rooted at p_root.
+CREATE OR REPLACE FUNCTION pg_temp.duroxide_subtree_count(p_root text)
+RETURNS bigint LANGUAGE plpgsql AS $$
+DECLARE sch text := df.duroxide_schema(); n bigint;
+BEGIN
+    EXECUTE format(
+        'WITH RECURSIVE t AS ( '
+        '    SELECT instance_id FROM %1$I.instances WHERE instance_id = $1 '
+        '    UNION '
+        '    SELECT c.instance_id FROM %1$I.instances c JOIN t ON c.parent_instance_id = t.instance_id '
+        ') SELECT count(*) FROM t', sch)
+        INTO n USING p_root;
+    RETURN n;
+END $$;
+
+-- ===========================================================================
+-- Scenario 1: a planted root orphan WITH a running subtree is fully collected
+-- ===========================================================================
+
+-- Start a parallel JOIN: each branch runs as its own sub-orchestration, which
+-- has no df.instances row of its own. Orphaning the root then forces reconcile
+-- to gather the FULL subtree (root + children) -- deleting only the root would be
+-- refused by delete_instances_atomic, leaving the orphan behind. Long branches
+-- keep the children running for the duration of the test.
+SET SESSION AUTHORIZATION df_e2e_user;
+CREATE TEMP TABLE _t_orphan (instance_id TEXT);
+INSERT INTO _t_orphan
+SELECT df.start('SELECT pg_sleep(30)' & 'SELECT pg_sleep(30)', 'reconcile-orphan-victim');
+RESET SESSION AUTHORIZATION;
+
+-- Wait until the worker has materialized the root AND at least one child, so the
+-- subtree genuinely exists when we reconcile.
+DO $$
+DECLARE inst_id TEXT; attempts INT := 0;
+BEGIN
+    SELECT instance_id INTO inst_id FROM _t_orphan;
+    LOOP
+        EXIT WHEN pg_temp.duroxide_child_count(inst_id) >= 1 OR attempts > 300;
+        PERFORM pg_sleep(0.1);
+        attempts := attempts + 1;
+    END LOOP;
+    IF pg_temp.duroxide_child_count(inst_id) < 1 THEN
+        RAISE EXCEPTION 'Setup: orphan victim never spawned a child sub-orchestration';
+    END IF;
+    RAISE NOTICE 'Setup: orphan victim % has % child sub-orchestration(s)', inst_id, pg_temp.duroxide_child_count(inst_id);
+END $$;
+
+-- Orphan it: drop the control-plane rows (superuser bypasses row security).
+-- df.instances and df.nodes reference each other, but the FKs are
+-- DEFERRABLE INITIALLY DEFERRED, so deleting both in one transaction is fine.
+BEGIN;
+DELETE FROM df.instances WHERE id          = (SELECT instance_id FROM _t_orphan);
+DELETE FROM df.nodes     WHERE instance_id = (SELECT instance_id FROM _t_orphan);
+COMMIT;
+
+-- Reconcile with a zero grace window: the orphaned root AND its subtree must be
+-- collected.
+SELECT df.reconcile(0);
+
+DO $$
+DECLARE inst_id TEXT; remaining bigint;
+BEGIN
+    SELECT instance_id INTO inst_id FROM _t_orphan;
+    remaining := pg_temp.duroxide_subtree_count(inst_id);
+    IF remaining > 0 THEN
+        RAISE EXCEPTION 'TEST FAILED: df.reconcile left % subtree row(s) for orphaned root % (subtree not gathered)', remaining, inst_id;
+    END IF;
+    RAISE NOTICE 'PASSED [orphan_subtree_collected]: root % and all children removed', inst_id;
+END $$;
+
+DROP TABLE _t_orphan;
+
+-- ===========================================================================
+-- Scenario 2: a healthy, running instance is left untouched
+-- ===========================================================================
+
+SET SESSION AUTHORIZATION df_e2e_user;
+CREATE TEMP TABLE _t_live (instance_id TEXT);
+INSERT INTO _t_live SELECT df.start('SELECT pg_sleep(3)', 'reconcile-live');
+RESET SESSION AUTHORIZATION;
+
+-- Ensure it is running before reconciling.
+DO $$
+DECLARE inst_id TEXT; status TEXT; attempts INT := 0;
+BEGIN
+    SELECT instance_id INTO inst_id FROM _t_live;
+    LOOP
+        SELECT s INTO status FROM df.status(inst_id) s;
+        EXIT WHEN lower(status) = 'running' OR attempts > 300;
+        PERFORM pg_sleep(0.1);
+        attempts := attempts + 1;
+    END LOOP;
+    IF lower(status) <> 'running' THEN
+        RAISE EXCEPTION 'Setup: live instance never reached running (status=%)', status;
+    END IF;
+END $$;
+
+-- Reconcile must not disturb a healthy instance (it has a df.instances row).
+SELECT df.reconcile(0);
+
+DO $$
+DECLARE inst_id TEXT; status TEXT;
+BEGIN
+    SELECT instance_id INTO inst_id FROM _t_live;
+    SELECT df.wait_for_completion(inst_id, 30) INTO status;
+    IF lower(status) <> 'completed' THEN
+        RAISE EXCEPTION 'TEST FAILED: df.reconcile disturbed a healthy instance % (status=%)', inst_id, status;
+    END IF;
+    RAISE NOTICE 'PASSED [live_untouched]: % completed normally despite reconcile', inst_id;
+END $$;
+
+DROP TABLE _t_live;
+
+SELECT 'TEST PASSED: reconcile orphan GC' AS result;

From 0053a187d5a7f8dff071d056df68d5f31e8b1899 Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 16:57:31 +0000
Subject: [PATCH 13/20] spec: call out direct duroxide-pg coupling and the
 PG-provider requirement

The in-transaction enqueue and df.reconcile() are the first code to call
duroxide-pg's SQL surface directly (enqueue_orchestrator_work,
delete_instances_atomic, reads of orchestrator_queue/instances/executions),
unlike the rest of pg_durable which uses duroxide's Rust provider/client API.
Document why (only direct SQL can share the caller's transaction), the existing
precedent (the _worker_ready table co-located in _duroxide; duroxide-pg
migrations), and that non-PG providers fall back to the out-of-band path.
---
 docs/spec-atomic-start.md | 38 ++++++++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index 17f64e7..4fc0d97 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -154,6 +154,32 @@ transaction: a rollback drops the queue row along with the `df.*` rows, and a co
 makes them visible together. `df.cancel()` and `df.signal()` work the same way
 through their own wrappers.
 
+### Why this reaches into duroxide-pg directly (and needs the PG provider)
+
+Everywhere else, pg_durable talks to the runtime through duroxide's Rust
+provider/client API — the out-of-band start/cancel/signal fallback (`src/client.rs`)
+and the monitoring/explain reads (`src/monitoring.rs`, `src/explain.rs`) all go
+through `Client`/`Provider` methods, which are provider-agnostic at the API level.
+The in-transaction enqueue and `df.reconcile()` are the **first places pg_durable
+calls duroxide-pg's SQL surface directly**: `enqueue_orchestrator_work` and
+`delete_instances_atomic`, plus reads of `_duroxide.orchestrator_queue`,
+`instances`, and `executions`. They therefore depend on duroxide-pg's table shapes
+and function signatures, not just on the runtime existing.
+
+This is deliberate, and it is the whole point of the change: only direct SQL (via
+SPI) can run inside the **caller's** transaction. The Rust client uses a separate
+connection pool, so it can never be atomic with the backend's SPI work. The trade is
+a deeper coupling to duroxide-pg in exchange for atomicity.
+
+It would not work for a non-PostgreSQL provider — there would be no SQL functions to
+call — but pg_durable already assumes a duroxide-pg provider in a known schema: it
+places its own `_worker_ready` table inside `_duroxide`, resolves the schema via
+`df.duroxide_schema()`, and relies on duroxide-pg applying its migrations at startup.
+In the pg_durable context — a PostgreSQL extension whose runtime state lives in the
+same database — there is no real use case for another provider. The probe-and-
+fallback (below) keeps any non-PG provider working, non-atomically, rather than
+broken.
+
 ### Why a privileged wrapper is needed
 
 `_duroxide.orchestrator_queue` is writable by its owner only (the worker role), so a
@@ -265,12 +291,12 @@ but it *is* a change from "fires the moment the statement runs." `df.cancel()` a
 
 ## Implementation notes
 
-- The atomic path only works with the **duroxide-pg (SQL-backed) provider**, because
-  it calls the provider's SQL functions directly. `df.start` / `df.cancel` /
-  `df.signal` each probe for that surface (`enqueue_orchestrator_work` in the
-  resolved schema); when it is absent — a different provider, or a schema predating
-  the wrappers — they fall back to the old out-of-band path and emit a warning, so
-  the change never breaks another provider.
+- The atomic path requires the duroxide-pg provider (see *Why this reaches into
+  duroxide-pg directly* above). `df.start` / `df.cancel` / `df.signal` each probe for
+  its SQL surface (`enqueue_orchestrator_work` in the resolved schema); when it is
+  absent — a different provider, or a schema predating the wrappers — they fall back
+  to the old out-of-band path and emit a warning, so the change never breaks another
+  provider.
 - `visible_at = now()` (transaction time) is enough for immediate starts.
 - The work items are byte-compatible with duroxide's `WorkItem` JSON, so the worker
   behaves exactly as before.

From d46f431d8a102b8a0005c0042d2dcd62d676dceb Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 17:56:14 +0000
Subject: [PATCH 14/20] Address self-review findings: upgrade path, fallback
 probe, start-wrapper hardening

Fixes from the PR self-review:

- Complete the 0.2.3->0.2.4 upgrade script with the atomic enqueue wrappers,
  df.reconcile(), and matching grant_usage()/revoke_usage() grants/revokes.
- Make the in-transaction enqueue probe require both surfaces: the duroxide-pg
  enqueue SQL function and the df._enqueue_orchestrator_* wrappers. This keeps
  the new .so B1-compatible with older df schemas that have not run ALTER
  EXTENSION UPDATE, falling back to the out-of-band client path instead of
  calling missing wrappers.
- Log (server-side) rather than emit a client-visible WARNING for the fallback,
  so scripts that capture SELECT df.start(...) output are not contaminated.
- Harden df._enqueue_orchestrator_start so it is not a generic privileged
  start-any-orchestration primitive: it only accepts the root graph executor,
  validates input.instance_id, and enqueues the hard-coded root orchestration.
- Extend the authz E2E test to cover the start-wrapper hardening and make its
  cleanup robust across interrupted prior runs.

Verified with scripts/test-upgrade.sh --verbose and targeted E2E tests 24/25/26.
---
 docs/spec-atomic-start.md                  |  24 +-
 docs/upgrade-testing.md                    |  12 +-
 sql/pg_durable--0.2.3--0.2.4.sql           | 283 ++++++++++++++++++++-
 src/dsl.rs                                 |  41 ++-
 src/lib.rs                                 |  16 +-
 tests/e2e/sql/25_enqueue_wrapper_authz.sql |  78 +++++-
 6 files changed, 416 insertions(+), 38 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index 4fc0d97..1f2c3d0 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -195,11 +195,12 @@ with `%I`. They are revoked from `PUBLIC` and granted through `df.grant_usage()`
 Because a `SECURITY DEFINER` function runs as its owner, the wrappers cannot trust
 the caller. Two safeguards make them safe to grant to every user:
 
-- **The work item is built inside the wrapper**, from the caller's arguments — never
+- **The work item is built inside the wrapper**, from validated arguments — never
   passed in. A caller cannot choose the work-item type or smuggle in a foreign
-  target. (`df.start` builds `StartOrchestration`, `df.cancel` builds
-  `CancelInstance`, `df.signal` builds `ExternalRaised`, each with
-  `json_build_object`, matching duroxide's wire format.)
+  target. `df.start` only accepts the public root graph-executor orchestration and
+  requires the input JSON's `instance_id` to match the target id; `df.cancel` builds
+  `CancelInstance`; `df.signal` builds `ExternalRaised`. All three use
+  `json_build_object`, matching duroxide's wire format.
 - **The caller is authorized before any enqueue**, by two different rules:
   - *Start* targets a brand-new instance, so it authorizes by *state*: it permits
     the enqueue only for a `pending` `df.instances` row that has no queue entry and
@@ -294,9 +295,9 @@ but it *is* a change from "fires the moment the statement runs." `df.cancel()` a
 - The atomic path requires the duroxide-pg provider (see *Why this reaches into
   duroxide-pg directly* above). `df.start` / `df.cancel` / `df.signal` each probe for
   its SQL surface (`enqueue_orchestrator_work` in the resolved schema); when it is
-  absent — a different provider, or a schema predating the wrappers — they fall back
-  to the old out-of-band path and emit a warning, so the change never breaks another
-  provider.
+  absent — a different provider, or a schema predating the wrappers — they log and
+  fall back to the old out-of-band path, so the change never breaks another provider
+  or contaminates `SELECT df.start(...)` output with client-visible warnings.
 - `visible_at = now()` (transaction time) is enough for immediate starts.
 - The work items are byte-compatible with duroxide's `WorkItem` JSON, so the worker
   behaves exactly as before.
@@ -322,7 +323,7 @@ but it *is* a change from "fires the moment the statement runs." `df.cancel()` a
 - **Self-healing (medium).** The reconciler is re-checked from the steady-state poll
   loop, not only once per worker epoch, so a cancelled one comes back within the poll
   interval.
-- **Silent fallback (medium).** `df.start` now warns when it uses the non-atomic
+- **Silent fallback (medium).** `df.start` now logs when it uses the non-atomic
   fallback.
 - **Kept `LOGIN` on `df_reconciler`** (debated): the worker runs the reconcile node
   by connecting *as* the role, like every other durable-function role, so `NOLOGIN`
@@ -341,9 +342,10 @@ they ship in the extension's install SQL and an upgrade script
 (`sql/pg_durable--<prev>--<current>.sql`, `CREATE FUNCTION` + `GRANT`), with a
 "Version-Specific Changes" entry in `docs/upgrade-testing.md`. A new binary running
 against an older `df` schema that predates the wrappers must still work: each caller
-probes for the enqueue surface and falls back to the old out-of-band path when it is
-missing. This keeps the binary compatible with every prior schema while the upgrade
-rolls out.
+uses the atomic path only when both the duroxide-pg SQL enqueue function **and** the
+`df._enqueue_orchestrator_*` wrappers exist; otherwise it falls back to the old
+out-of-band path. This keeps the binary compatible with every prior schema while
+the upgrade rolls out.
 
 **Data migration.** None. No table shapes change, and in-flight instances are
 unaffected (their queue items were already enqueued).
diff --git a/docs/upgrade-testing.md b/docs/upgrade-testing.md
index 586e359..e4dd1eb 100644
--- a/docs/upgrade-testing.md
+++ b/docs/upgrade-testing.md
@@ -206,14 +206,22 @@ what the upgrade script handles, and any backward compatibility considerations.
 ### v0.2.3 → v0.2.4
 
 #### Simplify `df.grant_usage()` — drop the explicit function allowlist
-- **DDL change (df schema):** `df.grant_usage()` no longer loops over a hard-coded `func_sigs` array issuing `GRANT EXECUTE` per function. Fresh installs (`src/lib.rs`) and the upgrade script (`sql/pg_durable--0.2.3--0.2.4.sql`) both `CREATE OR REPLACE` the function with a body that grants `USAGE ON SCHEMA df` plus the table privileges, and conditionally grants `df.http()` / the admin helpers. The signature `df.grant_usage(text, boolean, boolean)` is unchanged.
-- **DDL change (df schema):** `df.revoke_usage()` is made symmetric with the new `grant_usage()`. It no longer loops over every `df.*` function in `pg_proc` issuing `REVOKE EXECUTE` (which, post-simplification, only produced "no privileges could be revoked" warnings since ordinary functions are never granted per-function EXECUTE). The new body revokes only what `grant_usage()` grants: schema `USAGE`, EXECUTE on the sensitive functions (`df.http`, `df.grant_usage`, `df.revoke_usage`), and the table privileges. The signature `df.revoke_usage(text)` is unchanged.
+- **DDL change (df schema):** `df.grant_usage()` no longer loops over a hard-coded `func_sigs` array issuing `GRANT EXECUTE` per function. Fresh installs (`src/lib.rs`) and the upgrade script (`sql/pg_durable--0.2.3--0.2.4.sql`) both `CREATE OR REPLACE` the function with a body that grants `USAGE ON SCHEMA df` plus the table privileges, conditionally grants `df.http()` / the admin helpers, and explicitly grants the new private enqueue wrappers (`df._enqueue_orchestrator_start`, `df._enqueue_orchestrator_cancel`, `df._enqueue_orchestrator_signal`) to ordinary df users. The signature `df.grant_usage(text, boolean, boolean)` is unchanged.
+- **DDL change (df schema):** `df.revoke_usage()` is made symmetric with the new `grant_usage()`. It no longer loops over every `df.*` function in `pg_proc` issuing `REVOKE EXECUTE` (which, post-simplification, only produced "no privileges could be revoked" warnings since ordinary functions are never granted per-function EXECUTE). The new body revokes only what `grant_usage()` grants: schema `USAGE`, EXECUTE on the sensitive/admin functions, EXECUTE on the private enqueue wrappers, and the table privileges. The signature `df.revoke_usage(text)` is unchanged.
 - **Rationale:** The ordinary `df.*` functions retain PostgreSQL's default PUBLIC `EXECUTE`, so schema `USAGE` is the real access gate; the per-function grants/revokes were redundant. The sensitive functions have PUBLIC `EXECUTE` revoked at install time and were never in the allowlist, so their protection is unchanged.
 - **Behavioral note:** A newly added `df.*` function is now callable by any role with schema `USAGE` by default. To keep a future function private, `REVOKE EXECUTE ... FROM PUBLIC` at install time and grant it explicitly in `df.grant_usage()`.
 - **Legacy cleanup caveat:** A role that was granted under the *old* `grant_usage()` (explicit per-function EXECUTE) and is later revoked under the new `revoke_usage()` may retain inert EXECUTE entries on ordinary functions. These are harmless — revoking schema `USAGE` fully locks the role out — and clear on the next drop/regrant cycle.
 - **Scenario A considerations:** Signatures are identical on the fresh-install and upgrade paths (only the bodies differ), so the function-signature equivalence contract passes.
 - **Scenario B1/B2 considerations:** No schema/data migration and no new objects. The replaced bodies work against the existing schema and change no privileges already granted.
 
+#### Atomic in-transaction enqueue wrappers + `df.reconcile()`
+- **DDL change (df schema):** Adds three private `SECURITY DEFINER` wrappers: `df._enqueue_orchestrator_start(text, text, text)`, `df._enqueue_orchestrator_cancel(text, text)`, and `df._enqueue_orchestrator_signal(text, text, text)`. Fresh installs and the upgrade script create the same functions and `REVOKE EXECUTE ... FROM PUBLIC`; `df.grant_usage()` grants them explicitly to df users because `df.start()` / `df.cancel()` / `df.signal()` call them via SPI as the caller.
+- **DDL change (df schema):** Adds admin-only `df.reconcile(integer)` (`SECURITY DEFINER`, `REVOKE EXECUTE ... FROM PUBLIC`) to delete orphaned duroxide instance subtrees and mark stuck `df.instances` rows failed. The background worker starts it through a built-in durable loop as the dedicated `df_reconciler` role.
+- **Provider coupling:** These wrappers and `df.reconcile()` call the duroxide-pg SQL surface directly (`enqueue_orchestrator_work`, `delete_instances_atomic`, and `_duroxide` tables). This is intentional: only direct SQL via SPI can share the caller's transaction. Non-PG providers (or older schemas without the wrappers) use the legacy out-of-band fallback.
+- **Scenario A considerations:** Fresh-install and upgrade schemas must expose the same new `df` functions and grants. The upgrade script creates the wrappers/reconciler and updates `grant_usage()`/`revoke_usage()` in the same release, so upgraded users can call `df.start()` / `df.cancel()` / `df.signal()` immediately after `ALTER EXTENSION UPDATE`.
+- **Scenario B1 considerations:** The new `.so` remains compatible with pre-upgrade 0.2.3 schemas because `df.start()` / `df.cancel()` / `df.signal()` use the in-transaction path only when both the duroxide-pg provider SQL function and the `df._enqueue_orchestrator_*` wrappers exist; otherwise they log and fall back to the old out-of-band client path. This fallback is not client-visible as a SQL `WARNING`, so it does not contaminate scripts that capture `SELECT df.start(...)` output.
+- **Scenario B2 considerations:** No data migration. Existing instances and queued work keep their original behavior; the new atomic semantics apply to calls made after the schema has the wrappers.
+
 #### Rename `df.wait_for_completion()` to `df.await_instance()`
 - **DDL change (df schema):** Adds `df.await_instance(text, integer)` as the canonical C binding for the helper formerly exposed as `df.wait_for_completion(text, integer)`. The old SQL function remains present and the new `.so` continues exporting `wait_for_completion_wrapper` as a shim, so existing customer scripts keep working.
 - **Grant behavior:** No explicit grant migration is required. PostgreSQL grants `EXECUTE` on newly created functions to `PUBLIC` by default, and `df.await_instance` is not a sensitive helper whose default PUBLIC grant is revoked.
diff --git a/sql/pg_durable--0.2.3--0.2.4.sql b/sql/pg_durable--0.2.3--0.2.4.sql
index efaab86..9308bc1 100644
--- a/sql/pg_durable--0.2.3--0.2.4.sql
+++ b/sql/pg_durable--0.2.3--0.2.4.sql
@@ -40,7 +40,10 @@ DROP FUNCTION IF EXISTS df.debug_connection();
 --
 -- The sensitive functions (df.http, df.grant_usage, df.revoke_usage) had their
 -- PUBLIC EXECUTE revoked at install time and were never in the list; they are
--- still granted explicitly here, so their protection is unchanged.
+-- still granted explicitly here, so their protection is unchanged. The new
+-- in-transaction enqueue wrappers are also private (REVOKE FROM PUBLIC below)
+-- and are granted explicitly to df users because df.start()/df.cancel()/df.signal()
+-- call them via SPI as the calling role.
 --
 -- This CREATE OR REPLACE brings pre-existing installs in line with fresh
 -- 0.2.4 installs (see src/lib.rs). The new body works against the existing
@@ -76,6 +79,14 @@ BEGIN
         EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df.revoke_usage(text) TO %I', p_role) OPERATOR(pg_catalog.||) grant_opt;
     END IF;
 
+    -- In-transaction enqueue wrappers — SECURITY DEFINER, revoked from PUBLIC at
+    -- install. Granted unconditionally to every df user because df.start() /
+    -- df.cancel() / df.signal() call them via SPI as the calling role; their own
+    -- internal authorization checks gate access to other users' instances.
+    EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text, text) TO %I', p_role) OPERATOR(pg_catalog.||) grant_opt;
+    EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df._enqueue_orchestrator_cancel(text, text) TO %I', p_role) OPERATOR(pg_catalog.||) grant_opt;
+    EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df._enqueue_orchestrator_signal(text, text, text) TO %I', p_role) OPERATOR(pg_catalog.||) grant_opt;
+
     -- Table privileges
     EXECUTE pg_catalog.format('GRANT SELECT ON df.instances TO %I', p_role) OPERATOR(pg_catalog.||) grant_opt;
     EXECUTE pg_catalog.format('GRANT UPDATE (status, updated_at) ON df.instances TO %I', p_role) OPERATOR(pg_catalog.||) grant_opt;
@@ -100,10 +111,11 @@ $fn$;
 -- the role out of every ordinary df.* function.
 --
 -- The new body undoes exactly what grant_usage() grants: schema USAGE, EXECUTE
--- on the sensitive functions, and the table privileges. Note: a role granted
--- under the OLD grant_usage() (explicit per-function EXECUTE) may retain inert
--- EXECUTE entries on ordinary functions after this revoke; they are harmless
--- because schema USAGE is gone, and clear on the next drop/regrant cycle.
+-- on the sensitive/admin functions, EXECUTE on the new enqueue wrappers, and
+-- the table privileges. Note: a role granted under the OLD grant_usage()
+-- (explicit per-function EXECUTE) may retain inert EXECUTE entries on ordinary
+-- functions after this revoke; they are harmless because schema USAGE is gone,
+-- and clear on the next drop/regrant cycle.
 -- ============================================================================
 CREATE OR REPLACE FUNCTION df.revoke_usage(p_role TEXT)
 RETURNS VOID
@@ -134,6 +146,24 @@ BEGIN
         NULL;
     END;
 
+    -- In-transaction enqueue wrappers (granted unconditionally by grant_usage()).
+    -- A delegated admin may not be the grantor of these; skip if so.
+    BEGIN
+        EXECUTE pg_catalog.format('REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text, text) FROM %I CASCADE', p_role);
+    EXCEPTION WHEN insufficient_privilege THEN
+        NULL;
+    END;
+    BEGIN
+        EXECUTE pg_catalog.format('REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_cancel(text, text) FROM %I CASCADE', p_role);
+    EXCEPTION WHEN insufficient_privilege THEN
+        NULL;
+    END;
+    BEGIN
+        EXECUTE pg_catalog.format('REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_signal(text, text, text) FROM %I CASCADE', p_role);
+    EXCEPTION WHEN insufficient_privilege THEN
+        NULL;
+    END;
+
     -- Table privileges.
     -- Column-level revokes must match the column-level grants from grant_usage().
     EXECUTE pg_catalog.format('REVOKE SELECT, INSERT, UPDATE, DELETE ON df.vars FROM %I CASCADE', p_role);
@@ -166,3 +196,246 @@ CREATE FUNCTION df."await_instance"(
 STRICT
 LANGUAGE c
 AS 'MODULE_PATHNAME', 'await_instance_wrapper';
+
+-- ============================================================================
+-- In-transaction enqueue wrappers + reconciler (atomic df.start/cancel/signal,
+-- and df.reconcile()).
+--
+-- df.start()/df.cancel()/df.signal() now enqueue the duroxide work item over SPI
+-- inside the CALLER'S transaction (so a rollback undoes the enqueue), through
+-- these SECURITY DEFINER wrappers. The orchestrator queue is owner-only, so the
+-- wrappers perform the privileged INSERT; each builds the work item server-side
+-- and authorizes the caller (df.start by brand-new-instance state; df.cancel /
+-- df.signal by pg_has_role(session_user, <owner>, 'MEMBER')). They are revoked
+-- from PUBLIC and granted to df users by df.grant_usage() (above).
+--
+-- df.reconcile() is the admin-only backstop that deletes orphaned duroxide
+-- instance subtrees with no df.instances row and fails stuck df.instances rows.
+--
+-- The wrappers resolve the duroxide schema via df.duroxide_schema() (defined in
+-- the 0.2.2→0.2.3 upgrade) and require the duroxide-pg provider; df.start /
+-- df.cancel / df.signal fall back to the out-of-band path when it is absent, so
+-- this upgrade is safe on a fresh '_duroxide' or a legacy 'duroxide' schema.
+-- ============================================================================
+CREATE FUNCTION df._enqueue_orchestrator_start(
+    p_instance_id   text,
+    p_orchestration text,
+    p_input         text)
+RETURNS void
+LANGUAGE plpgsql
+SECURITY DEFINER
+SET search_path = pg_catalog
+AS $fn$
+DECLARE
+    sch       text := df.duroxide_schema();
+    work_item text;
+    v_blocked boolean;
+BEGIN
+    -- This wrapper is not a generic privileged "start any orchestration" entry
+    -- point. df.start() passes the root graph-executor name and FunctionInput
+    -- JSON; reject anything else so a caller cannot use the SECURITY DEFINER
+    -- privilege to enqueue an internal sub-orchestration with crafted input.
+    IF p_orchestration OPERATOR(pg_catalog.<>) 'pg_durable::orchestration::execute-function-graph' THEN
+        RAISE EXCEPTION 'pg_durable: invalid start orchestration %', p_orchestration
+            USING ERRCODE = 'invalid_parameter_value';
+    END IF;
+
+    IF (p_input::jsonb ->> 'instance_id') IS DISTINCT FROM p_instance_id THEN
+        RAISE EXCEPTION 'pg_durable: start input instance_id does not match %', p_instance_id
+            USING ERRCODE = 'invalid_parameter_value';
+    END IF;
+
+    -- Authorization. This runs as the (privileged) definer, so it must not
+    -- trust the caller to only target their own instance. Permit the enqueue
+    -- only for a brand-new, not-yet-started instance: a 'pending' df.instances
+    -- row with no orchestrator-queue entry and no duroxide instance. Under the
+    -- atomic-start path a committed df.instances row always has its queue row in
+    -- the same transaction, so this state is reachable only for the instance the
+    -- caller (df.start) just inserted in the current transaction — never another
+    -- user's committed instance. This is what stops a caller from forging work
+    -- against a foreign instance.
+    EXECUTE pg_catalog.format(
+        'SELECT NOT EXISTS (SELECT 1 FROM df.instances i WHERE i.id = $1 AND i.status = ''pending'') '
+        '    OR EXISTS (SELECT 1 FROM %I.orchestrator_queue q WHERE q.instance_id = $1) '
+        '    OR EXISTS (SELECT 1 FROM %I.instances d WHERE d.instance_id = $1)',
+        sch, sch)
+    INTO v_blocked
+    USING p_instance_id;
+
+    IF v_blocked THEN
+        RAISE EXCEPTION 'pg_durable: not authorized to enqueue a start for instance %', p_instance_id
+            USING ERRCODE = 'insufficient_privilege';
+    END IF;
+
+    -- Build the StartOrchestration work item server-side so the caller cannot
+    -- choose the work-item variant (no CancelInstance/ExternalRaised/etc.) or
+    -- target a different instance. Mirrors duroxide's WorkItem::StartOrchestration.
+    work_item := pg_catalog.json_build_object(
+        'StartOrchestration', pg_catalog.json_build_object(
+            'instance',        p_instance_id,
+        'orchestration',   'pg_durable::orchestration::execute-function-graph',
+            'input',           p_input,
+            'version',         NULL,
+            'parent_instance', NULL,
+            'parent_id',       NULL,
+            'execution_id',    1))::text;
+
+    EXECUTE pg_catalog.format('SELECT %I.enqueue_orchestrator_work($1, $2, $3)', sch)
+        USING p_instance_id, work_item, pg_catalog.now();
+END;
+$fn$;
+
+REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text, text) FROM PUBLIC;
+
+CREATE FUNCTION df._enqueue_orchestrator_cancel(p_instance_id text, p_reason text)
+RETURNS void
+LANGUAGE plpgsql
+SECURITY DEFINER
+SET search_path = pg_catalog
+AS $fn$
+DECLARE
+    sch       text := df.duroxide_schema();
+    owner_oid oid;
+BEGIN
+    SELECT i.submitted_by::oid INTO owner_oid FROM df.instances i WHERE i.id = p_instance_id;
+    IF owner_oid IS NULL OR NOT pg_catalog.pg_has_role(session_user, owner_oid, 'MEMBER') THEN
+        RAISE EXCEPTION 'pg_durable: not authorized to cancel instance %', p_instance_id
+            USING ERRCODE = 'insufficient_privilege';
+    END IF;
+
+    EXECUTE pg_catalog.format('SELECT %I.enqueue_orchestrator_work($1, $2, $3)', sch)
+        USING p_instance_id,
+              pg_catalog.json_build_object('CancelInstance',
+                  pg_catalog.json_build_object('instance', p_instance_id, 'reason', p_reason))::text,
+              pg_catalog.now();
+END;
+$fn$;
+
+REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_cancel(text, text) FROM PUBLIC;
+
+CREATE FUNCTION df._enqueue_orchestrator_signal(p_instance_id text, p_name text, p_data text)
+RETURNS void
+LANGUAGE plpgsql
+SECURITY DEFINER
+SET search_path = pg_catalog
+AS $fn$
+DECLARE
+    sch       text := df.duroxide_schema();
+    owner_oid oid;
+BEGIN
+    SELECT i.submitted_by::oid INTO owner_oid FROM df.instances i WHERE i.id = p_instance_id;
+    IF owner_oid IS NULL OR NOT pg_catalog.pg_has_role(session_user, owner_oid, 'MEMBER') THEN
+        RAISE EXCEPTION 'pg_durable: not authorized to signal instance %', p_instance_id
+            USING ERRCODE = 'insufficient_privilege';
+    END IF;
+
+    -- Raise the event for the target instance and every RUNNING descendant
+    -- (a sub-orchestration — JOIN/RACE branch or loop generation — may be the one
+    -- waiting on the signal), mirroring the out-of-band fan-out. %1$I = schema.
+    EXECUTE pg_catalog.format(
+        'INSERT INTO %1$I.orchestrator_queue (instance_id, work_item, visible_at, created_at) '
+        'SELECT t.instance_id, '
+        '       pg_catalog.json_build_object(''ExternalRaised'', '
+        '           pg_catalog.json_build_object(''instance'', t.instance_id, ''name'', $2, ''data'', $3))::text, '
+        '       pg_catalog.now(), pg_catalog.now() '
+        'FROM ( '
+        '    WITH RECURSIVE tree AS ( '
+        '        SELECT i.instance_id, i.current_execution_id, true AS is_root '
+        '        FROM %1$I.instances i WHERE i.instance_id = $1 '
+        '        UNION '
+        '        SELECT c.instance_id, c.current_execution_id, false '
+        '        FROM %1$I.instances c JOIN tree p ON c.parent_instance_id = p.instance_id '
+        '    ) '
+        '    SELECT tr.instance_id '
+        '    FROM tree tr '
+        '    LEFT JOIN %1$I.executions e '
+        '      ON e.instance_id = tr.instance_id AND e.execution_id = tr.current_execution_id '
+        '    WHERE tr.is_root OR pg_catalog.lower(COALESCE(e.status, '''')) = ''running'' '
+        ') t',
+        sch)
+    USING p_instance_id, p_name, p_data;
+END;
+$fn$;
+
+REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_signal(text, text, text) FROM PUBLIC;
+
+CREATE FUNCTION df.reconcile(p_grace_seconds integer DEFAULT 60)
+RETURNS TABLE(duroxide_orphans_deleted bigint, stuck_instances_failed bigint)
+LANGUAGE plpgsql
+SECURITY DEFINER
+SET search_path = pg_catalog
+AS $fn$
+DECLARE
+    sch        text := df.duroxide_schema();
+    orphan_ids text[];
+    deleted    bigint := 0;
+    stuck      bigint := 0;
+BEGIN
+    -- 1) Delete orphaned duroxide subtrees: every duroxide instance whose ROOT
+    --    ancestor has no df.instances row and is older than the grace window.
+    --    We must gather the FULL subtree (root + all descendants) because
+    --    delete_instances_atomic refuses (even with force) to delete a parent
+    --    whose children are not also in the list. Sub-orchestrations (JOIN/RACE
+    --    branches, loop generations) have no df.instances row and would be
+    --    mis-detected as roots if we keyed on "no df row" alone, so the orphan
+    --    seed is restricted to parent_instance_id IS NULL.
+    --    Wrapped so a GC failure never aborts reconcile or kills the built-in
+    --    reconciler loop.
+    BEGIN
+        EXECUTE pg_catalog.format(
+            'WITH RECURSIVE orphan_root AS ( '
+            '    SELECT d.instance_id '
+            '    FROM %1$I.instances d '
+            '    LEFT JOIN df.instances i ON i.id = d.instance_id '
+            '    WHERE i.id IS NULL '
+            '      AND d.parent_instance_id IS NULL '
+            '      AND d.created_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1) '
+            '), subtree AS ( '
+            '    SELECT instance_id FROM orphan_root '
+            '    UNION '
+            '    SELECT c.instance_id FROM %1$I.instances c '
+            '    JOIN subtree s ON c.parent_instance_id = s.instance_id '
+            ') SELECT pg_catalog.array_agg(instance_id) FROM subtree',
+            sch)
+        INTO orphan_ids
+        USING p_grace_seconds;
+
+        IF orphan_ids IS NOT NULL AND pg_catalog.array_length(orphan_ids, 1) > 0 THEN
+            EXECUTE pg_catalog.format(
+                'SELECT instances_deleted FROM %I.delete_instances_atomic($1, $2)', sch)
+            INTO deleted
+            USING orphan_ids, true;
+        END IF;
+    EXCEPTION WHEN OTHERS THEN
+        deleted := 0;
+        RAISE WARNING 'pg_durable: reconcile orphan-GC pass failed: %', SQLERRM;
+    END;
+
+    -- 2) df.instances stuck non-terminal with no live duroxide instance and no
+    --    queued start (lost enqueue) → mark failed. The duroxide queue row
+    --    persists (locked) until ack, and the instance row is created at ack, so
+    --    a healthy in-flight start always matches one of the NOT EXISTS guards
+    --    and is never failed here. Best-effort; wrapped like step 1.
+    BEGIN
+        EXECUTE pg_catalog.format(
+            'UPDATE df.instances i '
+            'SET status = ''failed'', updated_at = pg_catalog.now() '
+            'WHERE i.status IN (''pending'', ''running'') '
+            '  AND i.updated_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1) '
+            '  AND NOT EXISTS (SELECT 1 FROM %1$I.instances d WHERE d.instance_id = i.id) '
+            '  AND NOT EXISTS (SELECT 1 FROM %1$I.orchestrator_queue q WHERE q.instance_id = i.id)',
+            sch)
+        USING p_grace_seconds;
+        GET DIAGNOSTICS stuck = ROW_COUNT;
+    EXCEPTION WHEN OTHERS THEN
+        stuck := 0;
+        RAISE WARNING 'pg_durable: reconcile stuck-failover pass failed: %', SQLERRM;
+    END;
+
+    duroxide_orphans_deleted := deleted;
+    stuck_instances_failed := stuck;
+    RETURN NEXT;
+END;
+$fn$;
+
+REVOKE EXECUTE ON FUNCTION df.reconcile(integer) FROM PUBLIC;
diff --git a/src/dsl.rs b/src/dsl.rs
index 604cd61..6475420 100644
--- a/src/dsl.rs
+++ b/src/dsl.rs
@@ -642,7 +642,7 @@ pub fn signal(instance_id: &str, signal_name: &str, signal_data: default!(&str,
         }
         "OK".to_string()
     } else {
-        pgrx::warning!(
+        pgrx::log!(
             "pg_durable: df.signal() is using the non-atomic fallback enqueue \
              (duroxide-pg SQL surface not detected)"
         );
@@ -657,17 +657,34 @@ pub fn signal(instance_id: &str, signal_name: &str, signal_data: default!(&str,
 // Orchestration Control Functions
 // ============================================================================
 
-/// True when the duroxide store exposes the SQL enqueue surface — i.e. the
-/// duroxide-pg (PostgreSQL) provider — in the given schema. Detected by the
-/// presence of `enqueue_orchestrator_work`. When false (a non-pg provider, or a
-/// schema predating the wrapper), df.start() falls back to the out-of-band
-/// client path. The catalog probe never raises, so it is safe in a backend
-/// session even when the schema is absent.
+/// True when both required pieces of the in-transaction enqueue path exist:
+/// the duroxide-pg SQL enqueue surface in the provider schema and the df-schema
+/// SECURITY DEFINER wrappers created by the extension/upgrade script. When
+/// false (a non-pg provider, or a schema that predates the wrappers), callers
+/// fall back to the out-of-band client path. The catalog probe never raises, so
+/// it is safe in a backend session even when the schema is absent.
 fn in_tx_enqueue_supported(schema: &str) -> bool {
     Spi::get_one_with_args::<bool>(
-        "SELECT EXISTS(SELECT 1 FROM pg_catalog.pg_proc p \
-         JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace \
-         WHERE n.nspname = $1 AND p.proname = 'enqueue_orchestrator_work')",
+        "SELECT \
+            EXISTS(SELECT 1 FROM pg_catalog.pg_proc p \
+                   JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace \
+                   WHERE n.nspname = $1 \
+                     AND p.proname = 'enqueue_orchestrator_work') \
+            AND EXISTS(SELECT 1 FROM pg_catalog.pg_proc p \
+                       JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace \
+                       WHERE n.nspname = 'df' \
+                         AND p.proname = '_enqueue_orchestrator_start' \
+                         AND p.pronargs = 3) \
+            AND EXISTS(SELECT 1 FROM pg_catalog.pg_proc p \
+                       JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace \
+                       WHERE n.nspname = 'df' \
+                         AND p.proname = '_enqueue_orchestrator_cancel' \
+                         AND p.pronargs = 2) \
+            AND EXISTS(SELECT 1 FROM pg_catalog.pg_proc p \
+                       JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace \
+                       WHERE n.nspname = 'df' \
+                         AND p.proname = '_enqueue_orchestrator_signal' \
+                         AND p.pronargs = 3)",
         &[schema.into()],
     )
     .ok()
@@ -1014,7 +1031,7 @@ pub fn start(
         // the duroxide client. NOTE: this path is NOT atomic with the caller's
         // transaction — a rollback will NOT undo the start. Warn so the
         // non-atomic semantics are observable.
-        pgrx::warning!(
+        pgrx::log!(
             "pg_durable: df.start() is using the non-atomic fallback enqueue \
              (duroxide-pg SQL surface not detected); a rollback will not undo this start"
         );
@@ -1064,7 +1081,7 @@ pub fn cancel(instance_id: &str, reason: default!(&str, "'Cancelled by user'"))
             pgrx::error!("Failed to cancel: {:?}", e);
         }
     } else {
-        pgrx::warning!(
+        pgrx::log!(
             "pg_durable: df.cancel() is using the non-atomic fallback enqueue \
              (duroxide-pg SQL surface not detected)"
         );
diff --git a/src/lib.rs b/src/lib.rs
index 754b206..1143da7 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -554,6 +554,20 @@ DECLARE
     work_item text;
     v_blocked boolean;
 BEGIN
+    -- This wrapper is not a generic privileged "start any orchestration" entry
+    -- point. df.start() passes the root graph-executor name and FunctionInput
+    -- JSON; reject anything else so a caller cannot use the SECURITY DEFINER
+    -- privilege to enqueue an internal sub-orchestration with crafted input.
+    IF p_orchestration OPERATOR(pg_catalog.<>) 'pg_durable::orchestration::execute-function-graph' THEN
+        RAISE EXCEPTION 'pg_durable: invalid start orchestration %', p_orchestration
+            USING ERRCODE = 'invalid_parameter_value';
+    END IF;
+
+    IF (p_input::jsonb ->> 'instance_id') IS DISTINCT FROM p_instance_id THEN
+        RAISE EXCEPTION 'pg_durable: start input instance_id does not match %', p_instance_id
+            USING ERRCODE = 'invalid_parameter_value';
+    END IF;
+
     -- Authorization. This runs as the (privileged) definer, so it must not
     -- trust the caller to only target their own instance. Permit the enqueue
     -- only for a brand-new, not-yet-started instance: a 'pending' df.instances
@@ -582,7 +596,7 @@ BEGIN
     work_item := pg_catalog.json_build_object(
         'StartOrchestration', pg_catalog.json_build_object(
             'instance',        p_instance_id,
-            'orchestration',   p_orchestration,
+        'orchestration',   'pg_durable::orchestration::execute-function-graph',
             'input',           p_input,
             'version',         NULL,
             'parent_instance', NULL,
diff --git a/tests/e2e/sql/25_enqueue_wrapper_authz.sql b/tests/e2e/sql/25_enqueue_wrapper_authz.sql
index 68bb61d..6944b88 100644
--- a/tests/e2e/sql/25_enqueue_wrapper_authz.sql
+++ b/tests/e2e/sql/25_enqueue_wrapper_authz.sql
@@ -1,22 +1,43 @@
 -- Copyright (c) Microsoft Corporation.
 -- Licensed under the PostgreSQL License.
 
--- Tests: authorization on the in-transaction enqueue wrappers used by
--- df.cancel() / df.signal().
+-- Tests: authorization and hardening on the in-transaction enqueue wrappers.
 --
--- df._enqueue_orchestrator_cancel/_signal are SECURITY DEFINER (the runtime
+-- The df._enqueue_orchestrator_* functions are SECURITY DEFINER (the runtime
 -- queue is writable by its owner only) and granted to every df user via
 -- df.grant_usage(). They must therefore refuse to enqueue work against an
--- instance the caller does not own. Ownership is checked with
--- pg_has_role(session_user, <instance owner>, 'MEMBER'): session_user cannot be
--- spoofed inside a SECURITY DEFINER function, and membership lets a role that
--- owns the instance through SET ROLE still qualify.
+-- instance the caller does not own, and the start wrapper must not become a
+-- generic "start any internal orchestration with arbitrary input" primitive.
+-- Cancel/signal ownership is checked with pg_has_role(session_user,
+-- <instance owner>, 'MEMBER'): session_user cannot be spoofed inside a
+-- SECURITY DEFINER function, and membership lets a role that owns the instance
+-- through SET ROLE still qualify.
 --
 -- Without the change these wrappers do not exist, so the forge attempts below
 -- raise undefined_function and the test fails.
 
 -- Fresh, non-superuser roles. (Superusers bypass pg_has_role, so the denial can
 -- only be exercised by a non-superuser caller.)
+DO $precleanup$
+DECLARE
+    inst_id TEXT;
+BEGIN
+    -- A failed/interrupted prior run can leave a loop instance submitted by one
+    -- of these roles. Cancel it before DROP ROLE so the worker stops reconnecting
+    -- as that role while this test rebuilds privileges.
+    FOR inst_id IN
+        SELECT i.id
+        FROM df.instances i
+        JOIN pg_catalog.pg_roles r ON r.oid = i.submitted_by::oid
+        WHERE r.rolname IN ('authz_owner', 'authz_other')
+          AND i.status NOT IN ('completed', 'failed', 'cancelled')
+    LOOP
+        PERFORM df.cancel(inst_id, 'authz-test-precleanup');
+    END LOOP;
+END $precleanup$;
+
+SELECT pg_sleep(0.5);
+
 DO $setup$
 BEGIN
     BEGIN DROP OWNED BY authz_owner; EXCEPTION WHEN undefined_object THEN NULL; END;
@@ -98,6 +119,49 @@ BEGIN
 END $$;
 RESET SESSION AUTHORIZATION;
 
+-- =========================================================================== 
+-- A caller cannot use the start wrapper as a generic privileged entrypoint.
+-- =========================================================================== 
+
+SET SESSION AUTHORIZATION authz_other;
+DO $$
+DECLARE
+    inst_id CONSTANT TEXT := 'feedf00d';
+    root_id CONSTANT TEXT := 'deadbeef';
+BEGIN
+    -- Create a legitimate brand-new, pending instance owned by authz_other. The
+    -- start wrapper's state check should pass for this row; the test verifies
+    -- the wrapper still rejects a non-root internal orchestration name.
+    INSERT INTO df.instances (id, label, root_node, submitted_by, database)
+    VALUES (inst_id, 'authz-start-wrapper-hardening', root_id, current_user::regrole, 'postgres');
+
+    INSERT INTO df.nodes (id, instance_id, node_type, query, submitted_by, database)
+    VALUES (root_id, inst_id, 'SQL', 'SELECT 1', current_user::regrole, 'postgres');
+
+    BEGIN
+        PERFORM df._enqueue_orchestrator_start(
+            inst_id,
+            'pg_durable::orchestration::execute-subtree',
+            json_build_object('instance_id', inst_id)::text);
+        RAISE EXCEPTION 'TEST FAILED: start wrapper accepted a non-root orchestration for instance %', inst_id;
+    EXCEPTION
+        WHEN invalid_parameter_value THEN
+            IF SQLERRM LIKE '%invalid start orchestration%' THEN
+                RAISE NOTICE 'PASSED [start_wrapper_rejects_internal_orchestration]: %', SQLERRM;
+            ELSE
+                RAISE EXCEPTION 'TEST FAILED: start wrapper rejected with an unexpected validation error: %', SQLERRM;
+            END IF;
+        WHEN undefined_function THEN
+            RAISE EXCEPTION 'TEST FAILED: df._enqueue_orchestrator_start is missing (change not present): %', SQLERRM;
+    END;
+END $$;
+RESET SESSION AUTHORIZATION;
+
+BEGIN;
+DELETE FROM df.instances WHERE id = 'feedf00d';
+DELETE FROM df.nodes WHERE instance_id = 'feedf00d';
+COMMIT;
+
 -- ===========================================================================
 -- The owner can still cancel its own instance (the authorized path works).
 -- ===========================================================================

From 8472e1931e6b74cdddbd0070adcdb18234ae05b6 Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 18:10:49 +0000
Subject: [PATCH 15/20] Address rubber-duck feedback on early signals and
 upgrade docs

Duroxide does not buffer external events until an orchestration has a pending
subscription. A signal sent before the root runtime row exists would be accepted
but skipped before the workflow can observe it, so df.signal now rejects that
case with object_not_in_prerequisite_state instead of returning OK.

- Add the root-materialization check to df._enqueue_orchestrator_signal in both
  fresh-install SQL and the 0.2.3->0.2.4 upgrade script.
- Extend 24_atomic_rollback to assert same-transaction signal attempts are
  rejected before runtime materialization.
- Align the spec/upgrade docs: wait_for_schedule in-flight instances may need
  restart after the history-shape change; early signals are rejected, not
  buffered.
---
 docs/spec-atomic-start.md            | 17 +++++++++-----
 docs/upgrade-testing.md              |  1 +
 sql/pg_durable--0.2.3--0.2.4.sql     | 13 +++++++++++
 src/lib.rs                           | 13 +++++++++++
 tests/e2e/sql/24_atomic_rollback.sql | 33 ++++++++++++++++++++++++++++
 5 files changed, 71 insertions(+), 6 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index 1f2c3d0..f901c7a 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -219,10 +219,13 @@ the caller. Two safeguards make them safe to grant to every user:
 
 A signal must reach the root instance **and** every running sub-orchestration,
 because a child (a `JOIN`/`RACE` branch or a loop generation) may be the one waiting
-on it. The `df.signal` wrapper walks the instance tree from the root
-(`_duroxide.instances.parent_instance_id`), keeps the instances whose current
-execution is still running, and enqueues one `ExternalRaised` per surviving
-instance — all on the caller's transaction, so the whole fan-out is atomic.
+on it. Duroxide does not buffer external events until an orchestration has a pending
+subscription, so `df.signal` first requires the root runtime row to exist; a signal
+sent in the same transaction as `df.start()` is rejected instead of returning `OK`
+and being skipped before the workflow can observe it. Once the root is materialized,
+the wrapper walks the instance tree (`_duroxide.instances.parent_instance_id`) and
+enqueues one event for the root plus each running descendant — all on the caller's
+transaction, so the whole fan-out is atomic.
 
 ### Worker readiness and ordering
 
@@ -347,8 +350,10 @@ uses the atomic path only when both the duroxide-pg SQL enqueue function **and**
 out-of-band path. This keeps the binary compatible with every prior schema while
 the upgrade rolls out.
 
-**Data migration.** None. No table shapes change, and in-flight instances are
-unaffected (their queue items were already enqueued).
+**Data migration.** None. No table shapes change, and already-enqueued work items
+are left in place. One caveat: the `df.wait_for_schedule` fix changes the recorded
+history shape for a wait-in-loop orchestration, so such in-flight instances may need
+to be restarted after upgrade if they replay across the change.
 
 **Rollback.** Reverting the binary restores the out-of-band enqueue; the wrappers, if
 left in place, are simply unused.
diff --git a/docs/upgrade-testing.md b/docs/upgrade-testing.md
index e4dd1eb..ac95791 100644
--- a/docs/upgrade-testing.md
+++ b/docs/upgrade-testing.md
@@ -221,6 +221,7 @@ what the upgrade script handles, and any backward compatibility considerations.
 - **Scenario A considerations:** Fresh-install and upgrade schemas must expose the same new `df` functions and grants. The upgrade script creates the wrappers/reconciler and updates `grant_usage()`/`revoke_usage()` in the same release, so upgraded users can call `df.start()` / `df.cancel()` / `df.signal()` immediately after `ALTER EXTENSION UPDATE`.
 - **Scenario B1 considerations:** The new `.so` remains compatible with pre-upgrade 0.2.3 schemas because `df.start()` / `df.cancel()` / `df.signal()` use the in-transaction path only when both the duroxide-pg provider SQL function and the `df._enqueue_orchestrator_*` wrappers exist; otherwise they log and fall back to the old out-of-band client path. This fallback is not client-visible as a SQL `WARNING`, so it does not contaminate scripts that capture `SELECT df.start(...)` output.
 - **Scenario B2 considerations:** No data migration. Existing instances and queued work keep their original behavior; the new atomic semantics apply to calls made after the schema has the wrappers.
+- **In-flight schedule caveat:** This release also changes `df.wait_for_schedule` inside loops to recompute from the orchestration's recorded clock each generation. That fixes a busy-loop bug, but changes the history shape for a wait-in-loop instance that replays across the upgrade; customers should drain or restart such instances if they encounter nondeterministic replay.
 
 #### Rename `df.wait_for_completion()` to `df.await_instance()`
 - **DDL change (df schema):** Adds `df.await_instance(text, integer)` as the canonical C binding for the helper formerly exposed as `df.wait_for_completion(text, integer)`. The old SQL function remains present and the new `.so` continues exporting `wait_for_completion_wrapper` as a shim, so existing customer scripts keep working.
diff --git a/sql/pg_durable--0.2.3--0.2.4.sql b/sql/pg_durable--0.2.3--0.2.4.sql
index 9308bc1..f5f49f5 100644
--- a/sql/pg_durable--0.2.3--0.2.4.sql
+++ b/sql/pg_durable--0.2.3--0.2.4.sql
@@ -322,6 +322,7 @@ AS $fn$
 DECLARE
     sch       text := df.duroxide_schema();
     owner_oid oid;
+    root_exists boolean;
 BEGIN
     SELECT i.submitted_by::oid INTO owner_oid FROM df.instances i WHERE i.id = p_instance_id;
     IF owner_oid IS NULL OR NOT pg_catalog.pg_has_role(session_user, owner_oid, 'MEMBER') THEN
@@ -329,6 +330,18 @@ BEGIN
             USING ERRCODE = 'insufficient_privilege';
     END IF;
 
+    -- Duroxide does not buffer external events until an orchestration has a
+    -- pending subscription. If the root runtime row is not materialized yet, a
+    -- signal would be accepted but dropped before the workflow can observe it.
+    EXECUTE pg_catalog.format(
+        'SELECT EXISTS (SELECT 1 FROM %I.instances WHERE instance_id = $1)', sch)
+    INTO root_exists
+    USING p_instance_id;
+    IF NOT root_exists THEN
+        RAISE EXCEPTION 'pg_durable: instance % is not ready to receive signals', p_instance_id
+            USING ERRCODE = 'object_not_in_prerequisite_state';
+    END IF;
+
     -- Raise the event for the target instance and every RUNNING descendant
     -- (a sub-orchestration — JOIN/RACE branch or loop generation — may be the one
     -- waiting on the signal), mirroring the out-of-band fan-out. %1$I = schema.
diff --git a/src/lib.rs b/src/lib.rs
index 1143da7..d97fdff 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -668,6 +668,7 @@ AS $fn$
 DECLARE
     sch       text := df.duroxide_schema();
     owner_oid oid;
+    root_exists boolean;
 BEGIN
     SELECT i.submitted_by::oid INTO owner_oid FROM df.instances i WHERE i.id = p_instance_id;
     IF owner_oid IS NULL OR NOT pg_catalog.pg_has_role(session_user, owner_oid, 'MEMBER') THEN
@@ -675,6 +676,18 @@ BEGIN
             USING ERRCODE = 'insufficient_privilege';
     END IF;
 
+    -- Duroxide does not buffer external events until an orchestration has a
+    -- pending subscription. If the root runtime row is not materialized yet, a
+    -- signal would be accepted but dropped before the workflow can observe it.
+    EXECUTE pg_catalog.format(
+        'SELECT EXISTS (SELECT 1 FROM %I.instances WHERE instance_id = $1)', sch)
+    INTO root_exists
+    USING p_instance_id;
+    IF NOT root_exists THEN
+        RAISE EXCEPTION 'pg_durable: instance % is not ready to receive signals', p_instance_id
+            USING ERRCODE = 'object_not_in_prerequisite_state';
+    END IF;
+
     -- Raise the event for the target instance and every RUNNING descendant
     -- (a sub-orchestration — JOIN/RACE branch or loop generation — may be the one
     -- waiting on the signal), mirroring the out-of-band fan-out. %1$I = schema.
diff --git a/tests/e2e/sql/24_atomic_rollback.sql b/tests/e2e/sql/24_atomic_rollback.sql
index b2d8cc5..d9de3d4 100644
--- a/tests/e2e/sql/24_atomic_rollback.sql
+++ b/tests/e2e/sql/24_atomic_rollback.sql
@@ -171,4 +171,37 @@ DROP TABLE _t_signal;
 
 RESET SESSION AUTHORIZATION;
 
+-- ===========================================================================
+-- Scenario 4: a signal sent before runtime materialization is rejected
+-- ===========================================================================
+
+SET SESSION AUTHORIZATION df_e2e_user;
+
+DO $$
+DECLARE inst_id TEXT;
+BEGIN
+    BEGIN
+        inst_id := df.start(
+            'SELECT 1' ~> (df.wait_for_signal('go') |=> 'sig') ~> 'SELECT 1',
+            'atomic-signal-before-runtime-ready'
+        );
+
+        -- The worker cannot have materialized the root _duroxide.instances row
+        -- yet (we are still inside the same backend statement/transaction). A
+        -- signal here would be accepted by the runtime but skipped before the
+        -- workflow subscribes, so df.signal must fail instead of returning OK.
+        PERFORM df.signal(inst_id, 'go', '{}');
+        RAISE EXCEPTION 'TEST FAILED: df.signal accepted instance % before runtime materialization', inst_id;
+    EXCEPTION
+        WHEN object_not_in_prerequisite_state THEN
+            IF SQLERRM LIKE '%not ready to receive signals%' THEN
+                RAISE NOTICE 'PASSED [signal_before_runtime_ready_rejected]: %', SQLERRM;
+            ELSE
+                RAISE EXCEPTION 'TEST FAILED: unexpected not-ready signal error: %', SQLERRM;
+            END IF;
+    END;
+END $$;
+
+RESET SESSION AUTHORIZATION;
+
 SELECT 'TEST PASSED: atomic rollback (start/cancel/signal)' AS result;

From 00108fd8aebaf046f392e8178c5eebe60475fd6d Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 18:17:55 +0000
Subject: [PATCH 16/20] docs: flag wait_for_schedule replay-sequence upgrade
 caveat

The wait_for_schedule loop fix records an utc_now event before the timer, which
changes the durable replay event sequence for in-flight scheduled-loop instances.
Document the drain/restart requirement explicitly in the spec and upgrade notes.
---
 docs/spec-atomic-start.md | 17 ++++++++++++-----
 docs/upgrade-testing.md   |  2 +-
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index f901c7a..983a203 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -321,8 +321,11 @@ but it *is* a change from "fires the moment the statement runs." `df.cancel()` a
 - **Cron busy-loop (medium).** `df.wait_for_schedule` baked a fixed delay at build
   time that the loop replayed every generation. Fixed so the wait node recomputes the
   delay each generation from the orchestration's recorded clock (deterministic on
-  replay). Note: this changes the recorded history shape, so a wait-in-loop instance
-  in flight across the upgrade may need a restart.
+  replay). **Upgrade caveat:** this changes the recorded replay event sequence
+  (`utc_now` is recorded before the timer). A `df.loop(df.wait_for_schedule(...))`
+  instance that is already waiting when the binary changes may replay expecting the
+  old sequence and fail as nondeterministic; drain or restart those scheduled
+  instances during upgrade.
 - **Self-healing (medium).** The reconciler is re-checked from the steady-state poll
   loop, not only once per worker epoch, so a cancelled one comes back within the poll
   interval.
@@ -351,9 +354,13 @@ out-of-band path. This keeps the binary compatible with every prior schema while
 the upgrade rolls out.
 
 **Data migration.** None. No table shapes change, and already-enqueued work items
-are left in place. One caveat: the `df.wait_for_schedule` fix changes the recorded
-history shape for a wait-in-loop orchestration, so such in-flight instances may need
-to be restarted after upgrade if they replay across the change.
+are left in place.
+
+**In-flight scheduled loops.** `df.wait_for_schedule` now records an `utc_now` event
+before the timer so a loop computes the next cron tick each generation. That changes
+the recorded replay event sequence for existing wait-in-loop instances. Drain or
+restart any in-flight `df.loop(df.wait_for_schedule(...))` instances during upgrade;
+otherwise they may fail on replay as nondeterministic.
 
 **Rollback.** Reverting the binary restores the out-of-band enqueue; the wrappers, if
 left in place, are simply unused.
diff --git a/docs/upgrade-testing.md b/docs/upgrade-testing.md
index ac95791..b7f8672 100644
--- a/docs/upgrade-testing.md
+++ b/docs/upgrade-testing.md
@@ -221,7 +221,7 @@ what the upgrade script handles, and any backward compatibility considerations.
 - **Scenario A considerations:** Fresh-install and upgrade schemas must expose the same new `df` functions and grants. The upgrade script creates the wrappers/reconciler and updates `grant_usage()`/`revoke_usage()` in the same release, so upgraded users can call `df.start()` / `df.cancel()` / `df.signal()` immediately after `ALTER EXTENSION UPDATE`.
 - **Scenario B1 considerations:** The new `.so` remains compatible with pre-upgrade 0.2.3 schemas because `df.start()` / `df.cancel()` / `df.signal()` use the in-transaction path only when both the duroxide-pg provider SQL function and the `df._enqueue_orchestrator_*` wrappers exist; otherwise they log and fall back to the old out-of-band client path. This fallback is not client-visible as a SQL `WARNING`, so it does not contaminate scripts that capture `SELECT df.start(...)` output.
 - **Scenario B2 considerations:** No data migration. Existing instances and queued work keep their original behavior; the new atomic semantics apply to calls made after the schema has the wrappers.
-- **In-flight schedule caveat:** This release also changes `df.wait_for_schedule` inside loops to recompute from the orchestration's recorded clock each generation. That fixes a busy-loop bug, but changes the history shape for a wait-in-loop instance that replays across the upgrade; customers should drain or restart such instances if they encounter nondeterministic replay.
+- **In-flight schedule caveat:** This release also changes `df.wait_for_schedule` inside loops to recompute from the orchestration's recorded clock each generation. That fixes a busy-loop bug, but changes the recorded replay event sequence (`utc_now` is recorded before the timer). Existing `df.loop(df.wait_for_schedule(...))` instances that are already waiting when the binary changes should be drained or restarted during upgrade; otherwise they may replay expecting the old sequence and fail as nondeterministic.
 
 #### Rename `df.wait_for_completion()` to `df.await_instance()`
 - **DDL change (df schema):** Adds `df.await_instance(text, integer)` as the canonical C binding for the helper formerly exposed as `df.wait_for_completion(text, integer)`. The old SQL function remains present and the new `.so` continues exporting `wait_for_completion_wrapper` as a shim, so existing customer scripts keep working.

From 084562038805bd8a4b6d6e38e38960aaceeb70cf Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 18:43:35 +0000
Subject: [PATCH 17/20] Address correctness review: backfill wrapper grants and
 broaden schedule caveat

- During 0.2.3->0.2.4 upgrade, backfill EXECUTE on the new private atomic
  enqueue wrappers to roles that already had explicit USAGE on schema df,
  preserving grant option. Without this, existing df users could lose
  df.start()/df.cancel()/df.signal() after ALTER EXTENSION UPDATE because the
  new binary would choose the atomic path and then fail on wrapper EXECUTE.
- Extend scripts/test-upgrade.sh with a B2 assertion that a pre-upgrade df user
  retains wrapper privileges after upgrade.
- Broaden the wait_for_schedule upgrade caveat from looped schedules to any
  in-flight WAIT_SCHEDULE node, because the recorded replay event sequence now
  includes utc_now before the timer.
---
 docs/spec-atomic-start.md        | 18 +++++++++---------
 docs/upgrade-testing.md          |  6 +++---
 scripts/test-upgrade.sh          | 17 +++++++++++++++++
 sql/pg_durable--0.2.3--0.2.4.sql | 27 +++++++++++++++++++++++++++
 4 files changed, 56 insertions(+), 12 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index 983a203..e10cbe7 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -322,10 +322,10 @@ but it *is* a change from "fires the moment the statement runs." `df.cancel()` a
   time that the loop replayed every generation. Fixed so the wait node recomputes the
   delay each generation from the orchestration's recorded clock (deterministic on
   replay). **Upgrade caveat:** this changes the recorded replay event sequence
-  (`utc_now` is recorded before the timer). A `df.loop(df.wait_for_schedule(...))`
-  instance that is already waiting when the binary changes may replay expecting the
-  old sequence and fail as nondeterministic; drain or restart those scheduled
-  instances during upgrade.
+  (`utc_now` is recorded before the timer). Any instance already waiting in a
+  `WAIT_SCHEDULE` node when the binary changes may replay expecting the old sequence
+  and fail as nondeterministic; drain or restart those scheduled instances during
+  upgrade.
 - **Self-healing (medium).** The reconciler is re-checked from the steady-state poll
   loop, not only once per worker epoch, so a cancelled one comes back within the poll
   interval.
@@ -356,11 +356,11 @@ the upgrade rolls out.
 **Data migration.** None. No table shapes change, and already-enqueued work items
 are left in place.
 
-**In-flight scheduled loops.** `df.wait_for_schedule` now records an `utc_now` event
-before the timer so a loop computes the next cron tick each generation. That changes
-the recorded replay event sequence for existing wait-in-loop instances. Drain or
-restart any in-flight `df.loop(df.wait_for_schedule(...))` instances during upgrade;
-otherwise they may fail on replay as nondeterministic.
+**In-flight scheduled waits.** `df.wait_for_schedule` now records an `utc_now` event
+before the timer. That changes the recorded replay event sequence for existing
+instances already waiting in any `WAIT_SCHEDULE` node. Drain or restart those
+in-flight scheduled waits during upgrade; otherwise they may fail on replay as
+nondeterministic.
 
 **Rollback.** Reverting the binary restores the out-of-band enqueue; the wrappers, if
 left in place, are simply unused.
diff --git a/docs/upgrade-testing.md b/docs/upgrade-testing.md
index b7f8672..03d5ca1 100644
--- a/docs/upgrade-testing.md
+++ b/docs/upgrade-testing.md
@@ -215,13 +215,13 @@ what the upgrade script handles, and any backward compatibility considerations.
 - **Scenario B1/B2 considerations:** No schema/data migration and no new objects. The replaced bodies work against the existing schema and change no privileges already granted.
 
 #### Atomic in-transaction enqueue wrappers + `df.reconcile()`
-- **DDL change (df schema):** Adds three private `SECURITY DEFINER` wrappers: `df._enqueue_orchestrator_start(text, text, text)`, `df._enqueue_orchestrator_cancel(text, text)`, and `df._enqueue_orchestrator_signal(text, text, text)`. Fresh installs and the upgrade script create the same functions and `REVOKE EXECUTE ... FROM PUBLIC`; `df.grant_usage()` grants them explicitly to df users because `df.start()` / `df.cancel()` / `df.signal()` call them via SPI as the caller.
+- **DDL change (df schema):** Adds three private `SECURITY DEFINER` wrappers: `df._enqueue_orchestrator_start(text, text, text)`, `df._enqueue_orchestrator_cancel(text, text)`, and `df._enqueue_orchestrator_signal(text, text, text)`. Fresh installs and the upgrade script create the same functions and `REVOKE EXECUTE ... FROM PUBLIC`; `df.grant_usage()` grants them explicitly to df users because `df.start()` / `df.cancel()` / `df.signal()` call them via SPI as the caller. The upgrade script also backfills these wrapper grants to roles that already had explicit `USAGE` on schema `df` before `ALTER EXTENSION UPDATE`, preserving grant option where present.
 - **DDL change (df schema):** Adds admin-only `df.reconcile(integer)` (`SECURITY DEFINER`, `REVOKE EXECUTE ... FROM PUBLIC`) to delete orphaned duroxide instance subtrees and mark stuck `df.instances` rows failed. The background worker starts it through a built-in durable loop as the dedicated `df_reconciler` role.
 - **Provider coupling:** These wrappers and `df.reconcile()` call the duroxide-pg SQL surface directly (`enqueue_orchestrator_work`, `delete_instances_atomic`, and `_duroxide` tables). This is intentional: only direct SQL via SPI can share the caller's transaction. Non-PG providers (or older schemas without the wrappers) use the legacy out-of-band fallback.
-- **Scenario A considerations:** Fresh-install and upgrade schemas must expose the same new `df` functions and grants. The upgrade script creates the wrappers/reconciler and updates `grant_usage()`/`revoke_usage()` in the same release, so upgraded users can call `df.start()` / `df.cancel()` / `df.signal()` immediately after `ALTER EXTENSION UPDATE`.
+- **Scenario A considerations:** Fresh-install and upgrade schemas must expose the same new `df` functions and grants. The upgrade script creates the wrappers/reconciler, updates `grant_usage()`/`revoke_usage()`, and backfills existing df users in the same release, so upgraded users can call `df.start()` / `df.cancel()` / `df.signal()` immediately after `ALTER EXTENSION UPDATE`.
 - **Scenario B1 considerations:** The new `.so` remains compatible with pre-upgrade 0.2.3 schemas because `df.start()` / `df.cancel()` / `df.signal()` use the in-transaction path only when both the duroxide-pg provider SQL function and the `df._enqueue_orchestrator_*` wrappers exist; otherwise they log and fall back to the old out-of-band client path. This fallback is not client-visible as a SQL `WARNING`, so it does not contaminate scripts that capture `SELECT df.start(...)` output.
 - **Scenario B2 considerations:** No data migration. Existing instances and queued work keep their original behavior; the new atomic semantics apply to calls made after the schema has the wrappers.
-- **In-flight schedule caveat:** This release also changes `df.wait_for_schedule` inside loops to recompute from the orchestration's recorded clock each generation. That fixes a busy-loop bug, but changes the recorded replay event sequence (`utc_now` is recorded before the timer). Existing `df.loop(df.wait_for_schedule(...))` instances that are already waiting when the binary changes should be drained or restarted during upgrade; otherwise they may replay expecting the old sequence and fail as nondeterministic.
+- **In-flight schedule caveat:** This release also changes `df.wait_for_schedule` to compute from the orchestration's recorded clock. That fixes a loop busy-loop bug, but changes the recorded replay event sequence (`utc_now` is recorded before the timer). Any in-flight instance already waiting in a `WAIT_SCHEDULE` node when the binary changes should be drained or restarted during upgrade; otherwise it may replay expecting the old sequence and fail as nondeterministic.
 
 #### Rename `df.wait_for_completion()` to `df.await_instance()`
 - **DDL change (df schema):** Adds `df.await_instance(text, integer)` as the canonical C binding for the helper formerly exposed as `df.wait_for_completion(text, integer)`. The old SQL function remains present and the new `.so` continues exporting `wait_for_completion_wrapper` as a shim, so existing customer scripts keep working.
diff --git a/scripts/test-upgrade.sh b/scripts/test-upgrade.sh
index 1778543..bb0cfeb 100755
--- a/scripts/test-upgrade.sh
+++ b/scripts/test-upgrade.sh
@@ -914,6 +914,7 @@ echo ""
 B2_PRE_INSTANCE_ID=""
 B2_INFLIGHT_INSTANCE_ID=""
 B2_POST_INSTANCE_ID=""
+B2_PRE_GRANTED_ROLE="durable_b2_pre_grant"
 
 test_b2_data_survives_upgrade() {
     # Step 1: Install previous version and create test data
@@ -924,6 +925,14 @@ test_b2_data_survives_upgrade() {
     assert_sql_equals "SELECT df.clearvars();" "OK" || return 1
     assert_sql_equals "SELECT df.setvar('b2_key', 'b2_value');" "OK" || return 1
 
+    # Role granted under the previous schema. The upgrade must backfill EXECUTE
+    # on new private wrappers so this existing df user can keep calling
+    # df.start()/df.cancel()/df.signal() without re-running df.grant_usage().
+    run_sql_capture "DROP OWNED BY ${B2_PRE_GRANTED_ROLE};" >/dev/null 2>&1 || true
+    run_sql_capture "DROP ROLE IF EXISTS ${B2_PRE_GRANTED_ROLE};" >/dev/null 2>&1 || true
+    run_sql_capture "CREATE ROLE ${B2_PRE_GRANTED_ROLE} LOGIN;" >/dev/null || return 1
+    run_sql_capture "SELECT df.grant_usage('${B2_PRE_GRANTED_ROLE}');" >/dev/null || return 1
+
     B2_PRE_INSTANCE_ID=$(run_sql_capture "SELECT df.start('INSERT INTO test_upgrade_b2_log (kind, msg) VALUES (''pre'', ''{b2_key}'') RETURNING msg', 'b2-pre-upgrade');") || return 1
     B2_INFLIGHT_INSTANCE_ID=$(run_sql_capture "SELECT df.start(df.sleep(2) ~> 'SELECT ''b2-running'' AS value', 'b2-inflight');") || return 1
 
@@ -968,6 +977,12 @@ test_b2_new_data_after_upgrade() {
     assert_sql_equals "SELECT msg FROM test_upgrade_b2_log WHERE kind = 'post' ORDER BY id DESC LIMIT 1;" "new_value"
 }
 
+test_b2_existing_grants_after_upgrade() {
+    assert_sql_equals "SELECT has_function_privilege('${B2_PRE_GRANTED_ROLE}', 'df._enqueue_orchestrator_start(text, text, text)', 'EXECUTE');" "t" &&
+    assert_sql_equals "SELECT has_function_privilege('${B2_PRE_GRANTED_ROLE}', 'df._enqueue_orchestrator_cancel(text, text)', 'EXECUTE');" "t" &&
+    assert_sql_equals "SELECT has_function_privilege('${B2_PRE_GRANTED_ROLE}', 'df._enqueue_orchestrator_signal(text, text, text)', 'EXECUTE');" "t"
+}
+
 test_b2_grant_usage_after_upgrade() {
     # Regression guard for #110: after ALTER EXTENSION UPDATE, df.debug_connection()
     # must be gone from the catalog. Scenario A only compares function name/args/
@@ -996,6 +1011,7 @@ test_b2_grant_usage_after_upgrade() {
 
     # Clean up the probe role.
     run_sql_capture "DROP OWNED BY ${probe_role}; DROP ROLE IF EXISTS ${probe_role};" >/dev/null 2>&1 || true
+    run_sql_capture "DROP OWNED BY ${B2_PRE_GRANTED_ROLE}; DROP ROLE IF EXISTS ${B2_PRE_GRANTED_ROLE};" >/dev/null 2>&1 || true
 }
 
 if [ "$HAS_COMPAT_PREV" = true ]; then
@@ -1003,6 +1019,7 @@ if [ "$HAS_COMPAT_PREV" = true ]; then
     run_test "B2: Pre-upgrade instance remains queryable" test_b2_pre_upgrade_instance_after_upgrade
     run_test "B2: In-flight work completes after upgrade" test_b2_inflight_work_after_upgrade
     run_test "B2: New data and execution after upgrade" test_b2_new_data_after_upgrade
+    run_test "B2: Existing df users retain wrapper privileges after upgrade" test_b2_existing_grants_after_upgrade
     run_test "B2: df.grant_usage() works and df.debug_connection() is gone after upgrade" test_b2_grant_usage_after_upgrade
 fi
 
diff --git a/sql/pg_durable--0.2.3--0.2.4.sql b/sql/pg_durable--0.2.3--0.2.4.sql
index f5f49f5..e6fc0fa 100644
--- a/sql/pg_durable--0.2.3--0.2.4.sql
+++ b/sql/pg_durable--0.2.3--0.2.4.sql
@@ -372,6 +372,33 @@ $fn$;
 
 REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_signal(text, text, text) FROM PUBLIC;
 
+-- Backfill wrapper EXECUTE to roles that already had df usage before ALTER
+-- EXTENSION UPDATE. New calls to df.grant_usage() grant these wrappers via the
+-- function body above, but existing users would otherwise lose df.start() /
+-- df.cancel() / df.signal() when the new .so chooses the atomic path.
+DO $$
+DECLARE
+    r RECORD;
+    grant_opt TEXT;
+BEGIN
+    FOR r IN
+        SELECT
+            pg_catalog.quote_ident(pg_catalog.pg_get_userbyid(a.grantee)) AS grantee,
+            pg_catalog.bool_or(a.is_grantable) AS with_grant_option
+        FROM pg_catalog.pg_namespace n
+        CROSS JOIN LATERAL pg_catalog.aclexplode(n.nspacl) AS a
+        WHERE n.nspname OPERATOR(pg_catalog.=) 'df'
+          AND a.privilege_type OPERATOR(pg_catalog.=) 'USAGE'
+          AND a.grantee OPERATOR(pg_catalog.<>) 0  -- skip PUBLIC
+        GROUP BY a.grantee
+    LOOP
+        grant_opt := CASE WHEN r.with_grant_option THEN ' WITH GRANT OPTION' ELSE '' END;
+        EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text, text) TO %s', r.grantee) OPERATOR(pg_catalog.||) grant_opt;
+        EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df._enqueue_orchestrator_cancel(text, text) TO %s', r.grantee) OPERATOR(pg_catalog.||) grant_opt;
+        EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df._enqueue_orchestrator_signal(text, text, text) TO %s', r.grantee) OPERATOR(pg_catalog.||) grant_opt;
+    END LOOP;
+END $$;
+
 CREATE FUNCTION df.reconcile(p_grace_seconds integer DEFAULT 60)
 RETURNS TABLE(duroxide_orphans_deleted bigint, stuck_instances_failed bigint)
 LANGUAGE plpgsql

From 458a422ea1397d6eac65a123252006b5346ff35b Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 19:00:16 +0000
Subject: [PATCH 18/20] Fix pgspot search_path finding for SECURITY DEFINER
 functions

Use SET search_path = pg_catalog, pg_temp for the new SECURITY DEFINER wrappers
and df.reconcile() in both fresh-install SQL and the 0.2.3->0.2.4 upgrade
script. This satisfies the pgspot PS004 gate while preserving the fully qualified
SQL references inside the functions.
---
 sql/pg_durable--0.2.3--0.2.4.sql | 8 ++++----
 src/lib.rs                       | 8 ++++----
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/sql/pg_durable--0.2.3--0.2.4.sql b/sql/pg_durable--0.2.3--0.2.4.sql
index e6fc0fa..49f7d7c 100644
--- a/sql/pg_durable--0.2.3--0.2.4.sql
+++ b/sql/pg_durable--0.2.3--0.2.4.sql
@@ -224,7 +224,7 @@ CREATE FUNCTION df._enqueue_orchestrator_start(
 RETURNS void
 LANGUAGE plpgsql
 SECURITY DEFINER
-SET search_path = pg_catalog
+SET search_path = pg_catalog, pg_temp
 AS $fn$
 DECLARE
     sch       text := df.duroxide_schema();
@@ -291,7 +291,7 @@ CREATE FUNCTION df._enqueue_orchestrator_cancel(p_instance_id text, p_reason tex
 RETURNS void
 LANGUAGE plpgsql
 SECURITY DEFINER
-SET search_path = pg_catalog
+SET search_path = pg_catalog, pg_temp
 AS $fn$
 DECLARE
     sch       text := df.duroxide_schema();
@@ -317,7 +317,7 @@ CREATE FUNCTION df._enqueue_orchestrator_signal(p_instance_id text, p_name text,
 RETURNS void
 LANGUAGE plpgsql
 SECURITY DEFINER
-SET search_path = pg_catalog
+SET search_path = pg_catalog, pg_temp
 AS $fn$
 DECLARE
     sch       text := df.duroxide_schema();
@@ -403,7 +403,7 @@ CREATE FUNCTION df.reconcile(p_grace_seconds integer DEFAULT 60)
 RETURNS TABLE(duroxide_orphans_deleted bigint, stuck_instances_failed bigint)
 LANGUAGE plpgsql
 SECURITY DEFINER
-SET search_path = pg_catalog
+SET search_path = pg_catalog, pg_temp
 AS $fn$
 DECLARE
     sch        text := df.duroxide_schema();
diff --git a/src/lib.rs b/src/lib.rs
index d97fdff..dd7f7ac 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -547,7 +547,7 @@ CREATE FUNCTION df._enqueue_orchestrator_start(
 RETURNS void
 LANGUAGE plpgsql
 SECURITY DEFINER
-SET search_path = pg_catalog
+SET search_path = pg_catalog, pg_temp
 AS $fn$
 DECLARE
     sch       text := df.duroxide_schema();
@@ -637,7 +637,7 @@ CREATE FUNCTION df._enqueue_orchestrator_cancel(p_instance_id text, p_reason tex
 RETURNS void
 LANGUAGE plpgsql
 SECURITY DEFINER
-SET search_path = pg_catalog
+SET search_path = pg_catalog, pg_temp
 AS $fn$
 DECLARE
     sch       text := df.duroxide_schema();
@@ -663,7 +663,7 @@ CREATE FUNCTION df._enqueue_orchestrator_signal(p_instance_id text, p_name text,
 RETURNS void
 LANGUAGE plpgsql
 SECURITY DEFINER
-SET search_path = pg_catalog
+SET search_path = pg_catalog, pg_temp
 AS $fn$
 DECLARE
     sch       text := df.duroxide_schema();
@@ -751,7 +751,7 @@ CREATE FUNCTION df.reconcile(p_grace_seconds integer DEFAULT 60)
 RETURNS TABLE(duroxide_orphans_deleted bigint, stuck_instances_failed bigint)
 LANGUAGE plpgsql
 SECURITY DEFINER
-SET search_path = pg_catalog
+SET search_path = pg_catalog, pg_temp
 AS $fn$
 DECLARE
     sch        text := df.duroxide_schema();

From a38eb9f8aee33e072052c4635ac4aab93345307e Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 22:36:51 +0000
Subject: [PATCH 19/20] docs: shorten atomic consistency spec to high-level
 design

Trim implementation details that are visible in the diff and keep the spec focused
on the problem, chosen prevent/repair design, provider coupling, behavior changes,
upgrade caveats, validation, and remaining risks.
---
 docs/spec-atomic-start.md | 547 ++++++++++----------------------------
 1 file changed, 146 insertions(+), 401 deletions(-)

diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md
index e10cbe7..683ecf1 100644
--- a/docs/spec-atomic-start.md
+++ b/docs/spec-atomic-start.md
@@ -1,408 +1,153 @@
 # Keeping the control plane and the runtime consistent
 
-**Status:** Implemented and validated. Atomic `df.start`/`df.cancel`/`df.signal`,
-plus a reconciler that repairs leftover drift.
+**Status:** Implemented and validated. Atomic `df.start` / `df.cancel` /
+`df.signal`, plus a reconciler that repairs leftover drift.
 **Author:** pg_durable team
 **Date:** June 2026
 
-## The problem
+## Problem
 
-A durable function lives in two stores that are written separately:
-
-- the **pg_durable control plane** — `df.nodes` and `df.instances`, written on the
-  caller's transaction; and
-- the **duroxide runtime** — `_duroxide.orchestrator_queue`, where the orchestration
-  is enqueued.
-
-Originally `df.start()` wrote the first on the caller's transaction but enqueued the
-second out-of-band, on a separate connection pool that committed on its own. The two
-were never atomic, so they could disagree.
+A durable function has state in two places:
 
-The common failure was an **orphaned orchestration**: a `df.start()` whose
-surrounding transaction rolled back (or whose batch aborted on a primary-key
-collision) left a row in `_duroxide` with no matching `df.*` rows. The background
-worker then kept trying to load a graph that did not exist, failed after a 5-second
-timeout, and the orphan stayed behind. Under load these orphans accumulated and
-saturated the worker's two threads.
-
-We observed this directly: a `df.start()` inside a rolled-back transaction left
-`_duroxide.instances` row `48819f12` with zero rows in `df.instances`, surviving the
-rollback.
-
-## What we did
-
-Two complementary changes:
-
-1. **Prevent** the drift at the source. `df.start()`, `df.cancel()`, and
-   `df.signal()` now enqueue their runtime work on the **caller's transaction**, so
-   the control-plane writes and the runtime enqueue commit or roll back together.
-2. **Repair** what prevention cannot cover. A reconciler periodically deletes
-   leftover orphans and marks stuck instances failed — catching drift from crashes
-   mid-execution and from the legacy fallback path.
-
-Both are pg_durable-only. No `duroxide` / `duroxide-pg` or `Cargo.*` changes.
-
-## Designs we considered
-
-Four approaches were weighed for keeping the two stores in agreement.
-
-**Option 1 — One store.** Drop `df.nodes` / `df.instances` entirely and keep all
-state in `_duroxide`, exposing what pg_durable needs through views.
-*Strongest guarantee* (only one store can't diverge) but the most disruptive: it
-moves the `submitted_by` security boundary, rebuilds per-user row security over
-runtime-owned tables, re-exposes every read path, and needs a destructive migration.
-A possible long-term direction, not this change.
-
-**Option 2 — Run our writes inside duroxide's transaction.** Extend duroxide-pg so
-the enqueue also performs the `df.*` writes, on duroxide's worker-pool connection.
-This makes the two stores atomic — but on the *worker's* connection, so `df.start()`
-no longer joins the *caller's* transaction and the `df.*` writes run as the pool
-role, breaking identity capture and row security. Strictly worse than Option 3.
-
-**Option 3 — Run duroxide's enqueue inside the caller's transaction.** Keep both
-stores; call the runtime's enqueue over SPI on the caller's transaction, next to the
-`df.*` writes. Everything commits or rolls back together, identity capture and the
-read paths are untouched, and the 5-second load race disappears. The only cost is a
-`SECURITY DEFINER` wrapper, because the runtime queue is writable by its owner only.
-**Chosen as the primary fix.**
-
-**Option 4 — Tolerate drift, repair it later.** Accept a brief inconsistency window
-and add a reconciler that finds and fixes mismatches. Adds no guarantee on its own,
-but it is the only option that also catches drift from crashes *mid-execution* —
-something no start-time atomicity can prevent. **Chosen as a backstop.**
-
-| Option | Atomic with `df.*` | Atomic with caller's tx | Removes 5 s race | Fixes crash-time drift | Effort |
-|--------|:--:|:--:|:--:|:--:|:--:|
-| 1 — one store | n/a | no | yes | yes | Large |
-| 2 — our writes in duroxide's tx | yes | **no** | yes | no | Medium |
-| 3 — enqueue in caller's tx | yes | **yes** | **yes** | no | Small |
-| 4 — async reconciler | repairs | — | no | **yes** | Small |
-
-We implemented **Option 3 (prevent) plus Option 4 (repair)**. Options 1 and 2 were
-rejected: Option 1 is out of proportion to the problem, and Option 2 is a weaker
-version of Option 3.
-
-## How a start is written
-
-`df.start()` (`src/dsl.rs`) runs these steps on the caller's transaction:
-
-1. Pick an instance id (`short_id()` — eight hex characters).
-2. Validate (database exists, caller can log in, superuser policy).
-3. Insert the graph rows into `df.nodes`.
-4. Insert the instance row into `df.instances`.
-5. Snapshot the caller's `df.vars`.
-6. Enqueue the orchestration.
-
-The fix is small because the start is **a single enqueue**: one
-`StartOrchestration` row in `_duroxide.orchestrator_queue`. The worker builds the
-`_duroxide.instances` and history rows later. And because duroxide-pg exposes every
-operation as a SQL function in the `_duroxide` schema, that enqueue can run over SPI
-on the caller's transaction — no separate pool needed.
-
-### What each store owns
-
-The fix must not move any of these boundaries.
-
-| Data | Source of truth | Notes |
-|------|-----------------|-------|
-| Graph (`df.nodes`) | `df.*` | The worker loads it; duroxide does not store it. |
-| Execution history | `_duroxide.history` | Runtime-internal; transactional per turn. |
-| Control plane (`root_node`, `submitted_by`, `database`, `label`) | `df.instances` | The worker needs it to load and authenticate. |
-| `df.instances.status` | best-effort mirror | Written by the worker; self-heals, not atomic with the runtime. |
-| `df.nodes.status/result/error` | best-effort mirror | Same as above. |
-
-`submitted_by` is a **security boundary**: it is captured as the calling role at
-start time and must stay unforgeable. The worker re-checks the superuser policy when
-it loads the graph.
-
-### Where the two stores were written together
-
-Auditing every writer, the cross-store paths are bounded. All three now enqueue on
-the caller's transaction:
-
-| Caller | `df.*` write | `_duroxide` write | On caller's tx now? |
-|--------|------|------|:--:|
-| `df.start()` | insert `df.nodes` + `df.instances` | enqueue `StartOrchestration` | yes |
-| `df.cancel()` | update `df.instances.status` | enqueue `CancelInstance` | yes |
-| `df.signal()` | none (only a row-security read) | enqueue `ExternalRaised` (+ fan-out) | yes |
-
-- **`df.cancel()`** was a genuine dual-write: it enqueued the cancel, then updated
-  the status mirror. The status update is only an optimistic hint — the worker
-  re-applies the authoritative `cancelled` status when it processes the cancel — so
-  any disagreement was already transient. Both now commit together.
-- **`df.signal()`** writes only `_duroxide`, so it never created cross-store drift.
-  Moving it onto the caller's transaction simply gives it the same rule as the
-  others: *if my transaction rolls back, the signal is not delivered.*
-
-The worker also keeps the `df.*` status columns in sync as activities run. Those
-writes are eventually consistent and self-healing by design; the reconciler cleans
-up any lasting drift.
-
-## The mechanism
-
-Step 6 of `df.start()` is now an SPI call on the caller's transaction, right after
-the `df.instances` insert:
-
-```text
-INSERT INTO df.nodes ...                              -- step 3
-INSERT INTO df.instances ...                          -- step 4
-SELECT df._enqueue_orchestrator_start($id, $name, $input);  -- step 6 (SPI)
-```
-
-Because the enqueue is an ordinary SPI statement, it shares the caller's
-transaction: a rollback drops the queue row along with the `df.*` rows, and a commit
-makes them visible together. `df.cancel()` and `df.signal()` work the same way
-through their own wrappers.
-
-### Why this reaches into duroxide-pg directly (and needs the PG provider)
-
-Everywhere else, pg_durable talks to the runtime through duroxide's Rust
-provider/client API — the out-of-band start/cancel/signal fallback (`src/client.rs`)
-and the monitoring/explain reads (`src/monitoring.rs`, `src/explain.rs`) all go
-through `Client`/`Provider` methods, which are provider-agnostic at the API level.
-The in-transaction enqueue and `df.reconcile()` are the **first places pg_durable
-calls duroxide-pg's SQL surface directly**: `enqueue_orchestrator_work` and
-`delete_instances_atomic`, plus reads of `_duroxide.orchestrator_queue`,
-`instances`, and `executions`. They therefore depend on duroxide-pg's table shapes
-and function signatures, not just on the runtime existing.
-
-This is deliberate, and it is the whole point of the change: only direct SQL (via
-SPI) can run inside the **caller's** transaction. The Rust client uses a separate
-connection pool, so it can never be atomic with the backend's SPI work. The trade is
-a deeper coupling to duroxide-pg in exchange for atomicity.
-
-It would not work for a non-PostgreSQL provider — there would be no SQL functions to
-call — but pg_durable already assumes a duroxide-pg provider in a known schema: it
-places its own `_worker_ready` table inside `_duroxide`, resolves the schema via
-`df.duroxide_schema()`, and relies on duroxide-pg applying its migrations at startup.
-In the pg_durable context — a PostgreSQL extension whose runtime state lives in the
-same database — there is no real use case for another provider. The probe-and-
-fallback (below) keeps any non-PG provider working, non-atomically, rather than
-broken.
-
-### Why a privileged wrapper is needed
-
-`_duroxide.orchestrator_queue` is writable by its owner only (the worker role), so a
-normal caller cannot insert into it. The enqueue therefore runs through a
-`SECURITY DEFINER` wrapper in the `df` schema, owned by the extension owner. Each
-wrapper pins `search_path` and schema-qualifies every reference (per CVE-2018-1058,
-enforced by `scripts/pgspot-gate.sh`).
-
-The wrappers resolve the runtime schema at call time via `df.duroxide_schema()`
-(`_duroxide` on current installs, legacy `duroxide` on older ones) and quote it
-with `%I`. They are revoked from `PUBLIC` and granted through `df.grant_usage()`.
-
-Because a `SECURITY DEFINER` function runs as its owner, the wrappers cannot trust
-the caller. Two safeguards make them safe to grant to every user:
-
-- **The work item is built inside the wrapper**, from validated arguments — never
-  passed in. A caller cannot choose the work-item type or smuggle in a foreign
-  target. `df.start` only accepts the public root graph-executor orchestration and
-  requires the input JSON's `instance_id` to match the target id; `df.cancel` builds
-  `CancelInstance`; `df.signal` builds `ExternalRaised`. All three use
-  `json_build_object`, matching duroxide's wire format.
-- **The caller is authorized before any enqueue**, by two different rules:
-  - *Start* targets a brand-new instance, so it authorizes by *state*: it permits
-    the enqueue only for a `pending` `df.instances` row that has no queue entry and
-    no runtime instance yet. Under the atomic path a committed instance always has
-    its queue row in the same transaction, so this state is reachable only for the
-    row the caller just inserted — never someone else's instance.
-  - *Cancel* and *signal* target an already-committed instance, so they authorize by
-    *ownership*: `pg_has_role(session_user, <instance owner>, 'MEMBER')`.
-    `session_user` is the real authenticated role (it cannot be spoofed inside a
-    `SECURITY DEFINER` function), and checking membership rather than an exact name
-    means a role that owns the instance through `SET ROLE` still qualifies. A
-    non-member is rejected before anything is enqueued — the same gate `df.cancel` /
-    `df.signal` already enforce through row security.
-
-### Signal fan-out
-
-A signal must reach the root instance **and** every running sub-orchestration,
-because a child (a `JOIN`/`RACE` branch or a loop generation) may be the one waiting
-on it. Duroxide does not buffer external events until an orchestration has a pending
-subscription, so `df.signal` first requires the root runtime row to exist; a signal
-sent in the same transaction as `df.start()` is rejected instead of returning `OK`
-and being skipped before the workflow can observe it. Once the root is materialized,
-the wrapper walks the instance tree (`_duroxide.instances.parent_instance_id`) and
-enqueues one event for the root plus each running descendant — all on the caller's
-transaction, so the whole fan-out is atomic.
-
-### Worker readiness and ordering
-
-- The existing readiness gate still applies: if the worker has not initialized the
-  runtime schema yet, `df.start()` fails clearly (and atomically — the `df.*`
-  inserts roll back).
-- The enqueue is the last write in `df.start()`. A primary-key collision on
-  `df.nodes` / `df.instances` still aborts before it, now cleanly inside one
-  transaction.
-
-### The 5-second load race is gone (on the atomic path)
-
-The worker's `load-function-graph` activity used to wait up to five seconds for the
-`df.*` rows to appear, because it could dequeue a start before they committed. On
-the atomic path the queue row becomes visible only *after* those rows commit in the
-same transaction, so that wait can't trigger. The retry is kept for the
-non-atomic fallback path, where it remains a no-op on the atomic path.
-
-## The reconciler
-
-`df.reconcile(p_grace_seconds integer DEFAULT 60)` is a `SECURITY DEFINER`,
-admin-only function that repairs leftover drift:
-
-- It deletes orphaned **root** runtime instances — those with no parent, no matching
-  `df.instances` row, and older than the grace window — gathering each orphan's full
-  subtree so the delete is accepted. Sub-orchestrations (branches and loop
-  generations) have no `df.instances` row of their own and are excluded by the root
-  filter, so they are never collected on their own.
-- It marks `df.instances` rows that are stuck non-terminal — no live runtime
-  instance and no queued start — as `failed`.
-
-Each pass is wrapped so a single failure never aborts the reconcile or stops the
-loop.
-
-The background worker runs the reconciler as a **built-in durable cron loop**,
-keeping exactly one instance per cluster. It reads the schedule, connects as the
-`df_reconciler` role (`SET ROLE`), and starts:
-
-```sql
--- $1 = the pg_durable.reconciler_cron value, $2 = 'df_reconciler'
-SELECT df.start(
-  df.loop(df.seq('SELECT * FROM df.reconcile()', df.wait_for_schedule($1))),
-  $2);
-```
-
-`worker::ensure_reconciler` starts it, skips if one is already pending or running,
-and restarts it if it died. It is driven by the `pg_durable.reconciler_cron` setting
-(default `*/5 * * * *`; empty disables it) and submitted by a dedicated
-**non-superuser** role, `df_reconciler` (so `submitted_by` is `df_reconciler`). A
-non-superuser identity keeps it clear of the superuser-instance guard and limits what
-the instance can do to "run the reconciler" even if it were somehow forged.
-
-## Behavior change for users
-
-`df.start()` now takes part in the caller's transaction:
-
-```sql
-BEGIN;
-SELECT df.start('SELECT 1');   -- enqueued on THIS transaction
-ROLLBACK;                      -- fully undone; nothing runs
-```
-
-Previously the `df.*` rows rolled back but the orchestration still ran. The new
-behavior is the least surprising one and is the deliberate decision of this spec,
-but it *is* a change from "fires the moment the statement runs." `df.cancel()` and
-`df.signal()` gain the same property. This should be called out in `USER_GUIDE.md`.
-
-## Implementation notes
-
-- The atomic path requires the duroxide-pg provider (see *Why this reaches into
-  duroxide-pg directly* above). `df.start` / `df.cancel` / `df.signal` each probe for
-  its SQL surface (`enqueue_orchestrator_work` in the resolved schema); when it is
-  absent — a different provider, or a schema predating the wrappers — they log and
-  fall back to the old out-of-band path, so the change never breaks another provider
-  or contaminates `SELECT df.start(...)` output with client-visible warnings.
-- `visible_at = now()` (transaction time) is enough for immediate starts.
-- The work items are byte-compatible with duroxide's `WorkItem` JSON, so the worker
-  behaves exactly as before.
-
-### Hardening applied during review
-
-- **Cross-tenant enqueue (high).** The start wrapper first accepted an opaque work
-  item and was granted to everyone, which would let a caller forge a cancel or
-  signal against another instance. Fixed by building the work item inside the
-  wrapper and authorizing only a brand-new instance.
-- **Reconciler deleting live work (high).** The runtime refuses to delete a parent
-  without its children, so deleting only orphan roots failed on any parallel-workflow
-  orphan. Fixed by gathering the full subtree and isolating each pass so one failure
-  can't stop the loop.
-- **Suppressing the reconciler (high).** Its single-instance check first keyed on the
-  user-writable label, so a user could park a same-labelled instance to block it.
-  Fixed by keying on the unforgeable `df_reconciler` submitter.
-- **Cron busy-loop (medium).** `df.wait_for_schedule` baked a fixed delay at build
-  time that the loop replayed every generation. Fixed so the wait node recomputes the
-  delay each generation from the orchestration's recorded clock (deterministic on
-  replay). **Upgrade caveat:** this changes the recorded replay event sequence
-  (`utc_now` is recorded before the timer). Any instance already waiting in a
-  `WAIT_SCHEDULE` node when the binary changes may replay expecting the old sequence
-  and fail as nondeterministic; drain or restart those scheduled instances during
-  upgrade.
-- **Self-healing (medium).** The reconciler is re-checked from the steady-state poll
-  loop, not only once per worker epoch, so a cancelled one comes back within the poll
-  interval.
-- **Silent fallback (medium).** `df.start` now logs when it uses the non-atomic
-  fallback.
-- **Kept `LOGIN` on `df_reconciler`** (debated): the worker runs the reconcile node
-  by connecting *as* the role, like every other durable-function role, so `NOLOGIN`
-  would break it.
-
-> **Local-dev note:** `DROP EXTENSION pg_durable CASCADE` can leave the worker-owned
-> `_duroxide` schema half-dropped (its migration row remains but functions are gone),
-> which shows up as flaky `JOIN` behavior and a missing `_worker_ready`. Recover with
-> `DROP SCHEMA _duroxide CASCADE` and a restart so the worker re-applies its
-> migrations.
-
-## Upgrade and migration
-
-**New binary against an older schema.** The wrappers live in the `df` schema, so
-they ship in the extension's install SQL and an upgrade script
-(`sql/pg_durable--<prev>--<current>.sql`, `CREATE FUNCTION` + `GRANT`), with a
-"Version-Specific Changes" entry in `docs/upgrade-testing.md`. A new binary running
-against an older `df` schema that predates the wrappers must still work: each caller
-uses the atomic path only when both the duroxide-pg SQL enqueue function **and** the
-`df._enqueue_orchestrator_*` wrappers exist; otherwise it falls back to the old
-out-of-band path. This keeps the binary compatible with every prior schema while
-the upgrade rolls out.
-
-**Data migration.** None. No table shapes change, and already-enqueued work items
-are left in place.
-
-**In-flight scheduled waits.** `df.wait_for_schedule` now records an `utc_now` event
-before the timer. That changes the recorded replay event sequence for existing
-instances already waiting in any `WAIT_SCHEDULE` node. Drain or restart those
-in-flight scheduled waits during upgrade; otherwise they may fail on replay as
-nondeterministic.
-
-**Rollback.** Reverting the binary restores the out-of-band enqueue; the wrappers, if
-left in place, are simply unused.
-
-**Future direction.** The privileged enqueue could move into duroxide-pg, owned by
-the runtime schema owner — cleaner layering, at the cost of a coordinated
-duroxide-pg release. The `df`-schema wrapper avoids that coordination and is what
-ships today.
-
-## Validation (PostgreSQL 17)
-
-Prevent:
-
-| Check | Result |
-|-------|:--:|
-| Committed `df.start()` runs to completion | pass |
-| Rolled-back `df.start()` leaves no `df.*` and no `_duroxide` orphan | pass |
-| Rolled-back `df.cancel()` leaves the instance running and rolls back the enqueue; committed cancel takes effect | pass |
-| Rolled-back `df.signal()` rolls back the fan-out enqueue; committed signal delivers (incl. fan-out into a sub-orchestration) | pass |
-| A caller cannot enqueue a start, cancel, or signal against a foreign or arbitrary instance; cancel/signal succeed for the owning role (incl. via `SET ROLE`) | pass |
-| Schema resolved dynamically; atomicity holds | pass |
-| E2E `07_signals`, `23_signal_in_race` (fan-out), `22_cancel_status_consistency` | 3/3 |
-| Broader E2E subset (core, conditionals, loops, variables, join, race) | pass |
-
-Repair:
-
-| Check | Result |
-|-------|:--:|
-| `df.reconcile()` deletes an orphan root with children, leaves sub-orchestrations and live instances intact | pass |
-| `df.reconcile()` marks a stuck instance failed | pass |
-| Reconciler auto-starts as non-superuser `df_reconciler`, stays a single instance | pass |
-| Reconciler fires on the cron schedule without spinning, and restarts after being cancelled | pass |
-| E2E subset unaffected with the reconciler running | pass |
-| `cargo fmt --check` clean; no new clippy warnings | pass |
-
-## Open questions
-
-- **Enqueue placement** — keep the `df`-schema wrapper, or move it into duroxide-pg
-  (cleaner layering, but a coordinated release). The wrapper ships today; the move is
-  a future option.
-- **Contract stability** — pg_durable now depends on the enqueue function signature
-  and the `WorkItem` JSON shape. Mitigated by the version-locked
-  duroxide / duroxide-pg pair.
-- **Reconciler policy** — the grace window and cron cadence still want tuning, and
-  `df_reconciler` is created by the worker rather than provisioned externally.
+- **pg_durable control plane** — `df.nodes` and `df.instances`, written on the
+  caller's PostgreSQL transaction.
+- **duroxide runtime** — `_duroxide` queue/history/instance state, previously
+  written out-of-band through a separate connection.
+
+Those writes were not atomic. The visible failure was a rolled-back `df.start()`:
+`df.*` rows rolled back, but the duroxide enqueue survived. The worker then retried
+an orchestration whose graph no longer existed, waited up to 5 seconds for rows that
+would never appear, and left an orphan behind. We reproduced this with rolled-back
+starts leaving `_duroxide.instances` rows that had no matching `df.instances` row.
+
+`df.cancel()` and `df.signal()` had the same transaction-boundary problem for their
+runtime enqueue: a rollback of the caller's transaction did not roll back the runtime
+work.
+
+## Decision
+
+Implement **prevent + repair**:
+
+1. **Prevent:** enqueue `df.start()`, `df.cancel()`, and `df.signal()` runtime work
+   inside the caller's transaction via SPI. The control-plane writes and runtime
+   enqueue now commit or roll back together.
+2. **Repair:** run a lightweight reconciler that removes leftover runtime orphans
+   and marks stuck control-plane rows failed. This catches legacy/fallback drift and
+   crash-time drift that transaction-local enqueue cannot prevent.
+
+## Alternatives considered
+
+| Option | Summary | Decision |
+|---|---|---|
+| Single source of truth in `_duroxide` | Move all state into the runtime schema and rebuild pg_durable reads/security over it. | Too large and migration-heavy for this fix. |
+| Run pg_durable writes inside duroxide's transaction | Let duroxide-pg perform the `df.*` writes on its worker-pool connection. | Atomic, but not with the caller's transaction; breaks identity capture and row security expectations. |
+| Run duroxide enqueue inside the caller's transaction | Keep both stores and use SPI to enqueue runtime work next to `df.*` writes. | Chosen primary fix. |
+| Async reconciler only | Tolerate drift and repair it later. | Useful backstop, but insufficient alone. |
+
+## Design overview
+
+`df.start()` still creates the graph and instance rows in `df.*`. The final enqueue
+step now happens through SQL on the same transaction, so a rollback undoes both the
+control-plane rows and the runtime queue row. `df.cancel()` and `df.signal()` follow
+the same transaction rule.
+
+The runtime queue is not writable by ordinary users, so the enqueue goes through
+private `SECURITY DEFINER` wrappers in the `df` schema. These wrappers are granted
+through `df.grant_usage()`, build the runtime work items themselves, and perform
+their own authorization checks before writing to `_duroxide`.
+
+The start wrapper is intentionally **not** a general-purpose privileged runtime
+entrypoint. It only starts the root function-graph orchestration and validates that
+the input targets the same instance id. Cancel/signal wrappers authorize against the
+instance owner.
+
+### Direct duroxide-pg coupling
+
+This is the abstraction break in the design. Most pg_durable code talks to the
+runtime through duroxide's Rust provider/client API. That API uses a separate
+connection pool, so it cannot share the caller's backend transaction.
+
+To get caller-transaction atomicity, this PR calls the duroxide-pg SQL surface
+directly (`enqueue_orchestrator_work`, `delete_instances_atomic`, and selected
+runtime tables). That only works with a PostgreSQL-backed provider. In practice,
+pg_durable is itself a PostgreSQL extension whose runtime state lives in the same
+database, so a non-PG provider is not a meaningful deployment target; still, the
+code probes for the SQL surface and falls back to the old out-of-band path when it
+is absent.
+
+### Signals
+
+`df.signal()` fans out to the root instance and any running sub-orchestrations,
+because a child branch may be the one waiting on the signal.
+
+Duroxide does not buffer external events until an orchestration is ready to receive
+them. Therefore a signal sent before the root runtime row exists is rejected instead
+of returning `OK` and being silently skipped. Once the runtime row exists, the
+signal enqueue is atomic with the caller's transaction.
+
+### Reconciler
+
+`df.reconcile()` is an admin-only backstop. It:
+
+- deletes orphaned runtime **root** instances whose full subtree has no matching
+  `df.instances` row; and
+- marks stuck `df.instances` rows failed when there is no live runtime instance and
+  no queued start.
+
+The background worker keeps one reconciler durable loop running per cluster on
+`pg_durable.reconciler_cron` (default `*/5 * * * *`; empty disables it), submitted
+by the dedicated non-superuser role `df_reconciler`.
+
+## Behavior changes
+
+- `df.start()`, `df.cancel()`, and `df.signal()` now participate in the caller's
+  transaction. For example, `BEGIN; SELECT df.start(...); ROLLBACK;` no longer
+  starts the workflow on the atomic path.
+- If the duroxide-pg SQL surface or the `df` wrappers are missing, pg_durable logs
+  and falls back to the previous non-atomic client path. The fallback is not emitted
+  as a client-visible SQL `WARNING`, so scripts that capture `SELECT df.start(...)`
+  output remain compatible.
+- `df.wait_for_schedule` now records an `utc_now` event before the timer so repeated
+  schedule waits compute the next cron tick each generation. This fixes a loop
+  busy-loop bug, but changes the recorded replay event sequence.
+
+## Upgrade and compatibility
+
+The wrappers and `df.reconcile()` are `df`-schema objects, so they are included in
+both fresh-install SQL and the `0.2.3 -> 0.2.4` upgrade script. The upgrade script
+also updates `df.grant_usage()` / `df.revoke_usage()` and backfills wrapper EXECUTE
+privileges to existing roles that already had explicit `USAGE` on schema `df`.
+
+Binary backward compatibility is preserved for old schemas that have not yet run
+`ALTER EXTENSION UPDATE`: the new binary uses the atomic path only when both the
+duroxide-pg enqueue function and the `df._enqueue_orchestrator_*` wrappers exist;
+otherwise it falls back to the old out-of-band client path.
+
+There is no table data migration.
+
+**Upgrade caveat:** drain or restart any in-flight instance already waiting in a
+`WAIT_SCHEDULE` node during upgrade. Those histories may replay expecting the old
+sequence and fail as nondeterministic after the `utc_now` change.
+
+## Validation
+
+Automated coverage added in this PR:
+
+- `24_atomic_rollback` — rolled-back start/cancel/signal leave no runtime effect;
+  signaling before runtime materialization is rejected.
+- `25_enqueue_wrapper_authz` — wrapper authorization and start-wrapper hardening.
+- `26_reconcile_orphan_gc` — orphan root with child sub-orchestrations is collected
+  as a full subtree; healthy instances remain untouched.
+- `scripts/test-upgrade.sh` — schema equivalence, binary compatibility, data
+  compatibility, and existing-role wrapper-grant backfill.
+
+Manual/targeted validation included signal fan-out, cancel consistency, reconciler
+liveness, cron scheduling, formatting, clippy, pgspot, and upgrade tests.
+
+## Remaining risks / follow-up
+
+- The direct duroxide-pg SQL dependency is intentional but should remain small and
+  well documented. Moving the privileged enqueue surface into duroxide-pg would be a
+  cleaner long-term boundary.
+- Reconciler grace/cadence and role provisioning may need tuning after operational
+  experience.

From 73a8cfe5792a10e1b994a31d2cd9a88ec8e072fd Mon Sep 17 00:00:00 2001
From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com>
Date: Fri, 19 Jun 2026 23:22:10 +0000
Subject: [PATCH 20/20] docs: clarify start wrapper authorization comment

The SECURITY DEFINER start wrapper allows same-owner pending instances that have
not yet been enqueued; that state can be created directly by a df user, not only
by df.start in the same transaction. Clarify the comment and point to the actual
safety properties: fixed root orchestration, matching input instance id, and no
foreign already-started target.
---
 sql/pg_durable--0.2.3--0.2.4.sql | 13 +++++++------
 src/lib.rs                       | 13 +++++++------
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/sql/pg_durable--0.2.3--0.2.4.sql b/sql/pg_durable--0.2.3--0.2.4.sql
index 49f7d7c..947c9b7 100644
--- a/sql/pg_durable--0.2.3--0.2.4.sql
+++ b/sql/pg_durable--0.2.3--0.2.4.sql
@@ -248,12 +248,13 @@ BEGIN
     -- Authorization. This runs as the (privileged) definer, so it must not
     -- trust the caller to only target their own instance. Permit the enqueue
     -- only for a brand-new, not-yet-started instance: a 'pending' df.instances
-    -- row with no orchestrator-queue entry and no duroxide instance. Under the
-    -- atomic-start path a committed df.instances row always has its queue row in
-    -- the same transaction, so this state is reachable only for the instance the
-    -- caller (df.start) just inserted in the current transaction — never another
-    -- user's committed instance. This is what stops a caller from forging work
-    -- against a foreign instance.
+    -- row with no orchestrator-queue entry and no duroxide instance. A df user
+    -- can create such a pending instance directly (not only through df.start),
+    -- so this is a same-owner/not-yet-started check rather than proof that the
+    -- row was inserted in this transaction. The wrapper is safe because it also
+    -- fixes the orchestration to the root graph executor and validates the input
+    -- instance id, so callers cannot start internal orchestrations or target
+    -- someone else's already-started instance.
     EXECUTE pg_catalog.format(
         'SELECT NOT EXISTS (SELECT 1 FROM df.instances i WHERE i.id = $1 AND i.status = ''pending'') '
         '    OR EXISTS (SELECT 1 FROM %I.orchestrator_queue q WHERE q.instance_id = $1) '
diff --git a/src/lib.rs b/src/lib.rs
index dd7f7ac..2debab9 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -571,12 +571,13 @@ BEGIN
     -- Authorization. This runs as the (privileged) definer, so it must not
     -- trust the caller to only target their own instance. Permit the enqueue
     -- only for a brand-new, not-yet-started instance: a 'pending' df.instances
-    -- row with no orchestrator-queue entry and no duroxide instance. Under the
-    -- atomic-start path a committed df.instances row always has its queue row in
-    -- the same transaction, so this state is reachable only for the instance the
-    -- caller (df.start) just inserted in the current transaction — never another
-    -- user's committed instance. This is what stops a caller from forging work
-    -- against a foreign instance.
+    -- row with no orchestrator-queue entry and no duroxide instance. A df user
+    -- can create such a pending instance directly (not only through df.start),
+    -- so this is a same-owner/not-yet-started check rather than proof that the
+    -- row was inserted in this transaction. The wrapper is safe because it also
+    -- fixes the orchestration to the root graph executor and validates the input
+    -- instance id, so callers cannot start internal orchestrations or target
+    -- someone else's already-started instance.
     EXECUTE pg_catalog.format(
         'SELECT NOT EXISTS (SELECT 1 FROM df.instances i WHERE i.id = $1 AND i.status = ''pending'') '
         '    OR EXISTS (SELECT 1 FROM %I.orchestrator_queue q WHERE q.instance_id = $1) '