From c90689cf8d90f459cb9e6b0eeaa92019371c55a4 Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Thu, 18 Jun 2026 21:29:24 +0000 Subject: [PATCH 01/20] Add spec: atomic df.start() to eliminate df/_duroxide divergence Proposes making the duroxide start-enqueue part of the caller's backend transaction (Option 3) so df.start() is atomic: rollback leaves neither df.* rows nor a _duroxide orchestrator-queue orphan, and the worker only picks up an instance after its df.* rows commit. Grounded in the current start path (single orchestrator_queue INSERT via _duroxide.enqueue_orchestrator_work), the SECURITY DEFINER privilege model needed for the enqueue, the resulting removal of the load-function-graph 5s race, alternatives considered, and an Upgrade & Migration section. --- docs/spec-atomic-start.md | 370 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 370 insertions(+) create mode 100644 docs/spec-atomic-start.md diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md new file mode 100644 index 0000000..06c9903 --- /dev/null +++ b/docs/spec-atomic-start.md @@ -0,0 +1,370 @@ +# Spec: Atomic `df.start()` — Eliminating `df.*` ↔ `_duroxide` Divergence + +**Status:** Proposal +**Author:** pg_durable team +**Date:** June 2026 + +## Overview + +`df.start()` writes to two independent stores that are **not** committed together: + +- the **pg_durable control plane** (`df.nodes`, `df.instances`) — written on the + caller's backend transaction via SPI; and +- the **duroxide runtime** (`_duroxide.orchestrator_queue`) — enqueued + out-of-band on a *separate* sqlx connection pool, committed independently. + +Because the two halves are not atomic, the stores can diverge. The most common +failure is the "rolled-back start" leak: a `df.start()` whose surrounding +transaction is rolled back (or whose batch subtransaction aborts on a PK +collision) leaves an **orphaned orchestration** in `_duroxide` with no +corresponding `df.*` rows. The background worker then repeatedly tries to load a +graph that does not exist, fails after a 5 s timeout, and the orphan persists in +`_duroxide.instances`. + +This spec proposes making the duroxide start-enqueue part of the caller's +backend transaction (design **Option 3** from the design exploration), so that +`df.start()` is atomic: either everything commits, or nothing does. + +### Evidence (observed during investigation) + +- A `df.start()` executed inside a transaction that was then `ROLLBACK`-ed left a + row in `_duroxide.instances` (e.g. `48819f12`) with **zero** rows in + `df.instances`. The orphan survived the rollback. +- Under load, orphaned start items self-sustain via retry and saturate the + 2-thread worker with 5 s "Instance not found … transaction may have been + rolled back" failures. +- The short-id PK-collision path (8 hex chars ≈ 2³²) reliably aborts `df.start()` + well under 1M instances (observed first-collision draws: 9248, 34845, 40445, + 60776). That abort is *safe* today (it happens before the duroxide enqueue), + but it exercises the same dual-write seam. + +## Background: how a start is written today + +`crate::dsl::start()` (src/dsl.rs) executes, in order, on the **caller's** +transaction unless noted: + +1. `instance_id = short_id()` (src/dsl.rs:645). +2. Read-only validation (database exists, caller has `LOGIN`, superuser policy). +3. `insert_nodes()` → INSERT rows into `df.nodes` (src/dsl.rs:842). +4. INSERT row into `df.instances` (src/dsl.rs:888). +5. Capture `df.vars` snapshot. +6. `start_durable_function()` → `Client::start_orchestration()` → + `store.enqueue_for_orchestrator(StartOrchestration)` (src/dsl.rs:923) — runs on + the cached `DUROXIDE_CLIENT` sqlx pool (src/client.rs), **separate + connection, independently committed.** Failures here are logged, not raised + (src/dsl.rs:928). + +Key facts that make the fix small: + +- **The start is a single enqueue.** `Client::start_orchestration` builds one + `WorkItem::StartOrchestration` and calls `enqueue_for_orchestrator` + (duroxide `client/mod.rs`). The duroxide-pg provider implements that as one + call to the SQL function `_duroxide.enqueue_orchestrator_work(...)`, whose body + is a single `INSERT INTO _duroxide.orchestrator_queue`. The + `_duroxide.instances`/`history` rows are materialized **later** by the worker. +- **The atomic unit is therefore just** `{df.nodes INSERTs, df.instances INSERT, + one orchestrator_queue INSERT}`. +- **duroxide-pg is SQL-function based.** Every provider operation is a function + in the `_duroxide` schema, so the enqueue can be invoked directly via SPI on + the caller's transaction — no async pool required for the start path. + +### What each store owns (so we know what must *not* change) + +| Data | Source of truth | Notes | +|------|-----------------|-------| +| Graph (`df.nodes`) | `df.*` | duroxide does not store it; the worker loads it via the `load-function-graph` activity. | +| Execution history | `_duroxide.history` | duroxide-internal; transactional per turn via `ack_orchestration_item`. | +| Control plane (`root_node`, `submitted_by`, `database`, `label`) | `df.instances` | needed by the worker to load + authenticate. | +| `df.instances.status` | best-effort mirror | written by the `update-instance-status` activity; not atomic with duroxide, self-heals. | + +`submitted_by` is a **security boundary**: it is captured as `current_user` at +`df.start()` time and pinned by a composite FK; the worker re-checks the +superuser policy (src/activities/load_function_graph.rs). Any change must keep it +unforgeable. + +## Goals + +- `df.start()` is **atomic with the caller's transaction**: on rollback, no + `df.*` rows *and* no `_duroxide` start item remain; on commit, both are present + and the worker picks the instance up only after the `df.*` rows are visible. +- Eliminate the rolled-back-start orphan leak at the source. +- Eliminate the `load-function-graph` "wait up to 5 s for the row to appear" + race (it becomes structurally impossible). +- No change to the `df.*` schema's read surface (status/result/monitoring/explain + keep working unchanged). + +## Non-goals + +- Collapsing `df.*` into `_duroxide` (Option 1) — out of scope; tracked + separately as a possible long-term direction. +- Fixing the best-effort `df.instances.status` mirror (it is eventually + consistent by design). +- Changing short-id generation / the id space. (Independent; can be addressed + separately. This spec makes a collision a clean, fully-atomic abort.) +- Migrating `df.signal()` / `df.cancel()` to the in-transaction path — same + mechanism, scoped as a follow-up (see Phasing). + +## Considered alternatives (summary) + +1. **Single source of truth in `_duroxide`** — remove `df.*`, serialize the graph + into the orchestration input, re-expose reads as views over `_duroxide`. + Strongest invariant but large; security/RLS/read-API rework and a destructive + migration. Rejected for this change. +2. **duroxide-pg runs pg_durable's SQL inside *its* sqlx transaction** — makes + `df.*` ↔ `_duroxide` atomic but on the worker-pool connection, so `df.start` + stops participating in the caller's transaction, and the `df.*` writes run as + the pool role (breaks `current_user` capture). Strictly weaker than Option 3. +3. **Run the enqueue inside the caller's backend transaction via SPI (this + spec)** — smallest change, strongest guarantee, keeps `df.*` and identity + capture intact. +4. **Tolerate divergence + async GC/reconciler** — additive safety net; does not + prevent transient inconsistency. Recommended as a *complementary* backstop, + not a replacement. + +## Design + +### Mechanism + +In `crate::dsl::start()`, replace step 6 (the out-of-band +`start_durable_function` pool call) with an **SPI call** that enqueues the start +work item on the caller's transaction, immediately after the `df.instances` +INSERT: + +```text +-- still on the caller's transaction (SPI): +INSERT INTO df.nodes ... -- step 3 (unchanged) +INSERT INTO df.instances ... -- step 4 (unchanged) +SELECT df.__enqueue_start($instance_id, $work_item_json, $visible_at); -- step 6 (new) +``` + +Because the enqueue is now an SPI statement, it shares the caller's transaction: +rollback removes the queue row along with the `df.*` rows; commit makes them +visible together. + +### The work item + +pg_durable already depends on the `duroxide` crate, so it should **construct and +serialize the real `WorkItem` type** rather than hand-rolling JSON, guaranteeing +wire compatibility: + +```rust +let item = duroxide::providers::WorkItem::StartOrchestration { + instance: instance_id.clone(), + orchestration: orchestrations::execute_function_graph::NAME.into(), + input: input_json, // existing FunctionInput JSON + version: None, // runtime resolves from registry + parent_instance: None, + parent_id: None, + execution_id: duroxide::INITIAL_EXECUTION_ID, // = 1 +}; +let work_item_json = serde_json::to_string(&item)?; +``` + +`WorkItem` is externally tagged (default serde), i.e. +`{"StartOrchestration":{...}}`. These are exactly the values the current +`Client::start_orchestration` path uses, so worker behavior is unchanged. + +The enqueue target is the existing function (body is a single INSERT; the +`p_orchestration_name/version/execution_id` params are accepted but unused — the +work item is self-contained): + +```sql +_duroxide.enqueue_orchestrator_work( + p_instance_id text, + p_work_item text, -- work_item_json + p_visible_at timestamptz, -- now() for immediate start + p_orchestration_name text DEFAULT NULL, + p_orchestration_version text DEFAULT NULL, + p_execution_id bigint DEFAULT NULL) +``` + +### Privilege model (the one real wrinkle) + +`_duroxide.orchestrator_queue` grants INSERT to its **owner only** (the +`pg_durable.worker_role`; `relacl` is empty → no PUBLIC), and +`_duroxide.enqueue_orchestrator_work` is `SECURITY INVOKER`. `df.start()` is also +`SECURITY INVOKER` and runs SPI as the calling user. Therefore a normal caller +cannot insert into the queue directly — confirmed: for a fresh `LOGIN` role, +`has_table_privilege(... 'INSERT') = false`. + +The enqueue must run through a **`SECURITY DEFINER`** entrypoint owned by a role +that owns the `_duroxide` queue tables. Two placements: + +- **Recommended — provide it in duroxide-pg** (proper layering; pg_durable never + touches `_duroxide` internals). Add a stable, documented `SECURITY DEFINER` + client-enqueue function in the `_duroxide` schema, owned by the schema owner + (`worker_role`), e.g. `_duroxide.client_enqueue_start(p_instance_id text, + p_work_item text, p_visible_at timestamptz)`. It performs the single queue + INSERT and is `GRANT EXECUTE`-ed to the roles allowed to start instances. + Applied by the BGW migration runner (`ApplyAll`); no pg_durable extension + upgrade script needed (consistent with docs/bgw-applies-migrations.md). +- **Fallback — a `df` wrapper in pg_durable.** A `df.__enqueue_start(...)` + `SECURITY DEFINER` function owned by the extension owner that calls + `_duroxide.enqueue_orchestrator_work(...)`. Works in the common configuration + where the extension owner is a superuser (or otherwise owns/【has INSERT on】the + queue). In hardened multi-role deployments where `_duroxide` is owned by a + non-superuser `worker_role` distinct from the extension owner, this requires + the `worker_role` to `GRANT INSERT` on the queue tables (and `EXECUTE` on the + enqueue fn) to the wrapper's owner. The duroxide-pg placement avoids this + coordination. + +Either wrapper **must** pin `search_path` and schema-qualify every reference +(CVE-2018-1058; enforced by the pgspot gate, scripts/pgspot-gate.sh): + +```sql +CREATE FUNCTION _duroxide.client_enqueue_start(...) +RETURNS void +LANGUAGE plpgsql +SECURITY DEFINER +SET search_path = pg_catalog +AS $$ BEGIN + INSERT INTO _duroxide.orchestrator_queue (instance_id, work_item, visible_at, created_at) + VALUES (p_instance_id, p_work_item, p_visible_at, pg_catalog.now()); +END $$; +``` + +The wrapper takes **only** an opaque `instance_id`, a `work_item` string, and a +timestamp — it executes no caller-supplied SQL, so the `SECURITY DEFINER` +surface is a single fixed INSERT. + +### Identity / security + +`submitted_by` capture is unchanged: it is still read as `current_user` inside +`df.start()` (`SECURITY INVOKER`) **before** the enqueue and stored in +`df.instances`. The `SECURITY DEFINER` wrapper does not touch identity; the work +item carries only the orchestration name/input. The worker continues to +authenticate using `df.instances.submitted_by` and re-check the superuser policy +(unchanged). No new forgery surface is introduced. + +### Worker readiness and ordering + +- Keep the existing worker-readiness gate: if `_worker_ready` is absent/stale, + `df.start()` errors as it does today (the `_duroxide` schema may not yet + exist). This check moves from "before acquiring the client" to "before the SPI + enqueue." +- The enqueue is the **last** write in `df.start()`; a PK collision on + `df.nodes`/`df.instances` still aborts before it (no behavior change for the + collision case, now fully inside one transaction). + +### Consequence: the 5 s load race disappears + +`load-function-graph` currently polls for up to `MAX_WAIT_SECS = 5` because the +worker could dequeue a start item before the caller's `df.*` rows committed +(src/activities/load_function_graph.rs:21,60). With an atomic start, a queue row +becomes visible to the worker **only** after the `df.*` rows commit in the same +transaction, so the "not yet visible / rolled back" branch cannot occur. The +retry loop can be simplified to a single read (optionally retain a short grace +for replication scenarios). This also removes the worker-saturation failure mode. + +## Transaction-semantics decision (must be made explicit) + +This change makes `df.start()` **atomic with the caller's transaction**: + +```sql +BEGIN; +SELECT df.start('SELECT 1'); -- enqueued on THIS transaction +ROLLBACK; -- start is fully undone; nothing runs +``` + +Today the behavior is inconsistent (the `df.*` rows roll back but the duroxide +enqueue does not). The new behavior is the least-surprising contract and is the +explicit, documented decision of this spec. Note for users: a `df.start()` in a +transaction that later rolls back will **not** run — which is the desired fix, +but a behavior change from "fire-and-forget at statement time." This must be +called out in `USER_GUIDE.md`. + +## Signals & cancellation (follow-up) + +`df.signal()` (ExternalRaised) and `df.cancel()` (CancelInstance) are also single +orchestrator enqueues and can use the same in-transaction SPI path for the same +consistency benefit. They are **out of scope** here (signal/cancel target +already-committed instances, so the rolled-back-start leak does not apply), but +the `SECURITY DEFINER` enqueue entrypoint should be designed to serve them too +(generalize to `client_enqueue(instance, work_item, visible_at)`). + +## Complementary backstop (optional, recommended) + +Independently of this change, a lightweight **reconciler** — ideally a built-in +durable function (a `@>` loop + `df.wait_for_schedule`) — should sweep for +residual divergence that an atomic start cannot prevent (crashes mid-execution, +not-yet-migrated signal/cancel paths, `status` mirror drift), using existing +`_duroxide` read functions (`list_instances`, `get_instance_info`, +`get_queue_depths`) and `delete_instances_atomic`. Tracked separately as +"Option 4"; not required for this spec but cheap insurance. + +## Upgrade & Migration + +**B1 — binary backward compatibility (new `.so` vs all prior schemas).** The new +`.so` calls the enqueue entrypoint via SPI. +- If the entrypoint is the **duroxide-pg-provided** `SECURITY DEFINER` function, + the BGW applies the duroxide migration at startup (`ApplyAll`, + src/worker.rs), and the function exists for every running worker. Backends + detect readiness via `_worker_ready` (existing gate) before enqueuing. + Versioning: bump the pinned `duroxide`/`duroxide-pg` pair together (see + docs/upgrade-testing.md "Updating duroxide-pg") and run `cargo update`. **No + `pg_durable` extension upgrade script is required** for duroxide-schema changes + (docs/bgw-applies-migrations.md). +- If the entrypoint is the **`df`-schema fallback wrapper**, it is a `df` schema + object and **does** require an extension upgrade script + `sql/pg_durable----.sql` (CREATE FUNCTION + GRANT), plus the + fresh-install SQL, and a "Version-Specific Changes" entry in + docs/upgrade-testing.md. The new `.so` must still operate against older `df` + schemas that predate the wrapper: gate the SPI path on the wrapper's existence + (catalog probe) and **fall back to the existing out-of-band client enqueue** + when absent. This preserves B1 while the upgrade rolls out. + +**Runtime schema detection.** Reuse the existing `_worker_ready` / +`backend_duroxide_schema()` resolution. Add a one-time probe (cached per session) +for the enqueue entrypoint; if missing (old worker), fall back to the legacy +path. This makes the change safe under mixed binary/worker versions. + +**Data migration.** None. No `df.*` or `_duroxide` table shapes change. Existing +in-flight instances are unaffected (their queue items were already enqueued). + +**Rollback.** Reverting the `.so` restores the out-of-band enqueue; the optional +`df` wrapper, if added, is inert when unused. + +## Testing + +E2E (tests/e2e/sql/) and unit coverage: + +- **Atomic rollback (the leak):** `BEGIN; df.start(...); ROLLBACK;` then assert + **both** `df.instances` and `_duroxide.instances`/`orchestrator_queue` have no + row for the instance. (Today this fails: `_duroxide` keeps an orphan.) +- **Atomic commit:** `BEGIN; df.start(...); COMMIT;` runs to completion exactly + as before. +- **Subtransaction abort:** a `df.start()` inside a `BEGIN/EXCEPTION` block that + raises leaves no `_duroxide` orphan. +- **PK collision (existing repro):** the short-id collision aborts cleanly with + no `_duroxide` residue (extends the manual `short_id_collision` repro). +- **Worker-not-ready:** `df.start()` before `_worker_ready` errors as today. +- **Privilege/hardened config:** a non-superuser caller (and, for the fallback + wrapper, a distinct `worker_role`) can start successfully and cannot insert + into `_duroxide.orchestrator_queue` directly. +- **No 5 s waits:** worker logs contain no "transaction may have been rolled + back" entries during normal start/rollback flows. + +## Phasing + +1. **Phase 1 (this spec):** atomic `df.start()` via the `SECURITY DEFINER` + enqueue entrypoint; readiness gate + legacy fallback; simplify the + `load-function-graph` retry; docs + tests. +2. **Phase 2:** apply the same in-transaction enqueue to `df.signal()` / + `df.cancel()`. +3. **Phase 3 (optional):** ship the reconciler backstop (Option 4). + +## Open questions / risks + +- **Entrypoint placement:** duroxide-pg (clean layering, coordinated release) vs + a `df` fallback wrapper (no duroxide-pg release, but cross-role grants in + hardened setups). Recommendation: duroxide-pg, with the `df` fallback retained + for B1 during rollout. +- **Contract stability:** pg_durable now depends on the enqueue entrypoint + signature and the `WorkItem` JSON. Mitigated by reusing the `duroxide` + `WorkItem` type and the version-locked duroxide/duroxide-pg pair. +- **`visible_at` clock:** use `now()` (server time) for immediate starts to match + the duroxide-pg provider's behavior; confirm no reliance on the Rust-side + `now_ms` for ordering at start. +- **Idempotency:** `enqueue_orchestrator_work` is a bare INSERT (no dedup); the + `df.instances`/`df.nodes` PKs already prevent duplicate instance ids within the + transaction, so no additional dedup is required. From 00de1101843bde44bea3d3542d6a3cbfcdcfe3c8 Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Thu, 18 Jun 2026 21:39:49 +0000 Subject: [PATCH 02/20] spec: add dual-write inventory; clarify df.cancel vs df.signal Add a Dual-write inventory section that classifies every df.* writer vs duroxide-client call into two seams: backend session-side dual-writes (df.start, df.cancel) that Option 3 fixes, and worker-side best-effort mirror activities (update-instance-status, update-node-status) that it does not. Make explicit that df.cancel is a genuine dual-write (Phase 2) while df.signal writes only _duroxide and is not a cross-store divergence. --- docs/spec-atomic-start.md | 65 +++++++++++++++++++++++++++++++++++---- 1 file changed, 59 insertions(+), 6 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index 06c9903..b921bc2 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -76,12 +76,56 @@ Key facts that make the fix small: | Execution history | `_duroxide.history` | duroxide-internal; transactional per turn via `ack_orchestration_item`. | | Control plane (`root_node`, `submitted_by`, `database`, `label`) | `df.instances` | needed by the worker to load + authenticate. | | `df.instances.status` | best-effort mirror | written by the `update-instance-status` activity; not atomic with duroxide, self-heals. | +| `df.nodes.status/result/error` | best-effort mirror | written by the `update-node-status` activity; same properties as above. | `submitted_by` is a **security boundary**: it is captured as `current_user` at `df.start()` time and pinned by a composite FK; the worker re-checks the superuser policy (src/activities/load_function_graph.rs). Any change must keep it unforgeable. +### Dual-write inventory + +Auditing every `df.*` writer against every duroxide-client call, the cross-store +seams are bounded and fall into two distinct classes. + +**Backend (session) side — `df.*` on the caller's transaction + an out-of-band +duroxide-client enqueue.** This is the class Option 3 fixes. + +| Site | `df.*` write (user tx, SPI) | `_duroxide` write (out-of-band pool) | Dual-write? | +|------|------|------|------| +| `df.start()` | INSERT `df.nodes` + `df.instances` (src/dsl.rs:842,888) | enqueue `StartOrchestration` (src/dsl.rs:923) | **Yes** — Phase 1 | +| `df.cancel()` | UPDATE `df.instances.status='cancelled'` (src/dsl.rs:966) | `cancel_instance` enqueue (src/dsl.rs:955) | **Yes** — Phase 2 | +| `df.signal()` | **none** (only an RLS read of `df.instances`) | `raise_event` + descendant fan-out (src/dsl.rs:616) | **No** — single store | + +Notes: + +- **`df.cancel()` is a genuine dual-write** with two twists vs `df.start()`: it + enqueues to duroxide **first** and then does the `df.*` UPDATE (opposite + order), and the UPDATE is only an *optimistic* mirror — the authoritative + `cancelled` status is re-applied when the worker processes the cancel via + `update-instance-status`, so divergence here is usually transient and + self-healing. It still belongs in the same in-transaction treatment for strict + atomicity. +- **`df.signal()` is not a dual-write**: it writes only `_duroxide`. There is no + cross-store inconsistency to create; moving it in-transaction only changes + "don't deliver the signal if my surrounding tx rolls back" semantics — a + nice-to-have, not a consistency fix. +- Not seams: `df.run()` is a no-op stub; `df.result/status/explain/monitoring` + are reads; `df.vars` and `df._worker_epoch` touch only `df.*`. + +**Worker side — `df.*` mirror maintained by activities, on a connection separate +from duroxide's history ack.** Option 3 does **not** address this class. + +| Activity | `df.*` write | Property | +|----------|--------------|----------| +| `update-instance-status` | UPDATE `df.instances.status` | eventually consistent, self-healing | +| `update-node-status` | UPDATE `df.nodes.status/result/error` | eventually consistent, self-healing | + +These are at-least-once activities re-applied on replay, so a crash between the +duroxide ack and the mirror update is transient lag, not permanent divergence. +This is the "best-effort mirror"; any persistent drift is the reconciler's job +(see Complementary backstop), not Option 3's. + ## Goals - `df.start()` is **atomic with the caller's transaction**: on rollback, no @@ -275,12 +319,21 @@ called out in `USER_GUIDE.md`. ## Signals & cancellation (follow-up) -`df.signal()` (ExternalRaised) and `df.cancel()` (CancelInstance) are also single -orchestrator enqueues and can use the same in-transaction SPI path for the same -consistency benefit. They are **out of scope** here (signal/cancel target -already-committed instances, so the rolled-back-start leak does not apply), but -the `SECURITY DEFINER` enqueue entrypoint should be designed to serve them too -(generalize to `client_enqueue(instance, work_item, visible_at)`). +See the Dual-write inventory above for the precise classification. In short: + +- **`df.cancel()` is a genuine dual-write** (`df.instances.status` + a + `CancelInstance` enqueue) and should get the same in-transaction treatment as + `df.start()`. It is scoped as **Phase 2** rather than Phase 1 only because it + targets an already-committed instance (so the rolled-back-*start* leak does not + apply) and because the worker's `update-instance-status` activity already + re-applies the authoritative `cancelled` status, making its divergence + transient. The `SECURITY DEFINER` enqueue entrypoint should be generalized to + carry an arbitrary work item (e.g. `client_enqueue(instance, work_item, + visible_at)`) so it serves cancel directly. +- **`df.signal()` is not a dual-write** (it writes only `_duroxide`), so it + creates no cross-store inconsistency. It can ride on the same entrypoint for + "don't deliver if my tx rolls back" semantics, but that is a nice-to-have, not + a consistency fix — lowest priority. ## Complementary backstop (optional, recommended) From 7aeff23adf92394d26236e3b2df24338067a7487 Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Thu, 18 Jun 2026 22:10:25 +0000 Subject: [PATCH 03/20] spec: surface the four design options up front; state Option 3 + Option 4 decision Add a 'Design options considered' section near the top enumerating the four originally-explored options (single source of truth, our-SQL-in-their-tx, their-enqueue-in-our-tx, and tolerate-and-reconcile) with a Decision subsection, so the Option N references throughout resolve. Update the Overview to state we implement Option 3 (primary) + Option 4 (backstop), remove the now -redundant mid-document summary, and reframe the reconciler as in-scope Phase 3. --- docs/spec-atomic-start.md | 94 ++++++++++++++++++++++++++++----------- 1 file changed, 67 insertions(+), 27 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index b921bc2..9999134 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -21,9 +21,12 @@ corresponding `df.*` rows. The background worker then repeatedly tries to load a graph that does not exist, fails after a 5 s timeout, and the orphan persists in `_duroxide.instances`. -This spec proposes making the duroxide start-enqueue part of the caller's -backend transaction (design **Option 3** from the design exploration), so that -`df.start()` is atomic: either everything commits, or nothing does. +Four designs were explored to keep the two stores consistent (see **Design +options considered** below). This spec **implements Option 3 (primary) plus +Option 4 (complementary backstop)**: make the duroxide start-enqueue part of the +caller's backend transaction so `df.start()` is atomic — either everything +commits or nothing does — and add an asynchronous reconciler to repair any +residual drift that start-time atomicity cannot prevent. ### Evidence (observed during investigation) @@ -38,6 +41,60 @@ backend transaction (design **Option 3** from the design exploration), so that 60776). That abort is *safe* today (it happens before the duroxide enqueue), but it exercises the same dual-write seam. +## Design options considered + +Four approaches were explored for keeping the `df.*` control plane and the +`_duroxide` runtime consistent. The mechanics of the chosen approach are detailed +in **Design** below; this section records all four and the decision. + +**Option 1 — Single source of truth in `_duroxide`.** Drop the `df.instances` / +`df.nodes` control tables entirely, keep all instance state in `_duroxide`, +serialize the graph into the orchestration input, and have duroxide expose +whatever state pg_durable needs (via views / read APIs). +*Guarantee:* total — there is only one store, so nothing can diverge. +*Cost:* large — moves the `submitted_by` security boundary, requires rebuilding +per-user RLS over duroxide-owned tables, re-exposing every `df.*` read surface, +and a destructive migration with B1 backward-compat implications. A possible +long-term direction, not this change. + +**Option 2 — Hand duroxide-pg our SQL to run inside *its* transaction.** Keep +both schemas; extend duroxide-pg so the start-enqueue also executes the +pg_durable `df.*` writes in the same (sqlx, worker-pool) transaction. +*Guarantee:* `df.*` ↔ `_duroxide` atomic — but on the worker-pool connection, so +`df.start()` stops participating in the *caller's* transaction and the `df.*` +writes run as the pool role, breaking `current_user` / `submitted_by` capture and +RLS. Strictly weaker than Option 3. + +**Option 3 — Hand duroxide-pg *our* (the caller's) transaction.** Keep both +schemas; run duroxide's start-enqueue inside the caller's backend transaction via +SPI, alongside the `df.*` writes. +*Guarantee:* strongest — `df.nodes` + `df.instances` + the queue row + the +caller's surrounding statements all commit or roll back together; identity +capture and the `df.*` read surface are untouched; the 5 s load race disappears. +*Cost:* small — the start is a single SQL-function call — with one wrinkle: a +`SECURITY DEFINER` enqueue entrypoint is needed because the queue INSERT is +owner-only. **Chosen as the primary fix.** + +**Option 4 — Tolerate divergence; repair asynchronously.** Accept a transient +window and add a reconciler — ideally a built-in durable function — that detects +and repairs mismatches (orphaned `_duroxide` instances, stuck `df.instances` +rows, `status`-mirror drift). +*Guarantee:* none added; eventual consistency with bounded repair latency. +*Value:* the only option that also catches divergence from crashes +*mid-execution* and from paths Option 3 does not cover (the worker-side mirror +activities, and signal/cancel until they are migrated). **Chosen as a +complementary backstop.** + +### Decision + +Implement **Option 3 + Option 4**. Option 3 removes the dual-write at the source +for backend-initiated starts (and, in a later phase, `df.cancel`); Option 4 is a +lightweight safety net for the residual divergence that no start-time atomicity +can prevent (crashes mid-execution, the best-effort `df.*` mirror, and +not-yet-migrated signal/cancel). Options 1 and 2 are rejected — Option 1 is +disproportionate to the problem, and Option 2 is a strictly weaker variant of +Option 3. + ## Background: how a start is written today `crate::dsl::start()` (src/dsl.rs) executes, in order, on the **caller's** @@ -148,23 +205,6 @@ This is the "best-effort mirror"; any persistent drift is the reconciler's job - Migrating `df.signal()` / `df.cancel()` to the in-transaction path — same mechanism, scoped as a follow-up (see Phasing). -## Considered alternatives (summary) - -1. **Single source of truth in `_duroxide`** — remove `df.*`, serialize the graph - into the orchestration input, re-expose reads as views over `_duroxide`. - Strongest invariant but large; security/RLS/read-API rework and a destructive - migration. Rejected for this change. -2. **duroxide-pg runs pg_durable's SQL inside *its* sqlx transaction** — makes - `df.*` ↔ `_duroxide` atomic but on the worker-pool connection, so `df.start` - stops participating in the caller's transaction, and the `df.*` writes run as - the pool role (breaks `current_user` capture). Strictly weaker than Option 3. -3. **Run the enqueue inside the caller's backend transaction via SPI (this - spec)** — smallest change, strongest guarantee, keeps `df.*` and identity - capture intact. -4. **Tolerate divergence + async GC/reconciler** — additive safety net; does not - prevent transient inconsistency. Recommended as a *complementary* backstop, - not a replacement. - ## Design ### Mechanism @@ -335,15 +375,15 @@ See the Dual-write inventory above for the precise classification. In short: "don't deliver if my tx rolls back" semantics, but that is a nice-to-have, not a consistency fix — lowest priority. -## Complementary backstop (optional, recommended) +## Complementary backstop — Option 4 (in scope, Phase 3) -Independently of this change, a lightweight **reconciler** — ideally a built-in -durable function (a `@>` loop + `df.wait_for_schedule`) — should sweep for -residual divergence that an atomic start cannot prevent (crashes mid-execution, +As part of the chosen approach, a lightweight **reconciler** — ideally a built-in +durable function (a `@>` loop + `df.wait_for_schedule`) — sweeps for residual +divergence that an atomic start cannot prevent (crashes mid-execution, not-yet-migrated signal/cancel paths, `status` mirror drift), using existing `_duroxide` read functions (`list_instances`, `get_instance_info`, -`get_queue_depths`) and `delete_instances_atomic`. Tracked separately as -"Option 4"; not required for this spec but cheap insurance. +`get_queue_depths`) and `delete_instances_atomic`. It complements Option 3 rather +than replacing it, and is sequenced as Phase 3 (see Phasing). ## Upgrade & Migration @@ -404,7 +444,7 @@ E2E (tests/e2e/sql/) and unit coverage: `load-function-graph` retry; docs + tests. 2. **Phase 2:** apply the same in-transaction enqueue to `df.signal()` / `df.cancel()`. -3. **Phase 3 (optional):** ship the reconciler backstop (Option 4). +3. **Phase 3:** ship the reconciler backstop (Option 4). ## Open questions / risks From 3f64e60322c37c75adc86bed135e6bf86b1446e0 Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Thu, 18 Jun 2026 22:19:55 +0000 Subject: [PATCH 04/20] spec: add options comparison table Add the trade-off matrix (atomic with df.*, atomic with caller tx, kills the 5s load race, fixes crash-time drift, effort) to the Design options considered section, with footnotes explaining Option 1's convergence with Option 3 and why Option 4 is kept as a crash-drift backstop. --- docs/spec-atomic-start.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index 9999134..6f11d9d 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -85,6 +85,21 @@ rows, `status`-mirror drift). activities, and signal/cancel until they are migrated). **Chosen as a complementary backstop.** +### Comparison + +| Option | Atomic with `df.*` | Atomic with caller tx | Kills 5 s load race | Fixes crash-time drift | Effort | +|--------|:--:|:--:|:--:|:--:|:--:| +| 1 — single source of truth in `_duroxide` | n/a (one store) | no¹ | yes | yes | Large | +| 2 — our SQL in duroxide's tx | yes | **no** | yes | no | Medium | +| 3 — enqueue in caller's tx (SPI) | yes | **yes** | **yes** | no | Small–Med | +| 4 — async GC / reconciler | repairs | — | no | **yes** | Small–Med | + +¹ Option 1 also decouples from the caller's transaction unless the enqueue is +itself done via SPI — at which point its start path converges with Option 3. +"Fixes crash-time drift" = repairs divergence from a crash *mid-execution* (not +just at start); only Options 1 and 4 do this, which is why Option 4 is kept as a +backstop alongside Option 3. + ### Decision Implement **Option 3 + Option 4**. Option 3 removes the dual-write at the source From efb818c14d39c2f1fbebaffa4772a03bddd2610a Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Thu, 18 Jun 2026 22:50:10 +0000 Subject: [PATCH 05/20] POC: atomic df.start() via in-transaction SPI enqueue (Option 3) Enqueue the StartOrchestration work item inside the caller's backend transaction instead of out-of-band on the duroxide client pool, so the df.nodes/df.instances INSERTs and the orchestrator-queue INSERT commit or roll back together. - src/lib.rs: add df._enqueue_orchestrator_start(text,text), a SECURITY DEFINER wrapper (SET search_path=pg_catalog) that performs the privileged INSERT via _duroxide.enqueue_orchestrator_work(); EXECUTE revoked from PUBLIC and granted through df.grant_usage(). - src/dsl.rs: df.start() builds the duroxide WorkItem::StartOrchestration, gates on worker readiness, and enqueues via SPI through the wrapper, raising (aborting the tx) on failure. - src/client.rs: drop the now-unused out-of-band start_durable_function; expose is_worker_ready() to the start path. Verified: committed start runs to completion; rolled-back start leaves no df.* rows AND no _duroxide orphan (the leak this fixes); E2E subset incl. join/race/signals/cancel/loops passes; non-superuser df_e2e_user path works. POC scope: fresh-install only (no upgrade script); _duroxide schema name hard-coded; wrapper hardening deferred per docs/spec-atomic-start.md. --- src/client.rs | 25 +------------------------ src/dsl.rs | 46 +++++++++++++++++++++++++++++++++++----------- src/lib.rs | 44 +++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 79 insertions(+), 36 deletions(-) diff --git a/src/client.rs b/src/client.rs index 625ea73..ab70491 100644 --- a/src/client.rs +++ b/src/client.rs @@ -27,7 +27,7 @@ static DUROXIDE_CLIENT: OnceLock = OnceLock::new(); /// row, or has a `schema_version` below `WORKER_SCHEMA_VERSION`. This is a fast /// SPI read called once per session on the first call to any `df.*` function /// that needs the duroxide client. -fn is_worker_ready() -> bool { +pub(crate) fn is_worker_ready() -> bool { let schema = backend_duroxide_schema(); // First check if the readiness table exists via the catalogue. Querying @@ -135,29 +135,6 @@ async fn list_running_descendants(client: &Client, root_instance_id: &str) -> Ve descendants } -/// Start a durable function via the shared PostgreSQL store. -pub fn start_durable_function( - function_name: &str, - instance_id: &str, - input: &str, -) -> Result<(), String> { - log!( - "pg_durable: start_durable_function for instance {}", - instance_id - ); - - let rt = get_client_runtime(); - let client = get_duroxide_client()?; - - rt.block_on(async { - client - .start_orchestration(instance_id, function_name, input) - .await - .map_err(|e| format!("Failed to start durable function: {e:?}"))?; - Ok(()) - }) -} - /// Cancel a durable function. pub fn cancel_durable_function(instance_id: &str, reason: &str) -> Result<(), String> { let rt = get_client_runtime(); diff --git a/src/dsl.rs b/src/dsl.rs index 9e2c95e..e813bcf 100644 --- a/src/dsl.rs +++ b/src/dsl.rs @@ -12,7 +12,6 @@ use std::str::FromStr; use std::cell::RefCell; use std::time::Instant; -use crate::client::start_durable_function; use crate::types::{ mark_non_future_helper_call, short_id, validate_result_name, Durofut, FunctionInput, }; @@ -912,23 +911,48 @@ pub fn start( vars }); - // Start the orchestration via duroxide + // Start the orchestration durably. + // + // Option 3 (atomic start): the StartOrchestration work item is enqueued via + // SPI inside the CALLER'S transaction, so the df.nodes/df.instances INSERTs + // above and this orchestrator-queue INSERT commit or roll back together. A + // rolled-back df.start() therefore leaves no _duroxide orphan, and the worker + // only observes the queue row after the df.* rows are visible (which also + // removes the load-function-graph "wait for commit" race). let input = FunctionInput { instance_id: instance_id.clone(), label: label.map(|s| s.to_string()), vars, }; - let input_json = serde_json::to_string(&input).unwrap_or(instance_id.clone()); + let input_json = serde_json::to_string(&input).unwrap_or_else(|_| instance_id.clone()); - if let Err(e) = start_durable_function( - crate::orchestrations::execute_function_graph::NAME, - &instance_id, - &input_json, + // Fail clearly (and atomically — the df.* INSERTs roll back) if the worker + // has not yet initialised the duroxide schema. + if !crate::client::is_worker_ready() { + pgrx::error!("pg_durable background worker not yet initialized — try again in a moment"); + } + + // Build the exact same work item the duroxide client would enqueue, by + // serializing the real duroxide type (guarantees wire compatibility). + let work_item = duroxide::providers::WorkItem::StartOrchestration { + instance: instance_id.clone(), + orchestration: crate::orchestrations::execute_function_graph::NAME.to_string(), + input: input_json, + version: None, + parent_instance: None, + parent_id: None, + execution_id: duroxide::INITIAL_EXECUTION_ID, + }; + let work_item_json = serde_json::to_string(&work_item) + .unwrap_or_else(|e| pgrx::error!("Failed to serialize start work item: {}", e)); + + // SECURITY DEFINER wrapper performs the privileged orchestrator-queue INSERT + // on the caller's transaction. Raising on failure aborts the whole df.start. + if let Err(e) = Spi::run_with_args( + "SELECT df._enqueue_orchestrator_start($1, $2)", + &[instance_id.as_str().into(), work_item_json.as_str().into()], ) { - pgrx::log!( - "pg_durable: Warning - failed to start durable function: {}", - e - ); + pgrx::error!("Failed to enqueue durable function start: {:?}", e); } instance_id diff --git a/src/lib.rs b/src/lib.rs index 681b88d..7f8c723 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -413,7 +413,9 @@ DECLARE 'df.version()', 'df.debug_connection()', 'df.explain(text)', - 'df.target_database()' + 'df.target_database()', + -- Atomic-start enqueue wrapper (called by df.start() via SPI) + 'df._enqueue_orchestrator_start(text, text)' ]; BEGIN -- Validate the role exists @@ -561,6 +563,46 @@ REVOKE EXECUTE ON FUNCTION df.revoke_usage(text) FROM PUBLIC; requires = ["create_tables", dsl::http] ); +// ============================================================================ +// Atomic start enqueue (Option 3) — SECURITY DEFINER wrapper +// ============================================================================ +// +// df.start() enqueues the StartOrchestration work item by calling this wrapper +// via SPI, inside the caller's transaction. The duroxide orchestrator queue +// grants INSERT to its owner only, so the privileged INSERT must run as a +// definer that owns the _duroxide tables; the call still commits/rolls back as +// part of the caller's transaction, giving an atomic df.start(). +// +// The body references _duroxide.enqueue_orchestrator_work, which is created by +// the background worker's migrations (not this extension). plpgsql resolves it +// at call time, by which point df.start()'s worker-readiness gate guarantees the +// _duroxide schema exists. +// +// SECURITY (POC): this wrapper can enqueue an arbitrary orchestrator work item. +// EXECUTE is revoked from PUBLIC and granted only via df.grant_usage(). A +// production version should move this entrypoint into duroxide-pg (owned by the +// _duroxide schema owner) and/or validate the work item — see +// docs/spec-atomic-start.md. The _duroxide schema name is also hard-coded here +// for the POC (it is resolved dynamically elsewhere). +extension_sql!( + r#" +CREATE FUNCTION df._enqueue_orchestrator_start(p_instance_id text, p_work_item text) +RETURNS void +LANGUAGE plpgsql +SECURITY DEFINER +SET search_path = pg_catalog +AS $fn$ +BEGIN + PERFORM _duroxide.enqueue_orchestrator_work(p_instance_id, p_work_item, pg_catalog.now()); +END; +$fn$; + +REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text) FROM PUBLIC; +"#, + name = "atomic_start_enqueue", + requires = ["create_tables"] +); + // ============================================================================ // Extension Validation (must run before duroxide schema creation) // ============================================================================ From 6495ee12cb70a053cbcc4e0acb0ab84103dbb3a7 Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 00:20:13 +0000 Subject: [PATCH 06/20] spec: record validated Option 3 POC results Add a 'Proof of concept (validated)' section documenting what was built (the SECURITY DEFINER df._enqueue_orchestrator_start wrapper + in-transaction SPI enqueue), the confirmed findings (privilege model is the only wrinkle; visible_at=now() suffices; reusing the duroxide WorkItem type), and the validation matrix (atomicity, 8/8 E2E subset, 5/5 JOIN, non-superuser path). Update Status, Phase 1, and the visible_at open question to reflect the POC; note the DROP EXTENSION CASCADE / _duroxide recovery gotcha. --- docs/spec-atomic-start.md | 69 +++++++++++++++++++++++++++++++++++---- 1 file changed, 62 insertions(+), 7 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index 6f11d9d..4b3d89e 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -1,6 +1,6 @@ # Spec: Atomic `df.start()` — Eliminating `df.*` ↔ `_duroxide` Divergence -**Status:** Proposal +**Status:** Proposal — Option 3 prototyped & validated (POC) **Author:** pg_durable team **Date:** June 2026 @@ -355,6 +355,59 @@ transaction, so the "not yet visible / rolled back" branch cannot occur. The retry loop can be simplified to a single read (optionally retain a short grace for replication scenarios). This also removes the worker-saturation failure mode. +## Proof of concept (validated) + +Option 3 was prototyped end-to-end and validated on PostgreSQL 17 (branch +`tjgreen42/atomic-start`, ~80 lines across 3 files). + +**What was built:** + +- `df._enqueue_orchestrator_start(text, text)` — a `SECURITY DEFINER` wrapper + (`SET search_path = pg_catalog`) that performs the privileged + `_duroxide.enqueue_orchestrator_work(instance, work_item, now())` INSERT. + EXECUTE is revoked from PUBLIC and granted through `df.grant_usage()`. +- `df.start()` builds the real `duroxide::providers::WorkItem::StartOrchestration` + (guaranteeing wire compatibility), gates on `is_worker_ready()`, and enqueues + via SPI through the wrapper — raising (aborting the transaction) on failure. +- The out-of-band `start_durable_function` client path is removed from start + (the duroxide client pool is still used by `df.signal` / `df.cancel`). + +**Confirmed findings:** + +- **The privilege model is the only real wrinkle.** `_duroxide.orchestrator_queue` + grants INSERT to its owner only and the enqueue function is `SECURITY INVOKER`, + so a plain caller gets *permission denied*. The `SECURITY DEFINER` wrapper + resolves this while still committing in the caller's transaction. A + non-superuser role (`df_e2e_user`) granted via `df.grant_usage()` starts + instances successfully and cannot INSERT into the queue directly. +- **`visible_at = now()`** (SQL transaction time) is sufficient for immediate + starts; no reliance on the Rust-side `now_ms`. +- **Reusing the `duroxide` `WorkItem` type** makes the enqueued row + byte-identical to the client path, so the worker behaves identically. + +**Validated behavior:** + +| Check | Result | +|-------|:------:| +| Committed `df.start()` runs to completion | pass | +| Rolled-back `df.start()` leaves no `df.*` **and** no `_duroxide` orphan (queue + instances) | pass | +| E2E subset: core, conditionals, loops, variables, signals, join, race, cancel | 8/8 | +| JOIN determinism under the new (faster) timing | 5/5 | +| Non-superuser (`df_e2e_user`) start path | pass | +| `cargo fmt --check`; no new clippy warnings | pass | + +**POC scope / deferred to productionization:** fresh install only (no upgrade +script); the `_duroxide` schema name is hard-coded (resolved dynamically in the +real change — see Upgrade & Migration); wrapper hardening (move to duroxide-pg +and/or validate the work item). + +> **Local-dev note:** `DROP EXTENSION pg_durable CASCADE` can leave the +> BGW-owned `_duroxide` schema half-broken (migrations row present, functions +> dropped), which surfaces as *spurious* JOIN nondeterminism and a missing +> `_worker_ready`. Recover with `DROP SCHEMA _duroxide CASCADE` + restart so the +> BGW re-applies its migrations. (This cost real debugging time during the POC; +> the failures were environmental, not a regression.) + ## Transaction-semantics decision (must be made explicit) This change makes `df.start()` **atomic with the caller's transaction**: @@ -454,9 +507,11 @@ E2E (tests/e2e/sql/) and unit coverage: ## Phasing -1. **Phase 1 (this spec):** atomic `df.start()` via the `SECURITY DEFINER` - enqueue entrypoint; readiness gate + legacy fallback; simplify the - `load-function-graph` retry; docs + tests. +1. **Phase 1 (this spec) — prototyped & validated:** atomic `df.start()` via the + `SECURITY DEFINER` enqueue entrypoint; readiness gate + legacy fallback; + simplify the `load-function-graph` retry; docs + tests. (POC complete; see + Proof of concept. Remaining for production: upgrade script, dynamic + `_duroxide` schema resolution, entrypoint hardening.) 2. **Phase 2:** apply the same in-transaction enqueue to `df.signal()` / `df.cancel()`. 3. **Phase 3:** ship the reconciler backstop (Option 4). @@ -470,9 +525,9 @@ E2E (tests/e2e/sql/) and unit coverage: - **Contract stability:** pg_durable now depends on the enqueue entrypoint signature and the `WorkItem` JSON. Mitigated by reusing the `duroxide` `WorkItem` type and the version-locked duroxide/duroxide-pg pair. -- **`visible_at` clock:** use `now()` (server time) for immediate starts to match - the duroxide-pg provider's behavior; confirm no reliance on the Rust-side - `now_ms` for ordering at start. +- **`visible_at` clock:** the POC uses `now()` (server transaction time) for + immediate starts and it works; no reliance on the Rust-side `now_ms` for + ordering at start. - **Idempotency:** `enqueue_orchestrator_work` is a bare INSERT (no dedup); the `df.instances`/`df.nodes` PKs already prevent duplicate instance ids within the transaction, so no additional dedup is required. From 66dd6526b05028dcea9ab3d9deaf20dc1862badc Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 00:46:13 +0000 Subject: [PATCH 07/20] POC round 2: dynamic schema, provider probe + fallback, Option 4 reconciler Address the abstraction-breaking direct _duroxide call and add the GC backstop. - df._enqueue_orchestrator_start now resolves the duroxide schema dynamically via df.duroxide_schema() and EXECUTE format(%I.enqueue_orchestrator_work), instead of hard-coding _duroxide. - df.start() probes for the duroxide-pg SQL surface (enqueue_orchestrator_work in the resolved schema): present -> atomic in-transaction SPI enqueue; absent (non-pg provider / pre-wrapper schema) -> fall back to the out-of-band client path. start_durable_function restored for the fallback. - Add df.reconcile(grace) (Option 4): SECURITY DEFINER, admin-only. GCs orphaned ROOT duroxide instances (parent_instance_id IS NULL, no df row) via delete_instances_atomic and fails stuck df.instances; excludes sub-orchestrations. Schedulable as a durable cron loop. Validated: atomicity holds via dynamic schema; reconcile deletes a planted root orphan and a stuck instance while leaving sub-orchestrations + live instances intact; E2E subset 8/8; fmt clean, no new clippy warnings. Spec updated. --- docs/spec-atomic-start.md | 83 ++++++++++++++++++++--------- src/client.rs | 26 +++++++++ src/dsl.rs | 107 ++++++++++++++++++++++++++------------ src/lib.rs | 101 +++++++++++++++++++++++++++++++---- 4 files changed, 251 insertions(+), 66 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index 4b3d89e..3679439 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -364,26 +364,43 @@ Option 3 was prototyped end-to-end and validated on PostgreSQL 17 (branch - `df._enqueue_orchestrator_start(text, text)` — a `SECURITY DEFINER` wrapper (`SET search_path = pg_catalog`) that performs the privileged - `_duroxide.enqueue_orchestrator_work(instance, work_item, now())` INSERT. - EXECUTE is revoked from PUBLIC and granted through `df.grant_usage()`. -- `df.start()` builds the real `duroxide::providers::WorkItem::StartOrchestration` - (guaranteeing wire compatibility), gates on `is_worker_ready()`, and enqueues - via SPI through the wrapper — raising (aborting the transaction) on failure. -- The out-of-band `start_durable_function` client path is removed from start - (the duroxide client pool is still used by `df.signal` / `df.cancel`). + `.enqueue_orchestrator_work(instance, work_item, now())` INSERT. The + duroxide schema is **resolved dynamically** via `df.duroxide_schema()` + (`_duroxide` fresh / legacy `duroxide`) and quoted with `%I`. EXECUTE is + revoked from PUBLIC and granted through `df.grant_usage()`. +- `df.start()` **probes** for the duroxide-pg SQL surface + (`enqueue_orchestrator_work` in the resolved schema). When present it builds + the real `duroxide::providers::WorkItem::StartOrchestration` (guaranteeing wire + compatibility), gates on `is_worker_ready()`, and enqueues via SPI through the + wrapper — raising (aborting the transaction) on failure. When **absent** (a + non-pg provider, or a schema predating the wrapper) it **falls back** to the + out-of-band `start_durable_function` client path (legacy, non-atomic). +- `df.reconcile(grace_seconds int default 60)` (Option 4) — a `SECURITY DEFINER`, + admin-only (revoked from PUBLIC) reconciler that deletes orphaned **root** + duroxide instances (`parent_instance_id IS NULL`, no `df.instances` row) via + `delete_instances_atomic`, and fails `df.instances` stuck non-terminal with no + live duroxide instance and no queued start. Schedulable as a durable cron loop. + +**Provider requirement:** the atomic path only works with the **duroxide-pg +(SQL-backed) provider**, because it calls the provider's SQL function directly. +The probe + fallback make this non-breaking for any other provider. **Confirmed findings:** -- **The privilege model is the only real wrinkle.** `_duroxide.orchestrator_queue` - grants INSERT to its owner only and the enqueue function is `SECURITY INVOKER`, - so a plain caller gets *permission denied*. The `SECURITY DEFINER` wrapper - resolves this while still committing in the caller's transaction. A - non-superuser role (`df_e2e_user`) granted via `df.grant_usage()` starts - instances successfully and cannot INSERT into the queue directly. +- **The privilege model is the only real wrinkle.** The duroxide orchestrator + queue grants INSERT to its owner only and the enqueue function is + `SECURITY INVOKER`, so a plain caller gets *permission denied*. The + `SECURITY DEFINER` wrapper resolves this while still committing in the caller's + transaction. A non-superuser role (`df_e2e_user`) granted via + `df.grant_usage()` starts instances successfully and cannot INSERT into the + queue directly. - **`visible_at = now()`** (SQL transaction time) is sufficient for immediate starts; no reliance on the Rust-side `now_ms`. - **Reusing the `duroxide` `WorkItem` type** makes the enqueued row byte-identical to the client path, so the worker behaves identically. +- **The reconciler must exclude sub-orchestrations.** JOIN/RACE branches and loop + generations are legitimate duroxide instances with no `df.instances` row; + filtering on `parent_instance_id IS NULL` (root only) prevents collecting them. **Validated behavior:** @@ -391,15 +408,19 @@ Option 3 was prototyped end-to-end and validated on PostgreSQL 17 (branch |-------|:------:| | Committed `df.start()` runs to completion | pass | | Rolled-back `df.start()` leaves no `df.*` **and** no `_duroxide` orphan (queue + instances) | pass | +| Dynamic schema resolution drives the SPI path (atomicity holds) | pass | | E2E subset: core, conditionals, loops, variables, signals, join, race, cancel | 8/8 | | JOIN determinism under the new (faster) timing | 5/5 | | Non-superuser (`df_e2e_user`) start path | pass | +| `df.reconcile()` deletes a planted root orphan; leaves sub-orchestrations + live instances untouched | pass | +| `df.reconcile()` fails a stuck (lost-enqueue) `df.instances` | pass | | `cargo fmt --check`; no new clippy warnings | pass | **POC scope / deferred to productionization:** fresh install only (no upgrade -script); the `_duroxide` schema name is hard-coded (resolved dynamically in the -real change — see Upgrade & Migration); wrapper hardening (move to duroxide-pg -and/or validate the work item). +script); the atomic-start wrapper hardening (move the entrypoint into duroxide-pg +owned by the schema owner, and/or validate the work item); the reconciler is a +plain SQL function (Phase 3 wires it into a built-in durable cron loop and tunes +grace/policy). > **Local-dev note:** `DROP EXTENSION pg_durable CASCADE` can leave the > BGW-owned `_duroxide` schema half-broken (migrations row present, functions @@ -443,15 +464,27 @@ See the Dual-write inventory above for the precise classification. In short: "don't deliver if my tx rolls back" semantics, but that is a nice-to-have, not a consistency fix — lowest priority. -## Complementary backstop — Option 4 (in scope, Phase 3) +## Complementary backstop — Option 4 (prototyped, Phase 3) -As part of the chosen approach, a lightweight **reconciler** — ideally a built-in -durable function (a `@>` loop + `df.wait_for_schedule`) — sweeps for residual +As part of the chosen approach, a lightweight **reconciler** sweeps for residual divergence that an atomic start cannot prevent (crashes mid-execution, -not-yet-migrated signal/cancel paths, `status` mirror drift), using existing -`_duroxide` read functions (`list_instances`, `get_instance_info`, -`get_queue_depths`) and `delete_instances_atomic`. It complements Option 3 rather -than replacing it, and is sequenced as Phase 3 (see Phasing). +not-yet-migrated signal/cancel paths, `status` mirror drift, and legacy/fallback +orphans). The POC ships it as **`df.reconcile(grace_seconds int default 60)`** +(`SECURITY DEFINER`, admin-only): it deletes orphaned **root** duroxide instances +(`parent_instance_id IS NULL`, no `df.instances` row, older than the grace +window) via `delete_instances_atomic`, and marks `df.instances` rows stuck +non-terminal with no live duroxide instance and no queued start as `failed`. +Sub-orchestrations (JOIN/RACE branches, loop generations) are excluded by the +root filter so they are never collected. Phase 3 wires this into a built-in +durable cron loop (dogfooding pg_durable): + +```sql +SELECT df.start( + @> ( 'SELECT df.reconcile()' ~> df.wait_for_schedule('*/5 * * * *') ), + 'reconciler'); +``` + +It complements Option 3 rather than replacing it. ## Upgrade & Migration @@ -514,7 +547,9 @@ E2E (tests/e2e/sql/) and unit coverage: `_duroxide` schema resolution, entrypoint hardening.) 2. **Phase 2:** apply the same in-transaction enqueue to `df.signal()` / `df.cancel()`. -3. **Phase 3:** ship the reconciler backstop (Option 4). +3. **Phase 3 — reconciler prototyped:** `df.reconcile()` exists and is validated + (orphan GC + stuck-instance failover); remaining work is wiring it into a + built-in durable cron loop and tuning the grace/policy. ## Open questions / risks diff --git a/src/client.rs b/src/client.rs index ab70491..14b1091 100644 --- a/src/client.rs +++ b/src/client.rs @@ -135,6 +135,32 @@ async fn list_running_descendants(client: &Client, root_instance_id: &str) -> Ve descendants } +/// Start a durable function via the duroxide client (out-of-band, on the cached +/// pool). This is the provider-agnostic fallback used by df.start() when the +/// duroxide-pg SQL enqueue surface is not available; it is NOT atomic with the +/// caller's transaction. +pub fn start_durable_function( + function_name: &str, + instance_id: &str, + input: &str, +) -> Result<(), String> { + log!( + "pg_durable: start_durable_function for instance {}", + instance_id + ); + + let rt = get_client_runtime(); + let client = get_duroxide_client()?; + + rt.block_on(async { + client + .start_orchestration(instance_id, function_name, input) + .await + .map_err(|e| format!("Failed to start durable function: {e:?}"))?; + Ok(()) + }) +} + /// Cancel a durable function. pub fn cancel_durable_function(instance_id: &str, reason: &str) -> Result<(), String> { let rt = get_client_runtime(); diff --git a/src/dsl.rs b/src/dsl.rs index e813bcf..532f667 100644 --- a/src/dsl.rs +++ b/src/dsl.rs @@ -622,6 +622,24 @@ pub fn signal(instance_id: &str, signal_name: &str, signal_data: default!(&str, // Orchestration Control Functions // ============================================================================ +/// True when the duroxide store exposes the SQL enqueue surface — i.e. the +/// duroxide-pg (PostgreSQL) provider — in the given schema. Detected by the +/// presence of `enqueue_orchestrator_work`. When false (a non-pg provider, or a +/// schema predating the wrapper), df.start() falls back to the out-of-band +/// client path. The catalog probe never raises, so it is safe in a backend +/// session even when the schema is absent. +fn in_tx_enqueue_supported(schema: &str) -> bool { + Spi::get_one_with_args::( + "SELECT EXISTS(SELECT 1 FROM pg_catalog.pg_proc p \ + JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace \ + WHERE n.nspname = $1 AND p.proname = 'enqueue_orchestrator_work')", + &[schema.into()], + ) + .ok() + .flatten() + .unwrap_or(false) +} + /// Starts a durable SQL function. /// The fut argument can be either Durofut JSON or plain SQL string (auto-wrapped). /// Variables from df.vars are captured and passed to the orchestration. @@ -912,13 +930,6 @@ pub fn start( }); // Start the orchestration durably. - // - // Option 3 (atomic start): the StartOrchestration work item is enqueued via - // SPI inside the CALLER'S transaction, so the df.nodes/df.instances INSERTs - // above and this orchestrator-queue INSERT commit or roll back together. A - // rolled-back df.start() therefore leaves no _duroxide orphan, and the worker - // only observes the queue row after the df.* rows are visible (which also - // removes the load-function-graph "wait for commit" race). let input = FunctionInput { instance_id: instance_id.clone(), label: label.map(|s| s.to_string()), @@ -926,33 +937,63 @@ pub fn start( }; let input_json = serde_json::to_string(&input).unwrap_or_else(|_| instance_id.clone()); - // Fail clearly (and atomically — the df.* INSERTs roll back) if the worker - // has not yet initialised the duroxide schema. - if !crate::client::is_worker_ready() { - pgrx::error!("pg_durable background worker not yet initialized — try again in a moment"); - } + // Prefer the atomic in-transaction enqueue (Option 3) when the duroxide-pg + // SQL provider is present; otherwise fall back to the provider-agnostic + // out-of-band client path (legacy behavior). + let schema = crate::types::backend_duroxide_schema(); + if in_tx_enqueue_supported(schema) { + // Option 3 (atomic start): enqueue the StartOrchestration work item via + // SPI inside the CALLER'S transaction, so the df.nodes/df.instances + // INSERTs above and this orchestrator-queue INSERT commit or roll back + // together. A rolled-back df.start() leaves no duroxide orphan, and the + // worker only observes the queue row after the df.* rows are visible + // (which also removes the load-function-graph "wait for commit" race). + // + // Fail clearly (and atomically — the df.* INSERTs roll back) if the + // worker has not yet initialised the duroxide schema. + if !crate::client::is_worker_ready() { + pgrx::error!( + "pg_durable background worker not yet initialized — try again in a moment" + ); + } - // Build the exact same work item the duroxide client would enqueue, by - // serializing the real duroxide type (guarantees wire compatibility). - let work_item = duroxide::providers::WorkItem::StartOrchestration { - instance: instance_id.clone(), - orchestration: crate::orchestrations::execute_function_graph::NAME.to_string(), - input: input_json, - version: None, - parent_instance: None, - parent_id: None, - execution_id: duroxide::INITIAL_EXECUTION_ID, - }; - let work_item_json = serde_json::to_string(&work_item) - .unwrap_or_else(|e| pgrx::error!("Failed to serialize start work item: {}", e)); - - // SECURITY DEFINER wrapper performs the privileged orchestrator-queue INSERT - // on the caller's transaction. Raising on failure aborts the whole df.start. - if let Err(e) = Spi::run_with_args( - "SELECT df._enqueue_orchestrator_start($1, $2)", - &[instance_id.as_str().into(), work_item_json.as_str().into()], - ) { - pgrx::error!("Failed to enqueue durable function start: {:?}", e); + // Build the exact same work item the duroxide client would enqueue, by + // serializing the real duroxide type (guarantees wire compatibility). + let work_item = duroxide::providers::WorkItem::StartOrchestration { + instance: instance_id.clone(), + orchestration: crate::orchestrations::execute_function_graph::NAME.to_string(), + input: input_json, + version: None, + parent_instance: None, + parent_id: None, + execution_id: duroxide::INITIAL_EXECUTION_ID, + }; + let work_item_json = serde_json::to_string(&work_item) + .unwrap_or_else(|e| pgrx::error!("Failed to serialize start work item: {}", e)); + + // SECURITY DEFINER wrapper performs the privileged orchestrator-queue + // INSERT on the caller's transaction. Raising on failure aborts df.start. + if let Err(e) = Spi::run_with_args( + "SELECT df._enqueue_orchestrator_start($1, $2)", + &[instance_id.as_str().into(), work_item_json.as_str().into()], + ) { + pgrx::error!("Failed to enqueue durable function start: {:?}", e); + } + } else { + // Fallback: no duroxide-pg SQL enqueue surface detected (a non-pg + // provider, or a schema predating the wrapper). Enqueue out-of-band via + // the duroxide client. NOTE: this is the legacy path and is NOT atomic + // with the caller's transaction. + if let Err(e) = crate::client::start_durable_function( + crate::orchestrations::execute_function_graph::NAME, + &instance_id, + &input_json, + ) { + pgrx::log!( + "pg_durable: Warning - failed to start durable function: {}", + e + ); + } } instance_id diff --git a/src/lib.rs b/src/lib.rs index 7f8c723..7da9639 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -570,20 +570,24 @@ REVOKE EXECUTE ON FUNCTION df.revoke_usage(text) FROM PUBLIC; // df.start() enqueues the StartOrchestration work item by calling this wrapper // via SPI, inside the caller's transaction. The duroxide orchestrator queue // grants INSERT to its owner only, so the privileged INSERT must run as a -// definer that owns the _duroxide tables; the call still commits/rolls back as +// definer that owns the duroxide tables; the call still commits/rolls back as // part of the caller's transaction, giving an atomic df.start(). // -// The body references _duroxide.enqueue_orchestrator_work, which is created by -// the background worker's migrations (not this extension). plpgsql resolves it -// at call time, by which point df.start()'s worker-readiness gate guarantees the -// _duroxide schema exists. +// PROVIDER REQUIREMENT: this path only works with the duroxide-pg (SQL-backed) +// provider, because it calls the provider's `enqueue_orchestrator_work` SQL +// function directly. df.start() probes for that function and falls back to the +// provider-agnostic out-of-band client path when it is absent (see dsl::start). +// +// The schema is resolved dynamically via df.duroxide_schema() ('_duroxide' on +// fresh installs, legacy 'duroxide' on upgraded ones); %I quotes the identifier. +// The function is created by the worker's migrations (not this extension); +// plpgsql resolves it at call time. // // SECURITY (POC): this wrapper can enqueue an arbitrary orchestrator work item. // EXECUTE is revoked from PUBLIC and granted only via df.grant_usage(). A // production version should move this entrypoint into duroxide-pg (owned by the -// _duroxide schema owner) and/or validate the work item — see -// docs/spec-atomic-start.md. The _duroxide schema name is also hard-coded here -// for the POC (it is resolved dynamically elsewhere). +// duroxide schema owner) and/or validate the work item — see +// docs/spec-atomic-start.md. extension_sql!( r#" CREATE FUNCTION df._enqueue_orchestrator_start(p_instance_id text, p_work_item text) @@ -592,8 +596,11 @@ LANGUAGE plpgsql SECURITY DEFINER SET search_path = pg_catalog AS $fn$ +DECLARE + sch text := df.duroxide_schema(); BEGIN - PERFORM _duroxide.enqueue_orchestrator_work(p_instance_id, p_work_item, pg_catalog.now()); + EXECUTE pg_catalog.format('SELECT %I.enqueue_orchestrator_work($1, $2, $3)', sch) + USING p_instance_id, p_work_item, pg_catalog.now(); END; $fn$; @@ -603,6 +610,82 @@ REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text) FROM PUBLI requires = ["create_tables"] ); +// ============================================================================ +// Reconciler (Option 4) — repair residual df.* / duroxide divergence +// ============================================================================ +// +// Atomic df.start() prevents the rolled-back-start orphan leak at the source, +// but some divergence can still arise: legacy orphans, the non-atomic fallback +// path, df.signal()/df.cancel() (not yet migrated), and crashes mid-execution. +// df.reconcile() is a best-effort garbage collector for that residue. It is +// admin-only (EXECUTE revoked from PUBLIC) and SECURITY DEFINER so it can touch +// the duroxide-owned tables and all users' df.instances. +// +// Schedule it durably (dogfooding pg_durable), e.g. every 5 minutes: +// SELECT df.start( +// @> ( 'SELECT df.reconcile()' ~> df.wait_for_schedule('*/5 * * * *') ), +// 'reconciler'); +// +// Only ROOT duroxide instances (parent_instance_id IS NULL) are considered — +// sub-orchestrations (JOIN/RACE branches, loop generations) intentionally have +// no df.instances row and must not be collected. A grace window avoids racing +// legitimately in-flight operations. +extension_sql!( + r#" +CREATE FUNCTION df.reconcile(p_grace_seconds integer DEFAULT 60) +RETURNS TABLE(duroxide_orphans_deleted bigint, stuck_instances_failed bigint) +LANGUAGE plpgsql +SECURITY DEFINER +SET search_path = pg_catalog +AS $fn$ +DECLARE + sch text := df.duroxide_schema(); + orphan_ids text[]; + stuck bigint := 0; +BEGIN + -- 1) Root duroxide instances with no df.instances row, older than the grace + -- window → orphans (e.g. rolled-back starts on the fallback path). Delete. + EXECUTE pg_catalog.format( + 'SELECT pg_catalog.array_agg(d.instance_id) ' + 'FROM %I.instances d ' + 'LEFT JOIN df.instances i ON i.id = d.instance_id ' + 'WHERE i.id IS NULL ' + ' AND d.parent_instance_id IS NULL ' + ' AND d.created_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1)', + sch) + INTO orphan_ids + USING p_grace_seconds; + + IF orphan_ids IS NOT NULL AND pg_catalog.array_length(orphan_ids, 1) > 0 THEN + EXECUTE pg_catalog.format('SELECT %I.delete_instances_atomic($1, $2)', sch) + USING orphan_ids, true; + END IF; + + -- 2) df.instances stuck non-terminal with no live duroxide instance and no + -- queued start (lost enqueue) → mark failed. + EXECUTE pg_catalog.format( + 'UPDATE df.instances i ' + 'SET status = ''failed'', updated_at = pg_catalog.now() ' + 'WHERE i.status IN (''pending'', ''running'') ' + ' AND i.updated_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1) ' + ' AND NOT EXISTS (SELECT 1 FROM %I.instances d WHERE d.instance_id = i.id) ' + ' AND NOT EXISTS (SELECT 1 FROM %I.orchestrator_queue q WHERE q.instance_id = i.id)', + sch, sch) + USING p_grace_seconds; + GET DIAGNOSTICS stuck = ROW_COUNT; + + duroxide_orphans_deleted := COALESCE(pg_catalog.array_length(orphan_ids, 1), 0); + stuck_instances_failed := stuck; + RETURN NEXT; +END; +$fn$; + +REVOKE EXECUTE ON FUNCTION df.reconcile(integer) FROM PUBLIC; +"#, + name = "reconcile", + requires = ["create_tables"] +); + // ============================================================================ // Extension Validation (must run before duroxide schema creation) // ============================================================================ From f72c6a07cf179a21ec0bacb0c0f3d6914a416c06 Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 01:48:36 +0000 Subject: [PATCH 08/20] Option 4: built-in durable reconciler cron loop The background worker now keeps one reconciler instance running per cluster, dogfooding pg_durable as a durable cron loop: df.start(df.loop(df.seq('SELECT * FROM df.reconcile()', df.wait_for_schedule())), 'df_reconciler') - worker::ensure_reconciler runs each epoch: idempotent (skips if a reconciler is pending/running; self-heals if it died), submits as a dedicated NON-superuser role df_reconciler (worker-created; granted df.grant_usage() + EXECUTE on df.reconcile()). Non-superuser identity avoids the enable_superuser_instances guard and bounds blast radius to 'trigger GC'. - New GUC pg_durable.reconciler_cron (default '*/5 * * * *'; empty disables). Validated: auto-starts as df_reconciler; exactly one instance (idempotent); scheduled loop GCs a planted orphan on its next tick without self-GC'ing; E2E subset 6/6 unaffected with it running. Two gotchas handled: pg_-prefixed role names are reserved (-> df_reconciler), and a set-returning reconciler must be called as SELECT * FROM df.reconcile() (bare call yields an unsupported RECORD column in execute-sql). Spec updated. --- docs/spec-atomic-start.md | 57 ++++++++++++++------ src/lib.rs | 26 +++++++-- src/worker.rs | 107 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 171 insertions(+), 19 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index 3679439..d0b86cb 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -379,7 +379,18 @@ Option 3 was prototyped end-to-end and validated on PostgreSQL 17 (branch admin-only (revoked from PUBLIC) reconciler that deletes orphaned **root** duroxide instances (`parent_instance_id IS NULL`, no `df.instances` row) via `delete_instances_atomic`, and fails `df.instances` stuck non-terminal with no - live duroxide instance and no queued start. Schedulable as a durable cron loop. + live duroxide instance and no queued start. +- **Built-in durable cron loop** — the background worker + (`worker::ensure_reconciler`) keeps exactly one reconciler instance running per + cluster, dogfooding pg_durable: + `df.start(df.loop(df.seq('SELECT * FROM df.reconcile()', df.wait_for_schedule())), 'df_reconciler')`. + It is idempotent (skips if one is pending/running; self-heals if it died), + driven by the `pg_durable.reconciler_cron` GUC (default `*/5 * * * *`; empty + disables), and submitted by a dedicated **non-superuser** role `df_reconciler` + (created by the worker, granted only `df.grant_usage()` + EXECUTE on + `df.reconcile()`). A non-superuser identity sidesteps the + `enable_superuser_instances` guard and bounds the blast radius to "trigger GC" + even if the instance were forged. **Provider requirement:** the atomic path only works with the **duroxide-pg (SQL-backed) provider**, because it calls the provider's SQL function directly. @@ -401,6 +412,11 @@ The probe + fallback make this non-breaking for any other provider. - **The reconciler must exclude sub-orchestrations.** JOIN/RACE branches and loop generations are legitimate duroxide instances with no `df.instances` row; filtering on `parent_instance_id IS NULL` (root only) prevents collecting them. +- **Two gotchas in the cron loop:** the reconciler role cannot be named with a + `pg_` prefix (PostgreSQL reserves it) — hence `df_reconciler`; and the loop + node must call the set-returning reconciler as `SELECT * FROM df.reconcile()` + (a bare `SELECT df.reconcile()` yields an unsupported `RECORD` column in + `execute-sql`). **Validated behavior:** @@ -414,13 +430,16 @@ The probe + fallback make this non-breaking for any other provider. | Non-superuser (`df_e2e_user`) start path | pass | | `df.reconcile()` deletes a planted root orphan; leaves sub-orchestrations + live instances untouched | pass | | `df.reconcile()` fails a stuck (lost-enqueue) `df.instances` | pass | +| Built-in reconciler auto-starts as non-superuser `df_reconciler`, idempotent (exactly one) | pass | +| Scheduled loop GCs a planted orphan on its next tick, without self-GC'ing | pass | +| E2E subset unaffected with the reconciler running (incl. heartbeat + join) | 6/6 | | `cargo fmt --check`; no new clippy warnings | pass | **POC scope / deferred to productionization:** fresh install only (no upgrade script); the atomic-start wrapper hardening (move the entrypoint into duroxide-pg -owned by the schema owner, and/or validate the work item); the reconciler is a -plain SQL function (Phase 3 wires it into a built-in durable cron loop and tunes -grace/policy). +owned by the schema owner, and/or validate the work item); the reconciler's +grace/cadence policy still needs tuning, and the `df_reconciler` role is created +by the worker rather than via managed provisioning. > **Local-dev note:** `DROP EXTENSION pg_durable CASCADE` can leave the > BGW-owned `_duroxide` schema half-broken (migrations row present, functions @@ -464,27 +483,33 @@ See the Dual-write inventory above for the precise classification. In short: "don't deliver if my tx rolls back" semantics, but that is a nice-to-have, not a consistency fix — lowest priority. -## Complementary backstop — Option 4 (prototyped, Phase 3) +## Complementary backstop — Option 4 (implemented) As part of the chosen approach, a lightweight **reconciler** sweeps for residual divergence that an atomic start cannot prevent (crashes mid-execution, not-yet-migrated signal/cancel paths, `status` mirror drift, and legacy/fallback -orphans). The POC ships it as **`df.reconcile(grace_seconds int default 60)`** +orphans). It ships as **`df.reconcile(grace_seconds int default 60)`** (`SECURITY DEFINER`, admin-only): it deletes orphaned **root** duroxide instances (`parent_instance_id IS NULL`, no `df.instances` row, older than the grace window) via `delete_instances_atomic`, and marks `df.instances` rows stuck non-terminal with no live duroxide instance and no queued start as `failed`. Sub-orchestrations (JOIN/RACE branches, loop generations) are excluded by the -root filter so they are never collected. Phase 3 wires this into a built-in -durable cron loop (dogfooding pg_durable): +root filter so they are never collected. + +The background worker runs it as a **built-in durable cron loop** (dogfooding +pg_durable), ensuring exactly one instance per cluster: ```sql -SELECT df.start( - @> ( 'SELECT df.reconcile()' ~> df.wait_for_schedule('*/5 * * * *') ), - 'reconciler'); +df.start( + df.loop(df.seq('SELECT * FROM df.reconcile()', + df.wait_for_schedule(current_setting('pg_durable.reconciler_cron')))), + 'df_reconciler'); ``` -It complements Option 3 rather than replacing it. +started by `worker::ensure_reconciler` each epoch (idempotent; self-heals if it +died), submitted by the dedicated non-superuser `df_reconciler` role, and driven +by `pg_durable.reconciler_cron` (default `*/5 * * * *`; empty disables). It +complements Option 3 rather than replacing it. ## Upgrade & Migration @@ -547,9 +572,11 @@ E2E (tests/e2e/sql/) and unit coverage: `_duroxide` schema resolution, entrypoint hardening.) 2. **Phase 2:** apply the same in-transaction enqueue to `df.signal()` / `df.cancel()`. -3. **Phase 3 — reconciler prototyped:** `df.reconcile()` exists and is validated - (orphan GC + stuck-instance failover); remaining work is wiring it into a - built-in durable cron loop and tuning the grace/policy. +3. **Phase 3 — reconciler implemented:** `df.reconcile()` plus a built-in durable + cron loop (worker-managed `df_reconciler` instance on + `pg_durable.reconciler_cron`) are validated (auto-start, idempotency, + scheduled orphan GC, no self-GC). Remaining: tune the grace/cadence policy and + role provisioning. ## Open questions / risks diff --git a/src/lib.rs b/src/lib.rs index 7da9639..4a29fd7 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -29,6 +29,12 @@ pub static EXECUTION_ACQUIRE_TIMEOUT: GucSetting = GucSetting::::new(3 /// functions are explicitly desired. See docs/superuser_guc.md. pub static ENABLE_SUPERUSER_INSTANCES: GucSetting = GucSetting::::new(false); +/// Cron schedule for the built-in durable reconciler (Option 4 / df.reconcile()). +/// The background worker ensures one reconciler instance per cluster, running on +/// this schedule. Set to an empty string to disable the built-in reconciler. +pub static RECONCILER_CRON: GucSetting> = + GucSetting::>::new(Some(c"*/5 * * * *")); + // Module declarations pub mod activities; pub mod client; @@ -135,6 +141,15 @@ pub extern "C-unwind" fn _PG_init() { GucFlags::SUPERUSER_ONLY, ); + GucRegistry::define_string_guc( + c"pg_durable.reconciler_cron", + c"Cron schedule for the built-in durable reconciler (df.reconcile()); empty disables it", + c"The background worker keeps one reconciler instance running on this schedule to repair residual df.* / duroxide divergence. Empty string disables the built-in reconciler. Requires server restart to change.", + &RECONCILER_CRON, + GucContext::Postmaster, + GucFlags::default(), + ); + worker::register_background_worker(); } @@ -621,10 +636,13 @@ REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text) FROM PUBLI // admin-only (EXECUTE revoked from PUBLIC) and SECURITY DEFINER so it can touch // the duroxide-owned tables and all users' df.instances. // -// Schedule it durably (dogfooding pg_durable), e.g. every 5 minutes: -// SELECT df.start( -// @> ( 'SELECT df.reconcile()' ~> df.wait_for_schedule('*/5 * * * *') ), -// 'reconciler'); +// The background worker keeps one reconciler instance running automatically +// (worker::ensure_reconciler), dogfooding pg_durable as a durable cron loop: +// df.start(df.loop(df.seq('SELECT * FROM df.reconcile()', +// df.wait_for_schedule())), +// 'df_reconciler') +// submitted by the dedicated non-superuser role df_reconciler. Set +// pg_durable.reconciler_cron = '' to disable it. // // Only ROOT duroxide instances (parent_instance_id IS NULL) are considered — // sub-orchestrations (JOIN/RACE branches, loop generations) intentionally have diff --git a/src/worker.rs b/src/worker.rs index 4f6d66a..c9c99d1 100644 --- a/src/worker.rs +++ b/src/worker.rs @@ -202,6 +202,10 @@ async fn run_duroxide_runtime() { } }; + // Ensure the built-in durable reconciler is running for this epoch + // (Option 4). Idempotent: starts one only if none is pending/running. + ensure_reconciler(&mgmt_pool).await; + run_until_extension_dropped_or_shutdown( &mgmt_pool, duroxide_runtime, @@ -215,6 +219,109 @@ async fn run_duroxide_runtime() { mgmt_pool.close().await; } +/// Built-in durable reconciler (Option 4). Ensures exactly one reconciler +/// instance is running per cluster: an infinite durable loop that calls +/// `df.reconcile()` on the `pg_durable.reconciler_cron` schedule. +/// +/// Idempotent — does nothing if a reconciler is already pending/running, and +/// (re)starts one if it has died. The instance is submitted by a dedicated +/// **non-superuser** role (`pg_durable_reconciler`) granted only df usage plus +/// EXECUTE on the SECURITY DEFINER `df.reconcile()`. Using a non-superuser +/// identity avoids the enable_superuser_instances guard, and bounds the blast +/// radius to "trigger GC" even if the instance were forged. +async fn ensure_reconciler(pool: &sqlx::PgPool) { + const ROLE: &str = "df_reconciler"; + const LABEL: &str = "df_reconciler"; + + let cron = crate::RECONCILER_CRON + .get() + .map(|c| c.to_string_lossy().into_owned()) + .unwrap_or_default(); + let cron = cron.trim().to_string(); + if cron.is_empty() { + return; // built-in reconciler disabled + } + + // Skip if a reconciler is already pending/running (idempotent across epochs; + // self-heals if a previous one failed/was cancelled). Runs as the worker + // role, which bypasses RLS, so it sees the instance regardless of owner. + let already_running: i64 = match sqlx::query_scalar( + "SELECT count(*) FROM df.instances \ + WHERE label = $1 AND status IN ('pending', 'running')", + ) + .bind(LABEL) + .fetch_one(pool) + .await + { + Ok(n) => n, + Err(e) => { + log!("pg_durable: reconciler liveness check failed: {}", e); + return; + } + }; + if already_running > 0 { + return; + } + + // Ensure the dedicated role exists with the minimal grants. + if let Err(e) = ensure_reconciler_role(pool, ROLE).await { + log!("pg_durable: failed to provision reconciler role: {}", e); + return; + } + + // Start the durable loop as the dedicated role on a single connection so the + // SET ROLE applies to df.start() (which captures submitted_by = current_user). + let started = async { + let mut conn = pool.acquire().await?; + sqlx::query(&format!("SET ROLE {ROLE}")) + .execute(&mut *conn) + .await?; + let res = sqlx::query( + "SELECT df.start(\ + df.loop(df.seq('SELECT * FROM df.reconcile()', df.wait_for_schedule($1))), \ + $2)", + ) + .bind(&cron) + .bind(LABEL) + .execute(&mut *conn) + .await; + // Always reset the role before the connection returns to the pool. + let _ = sqlx::query("RESET ROLE").execute(&mut *conn).await; + res.map(|_| ()) + } + .await; + + match started { + Ok(()) => log!("pg_durable: started built-in reconciler (cron='{}')", cron), + Err(e) => log!("pg_durable: failed to start built-in reconciler: {}", e), + } +} + +/// Create the dedicated non-superuser reconciler role (if absent) and grant it +/// df usage plus EXECUTE on df.reconcile(). Idempotent. +async fn ensure_reconciler_role(pool: &sqlx::PgPool, role: &str) -> Result<(), sqlx::Error> { + // role is a fixed compile-time constant, so the format! is injection-safe. + sqlx::query(&format!( + "DO $$ BEGIN \ + IF NOT EXISTS (SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '{role}') THEN \ + CREATE ROLE {role} LOGIN; \ + END IF; \ + END $$;" + )) + .execute(pool) + .await?; + sqlx::query("SELECT df.grant_usage($1)") + .bind(role) + .execute(pool) + .await?; + sqlx::query(&format!( + "GRANT EXECUTE ON FUNCTION df.reconcile(integer) TO {role}" + )) + .execute(pool) + .await?; + Ok(()) +} + async fn wait_for_extension_creation(poll_pool: &sqlx::PgPool, poll_interval: Duration) -> bool { log!("pg_durable: waiting for CREATE EXTENSION pg_durable..."); From 33e9648c55e9310410a1236c277e579aadb7c99a Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 02:43:53 +0000 Subject: [PATCH 09/20] Address code-review findings (security + correctness) Fixes from an SoT review of the atomic-start + reconciler work: - HIGH (security): df._enqueue_orchestrator_start no longer accepts an opaque work item. It builds the StartOrchestration server-side and authorizes only a brand-new, not-yet-started instance (pending df.instances row, no queue entry, no duroxide instance), so a df user can no longer forge Cancel/Signal/Activity work items against another user's instance. Wrapper is now (text,text,text). - HIGH (correctness): df.reconcile() gathers the FULL orphan subtree (recursive on parent_instance_id) before delete_instances_atomic, which refuses to delete a parent without its children; each pass is wrapped in EXCEPTION so a failure never aborts reconcile or kills the built-in loop. Count now reflects rows actually deleted. - HIGH (security): reconciler singleton liveness keys on the unforgeable submitted_by = df_reconciler role, not the user-writable label, so a user cannot suppress GC with a look-alike instance. - MED/HIGH: WAIT_SCHEDULE recomputes its delay from the deterministic clock (ctx.utc_now) each generation instead of replaying a fixed build-time offset, fixing a df.loop(df.wait_for_schedule) busy-loop (calculate_cron_wait_from). - MED: ensure_reconciler is re-asserted from the steady-state poll loop, so a reconciler that dies mid-epoch restarts within the poll interval. - MED: df.start() warns when it takes the non-atomic fallback path. - Docs: keep df_reconciler LOGIN (connect_as_user needs it); correct stale role name in comments; qualify the 5s-race claim to the atomic path; spec Review hardening section. Verified: forged enqueues denied; root+child orphan GC'd with no error; label spoof does not suppress GC; reconciler fires on cron (no spin) and self-heals in ~4s after cancel; E2E subset 6/6; fmt clean, no new clippy warnings. --- docs/spec-atomic-start.md | 70 +++++++-- src/dsl.rs | 38 +++-- src/lib.rs | 153 ++++++++++++++----- src/orchestrations/execute_function_graph.rs | 29 +++- src/types.rs | 20 ++- src/worker.rs | 46 ++++-- 6 files changed, 259 insertions(+), 97 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index d0b86cb..5022f92 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -70,7 +70,8 @@ schemas; run duroxide's start-enqueue inside the caller's backend transaction vi SPI, alongside the `df.*` writes. *Guarantee:* strongest — `df.nodes` + `df.instances` + the queue row + the caller's surrounding statements all commit or roll back together; identity -capture and the `df.*` read surface are untouched; the 5 s load race disappears. +capture and the `df.*` read surface are untouched; the 5 s load race disappears +on the atomic path. *Cost:* small — the start is a single SQL-function call — with one wrinkle: a `SECURITY DEFINER` enqueue entrypoint is needed because the queue INSERT is owner-only. **Chosen as the primary fix.** @@ -205,7 +206,7 @@ This is the "best-effort mirror"; any persistent drift is the reconciler's job and the worker picks the instance up only after the `df.*` rows are visible. - Eliminate the rolled-back-start orphan leak at the source. - Eliminate the `load-function-graph` "wait up to 5 s for the row to appear" - race (it becomes structurally impossible). + race on the atomic path (the retry is retained for the non-atomic fallback). - No change to the `df.*` schema's read surface (status/result/monitoring/explain keep working unchanged). @@ -322,9 +323,17 @@ AS $$ BEGIN END $$; ``` -The wrapper takes **only** an opaque `instance_id`, a `work_item` string, and a -timestamp — it executes no caller-supplied SQL, so the `SECURITY DEFINER` -surface is a single fixed INSERT. +The wrapper takes the caller's `instance_id`, orchestration name, and input +string; it **constructs the `StartOrchestration` work item server-side** (so the +caller cannot choose the work-item variant or target a foreign instance) and +**authorizes** the enqueue only for a brand-new, not-yet-started instance +(`pending` `df.instances` row, no orchestrator-queue entry, no duroxide +instance — under atomic-start semantics reachable only for the row `df.start` +just inserted in the current transaction). It executes no caller-supplied SQL. +This is what prevents the wrapper from being a generic, cross-tenant +arbitrary-work-item enqueue primitive (which would otherwise let any df user +forge `CancelInstance`/`ExternalRaised`/`ActivityCompleted` against another +user's instance, bypassing the RLS gate `df.signal`/`df.cancel` enforce). ### Identity / security @@ -345,15 +354,17 @@ authenticate using `df.instances.submitted_by` and re-check the superuser policy `df.nodes`/`df.instances` still aborts before it (no behavior change for the collision case, now fully inside one transaction). -### Consequence: the 5 s load race disappears +### Consequence: the 5 s load race disappears (on the atomic path) -`load-function-graph` currently polls for up to `MAX_WAIT_SECS = 5` because the -worker could dequeue a start item before the caller's `df.*` rows committed -(src/activities/load_function_graph.rs:21,60). With an atomic start, a queue row +`load-function-graph` polls for up to `MAX_WAIT_SECS = 5` because the worker +could dequeue a start item before the caller's `df.*` rows committed +(src/activities/load_function_graph.rs:21,60). On the **atomic path** a queue row becomes visible to the worker **only** after the `df.*` rows commit in the same transaction, so the "not yet visible / rolled back" branch cannot occur. The -retry loop can be simplified to a single read (optionally retain a short grace -for replication scenarios). This also removes the worker-saturation failure mode. +retry is **retained** rather than removed, because the non-atomic **fallback** +path can still enqueue before the `df.*` rows commit; it is a no-op on the atomic +path. So this removes the worker-saturation failure mode for atomic starts while +staying correct for the fallback. ## Proof of concept (validated) @@ -435,6 +446,43 @@ The probe + fallback make this non-breaking for any other provider. | E2E subset unaffected with the reconciler running (incl. heartbeat + join) | 6/6 | | `cargo fmt --check`; no new clippy warnings | pass | +### Review hardening (applied) + +A source-of-truth code review surfaced and fixed several issues: + +- **Cross-tenant enqueue bypass (HIGH).** The wrapper originally accepted an + opaque `work_item` and was granted to every df user, letting a caller forge + `CancelInstance`/`ExternalRaised`/`ActivityCompleted` against another user's + instance. Fixed: the wrapper now **builds the `StartOrchestration` server-side** + and **authorizes** only a brand-new, not-yet-started instance — verified that a + non-superuser can no longer forge an enqueue for a foreign or arbitrary id. +- **Reconciler self-destruct on parallel-workflow orphans (HIGH).** + `delete_instances_atomic` refuses to delete a parent without its children (even + with `force`), so deleting only orphan **roots** raised on any JOIN/RACE/loop + orphan and aborted GC. Fixed: gather the **full subtree** (recursive on + `parent_instance_id`) and wrap each pass in an exception block so one failure + never aborts reconcile or kills the loop — verified root+child GC with no error. +- **GC-suppression via user label (HIGH).** The singleton key was the + user-writable `label`; any user could park a `df_reconciler`-labelled instance + to stop the GC from starting. Fixed: key liveness on the unforgeable + `submitted_by = df_reconciler` role. +- **Cron busy-loop (MED/HIGH).** `df.wait_for_schedule` baked a fixed offset at + build time that a `df.loop` replayed every generation (worst case ~1 s spin). + Fixed: the WAIT_SCHEDULE node now recomputes the delay from the orchestration's + deterministic clock (`ctx.utc_now()`) each generation — verified the reconciler + fires on the cron tick and does not spin. (Note: adding the clock read changes + the orchestration history shape, so a wait-in-loop instance in flight across the + upgrade may need to restart.) +- **No in-epoch self-healing (MED).** The reconciler was (re)started only per + epoch. Fixed: `ensure_reconciler` is also re-asserted from the steady-state poll + loop — verified a cancelled reconciler restarts within the poll interval. +- **Silent non-atomic fallback (MED).** `df.start()` now emits a `WARNING` when it + takes the non-atomic fallback path. +- **Debated/rejected:** the `df_reconciler` role keeps `LOGIN` — the worker + executes the reconcile node by connecting *as* the role via `connect_as_user`, + exactly like every other durable-function role; the proposed `NOLOGIN` would + break node execution. + **POC scope / deferred to productionization:** fresh install only (no upgrade script); the atomic-start wrapper hardening (move the entrypoint into duroxide-pg owned by the schema owner, and/or validate the work item); the reconciler's diff --git a/src/dsl.rs b/src/dsl.rs index 532f667..de786cb 100644 --- a/src/dsl.rs +++ b/src/dsl.rs @@ -957,33 +957,31 @@ pub fn start( ); } - // Build the exact same work item the duroxide client would enqueue, by - // serializing the real duroxide type (guarantees wire compatibility). - let work_item = duroxide::providers::WorkItem::StartOrchestration { - instance: instance_id.clone(), - orchestration: crate::orchestrations::execute_function_graph::NAME.to_string(), - input: input_json, - version: None, - parent_instance: None, - parent_id: None, - execution_id: duroxide::INITIAL_EXECUTION_ID, - }; - let work_item_json = serde_json::to_string(&work_item) - .unwrap_or_else(|e| pgrx::error!("Failed to serialize start work item: {}", e)); - - // SECURITY DEFINER wrapper performs the privileged orchestrator-queue - // INSERT on the caller's transaction. Raising on failure aborts df.start. + // The SECURITY DEFINER wrapper builds the StartOrchestration work item + // server-side from these trusted arguments (the caller cannot choose the + // work-item variant or target a foreign instance) and performs the + // privileged orchestrator-queue INSERT on the caller's transaction. + // Raising on failure aborts the whole df.start atomically. if let Err(e) = Spi::run_with_args( - "SELECT df._enqueue_orchestrator_start($1, $2)", - &[instance_id.as_str().into(), work_item_json.as_str().into()], + "SELECT df._enqueue_orchestrator_start($1, $2, $3)", + &[ + instance_id.as_str().into(), + crate::orchestrations::execute_function_graph::NAME.into(), + input_json.as_str().into(), + ], ) { pgrx::error!("Failed to enqueue durable function start: {:?}", e); } } else { // Fallback: no duroxide-pg SQL enqueue surface detected (a non-pg // provider, or a schema predating the wrapper). Enqueue out-of-band via - // the duroxide client. NOTE: this is the legacy path and is NOT atomic - // with the caller's transaction. + // the duroxide client. NOTE: this path is NOT atomic with the caller's + // transaction — a rollback will NOT undo the start. Warn so the + // non-atomic semantics are observable. + pgrx::warning!( + "pg_durable: df.start() is using the non-atomic fallback enqueue \ + (duroxide-pg SQL surface not detected); a rollback will not undo this start" + ); if let Err(e) = crate::client::start_durable_function( crate::orchestrations::execute_function_graph::NAME, &instance_id, diff --git a/src/lib.rs b/src/lib.rs index 4a29fd7..d7a6b15 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -430,7 +430,7 @@ DECLARE 'df.explain(text)', 'df.target_database()', -- Atomic-start enqueue wrapper (called by df.start() via SPI) - 'df._enqueue_orchestrator_start(text, text)' + 'df._enqueue_orchestrator_start(text, text, text)' ]; BEGIN -- Validate the role exists @@ -598,28 +598,74 @@ REVOKE EXECUTE ON FUNCTION df.revoke_usage(text) FROM PUBLIC; // The function is created by the worker's migrations (not this extension); // plpgsql resolves it at call time. // -// SECURITY (POC): this wrapper can enqueue an arbitrary orchestrator work item. -// EXECUTE is revoked from PUBLIC and granted only via df.grant_usage(). A -// production version should move this entrypoint into duroxide-pg (owned by the -// duroxide schema owner) and/or validate the work item — see -// docs/spec-atomic-start.md. +// SECURITY: this wrapper is SECURITY DEFINER and granted to every df user via +// df.grant_usage() (df.start() runs SECURITY INVOKER and calls it via SPI). To +// avoid handing users an arbitrary-work-item enqueue primitive (which would let +// a caller forge CancelInstance/ExternalRaised/etc. against another user's +// instance), it (1) constructs the StartOrchestration work item server-side from +// the caller's arguments — the caller cannot choose the variant or a foreign +// target — and (2) authorizes the enqueue only for a brand-new, not-yet-started +// instance (pending df.instances row, no queue entry, no duroxide instance), +// which under atomic-start semantics is reachable only for the row df.start just +// inserted in the current transaction. Hardening direction (move the entrypoint +// into duroxide-pg, owned by the schema owner) is in docs/spec-atomic-start.md. extension_sql!( r#" -CREATE FUNCTION df._enqueue_orchestrator_start(p_instance_id text, p_work_item text) +CREATE FUNCTION df._enqueue_orchestrator_start( + p_instance_id text, + p_orchestration text, + p_input text) RETURNS void LANGUAGE plpgsql SECURITY DEFINER SET search_path = pg_catalog AS $fn$ DECLARE - sch text := df.duroxide_schema(); + sch text := df.duroxide_schema(); + work_item text; + v_blocked boolean; BEGIN + -- Authorization. This runs as the (privileged) definer, so it must not + -- trust the caller to only target their own instance. Permit the enqueue + -- only for a brand-new, not-yet-started instance: a 'pending' df.instances + -- row with no orchestrator-queue entry and no duroxide instance. Under the + -- atomic-start path a committed df.instances row always has its queue row in + -- the same transaction, so this state is reachable only for the instance the + -- caller (df.start) just inserted in the current transaction — never another + -- user's committed instance. This is what stops a caller from forging work + -- against a foreign instance. + EXECUTE pg_catalog.format( + 'SELECT NOT EXISTS (SELECT 1 FROM df.instances i WHERE i.id = $1 AND i.status = ''pending'') ' + ' OR EXISTS (SELECT 1 FROM %I.orchestrator_queue q WHERE q.instance_id = $1) ' + ' OR EXISTS (SELECT 1 FROM %I.instances d WHERE d.instance_id = $1)', + sch, sch) + INTO v_blocked + USING p_instance_id; + + IF v_blocked THEN + RAISE EXCEPTION 'pg_durable: not authorized to enqueue a start for instance %', p_instance_id + USING ERRCODE = 'insufficient_privilege'; + END IF; + + -- Build the StartOrchestration work item server-side so the caller cannot + -- choose the work-item variant (no CancelInstance/ExternalRaised/etc.) or + -- target a different instance. Mirrors duroxide's WorkItem::StartOrchestration. + work_item := pg_catalog.json_build_object( + 'StartOrchestration', pg_catalog.json_build_object( + 'instance', p_instance_id, + 'orchestration', p_orchestration, + 'input', p_input, + 'version', NULL, + 'parent_instance', NULL, + 'parent_id', NULL, + 'execution_id', 1))::text; + EXECUTE pg_catalog.format('SELECT %I.enqueue_orchestrator_work($1, $2, $3)', sch) - USING p_instance_id, p_work_item, pg_catalog.now(); + USING p_instance_id, work_item, pg_catalog.now(); END; $fn$; -REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text) FROM PUBLIC; +REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text, text) FROM PUBLIC; "#, name = "atomic_start_enqueue", requires = ["create_tables"] @@ -659,40 +705,71 @@ AS $fn$ DECLARE sch text := df.duroxide_schema(); orphan_ids text[]; + deleted bigint := 0; stuck bigint := 0; BEGIN - -- 1) Root duroxide instances with no df.instances row, older than the grace - -- window → orphans (e.g. rolled-back starts on the fallback path). Delete. - EXECUTE pg_catalog.format( - 'SELECT pg_catalog.array_agg(d.instance_id) ' - 'FROM %I.instances d ' - 'LEFT JOIN df.instances i ON i.id = d.instance_id ' - 'WHERE i.id IS NULL ' - ' AND d.parent_instance_id IS NULL ' - ' AND d.created_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1)', - sch) - INTO orphan_ids - USING p_grace_seconds; - - IF orphan_ids IS NOT NULL AND pg_catalog.array_length(orphan_ids, 1) > 0 THEN - EXECUTE pg_catalog.format('SELECT %I.delete_instances_atomic($1, $2)', sch) + -- 1) Delete orphaned duroxide subtrees: every duroxide instance whose ROOT + -- ancestor has no df.instances row and is older than the grace window. + -- We must gather the FULL subtree (root + all descendants) because + -- delete_instances_atomic refuses (even with force) to delete a parent + -- whose children are not also in the list. Sub-orchestrations (JOIN/RACE + -- branches, loop generations) have no df.instances row and would be + -- mis-detected as roots if we keyed on "no df row" alone, so the orphan + -- seed is restricted to parent_instance_id IS NULL. + -- Wrapped so a GC failure never aborts reconcile or kills the built-in + -- reconciler loop. + BEGIN + EXECUTE pg_catalog.format( + 'WITH RECURSIVE orphan_root AS ( ' + ' SELECT d.instance_id ' + ' FROM %1$I.instances d ' + ' LEFT JOIN df.instances i ON i.id = d.instance_id ' + ' WHERE i.id IS NULL ' + ' AND d.parent_instance_id IS NULL ' + ' AND d.created_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1) ' + '), subtree AS ( ' + ' SELECT instance_id FROM orphan_root ' + ' UNION ' + ' SELECT c.instance_id FROM %1$I.instances c ' + ' JOIN subtree s ON c.parent_instance_id = s.instance_id ' + ') SELECT pg_catalog.array_agg(instance_id) FROM subtree', + sch) + INTO orphan_ids + USING p_grace_seconds; + + IF orphan_ids IS NOT NULL AND pg_catalog.array_length(orphan_ids, 1) > 0 THEN + EXECUTE pg_catalog.format( + 'SELECT instances_deleted FROM %I.delete_instances_atomic($1, $2)', sch) + INTO deleted USING orphan_ids, true; - END IF; + END IF; + EXCEPTION WHEN OTHERS THEN + deleted := 0; + RAISE WARNING 'pg_durable: reconcile orphan-GC pass failed: %', SQLERRM; + END; -- 2) df.instances stuck non-terminal with no live duroxide instance and no - -- queued start (lost enqueue) → mark failed. - EXECUTE pg_catalog.format( - 'UPDATE df.instances i ' - 'SET status = ''failed'', updated_at = pg_catalog.now() ' - 'WHERE i.status IN (''pending'', ''running'') ' - ' AND i.updated_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1) ' - ' AND NOT EXISTS (SELECT 1 FROM %I.instances d WHERE d.instance_id = i.id) ' - ' AND NOT EXISTS (SELECT 1 FROM %I.orchestrator_queue q WHERE q.instance_id = i.id)', - sch, sch) - USING p_grace_seconds; - GET DIAGNOSTICS stuck = ROW_COUNT; + -- queued start (lost enqueue) → mark failed. The duroxide queue row + -- persists (locked) until ack, and the instance row is created at ack, so + -- a healthy in-flight start always matches one of the NOT EXISTS guards + -- and is never failed here. Best-effort; wrapped like step 1. + BEGIN + EXECUTE pg_catalog.format( + 'UPDATE df.instances i ' + 'SET status = ''failed'', updated_at = pg_catalog.now() ' + 'WHERE i.status IN (''pending'', ''running'') ' + ' AND i.updated_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1) ' + ' AND NOT EXISTS (SELECT 1 FROM %1$I.instances d WHERE d.instance_id = i.id) ' + ' AND NOT EXISTS (SELECT 1 FROM %1$I.orchestrator_queue q WHERE q.instance_id = i.id)', + sch) + USING p_grace_seconds; + GET DIAGNOSTICS stuck = ROW_COUNT; + EXCEPTION WHEN OTHERS THEN + stuck := 0; + RAISE WARNING 'pg_durable: reconcile stuck-failover pass failed: %', SQLERRM; + END; - duroxide_orphans_deleted := COALESCE(pg_catalog.array_length(orphan_ids, 1), 0); + duroxide_orphans_deleted := deleted; stuck_instances_failed := stuck; RETURN NEXT; END; diff --git a/src/orchestrations/execute_function_graph.rs b/src/orchestrations/execute_function_graph.rs index fd17722..a652231 100644 --- a/src/orchestrations/execute_function_graph.rs +++ b/src/orchestrations/execute_function_graph.rs @@ -506,16 +506,31 @@ async fn execute_wait_schedule_node( let config: serde_json::Value = serde_json::from_str(config_str) .map_err(|e| format!("Invalid WAIT_SCHEDULE config: {e}"))?; - let wait_seconds = config["wait_seconds"] - .as_u64() - .ok_or_else(|| "WAIT_SCHEDULE missing wait_seconds".to_string())?; - - let cron_expr = config["cron_expr"].as_str().unwrap_or("?"); + let baked_wait_seconds = config["wait_seconds"].as_u64().unwrap_or(0); + let cron_expr = config["cron_expr"].as_str(); + + // Recompute the delay from the orchestration's deterministic clock on each + // execution, so a df.loop around df.wait_for_schedule waits until the NEXT + // cron tick rather than replaying the fixed offset baked at DSL-build time + // (which would make every loop generation wait the same — possibly ~0s — + // interval, busy-looping). ctx.utc_now() is recorded in history, so this + // stays deterministic on replay. Falls back to the baked seconds if the + // clock read or cron parse is unavailable. + let wait = match (cron_expr, ctx.utc_now().await) { + (Some(cron), Ok(now_st)) => { + let now_dt: chrono::DateTime = now_st.into(); + crate::types::calculate_cron_wait_from(cron, now_dt) + .unwrap_or_else(|_| Duration::from_secs(baked_wait_seconds)) + } + _ => Duration::from_secs(baked_wait_seconds), + }; ctx.trace_info(format!( - "Waiting {wait_seconds} seconds until schedule: {cron_expr}" + "Waiting {}s until schedule: {}", + wait.as_secs(), + cron_expr.unwrap_or("?") )); - ctx.schedule_timer(Duration::from_secs(wait_seconds)).await; + ctx.schedule_timer(wait).await; Ok(r#"{"scheduled": true}"#.to_string()) } diff --git a/src/types.rs b/src/types.rs index 28aacbe..961e1fc 100644 --- a/src/types.rs +++ b/src/types.rs @@ -353,25 +353,29 @@ pub fn worker_provider_config( config } -/// Calculate the duration until the next cron schedule match -pub fn calculate_cron_wait(cron_expr: &str) -> Result { +/// Calculate the duration until the next cron schedule match, relative to a +/// caller-supplied `now`. Used by the WAIT_SCHEDULE node executor with the +/// orchestration's deterministic clock so a looped `df.wait_for_schedule` waits +/// until the *next* tick each generation instead of replaying a fixed offset. +pub fn calculate_cron_wait_from(cron_expr: &str, now: DateTime) -> Result { let cron_with_seconds = format!("0 {cron_expr}"); let schedule = CronSchedule::from_str(&cron_with_seconds) .map_err(|e| format!("Invalid cron expression '{cron_expr}': {e}"))?; - let now: DateTime = Utc::now(); - let next = schedule - .upcoming(Utc) + .after(&now) .next() .ok_or_else(|| "No upcoming schedule found".to_string())?; - let duration = (next - now) + (next - now) .to_std() - .map_err(|_| "Failed to calculate wait duration".to_string())?; + .map_err(|_| "Failed to calculate wait duration".to_string()) +} - Ok(duration) +/// Calculate the duration until the next cron schedule match +pub fn calculate_cron_wait(cron_expr: &str) -> Result { + calculate_cron_wait_from(cron_expr, Utc::now()) } /// Evaluate a condition result to determine if it's truthy. diff --git a/src/worker.rs b/src/worker.rs index c9c99d1..2b6d285 100644 --- a/src/worker.rs +++ b/src/worker.rs @@ -219,16 +219,19 @@ async fn run_duroxide_runtime() { mgmt_pool.close().await; } -/// Built-in durable reconciler (Option 4). Ensures exactly one reconciler -/// instance is running per cluster: an infinite durable loop that calls -/// `df.reconcile()` on the `pg_durable.reconciler_cron` schedule. +/// Built-in durable reconciler (Option 4). Ensures one reconciler instance is +/// running per cluster: an infinite durable loop that calls `df.reconcile()` on +/// the `pg_durable.reconciler_cron` schedule. /// /// Idempotent — does nothing if a reconciler is already pending/running, and -/// (re)starts one if it has died. The instance is submitted by a dedicated -/// **non-superuser** role (`pg_durable_reconciler`) granted only df usage plus -/// EXECUTE on the SECURITY DEFINER `df.reconcile()`. Using a non-superuser -/// identity avoids the enable_superuser_instances guard, and bounds the blast -/// radius to "trigger GC" even if the instance were forged. +/// (re)starts one if it has died. Called once per epoch and then periodically +/// from the steady-state poll loop, so a reconciler that fails mid-epoch is +/// restarted within the poll interval rather than only at the next epoch. The +/// instance is submitted by a dedicated **non-superuser** role +/// (`df_reconciler`) granted only df usage plus EXECUTE on the SECURITY DEFINER +/// `df.reconcile()`. Using a non-superuser identity avoids the +/// enable_superuser_instances guard, and bounds the blast radius to "trigger GC" +/// even if the instance were forged. async fn ensure_reconciler(pool: &sqlx::PgPool) { const ROLE: &str = "df_reconciler"; const LABEL: &str = "df_reconciler"; @@ -242,14 +245,20 @@ async fn ensure_reconciler(pool: &sqlx::PgPool) { return; // built-in reconciler disabled } - // Skip if a reconciler is already pending/running (idempotent across epochs; - // self-heals if a previous one failed/was cancelled). Runs as the worker - // role, which bypasses RLS, so it sees the instance regardless of owner. + // Skip if a reconciler is already pending/running (idempotent; self-heals if + // a previous one failed/was cancelled). Runs as the worker role (bypasses + // RLS). The liveness key is the unforgeable submitted_by = df_reconciler + // role, NOT the user-supplied `label`: any df user can call df.start() with + // label 'df_reconciler', so keying on label would let an unprivileged user + // suppress the GC by parking a look-alike instance. A regular user cannot + // submit as df_reconciler (submitted_by = current_user at start time), so + // submitted_by is a sound singleton key. `::text` avoids a regrole cast + // error before the role exists (it simply matches no rows). let already_running: i64 = match sqlx::query_scalar( "SELECT count(*) FROM df.instances \ - WHERE label = $1 AND status IN ('pending', 'running')", + WHERE submitted_by::text = $1 AND status IN ('pending', 'running')", ) - .bind(LABEL) + .bind(ROLE) .fetch_one(pool) .await { @@ -299,6 +308,13 @@ async fn ensure_reconciler(pool: &sqlx::PgPool) { /// Create the dedicated non-superuser reconciler role (if absent) and grant it /// df usage plus EXECUTE on df.reconcile(). Idempotent. +/// +/// The role is `LOGIN` on purpose: the worker executes the reconcile SQL node by +/// opening a real connection AS submitted_by via `connect_as_user` (see +/// types.rs), exactly as for every other durable-function role, so it must be +/// connectable. It is non-superuser and holds only df usage + EXECUTE on +/// df.reconcile(); password/auth exposure is governed by the deployment's +/// pg_hba.conf, the same as any other df submitted_by role. async fn ensure_reconciler_role(pool: &sqlx::PgPool, role: &str) -> Result<(), sqlx::Error> { // role is a fixed compile-time constant, so the format! is injection-safe. sqlx::query(&format!( @@ -734,6 +750,10 @@ async fn run_until_extension_dropped_or_shutdown( log!("pg_durable: epoch sentinel gone — extension dropped or recreated"); break; } + // Self-heal: re-assert the built-in reconciler so a loop that + // died mid-epoch is restarted within the poll interval rather + // than only at the next epoch. Idempotent (no-op while alive). + ensure_reconciler(poll_pool).await; } } } From bbb4fed121b1690edc2559183438bb60f2d50095 Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 04:36:25 +0000 Subject: [PATCH 10/20] Enqueue df.signal/df.cancel on the caller's transaction Route df.signal()/df.cancel() through SECURITY DEFINER wrappers (df._enqueue_orchestrator_signal, df._enqueue_orchestrator_cancel) called via SPI, so their _duroxide enqueue commits or rolls back with the caller's transaction, matching the atomic df.start() path. - Work items (ExternalRaised / CancelInstance) are built server-side, so a caller cannot choose the variant or target a foreign instance. - Ownership is authorized via pg_has_role(session_user, submitted_by::oid, 'MEMBER'): session_user is unforgeable under SECURITY DEFINER and membership honors SET ROLE. - Signal preserves the root + running-descendant fan-out via a recursive CTE over parent_instance_id. - Both retain the out-of-band fallback (with a WARNING) when the duroxide-pg SQL surface is absent. Update docs/spec-atomic-start.md to reflect signal/cancel as migrated. --- docs/spec-atomic-start.md | 123 ++++++++++++++++++++++++-------------- src/dsl.rs | 51 ++++++++++++++-- src/lib.rs | 107 +++++++++++++++++++++++++++++++-- 3 files changed, 226 insertions(+), 55 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index 5022f92..527583d 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -1,6 +1,6 @@ # Spec: Atomic `df.start()` — Eliminating `df.*` ↔ `_duroxide` Divergence -**Status:** Proposal — Option 3 prototyped & validated (POC) +**Status:** Implemented & validated — Option 3 (atomic `df.start`/`df.cancel`/`df.signal`) + Option 4 (reconciler) **Author:** pg_durable team **Date:** June 2026 @@ -104,12 +104,12 @@ backstop alongside Option 3. ### Decision Implement **Option 3 + Option 4**. Option 3 removes the dual-write at the source -for backend-initiated starts (and, in a later phase, `df.cancel`); Option 4 is a -lightweight safety net for the residual divergence that no start-time atomicity -can prevent (crashes mid-execution, the best-effort `df.*` mirror, and -not-yet-migrated signal/cancel). Options 1 and 2 are rejected — Option 1 is -disproportionate to the problem, and Option 2 is a strictly weaker variant of -Option 3. +for every backend-initiated cross-store write — `df.start()`, `df.cancel()`, and +`df.signal()` all enqueue on the caller's transaction; Option 4 is a lightweight +safety net for the residual divergence that no in-transaction enqueue can prevent +(crashes mid-execution and the best-effort `df.*` mirror). Options 1 and 2 are +rejected — Option 1 is disproportionate to the problem, and Option 2 is a strictly +weaker variant of Option 3. ## Background: how a start is written today @@ -164,25 +164,27 @@ seams are bounded and fall into two distinct classes. **Backend (session) side — `df.*` on the caller's transaction + an out-of-band duroxide-client enqueue.** This is the class Option 3 fixes. -| Site | `df.*` write (user tx, SPI) | `_duroxide` write (out-of-band pool) | Dual-write? | +| Site | `df.*` write (user tx, SPI) | `_duroxide` write | Now in caller tx? | |------|------|------|------| -| `df.start()` | INSERT `df.nodes` + `df.instances` (src/dsl.rs:842,888) | enqueue `StartOrchestration` (src/dsl.rs:923) | **Yes** — Phase 1 | -| `df.cancel()` | UPDATE `df.instances.status='cancelled'` (src/dsl.rs:966) | `cancel_instance` enqueue (src/dsl.rs:955) | **Yes** — Phase 2 | -| `df.signal()` | **none** (only an RLS read of `df.instances`) | `raise_event` + descendant fan-out (src/dsl.rs:616) | **No** — single store | +| `df.start()` | INSERT `df.nodes` + `df.instances` (src/dsl.rs:842,888) | enqueue `StartOrchestration` (src/dsl.rs:923) | **Yes** | +| `df.cancel()` | UPDATE `df.instances.status='cancelled'` (src/dsl.rs:966) | `CancelInstance` enqueue | **Yes** | +| `df.signal()` | **none** (only an RLS read of `df.instances`) | `ExternalRaised` + descendant fan-out (src/dsl.rs:616) | **Yes** | Notes: - **`df.cancel()` is a genuine dual-write** with two twists vs `df.start()`: it - enqueues to duroxide **first** and then does the `df.*` UPDATE (opposite - order), and the UPDATE is only an *optimistic* mirror — the authoritative - `cancelled` status is re-applied when the worker processes the cancel via + enqueued to duroxide **first** and then did the `df.*` UPDATE (opposite order), + and the UPDATE is only an *optimistic* mirror — the authoritative `cancelled` + status is re-applied when the worker processes the cancel via `update-instance-status`, so divergence here is usually transient and - self-healing. It still belongs in the same in-transaction treatment for strict - atomicity. + self-healing. It now uses the same in-transaction enqueue as `df.start()`, so + the optimistic UPDATE and the `CancelInstance` enqueue commit or roll back + together. - **`df.signal()` is not a dual-write**: it writes only `_duroxide`. There is no - cross-store inconsistency to create; moving it in-transaction only changes - "don't deliver the signal if my surrounding tx rolls back" semantics — a - nice-to-have, not a consistency fix. + cross-store inconsistency to create; moving it in-transaction changes + "don't deliver the signal if my surrounding tx rolls back" semantics, which is + the least-surprising contract and is applied here for consistency with the + other paths. - Not seams: `df.run()` is a no-op stub; `df.result/status/explain/monitoring` are reads; `df.vars` and `df._worker_epoch` touch only `df.*`. @@ -204,6 +206,9 @@ This is the "best-effort mirror"; any persistent drift is the reconciler's job - `df.start()` is **atomic with the caller's transaction**: on rollback, no `df.*` rows *and* no `_duroxide` start item remain; on commit, both are present and the worker picks the instance up only after the `df.*` rows are visible. +- `df.cancel()` and `df.signal()` enqueue on the **caller's transaction** too: a + rolled-back cancel/signal leaves nothing in `_duroxide`, and a committed one is + durable with the caller's other work. - Eliminate the rolled-back-start orphan leak at the source. - Eliminate the `load-function-graph` "wait up to 5 s for the row to appear" race on the atomic path (the retry is retained for the non-atomic fallback). @@ -218,8 +223,6 @@ This is the "best-effort mirror"; any persistent drift is the reconciler's job consistent by design). - Changing short-id generation / the id space. (Independent; can be addressed separately. This spec makes a collision a clean, fully-atomic abort.) -- Migrating `df.signal()` / `df.cancel()` to the in-transaction path — same - mechanism, scoped as a follow-up (see Phasing). ## Design @@ -344,6 +347,34 @@ item carries only the orchestration name/input. The worker continues to authenticate using `df.instances.submitted_by` and re-check the superuser policy (unchanged). No new forgery surface is introduced. +### Cancel and signal wrappers + +`df.cancel()` and `df.signal()` enqueue on the caller's transaction through their +own `SECURITY DEFINER` wrappers (`df._enqueue_orchestrator_cancel`, +`df._enqueue_orchestrator_signal`). Both build the work item server-side +(`CancelInstance` / `ExternalRaised`) via `json_build_object`, so a caller cannot +choose the variant or smuggle a foreign target. + +The start wrapper can authorize structurally (a brand-new `pending` instance is +only reachable for the row the caller just inserted). Cancel and signal target an +**already-committed** instance, so they authorize explicitly: the wrapper checks +`pg_has_role(session_user, df.instances.submitted_by::oid, 'MEMBER')` +(`submitted_by` is a `regrole`, so `::oid` yields the owning role's OID). +`session_user` is unforgeable inside `SECURITY DEFINER` (unlike `current_user`, +which is the definer), and membership — rather than an exact-name match — lets a +role that owns the instance via `SET ROLE` still cancel/signal it. A caller that +is not a member of the owning role is rejected before any enqueue, mirroring the +RLS gate `df.cancel`/`df.signal` already enforce. The wrappers run no +caller-supplied SQL. + +**Signal fan-out.** A signal must reach the root instance **and** every running +sub-orchestration (a descendant may be the one blocked on +`wait_for_signal`). The wrapper recurses `_duroxide.instances.parent_instance_id` +from the root, joins each instance to its current execution, keeps only +executions whose `status` is `Running` (lower-cased), and enqueues one +`ExternalRaised` per surviving instance — all in the caller's transaction, so the +whole fan-out commits or rolls back atomically. + ### Worker readiness and ordering - Keep the existing worker-readiness gate: if `_worker_ready` is absent/stale, @@ -439,6 +470,10 @@ The probe + fallback make this non-breaking for any other provider. | E2E subset: core, conditionals, loops, variables, signals, join, race, cancel | 8/8 | | JOIN determinism under the new (faster) timing | 5/5 | | Non-superuser (`df_e2e_user`) start path | pass | +| Committed `df.cancel()` cancels; rolled-back `df.cancel()` leaves the instance running and rolls back the queue enqueue | pass | +| Committed `df.signal()` delivers (incl. fan-out to a sub-orchestration in a race); rolled-back `df.signal()` rolls back the enqueue | pass | +| Cancel/signal authz: a non-member caller is denied on the wrapper and via the RLS gate; the owning role (incl. via `SET ROLE`) succeeds | pass | +| E2E: `07_signals`, `23_signal_in_race` (fan-out), `22_cancel_status_consistency` | 3/3 | | `df.reconcile()` deletes a planted root orphan; leaves sub-orchestrations + live instances untouched | pass | | `df.reconcile()` fails a stuck (lost-enqueue) `df.instances` | pass | | Built-in reconciler auto-starts as non-superuser `df_reconciler`, idempotent (exactly one) | pass | @@ -513,30 +548,26 @@ transaction that later rolls back will **not** run — which is the desired fix, but a behavior change from "fire-and-forget at statement time." This must be called out in `USER_GUIDE.md`. -## Signals & cancellation (follow-up) +## Signals & cancellation (migrated) -See the Dual-write inventory above for the precise classification. In short: +`df.cancel()` and `df.signal()` use the same in-transaction enqueue as +`df.start()` (see Cancel and signal wrappers above). In short: -- **`df.cancel()` is a genuine dual-write** (`df.instances.status` + a - `CancelInstance` enqueue) and should get the same in-transaction treatment as - `df.start()`. It is scoped as **Phase 2** rather than Phase 1 only because it - targets an already-committed instance (so the rolled-back-*start* leak does not - apply) and because the worker's `update-instance-status` activity already - re-applies the authoritative `cancelled` status, making its divergence - transient. The `SECURITY DEFINER` enqueue entrypoint should be generalized to - carry an arbitrary work item (e.g. `client_enqueue(instance, work_item, - visible_at)`) so it serves cancel directly. -- **`df.signal()` is not a dual-write** (it writes only `_duroxide`), so it - creates no cross-store inconsistency. It can ride on the same entrypoint for - "don't deliver if my tx rolls back" semantics, but that is a nice-to-have, not - a consistency fix — lowest priority. +- **`df.cancel()`** — the optimistic `df.instances.status='cancelled'` UPDATE and + the `CancelInstance` enqueue now commit or roll back together on the caller's + transaction. The worker's `update-instance-status` activity still re-applies the + authoritative status, so a worker-side crash remains transient/self-healing. +- **`df.signal()`** — the `ExternalRaised` fan-out (root + running + sub-orchestrations) is enqueued on the caller's transaction, giving "don't + deliver if my tx rolls back" semantics. It writes only `_duroxide`, so this is a + consistency-of-contract change rather than a cross-store fix. ## Complementary backstop — Option 4 (implemented) As part of the chosen approach, a lightweight **reconciler** sweeps for residual -divergence that an atomic start cannot prevent (crashes mid-execution, -not-yet-migrated signal/cancel paths, `status` mirror drift, and legacy/fallback -orphans). It ships as **`df.reconcile(grace_seconds int default 60)`** +divergence that an in-transaction enqueue cannot prevent (crashes mid-execution, +the best-effort `df.*` `status` mirror, and legacy/fallback orphans). It ships as +**`df.reconcile(grace_seconds int default 60)`** (`SECURITY DEFINER`, admin-only): it deletes orphaned **root** duroxide instances (`parent_instance_id IS NULL`, no `df.instances` row, older than the grace window) via `delete_instances_atomic`, and marks `df.instances` rows stuck @@ -613,13 +644,15 @@ E2E (tests/e2e/sql/) and unit coverage: ## Phasing -1. **Phase 1 (this spec) — prototyped & validated:** atomic `df.start()` via the +1. **Phase 1 — implemented & validated:** atomic `df.start()` via the `SECURITY DEFINER` enqueue entrypoint; readiness gate + legacy fallback; - simplify the `load-function-graph` retry; docs + tests. (POC complete; see - Proof of concept. Remaining for production: upgrade script, dynamic - `_duroxide` schema resolution, entrypoint hardening.) -2. **Phase 2:** apply the same in-transaction enqueue to `df.signal()` / - `df.cancel()`. + simplify the `load-function-graph` retry; docs + tests. Remaining for + production: upgrade script and entrypoint hardening. +2. **Phase 2 — implemented & validated:** the same in-transaction enqueue for + `df.signal()` (root + running-descendant `ExternalRaised` fan-out) and + `df.cancel()` (`CancelInstance`), each via a `SECURITY DEFINER` wrapper that + builds the work item server-side and authorizes the caller via + `pg_has_role(session_user, ...)`. 3. **Phase 3 — reconciler implemented:** `df.reconcile()` plus a built-in durable cron loop (worker-managed `df_reconciler` instance on `pg_durable.reconciler_cron`) are validated (auto-start, idempotency, diff --git a/src/dsl.rs b/src/dsl.rs index de786cb..75c3b42 100644 --- a/src/dsl.rs +++ b/src/dsl.rs @@ -612,9 +612,32 @@ pub fn signal(instance_id: &str, signal_name: &str, signal_data: default!(&str, pgrx::error!("Instance not found or access denied: {}", instance_id); } - match raise_external_event(instance_id, signal_name, &signal_data) { - Ok(_) => "OK".to_string(), - Err(e) => pgrx::error!("Failed to send signal: {}", e), + // Enqueue the ExternalRaised work item. Prefer the in-transaction SPI path + // (atomic with the caller's transaction; fans out to running descendants in + // the wrapper) when the duroxide-pg SQL surface is present; otherwise fall + // back to the out-of-band client path. + let schema = crate::types::backend_duroxide_schema(); + if in_tx_enqueue_supported(schema) { + if let Err(e) = Spi::run_with_args( + "SELECT df._enqueue_orchestrator_signal($1, $2, $3)", + &[ + instance_id.into(), + signal_name.into(), + signal_data.as_str().into(), + ], + ) { + pgrx::error!("Failed to send signal: {:?}", e); + } + "OK".to_string() + } else { + pgrx::warning!( + "pg_durable: df.signal() is using the non-atomic fallback enqueue \ + (duroxide-pg SQL surface not detected)" + ); + match raise_external_event(instance_id, signal_name, &signal_data) { + Ok(_) => "OK".to_string(), + Err(e) => pgrx::error!("Failed to send signal: {}", e), + } } } @@ -1015,8 +1038,26 @@ pub fn cancel(instance_id: &str, reason: default!(&str, "'Cancelled by user'")) pgrx::error!("Instance not found or access denied: {}", instance_id); } - if let Err(e) = cancel_durable_function(instance_id, reason) { - return format!("Failed to cancel: {e}"); + // Enqueue the CancelInstance work item and flip the status mirror together. + // On the in-transaction path the enqueue and the status UPDATE commit + // atomically (no df.* / duroxide divergence on cancel); otherwise fall back + // to the out-of-band client path. + let schema = crate::types::backend_duroxide_schema(); + if in_tx_enqueue_supported(schema) { + if let Err(e) = Spi::run_with_args( + "SELECT df._enqueue_orchestrator_cancel($1, $2)", + &[instance_id.into(), reason.into()], + ) { + pgrx::error!("Failed to cancel: {:?}", e); + } + } else { + pgrx::warning!( + "pg_durable: df.cancel() is using the non-atomic fallback enqueue \ + (duroxide-pg SQL surface not detected)" + ); + if let Err(e) = cancel_durable_function(instance_id, reason) { + return format!("Failed to cancel: {e}"); + } } // Update the instance status to 'cancelled' via SPI only when the instance is not diff --git a/src/lib.rs b/src/lib.rs index d7a6b15..a7b2af2 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -430,7 +430,9 @@ DECLARE 'df.explain(text)', 'df.target_database()', -- Atomic-start enqueue wrapper (called by df.start() via SPI) - 'df._enqueue_orchestrator_start(text, text, text)' + 'df._enqueue_orchestrator_start(text, text, text)', + 'df._enqueue_orchestrator_cancel(text, text)', + 'df._enqueue_orchestrator_signal(text, text, text)' ]; BEGIN -- Validate the role exists @@ -671,14 +673,109 @@ REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text, text) FROM requires = ["create_tables"] ); +// ============================================================================ +// In-transaction signal / cancel enqueue (Part 1, extended) — SECURITY DEFINER +// ============================================================================ +// +// df.signal()/df.cancel() enqueue their work items (ExternalRaised / +// CancelInstance) via SPI through these wrappers, inside the caller's +// transaction, instead of out-of-band on the duroxide client pool. Both target +// an already-committed instance, so the start wrapper's "brand-new instance" +// guard does not apply; instead they authorize on ownership. +// +// AUTHORIZATION: these are SECURITY DEFINER (the orchestrator queue is +// owner-only) and granted to every df user via df.grant_usage(), so they must +// not let a caller signal/cancel a foreign instance. current_user is the definer +// here, so ownership is checked against session_user (the unforgeable +// authenticated role) plus membership in the instance's submitted_by — i.e. the +// caller's session must be able to act as the instance owner. The work items are +// built server-side from the arguments (no opaque caller-supplied work item). +extension_sql!( + r#" +CREATE FUNCTION df._enqueue_orchestrator_cancel(p_instance_id text, p_reason text) +RETURNS void +LANGUAGE plpgsql +SECURITY DEFINER +SET search_path = pg_catalog +AS $fn$ +DECLARE + sch text := df.duroxide_schema(); + owner_oid oid; +BEGIN + SELECT i.submitted_by::oid INTO owner_oid FROM df.instances i WHERE i.id = p_instance_id; + IF owner_oid IS NULL OR NOT pg_catalog.pg_has_role(session_user, owner_oid, 'MEMBER') THEN + RAISE EXCEPTION 'pg_durable: not authorized to cancel instance %', p_instance_id + USING ERRCODE = 'insufficient_privilege'; + END IF; + + EXECUTE pg_catalog.format('SELECT %I.enqueue_orchestrator_work($1, $2, $3)', sch) + USING p_instance_id, + pg_catalog.json_build_object('CancelInstance', + pg_catalog.json_build_object('instance', p_instance_id, 'reason', p_reason))::text, + pg_catalog.now(); +END; +$fn$; + +REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_cancel(text, text) FROM PUBLIC; + +CREATE FUNCTION df._enqueue_orchestrator_signal(p_instance_id text, p_name text, p_data text) +RETURNS void +LANGUAGE plpgsql +SECURITY DEFINER +SET search_path = pg_catalog +AS $fn$ +DECLARE + sch text := df.duroxide_schema(); + owner_oid oid; +BEGIN + SELECT i.submitted_by::oid INTO owner_oid FROM df.instances i WHERE i.id = p_instance_id; + IF owner_oid IS NULL OR NOT pg_catalog.pg_has_role(session_user, owner_oid, 'MEMBER') THEN + RAISE EXCEPTION 'pg_durable: not authorized to signal instance %', p_instance_id + USING ERRCODE = 'insufficient_privilege'; + END IF; + + -- Raise the event for the target instance and every RUNNING descendant + -- (a sub-orchestration — JOIN/RACE branch or loop generation — may be the one + -- waiting on the signal), mirroring the out-of-band fan-out. %1$I = schema. + EXECUTE pg_catalog.format( + 'INSERT INTO %1$I.orchestrator_queue (instance_id, work_item, visible_at, created_at) ' + 'SELECT t.instance_id, ' + ' pg_catalog.json_build_object(''ExternalRaised'', ' + ' pg_catalog.json_build_object(''instance'', t.instance_id, ''name'', $2, ''data'', $3))::text, ' + ' pg_catalog.now(), pg_catalog.now() ' + 'FROM ( ' + ' WITH RECURSIVE tree AS ( ' + ' SELECT i.instance_id, i.current_execution_id, true AS is_root ' + ' FROM %1$I.instances i WHERE i.instance_id = $1 ' + ' UNION ' + ' SELECT c.instance_id, c.current_execution_id, false ' + ' FROM %1$I.instances c JOIN tree p ON c.parent_instance_id = p.instance_id ' + ' ) ' + ' SELECT tr.instance_id ' + ' FROM tree tr ' + ' LEFT JOIN %1$I.executions e ' + ' ON e.instance_id = tr.instance_id AND e.execution_id = tr.current_execution_id ' + ' WHERE tr.is_root OR pg_catalog.lower(COALESCE(e.status, '''')) = ''running'' ' + ') t', + sch) + USING p_instance_id, p_name, p_data; +END; +$fn$; + +REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_signal(text, text, text) FROM PUBLIC; +"#, + name = "atomic_signal_cancel_enqueue", + requires = ["create_tables"] +); + // ============================================================================ // Reconciler (Option 4) — repair residual df.* / duroxide divergence // ============================================================================ // -// Atomic df.start() prevents the rolled-back-start orphan leak at the source, -// but some divergence can still arise: legacy orphans, the non-atomic fallback -// path, df.signal()/df.cancel() (not yet migrated), and crashes mid-execution. -// df.reconcile() is a best-effort garbage collector for that residue. It is +// Atomic df.start()/df.signal()/df.cancel() commit their duroxide enqueue with +// the caller's transaction, but some divergence can still arise: legacy orphans, +// the non-atomic fallback path, and crashes mid-execution. df.reconcile() is a +// best-effort garbage collector for that residue. It is // admin-only (EXECUTE revoked from PUBLIC) and SECURITY DEFINER so it can touch // the duroxide-owned tables and all users' df.instances. // From bf10f6f5dcf3dc52e4315a1e325111e05d99ff96 Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 14:52:23 +0000 Subject: [PATCH 11/20] Rewrite atomic-start spec: align with implementation, tighten, de-jargon The spec had drifted from the shipped code (Rust-side work-item construction, a duroxide-pg wrapper placement that never landed, stale df.__enqueue_start naming, POC/phasing framing) and triplicated the reconciler and signal/cancel descriptions. Rewrite end-to-end against reality: server-side work-item build in the df-schema SECURITY DEFINER wrappers, start authorized by instance state and cancel/signal by pg_has_role(session_user, ...), the actual worker reconciler DSL, and a single prevent/repair structure. Drop POC language and dead jargon. 676 -> 368 lines; no code changes. --- docs/spec-atomic-start.md | 984 +++++++++++++------------------------- 1 file changed, 338 insertions(+), 646 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index 527583d..17f64e7 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -1,676 +1,368 @@ -# Spec: Atomic `df.start()` — Eliminating `df.*` ↔ `_duroxide` Divergence +# Keeping the control plane and the runtime consistent -**Status:** Implemented & validated — Option 3 (atomic `df.start`/`df.cancel`/`df.signal`) + Option 4 (reconciler) +**Status:** Implemented and validated. Atomic `df.start`/`df.cancel`/`df.signal`, +plus a reconciler that repairs leftover drift. **Author:** pg_durable team **Date:** June 2026 -## Overview - -`df.start()` writes to two independent stores that are **not** committed together: - -- the **pg_durable control plane** (`df.nodes`, `df.instances`) — written on the - caller's backend transaction via SPI; and -- the **duroxide runtime** (`_duroxide.orchestrator_queue`) — enqueued - out-of-band on a *separate* sqlx connection pool, committed independently. - -Because the two halves are not atomic, the stores can diverge. The most common -failure is the "rolled-back start" leak: a `df.start()` whose surrounding -transaction is rolled back (or whose batch subtransaction aborts on a PK -collision) leaves an **orphaned orchestration** in `_duroxide` with no -corresponding `df.*` rows. The background worker then repeatedly tries to load a -graph that does not exist, fails after a 5 s timeout, and the orphan persists in -`_duroxide.instances`. - -Four designs were explored to keep the two stores consistent (see **Design -options considered** below). This spec **implements Option 3 (primary) plus -Option 4 (complementary backstop)**: make the duroxide start-enqueue part of the -caller's backend transaction so `df.start()` is atomic — either everything -commits or nothing does — and add an asynchronous reconciler to repair any -residual drift that start-time atomicity cannot prevent. - -### Evidence (observed during investigation) - -- A `df.start()` executed inside a transaction that was then `ROLLBACK`-ed left a - row in `_duroxide.instances` (e.g. `48819f12`) with **zero** rows in - `df.instances`. The orphan survived the rollback. -- Under load, orphaned start items self-sustain via retry and saturate the - 2-thread worker with 5 s "Instance not found … transaction may have been - rolled back" failures. -- The short-id PK-collision path (8 hex chars ≈ 2³²) reliably aborts `df.start()` - well under 1M instances (observed first-collision draws: 9248, 34845, 40445, - 60776). That abort is *safe* today (it happens before the duroxide enqueue), - but it exercises the same dual-write seam. - -## Design options considered - -Four approaches were explored for keeping the `df.*` control plane and the -`_duroxide` runtime consistent. The mechanics of the chosen approach are detailed -in **Design** below; this section records all four and the decision. - -**Option 1 — Single source of truth in `_duroxide`.** Drop the `df.instances` / -`df.nodes` control tables entirely, keep all instance state in `_duroxide`, -serialize the graph into the orchestration input, and have duroxide expose -whatever state pg_durable needs (via views / read APIs). -*Guarantee:* total — there is only one store, so nothing can diverge. -*Cost:* large — moves the `submitted_by` security boundary, requires rebuilding -per-user RLS over duroxide-owned tables, re-exposing every `df.*` read surface, -and a destructive migration with B1 backward-compat implications. A possible -long-term direction, not this change. - -**Option 2 — Hand duroxide-pg our SQL to run inside *its* transaction.** Keep -both schemas; extend duroxide-pg so the start-enqueue also executes the -pg_durable `df.*` writes in the same (sqlx, worker-pool) transaction. -*Guarantee:* `df.*` ↔ `_duroxide` atomic — but on the worker-pool connection, so -`df.start()` stops participating in the *caller's* transaction and the `df.*` -writes run as the pool role, breaking `current_user` / `submitted_by` capture and -RLS. Strictly weaker than Option 3. - -**Option 3 — Hand duroxide-pg *our* (the caller's) transaction.** Keep both -schemas; run duroxide's start-enqueue inside the caller's backend transaction via -SPI, alongside the `df.*` writes. -*Guarantee:* strongest — `df.nodes` + `df.instances` + the queue row + the -caller's surrounding statements all commit or roll back together; identity -capture and the `df.*` read surface are untouched; the 5 s load race disappears -on the atomic path. -*Cost:* small — the start is a single SQL-function call — with one wrinkle: a -`SECURITY DEFINER` enqueue entrypoint is needed because the queue INSERT is -owner-only. **Chosen as the primary fix.** - -**Option 4 — Tolerate divergence; repair asynchronously.** Accept a transient -window and add a reconciler — ideally a built-in durable function — that detects -and repairs mismatches (orphaned `_duroxide` instances, stuck `df.instances` -rows, `status`-mirror drift). -*Guarantee:* none added; eventual consistency with bounded repair latency. -*Value:* the only option that also catches divergence from crashes -*mid-execution* and from paths Option 3 does not cover (the worker-side mirror -activities, and signal/cancel until they are migrated). **Chosen as a -complementary backstop.** - -### Comparison - -| Option | Atomic with `df.*` | Atomic with caller tx | Kills 5 s load race | Fixes crash-time drift | Effort | +## The problem + +A durable function lives in two stores that are written separately: + +- the **pg_durable control plane** — `df.nodes` and `df.instances`, written on the + caller's transaction; and +- the **duroxide runtime** — `_duroxide.orchestrator_queue`, where the orchestration + is enqueued. + +Originally `df.start()` wrote the first on the caller's transaction but enqueued the +second out-of-band, on a separate connection pool that committed on its own. The two +were never atomic, so they could disagree. + +The common failure was an **orphaned orchestration**: a `df.start()` whose +surrounding transaction rolled back (or whose batch aborted on a primary-key +collision) left a row in `_duroxide` with no matching `df.*` rows. The background +worker then kept trying to load a graph that did not exist, failed after a 5-second +timeout, and the orphan stayed behind. Under load these orphans accumulated and +saturated the worker's two threads. + +We observed this directly: a `df.start()` inside a rolled-back transaction left +`_duroxide.instances` row `48819f12` with zero rows in `df.instances`, surviving the +rollback. + +## What we did + +Two complementary changes: + +1. **Prevent** the drift at the source. `df.start()`, `df.cancel()`, and + `df.signal()` now enqueue their runtime work on the **caller's transaction**, so + the control-plane writes and the runtime enqueue commit or roll back together. +2. **Repair** what prevention cannot cover. A reconciler periodically deletes + leftover orphans and marks stuck instances failed — catching drift from crashes + mid-execution and from the legacy fallback path. + +Both are pg_durable-only. No `duroxide` / `duroxide-pg` or `Cargo.*` changes. + +## Designs we considered + +Four approaches were weighed for keeping the two stores in agreement. + +**Option 1 — One store.** Drop `df.nodes` / `df.instances` entirely and keep all +state in `_duroxide`, exposing what pg_durable needs through views. +*Strongest guarantee* (only one store can't diverge) but the most disruptive: it +moves the `submitted_by` security boundary, rebuilds per-user row security over +runtime-owned tables, re-exposes every read path, and needs a destructive migration. +A possible long-term direction, not this change. + +**Option 2 — Run our writes inside duroxide's transaction.** Extend duroxide-pg so +the enqueue also performs the `df.*` writes, on duroxide's worker-pool connection. +This makes the two stores atomic — but on the *worker's* connection, so `df.start()` +no longer joins the *caller's* transaction and the `df.*` writes run as the pool +role, breaking identity capture and row security. Strictly worse than Option 3. + +**Option 3 — Run duroxide's enqueue inside the caller's transaction.** Keep both +stores; call the runtime's enqueue over SPI on the caller's transaction, next to the +`df.*` writes. Everything commits or rolls back together, identity capture and the +read paths are untouched, and the 5-second load race disappears. The only cost is a +`SECURITY DEFINER` wrapper, because the runtime queue is writable by its owner only. +**Chosen as the primary fix.** + +**Option 4 — Tolerate drift, repair it later.** Accept a brief inconsistency window +and add a reconciler that finds and fixes mismatches. Adds no guarantee on its own, +but it is the only option that also catches drift from crashes *mid-execution* — +something no start-time atomicity can prevent. **Chosen as a backstop.** + +| Option | Atomic with `df.*` | Atomic with caller's tx | Removes 5 s race | Fixes crash-time drift | Effort | |--------|:--:|:--:|:--:|:--:|:--:| -| 1 — single source of truth in `_duroxide` | n/a (one store) | no¹ | yes | yes | Large | -| 2 — our SQL in duroxide's tx | yes | **no** | yes | no | Medium | -| 3 — enqueue in caller's tx (SPI) | yes | **yes** | **yes** | no | Small–Med | -| 4 — async GC / reconciler | repairs | — | no | **yes** | Small–Med | - -¹ Option 1 also decouples from the caller's transaction unless the enqueue is -itself done via SPI — at which point its start path converges with Option 3. -"Fixes crash-time drift" = repairs divergence from a crash *mid-execution* (not -just at start); only Options 1 and 4 do this, which is why Option 4 is kept as a -backstop alongside Option 3. - -### Decision - -Implement **Option 3 + Option 4**. Option 3 removes the dual-write at the source -for every backend-initiated cross-store write — `df.start()`, `df.cancel()`, and -`df.signal()` all enqueue on the caller's transaction; Option 4 is a lightweight -safety net for the residual divergence that no in-transaction enqueue can prevent -(crashes mid-execution and the best-effort `df.*` mirror). Options 1 and 2 are -rejected — Option 1 is disproportionate to the problem, and Option 2 is a strictly -weaker variant of Option 3. - -## Background: how a start is written today - -`crate::dsl::start()` (src/dsl.rs) executes, in order, on the **caller's** -transaction unless noted: - -1. `instance_id = short_id()` (src/dsl.rs:645). -2. Read-only validation (database exists, caller has `LOGIN`, superuser policy). -3. `insert_nodes()` → INSERT rows into `df.nodes` (src/dsl.rs:842). -4. INSERT row into `df.instances` (src/dsl.rs:888). -5. Capture `df.vars` snapshot. -6. `start_durable_function()` → `Client::start_orchestration()` → - `store.enqueue_for_orchestrator(StartOrchestration)` (src/dsl.rs:923) — runs on - the cached `DUROXIDE_CLIENT` sqlx pool (src/client.rs), **separate - connection, independently committed.** Failures here are logged, not raised - (src/dsl.rs:928). - -Key facts that make the fix small: - -- **The start is a single enqueue.** `Client::start_orchestration` builds one - `WorkItem::StartOrchestration` and calls `enqueue_for_orchestrator` - (duroxide `client/mod.rs`). The duroxide-pg provider implements that as one - call to the SQL function `_duroxide.enqueue_orchestrator_work(...)`, whose body - is a single `INSERT INTO _duroxide.orchestrator_queue`. The - `_duroxide.instances`/`history` rows are materialized **later** by the worker. -- **The atomic unit is therefore just** `{df.nodes INSERTs, df.instances INSERT, - one orchestrator_queue INSERT}`. -- **duroxide-pg is SQL-function based.** Every provider operation is a function - in the `_duroxide` schema, so the enqueue can be invoked directly via SPI on - the caller's transaction — no async pool required for the start path. - -### What each store owns (so we know what must *not* change) +| 1 — one store | n/a | no | yes | yes | Large | +| 2 — our writes in duroxide's tx | yes | **no** | yes | no | Medium | +| 3 — enqueue in caller's tx | yes | **yes** | **yes** | no | Small | +| 4 — async reconciler | repairs | — | no | **yes** | Small | + +We implemented **Option 3 (prevent) plus Option 4 (repair)**. Options 1 and 2 were +rejected: Option 1 is out of proportion to the problem, and Option 2 is a weaker +version of Option 3. + +## How a start is written + +`df.start()` (`src/dsl.rs`) runs these steps on the caller's transaction: + +1. Pick an instance id (`short_id()` — eight hex characters). +2. Validate (database exists, caller can log in, superuser policy). +3. Insert the graph rows into `df.nodes`. +4. Insert the instance row into `df.instances`. +5. Snapshot the caller's `df.vars`. +6. Enqueue the orchestration. + +The fix is small because the start is **a single enqueue**: one +`StartOrchestration` row in `_duroxide.orchestrator_queue`. The worker builds the +`_duroxide.instances` and history rows later. And because duroxide-pg exposes every +operation as a SQL function in the `_duroxide` schema, that enqueue can run over SPI +on the caller's transaction — no separate pool needed. + +### What each store owns + +The fix must not move any of these boundaries. | Data | Source of truth | Notes | |------|-----------------|-------| -| Graph (`df.nodes`) | `df.*` | duroxide does not store it; the worker loads it via the `load-function-graph` activity. | -| Execution history | `_duroxide.history` | duroxide-internal; transactional per turn via `ack_orchestration_item`. | -| Control plane (`root_node`, `submitted_by`, `database`, `label`) | `df.instances` | needed by the worker to load + authenticate. | -| `df.instances.status` | best-effort mirror | written by the `update-instance-status` activity; not atomic with duroxide, self-heals. | -| `df.nodes.status/result/error` | best-effort mirror | written by the `update-node-status` activity; same properties as above. | - -`submitted_by` is a **security boundary**: it is captured as `current_user` at -`df.start()` time and pinned by a composite FK; the worker re-checks the -superuser policy (src/activities/load_function_graph.rs). Any change must keep it -unforgeable. - -### Dual-write inventory - -Auditing every `df.*` writer against every duroxide-client call, the cross-store -seams are bounded and fall into two distinct classes. - -**Backend (session) side — `df.*` on the caller's transaction + an out-of-band -duroxide-client enqueue.** This is the class Option 3 fixes. - -| Site | `df.*` write (user tx, SPI) | `_duroxide` write | Now in caller tx? | -|------|------|------|------| -| `df.start()` | INSERT `df.nodes` + `df.instances` (src/dsl.rs:842,888) | enqueue `StartOrchestration` (src/dsl.rs:923) | **Yes** | -| `df.cancel()` | UPDATE `df.instances.status='cancelled'` (src/dsl.rs:966) | `CancelInstance` enqueue | **Yes** | -| `df.signal()` | **none** (only an RLS read of `df.instances`) | `ExternalRaised` + descendant fan-out (src/dsl.rs:616) | **Yes** | - -Notes: - -- **`df.cancel()` is a genuine dual-write** with two twists vs `df.start()`: it - enqueued to duroxide **first** and then did the `df.*` UPDATE (opposite order), - and the UPDATE is only an *optimistic* mirror — the authoritative `cancelled` - status is re-applied when the worker processes the cancel via - `update-instance-status`, so divergence here is usually transient and - self-healing. It now uses the same in-transaction enqueue as `df.start()`, so - the optimistic UPDATE and the `CancelInstance` enqueue commit or roll back - together. -- **`df.signal()` is not a dual-write**: it writes only `_duroxide`. There is no - cross-store inconsistency to create; moving it in-transaction changes - "don't deliver the signal if my surrounding tx rolls back" semantics, which is - the least-surprising contract and is applied here for consistency with the - other paths. -- Not seams: `df.run()` is a no-op stub; `df.result/status/explain/monitoring` - are reads; `df.vars` and `df._worker_epoch` touch only `df.*`. - -**Worker side — `df.*` mirror maintained by activities, on a connection separate -from duroxide's history ack.** Option 3 does **not** address this class. - -| Activity | `df.*` write | Property | -|----------|--------------|----------| -| `update-instance-status` | UPDATE `df.instances.status` | eventually consistent, self-healing | -| `update-node-status` | UPDATE `df.nodes.status/result/error` | eventually consistent, self-healing | - -These are at-least-once activities re-applied on replay, so a crash between the -duroxide ack and the mirror update is transient lag, not permanent divergence. -This is the "best-effort mirror"; any persistent drift is the reconciler's job -(see Complementary backstop), not Option 3's. - -## Goals - -- `df.start()` is **atomic with the caller's transaction**: on rollback, no - `df.*` rows *and* no `_duroxide` start item remain; on commit, both are present - and the worker picks the instance up only after the `df.*` rows are visible. -- `df.cancel()` and `df.signal()` enqueue on the **caller's transaction** too: a - rolled-back cancel/signal leaves nothing in `_duroxide`, and a committed one is - durable with the caller's other work. -- Eliminate the rolled-back-start orphan leak at the source. -- Eliminate the `load-function-graph` "wait up to 5 s for the row to appear" - race on the atomic path (the retry is retained for the non-atomic fallback). -- No change to the `df.*` schema's read surface (status/result/monitoring/explain - keep working unchanged). - -## Non-goals - -- Collapsing `df.*` into `_duroxide` (Option 1) — out of scope; tracked - separately as a possible long-term direction. -- Fixing the best-effort `df.instances.status` mirror (it is eventually - consistent by design). -- Changing short-id generation / the id space. (Independent; can be addressed - separately. This spec makes a collision a clean, fully-atomic abort.) - -## Design - -### Mechanism - -In `crate::dsl::start()`, replace step 6 (the out-of-band -`start_durable_function` pool call) with an **SPI call** that enqueues the start -work item on the caller's transaction, immediately after the `df.instances` -INSERT: +| Graph (`df.nodes`) | `df.*` | The worker loads it; duroxide does not store it. | +| Execution history | `_duroxide.history` | Runtime-internal; transactional per turn. | +| Control plane (`root_node`, `submitted_by`, `database`, `label`) | `df.instances` | The worker needs it to load and authenticate. | +| `df.instances.status` | best-effort mirror | Written by the worker; self-heals, not atomic with the runtime. | +| `df.nodes.status/result/error` | best-effort mirror | Same as above. | + +`submitted_by` is a **security boundary**: it is captured as the calling role at +start time and must stay unforgeable. The worker re-checks the superuser policy when +it loads the graph. + +### Where the two stores were written together + +Auditing every writer, the cross-store paths are bounded. All three now enqueue on +the caller's transaction: + +| Caller | `df.*` write | `_duroxide` write | On caller's tx now? | +|--------|------|------|:--:| +| `df.start()` | insert `df.nodes` + `df.instances` | enqueue `StartOrchestration` | yes | +| `df.cancel()` | update `df.instances.status` | enqueue `CancelInstance` | yes | +| `df.signal()` | none (only a row-security read) | enqueue `ExternalRaised` (+ fan-out) | yes | + +- **`df.cancel()`** was a genuine dual-write: it enqueued the cancel, then updated + the status mirror. The status update is only an optimistic hint — the worker + re-applies the authoritative `cancelled` status when it processes the cancel — so + any disagreement was already transient. Both now commit together. +- **`df.signal()`** writes only `_duroxide`, so it never created cross-store drift. + Moving it onto the caller's transaction simply gives it the same rule as the + others: *if my transaction rolls back, the signal is not delivered.* + +The worker also keeps the `df.*` status columns in sync as activities run. Those +writes are eventually consistent and self-healing by design; the reconciler cleans +up any lasting drift. + +## The mechanism + +Step 6 of `df.start()` is now an SPI call on the caller's transaction, right after +the `df.instances` insert: ```text --- still on the caller's transaction (SPI): -INSERT INTO df.nodes ... -- step 3 (unchanged) -INSERT INTO df.instances ... -- step 4 (unchanged) -SELECT df.__enqueue_start($instance_id, $work_item_json, $visible_at); -- step 6 (new) +INSERT INTO df.nodes ... -- step 3 +INSERT INTO df.instances ... -- step 4 +SELECT df._enqueue_orchestrator_start($id, $name, $input); -- step 6 (SPI) ``` -Because the enqueue is now an SPI statement, it shares the caller's transaction: -rollback removes the queue row along with the `df.*` rows; commit makes them -visible together. - -### The work item - -pg_durable already depends on the `duroxide` crate, so it should **construct and -serialize the real `WorkItem` type** rather than hand-rolling JSON, guaranteeing -wire compatibility: - -```rust -let item = duroxide::providers::WorkItem::StartOrchestration { - instance: instance_id.clone(), - orchestration: orchestrations::execute_function_graph::NAME.into(), - input: input_json, // existing FunctionInput JSON - version: None, // runtime resolves from registry - parent_instance: None, - parent_id: None, - execution_id: duroxide::INITIAL_EXECUTION_ID, // = 1 -}; -let work_item_json = serde_json::to_string(&item)?; -``` +Because the enqueue is an ordinary SPI statement, it shares the caller's +transaction: a rollback drops the queue row along with the `df.*` rows, and a commit +makes them visible together. `df.cancel()` and `df.signal()` work the same way +through their own wrappers. + +### Why a privileged wrapper is needed + +`_duroxide.orchestrator_queue` is writable by its owner only (the worker role), so a +normal caller cannot insert into it. The enqueue therefore runs through a +`SECURITY DEFINER` wrapper in the `df` schema, owned by the extension owner. Each +wrapper pins `search_path` and schema-qualifies every reference (per CVE-2018-1058, +enforced by `scripts/pgspot-gate.sh`). + +The wrappers resolve the runtime schema at call time via `df.duroxide_schema()` +(`_duroxide` on current installs, legacy `duroxide` on older ones) and quote it +with `%I`. They are revoked from `PUBLIC` and granted through `df.grant_usage()`. + +Because a `SECURITY DEFINER` function runs as its owner, the wrappers cannot trust +the caller. Two safeguards make them safe to grant to every user: + +- **The work item is built inside the wrapper**, from the caller's arguments — never + passed in. A caller cannot choose the work-item type or smuggle in a foreign + target. (`df.start` builds `StartOrchestration`, `df.cancel` builds + `CancelInstance`, `df.signal` builds `ExternalRaised`, each with + `json_build_object`, matching duroxide's wire format.) +- **The caller is authorized before any enqueue**, by two different rules: + - *Start* targets a brand-new instance, so it authorizes by *state*: it permits + the enqueue only for a `pending` `df.instances` row that has no queue entry and + no runtime instance yet. Under the atomic path a committed instance always has + its queue row in the same transaction, so this state is reachable only for the + row the caller just inserted — never someone else's instance. + - *Cancel* and *signal* target an already-committed instance, so they authorize by + *ownership*: `pg_has_role(session_user, , 'MEMBER')`. + `session_user` is the real authenticated role (it cannot be spoofed inside a + `SECURITY DEFINER` function), and checking membership rather than an exact name + means a role that owns the instance through `SET ROLE` still qualifies. A + non-member is rejected before anything is enqueued — the same gate `df.cancel` / + `df.signal` already enforce through row security. + +### Signal fan-out + +A signal must reach the root instance **and** every running sub-orchestration, +because a child (a `JOIN`/`RACE` branch or a loop generation) may be the one waiting +on it. The `df.signal` wrapper walks the instance tree from the root +(`_duroxide.instances.parent_instance_id`), keeps the instances whose current +execution is still running, and enqueues one `ExternalRaised` per surviving +instance — all on the caller's transaction, so the whole fan-out is atomic. -`WorkItem` is externally tagged (default serde), i.e. -`{"StartOrchestration":{...}}`. These are exactly the values the current -`Client::start_orchestration` path uses, so worker behavior is unchanged. +### Worker readiness and ordering -The enqueue target is the existing function (body is a single INSERT; the -`p_orchestration_name/version/execution_id` params are accepted but unused — the -work item is self-contained): +- The existing readiness gate still applies: if the worker has not initialized the + runtime schema yet, `df.start()` fails clearly (and atomically — the `df.*` + inserts roll back). +- The enqueue is the last write in `df.start()`. A primary-key collision on + `df.nodes` / `df.instances` still aborts before it, now cleanly inside one + transaction. -```sql -_duroxide.enqueue_orchestrator_work( - p_instance_id text, - p_work_item text, -- work_item_json - p_visible_at timestamptz, -- now() for immediate start - p_orchestration_name text DEFAULT NULL, - p_orchestration_version text DEFAULT NULL, - p_execution_id bigint DEFAULT NULL) -``` +### The 5-second load race is gone (on the atomic path) + +The worker's `load-function-graph` activity used to wait up to five seconds for the +`df.*` rows to appear, because it could dequeue a start before they committed. On +the atomic path the queue row becomes visible only *after* those rows commit in the +same transaction, so that wait can't trigger. The retry is kept for the +non-atomic fallback path, where it remains a no-op on the atomic path. + +## The reconciler + +`df.reconcile(p_grace_seconds integer DEFAULT 60)` is a `SECURITY DEFINER`, +admin-only function that repairs leftover drift: -### Privilege model (the one real wrinkle) - -`_duroxide.orchestrator_queue` grants INSERT to its **owner only** (the -`pg_durable.worker_role`; `relacl` is empty → no PUBLIC), and -`_duroxide.enqueue_orchestrator_work` is `SECURITY INVOKER`. `df.start()` is also -`SECURITY INVOKER` and runs SPI as the calling user. Therefore a normal caller -cannot insert into the queue directly — confirmed: for a fresh `LOGIN` role, -`has_table_privilege(... 'INSERT') = false`. - -The enqueue must run through a **`SECURITY DEFINER`** entrypoint owned by a role -that owns the `_duroxide` queue tables. Two placements: - -- **Recommended — provide it in duroxide-pg** (proper layering; pg_durable never - touches `_duroxide` internals). Add a stable, documented `SECURITY DEFINER` - client-enqueue function in the `_duroxide` schema, owned by the schema owner - (`worker_role`), e.g. `_duroxide.client_enqueue_start(p_instance_id text, - p_work_item text, p_visible_at timestamptz)`. It performs the single queue - INSERT and is `GRANT EXECUTE`-ed to the roles allowed to start instances. - Applied by the BGW migration runner (`ApplyAll`); no pg_durable extension - upgrade script needed (consistent with docs/bgw-applies-migrations.md). -- **Fallback — a `df` wrapper in pg_durable.** A `df.__enqueue_start(...)` - `SECURITY DEFINER` function owned by the extension owner that calls - `_duroxide.enqueue_orchestrator_work(...)`. Works in the common configuration - where the extension owner is a superuser (or otherwise owns/【has INSERT on】the - queue). In hardened multi-role deployments where `_duroxide` is owned by a - non-superuser `worker_role` distinct from the extension owner, this requires - the `worker_role` to `GRANT INSERT` on the queue tables (and `EXECUTE` on the - enqueue fn) to the wrapper's owner. The duroxide-pg placement avoids this - coordination. - -Either wrapper **must** pin `search_path` and schema-qualify every reference -(CVE-2018-1058; enforced by the pgspot gate, scripts/pgspot-gate.sh): +- It deletes orphaned **root** runtime instances — those with no parent, no matching + `df.instances` row, and older than the grace window — gathering each orphan's full + subtree so the delete is accepted. Sub-orchestrations (branches and loop + generations) have no `df.instances` row of their own and are excluded by the root + filter, so they are never collected on their own. +- It marks `df.instances` rows that are stuck non-terminal — no live runtime + instance and no queued start — as `failed`. + +Each pass is wrapped so a single failure never aborts the reconcile or stops the +loop. + +The background worker runs the reconciler as a **built-in durable cron loop**, +keeping exactly one instance per cluster. It reads the schedule, connects as the +`df_reconciler` role (`SET ROLE`), and starts: ```sql -CREATE FUNCTION _duroxide.client_enqueue_start(...) -RETURNS void -LANGUAGE plpgsql -SECURITY DEFINER -SET search_path = pg_catalog -AS $$ BEGIN - INSERT INTO _duroxide.orchestrator_queue (instance_id, work_item, visible_at, created_at) - VALUES (p_instance_id, p_work_item, p_visible_at, pg_catalog.now()); -END $$; +-- $1 = the pg_durable.reconciler_cron value, $2 = 'df_reconciler' +SELECT df.start( + df.loop(df.seq('SELECT * FROM df.reconcile()', df.wait_for_schedule($1))), + $2); ``` -The wrapper takes the caller's `instance_id`, orchestration name, and input -string; it **constructs the `StartOrchestration` work item server-side** (so the -caller cannot choose the work-item variant or target a foreign instance) and -**authorizes** the enqueue only for a brand-new, not-yet-started instance -(`pending` `df.instances` row, no orchestrator-queue entry, no duroxide -instance — under atomic-start semantics reachable only for the row `df.start` -just inserted in the current transaction). It executes no caller-supplied SQL. -This is what prevents the wrapper from being a generic, cross-tenant -arbitrary-work-item enqueue primitive (which would otherwise let any df user -forge `CancelInstance`/`ExternalRaised`/`ActivityCompleted` against another -user's instance, bypassing the RLS gate `df.signal`/`df.cancel` enforce). - -### Identity / security - -`submitted_by` capture is unchanged: it is still read as `current_user` inside -`df.start()` (`SECURITY INVOKER`) **before** the enqueue and stored in -`df.instances`. The `SECURITY DEFINER` wrapper does not touch identity; the work -item carries only the orchestration name/input. The worker continues to -authenticate using `df.instances.submitted_by` and re-check the superuser policy -(unchanged). No new forgery surface is introduced. - -### Cancel and signal wrappers - -`df.cancel()` and `df.signal()` enqueue on the caller's transaction through their -own `SECURITY DEFINER` wrappers (`df._enqueue_orchestrator_cancel`, -`df._enqueue_orchestrator_signal`). Both build the work item server-side -(`CancelInstance` / `ExternalRaised`) via `json_build_object`, so a caller cannot -choose the variant or smuggle a foreign target. - -The start wrapper can authorize structurally (a brand-new `pending` instance is -only reachable for the row the caller just inserted). Cancel and signal target an -**already-committed** instance, so they authorize explicitly: the wrapper checks -`pg_has_role(session_user, df.instances.submitted_by::oid, 'MEMBER')` -(`submitted_by` is a `regrole`, so `::oid` yields the owning role's OID). -`session_user` is unforgeable inside `SECURITY DEFINER` (unlike `current_user`, -which is the definer), and membership — rather than an exact-name match — lets a -role that owns the instance via `SET ROLE` still cancel/signal it. A caller that -is not a member of the owning role is rejected before any enqueue, mirroring the -RLS gate `df.cancel`/`df.signal` already enforce. The wrappers run no -caller-supplied SQL. - -**Signal fan-out.** A signal must reach the root instance **and** every running -sub-orchestration (a descendant may be the one blocked on -`wait_for_signal`). The wrapper recurses `_duroxide.instances.parent_instance_id` -from the root, joins each instance to its current execution, keeps only -executions whose `status` is `Running` (lower-cased), and enqueues one -`ExternalRaised` per surviving instance — all in the caller's transaction, so the -whole fan-out commits or rolls back atomically. - -### Worker readiness and ordering +`worker::ensure_reconciler` starts it, skips if one is already pending or running, +and restarts it if it died. It is driven by the `pg_durable.reconciler_cron` setting +(default `*/5 * * * *`; empty disables it) and submitted by a dedicated +**non-superuser** role, `df_reconciler` (so `submitted_by` is `df_reconciler`). A +non-superuser identity keeps it clear of the superuser-instance guard and limits what +the instance can do to "run the reconciler" even if it were somehow forged. -- Keep the existing worker-readiness gate: if `_worker_ready` is absent/stale, - `df.start()` errors as it does today (the `_duroxide` schema may not yet - exist). This check moves from "before acquiring the client" to "before the SPI - enqueue." -- The enqueue is the **last** write in `df.start()`; a PK collision on - `df.nodes`/`df.instances` still aborts before it (no behavior change for the - collision case, now fully inside one transaction). - -### Consequence: the 5 s load race disappears (on the atomic path) - -`load-function-graph` polls for up to `MAX_WAIT_SECS = 5` because the worker -could dequeue a start item before the caller's `df.*` rows committed -(src/activities/load_function_graph.rs:21,60). On the **atomic path** a queue row -becomes visible to the worker **only** after the `df.*` rows commit in the same -transaction, so the "not yet visible / rolled back" branch cannot occur. The -retry is **retained** rather than removed, because the non-atomic **fallback** -path can still enqueue before the `df.*` rows commit; it is a no-op on the atomic -path. So this removes the worker-saturation failure mode for atomic starts while -staying correct for the fallback. - -## Proof of concept (validated) - -Option 3 was prototyped end-to-end and validated on PostgreSQL 17 (branch -`tjgreen42/atomic-start`, ~80 lines across 3 files). - -**What was built:** - -- `df._enqueue_orchestrator_start(text, text)` — a `SECURITY DEFINER` wrapper - (`SET search_path = pg_catalog`) that performs the privileged - `.enqueue_orchestrator_work(instance, work_item, now())` INSERT. The - duroxide schema is **resolved dynamically** via `df.duroxide_schema()` - (`_duroxide` fresh / legacy `duroxide`) and quoted with `%I`. EXECUTE is - revoked from PUBLIC and granted through `df.grant_usage()`. -- `df.start()` **probes** for the duroxide-pg SQL surface - (`enqueue_orchestrator_work` in the resolved schema). When present it builds - the real `duroxide::providers::WorkItem::StartOrchestration` (guaranteeing wire - compatibility), gates on `is_worker_ready()`, and enqueues via SPI through the - wrapper — raising (aborting the transaction) on failure. When **absent** (a - non-pg provider, or a schema predating the wrapper) it **falls back** to the - out-of-band `start_durable_function` client path (legacy, non-atomic). -- `df.reconcile(grace_seconds int default 60)` (Option 4) — a `SECURITY DEFINER`, - admin-only (revoked from PUBLIC) reconciler that deletes orphaned **root** - duroxide instances (`parent_instance_id IS NULL`, no `df.instances` row) via - `delete_instances_atomic`, and fails `df.instances` stuck non-terminal with no - live duroxide instance and no queued start. -- **Built-in durable cron loop** — the background worker - (`worker::ensure_reconciler`) keeps exactly one reconciler instance running per - cluster, dogfooding pg_durable: - `df.start(df.loop(df.seq('SELECT * FROM df.reconcile()', df.wait_for_schedule())), 'df_reconciler')`. - It is idempotent (skips if one is pending/running; self-heals if it died), - driven by the `pg_durable.reconciler_cron` GUC (default `*/5 * * * *`; empty - disables), and submitted by a dedicated **non-superuser** role `df_reconciler` - (created by the worker, granted only `df.grant_usage()` + EXECUTE on - `df.reconcile()`). A non-superuser identity sidesteps the - `enable_superuser_instances` guard and bounds the blast radius to "trigger GC" - even if the instance were forged. - -**Provider requirement:** the atomic path only works with the **duroxide-pg -(SQL-backed) provider**, because it calls the provider's SQL function directly. -The probe + fallback make this non-breaking for any other provider. - -**Confirmed findings:** - -- **The privilege model is the only real wrinkle.** The duroxide orchestrator - queue grants INSERT to its owner only and the enqueue function is - `SECURITY INVOKER`, so a plain caller gets *permission denied*. The - `SECURITY DEFINER` wrapper resolves this while still committing in the caller's - transaction. A non-superuser role (`df_e2e_user`) granted via - `df.grant_usage()` starts instances successfully and cannot INSERT into the - queue directly. -- **`visible_at = now()`** (SQL transaction time) is sufficient for immediate - starts; no reliance on the Rust-side `now_ms`. -- **Reusing the `duroxide` `WorkItem` type** makes the enqueued row - byte-identical to the client path, so the worker behaves identically. -- **The reconciler must exclude sub-orchestrations.** JOIN/RACE branches and loop - generations are legitimate duroxide instances with no `df.instances` row; - filtering on `parent_instance_id IS NULL` (root only) prevents collecting them. -- **Two gotchas in the cron loop:** the reconciler role cannot be named with a - `pg_` prefix (PostgreSQL reserves it) — hence `df_reconciler`; and the loop - node must call the set-returning reconciler as `SELECT * FROM df.reconcile()` - (a bare `SELECT df.reconcile()` yields an unsupported `RECORD` column in - `execute-sql`). - -**Validated behavior:** +## Behavior change for users -| Check | Result | -|-------|:------:| -| Committed `df.start()` runs to completion | pass | -| Rolled-back `df.start()` leaves no `df.*` **and** no `_duroxide` orphan (queue + instances) | pass | -| Dynamic schema resolution drives the SPI path (atomicity holds) | pass | -| E2E subset: core, conditionals, loops, variables, signals, join, race, cancel | 8/8 | -| JOIN determinism under the new (faster) timing | 5/5 | -| Non-superuser (`df_e2e_user`) start path | pass | -| Committed `df.cancel()` cancels; rolled-back `df.cancel()` leaves the instance running and rolls back the queue enqueue | pass | -| Committed `df.signal()` delivers (incl. fan-out to a sub-orchestration in a race); rolled-back `df.signal()` rolls back the enqueue | pass | -| Cancel/signal authz: a non-member caller is denied on the wrapper and via the RLS gate; the owning role (incl. via `SET ROLE`) succeeds | pass | -| E2E: `07_signals`, `23_signal_in_race` (fan-out), `22_cancel_status_consistency` | 3/3 | -| `df.reconcile()` deletes a planted root orphan; leaves sub-orchestrations + live instances untouched | pass | -| `df.reconcile()` fails a stuck (lost-enqueue) `df.instances` | pass | -| Built-in reconciler auto-starts as non-superuser `df_reconciler`, idempotent (exactly one) | pass | -| Scheduled loop GCs a planted orphan on its next tick, without self-GC'ing | pass | -| E2E subset unaffected with the reconciler running (incl. heartbeat + join) | 6/6 | -| `cargo fmt --check`; no new clippy warnings | pass | - -### Review hardening (applied) - -A source-of-truth code review surfaced and fixed several issues: - -- **Cross-tenant enqueue bypass (HIGH).** The wrapper originally accepted an - opaque `work_item` and was granted to every df user, letting a caller forge - `CancelInstance`/`ExternalRaised`/`ActivityCompleted` against another user's - instance. Fixed: the wrapper now **builds the `StartOrchestration` server-side** - and **authorizes** only a brand-new, not-yet-started instance — verified that a - non-superuser can no longer forge an enqueue for a foreign or arbitrary id. -- **Reconciler self-destruct on parallel-workflow orphans (HIGH).** - `delete_instances_atomic` refuses to delete a parent without its children (even - with `force`), so deleting only orphan **roots** raised on any JOIN/RACE/loop - orphan and aborted GC. Fixed: gather the **full subtree** (recursive on - `parent_instance_id`) and wrap each pass in an exception block so one failure - never aborts reconcile or kills the loop — verified root+child GC with no error. -- **GC-suppression via user label (HIGH).** The singleton key was the - user-writable `label`; any user could park a `df_reconciler`-labelled instance - to stop the GC from starting. Fixed: key liveness on the unforgeable - `submitted_by = df_reconciler` role. -- **Cron busy-loop (MED/HIGH).** `df.wait_for_schedule` baked a fixed offset at - build time that a `df.loop` replayed every generation (worst case ~1 s spin). - Fixed: the WAIT_SCHEDULE node now recomputes the delay from the orchestration's - deterministic clock (`ctx.utc_now()`) each generation — verified the reconciler - fires on the cron tick and does not spin. (Note: adding the clock read changes - the orchestration history shape, so a wait-in-loop instance in flight across the - upgrade may need to restart.) -- **No in-epoch self-healing (MED).** The reconciler was (re)started only per - epoch. Fixed: `ensure_reconciler` is also re-asserted from the steady-state poll - loop — verified a cancelled reconciler restarts within the poll interval. -- **Silent non-atomic fallback (MED).** `df.start()` now emits a `WARNING` when it - takes the non-atomic fallback path. -- **Debated/rejected:** the `df_reconciler` role keeps `LOGIN` — the worker - executes the reconcile node by connecting *as* the role via `connect_as_user`, - exactly like every other durable-function role; the proposed `NOLOGIN` would - break node execution. - -**POC scope / deferred to productionization:** fresh install only (no upgrade -script); the atomic-start wrapper hardening (move the entrypoint into duroxide-pg -owned by the schema owner, and/or validate the work item); the reconciler's -grace/cadence policy still needs tuning, and the `df_reconciler` role is created -by the worker rather than via managed provisioning. - -> **Local-dev note:** `DROP EXTENSION pg_durable CASCADE` can leave the -> BGW-owned `_duroxide` schema half-broken (migrations row present, functions -> dropped), which surfaces as *spurious* JOIN nondeterminism and a missing -> `_worker_ready`. Recover with `DROP SCHEMA _duroxide CASCADE` + restart so the -> BGW re-applies its migrations. (This cost real debugging time during the POC; -> the failures were environmental, not a regression.) - -## Transaction-semantics decision (must be made explicit) - -This change makes `df.start()` **atomic with the caller's transaction**: +`df.start()` now takes part in the caller's transaction: ```sql BEGIN; SELECT df.start('SELECT 1'); -- enqueued on THIS transaction -ROLLBACK; -- start is fully undone; nothing runs +ROLLBACK; -- fully undone; nothing runs ``` -Today the behavior is inconsistent (the `df.*` rows roll back but the duroxide -enqueue does not). The new behavior is the least-surprising contract and is the -explicit, documented decision of this spec. Note for users: a `df.start()` in a -transaction that later rolls back will **not** run — which is the desired fix, -but a behavior change from "fire-and-forget at statement time." This must be -called out in `USER_GUIDE.md`. - -## Signals & cancellation (migrated) - -`df.cancel()` and `df.signal()` use the same in-transaction enqueue as -`df.start()` (see Cancel and signal wrappers above). In short: - -- **`df.cancel()`** — the optimistic `df.instances.status='cancelled'` UPDATE and - the `CancelInstance` enqueue now commit or roll back together on the caller's - transaction. The worker's `update-instance-status` activity still re-applies the - authoritative status, so a worker-side crash remains transient/self-healing. -- **`df.signal()`** — the `ExternalRaised` fan-out (root + running - sub-orchestrations) is enqueued on the caller's transaction, giving "don't - deliver if my tx rolls back" semantics. It writes only `_duroxide`, so this is a - consistency-of-contract change rather than a cross-store fix. - -## Complementary backstop — Option 4 (implemented) - -As part of the chosen approach, a lightweight **reconciler** sweeps for residual -divergence that an in-transaction enqueue cannot prevent (crashes mid-execution, -the best-effort `df.*` `status` mirror, and legacy/fallback orphans). It ships as -**`df.reconcile(grace_seconds int default 60)`** -(`SECURITY DEFINER`, admin-only): it deletes orphaned **root** duroxide instances -(`parent_instance_id IS NULL`, no `df.instances` row, older than the grace -window) via `delete_instances_atomic`, and marks `df.instances` rows stuck -non-terminal with no live duroxide instance and no queued start as `failed`. -Sub-orchestrations (JOIN/RACE branches, loop generations) are excluded by the -root filter so they are never collected. - -The background worker runs it as a **built-in durable cron loop** (dogfooding -pg_durable), ensuring exactly one instance per cluster: +Previously the `df.*` rows rolled back but the orchestration still ran. The new +behavior is the least surprising one and is the deliberate decision of this spec, +but it *is* a change from "fires the moment the statement runs." `df.cancel()` and +`df.signal()` gain the same property. This should be called out in `USER_GUIDE.md`. + +## Implementation notes + +- The atomic path only works with the **duroxide-pg (SQL-backed) provider**, because + it calls the provider's SQL functions directly. `df.start` / `df.cancel` / + `df.signal` each probe for that surface (`enqueue_orchestrator_work` in the + resolved schema); when it is absent — a different provider, or a schema predating + the wrappers — they fall back to the old out-of-band path and emit a warning, so + the change never breaks another provider. +- `visible_at = now()` (transaction time) is enough for immediate starts. +- The work items are byte-compatible with duroxide's `WorkItem` JSON, so the worker + behaves exactly as before. + +### Hardening applied during review + +- **Cross-tenant enqueue (high).** The start wrapper first accepted an opaque work + item and was granted to everyone, which would let a caller forge a cancel or + signal against another instance. Fixed by building the work item inside the + wrapper and authorizing only a brand-new instance. +- **Reconciler deleting live work (high).** The runtime refuses to delete a parent + without its children, so deleting only orphan roots failed on any parallel-workflow + orphan. Fixed by gathering the full subtree and isolating each pass so one failure + can't stop the loop. +- **Suppressing the reconciler (high).** Its single-instance check first keyed on the + user-writable label, so a user could park a same-labelled instance to block it. + Fixed by keying on the unforgeable `df_reconciler` submitter. +- **Cron busy-loop (medium).** `df.wait_for_schedule` baked a fixed delay at build + time that the loop replayed every generation. Fixed so the wait node recomputes the + delay each generation from the orchestration's recorded clock (deterministic on + replay). Note: this changes the recorded history shape, so a wait-in-loop instance + in flight across the upgrade may need a restart. +- **Self-healing (medium).** The reconciler is re-checked from the steady-state poll + loop, not only once per worker epoch, so a cancelled one comes back within the poll + interval. +- **Silent fallback (medium).** `df.start` now warns when it uses the non-atomic + fallback. +- **Kept `LOGIN` on `df_reconciler`** (debated): the worker runs the reconcile node + by connecting *as* the role, like every other durable-function role, so `NOLOGIN` + would break it. + +> **Local-dev note:** `DROP EXTENSION pg_durable CASCADE` can leave the worker-owned +> `_duroxide` schema half-dropped (its migration row remains but functions are gone), +> which shows up as flaky `JOIN` behavior and a missing `_worker_ready`. Recover with +> `DROP SCHEMA _duroxide CASCADE` and a restart so the worker re-applies its +> migrations. + +## Upgrade and migration + +**New binary against an older schema.** The wrappers live in the `df` schema, so +they ship in the extension's install SQL and an upgrade script +(`sql/pg_durable----.sql`, `CREATE FUNCTION` + `GRANT`), with a +"Version-Specific Changes" entry in `docs/upgrade-testing.md`. A new binary running +against an older `df` schema that predates the wrappers must still work: each caller +probes for the enqueue surface and falls back to the old out-of-band path when it is +missing. This keeps the binary compatible with every prior schema while the upgrade +rolls out. + +**Data migration.** None. No table shapes change, and in-flight instances are +unaffected (their queue items were already enqueued). + +**Rollback.** Reverting the binary restores the out-of-band enqueue; the wrappers, if +left in place, are simply unused. + +**Future direction.** The privileged enqueue could move into duroxide-pg, owned by +the runtime schema owner — cleaner layering, at the cost of a coordinated +duroxide-pg release. The `df`-schema wrapper avoids that coordination and is what +ships today. + +## Validation (PostgreSQL 17) + +Prevent: -```sql -df.start( - df.loop(df.seq('SELECT * FROM df.reconcile()', - df.wait_for_schedule(current_setting('pg_durable.reconciler_cron')))), - 'df_reconciler'); -``` +| Check | Result | +|-------|:--:| +| Committed `df.start()` runs to completion | pass | +| Rolled-back `df.start()` leaves no `df.*` and no `_duroxide` orphan | pass | +| Rolled-back `df.cancel()` leaves the instance running and rolls back the enqueue; committed cancel takes effect | pass | +| Rolled-back `df.signal()` rolls back the fan-out enqueue; committed signal delivers (incl. fan-out into a sub-orchestration) | pass | +| A caller cannot enqueue a start, cancel, or signal against a foreign or arbitrary instance; cancel/signal succeed for the owning role (incl. via `SET ROLE`) | pass | +| Schema resolved dynamically; atomicity holds | pass | +| E2E `07_signals`, `23_signal_in_race` (fan-out), `22_cancel_status_consistency` | 3/3 | +| Broader E2E subset (core, conditionals, loops, variables, join, race) | pass | + +Repair: -started by `worker::ensure_reconciler` each epoch (idempotent; self-heals if it -died), submitted by the dedicated non-superuser `df_reconciler` role, and driven -by `pg_durable.reconciler_cron` (default `*/5 * * * *`; empty disables). It -complements Option 3 rather than replacing it. - -## Upgrade & Migration - -**B1 — binary backward compatibility (new `.so` vs all prior schemas).** The new -`.so` calls the enqueue entrypoint via SPI. -- If the entrypoint is the **duroxide-pg-provided** `SECURITY DEFINER` function, - the BGW applies the duroxide migration at startup (`ApplyAll`, - src/worker.rs), and the function exists for every running worker. Backends - detect readiness via `_worker_ready` (existing gate) before enqueuing. - Versioning: bump the pinned `duroxide`/`duroxide-pg` pair together (see - docs/upgrade-testing.md "Updating duroxide-pg") and run `cargo update`. **No - `pg_durable` extension upgrade script is required** for duroxide-schema changes - (docs/bgw-applies-migrations.md). -- If the entrypoint is the **`df`-schema fallback wrapper**, it is a `df` schema - object and **does** require an extension upgrade script - `sql/pg_durable----.sql` (CREATE FUNCTION + GRANT), plus the - fresh-install SQL, and a "Version-Specific Changes" entry in - docs/upgrade-testing.md. The new `.so` must still operate against older `df` - schemas that predate the wrapper: gate the SPI path on the wrapper's existence - (catalog probe) and **fall back to the existing out-of-band client enqueue** - when absent. This preserves B1 while the upgrade rolls out. - -**Runtime schema detection.** Reuse the existing `_worker_ready` / -`backend_duroxide_schema()` resolution. Add a one-time probe (cached per session) -for the enqueue entrypoint; if missing (old worker), fall back to the legacy -path. This makes the change safe under mixed binary/worker versions. - -**Data migration.** None. No `df.*` or `_duroxide` table shapes change. Existing -in-flight instances are unaffected (their queue items were already enqueued). - -**Rollback.** Reverting the `.so` restores the out-of-band enqueue; the optional -`df` wrapper, if added, is inert when unused. - -## Testing - -E2E (tests/e2e/sql/) and unit coverage: - -- **Atomic rollback (the leak):** `BEGIN; df.start(...); ROLLBACK;` then assert - **both** `df.instances` and `_duroxide.instances`/`orchestrator_queue` have no - row for the instance. (Today this fails: `_duroxide` keeps an orphan.) -- **Atomic commit:** `BEGIN; df.start(...); COMMIT;` runs to completion exactly - as before. -- **Subtransaction abort:** a `df.start()` inside a `BEGIN/EXCEPTION` block that - raises leaves no `_duroxide` orphan. -- **PK collision (existing repro):** the short-id collision aborts cleanly with - no `_duroxide` residue (extends the manual `short_id_collision` repro). -- **Worker-not-ready:** `df.start()` before `_worker_ready` errors as today. -- **Privilege/hardened config:** a non-superuser caller (and, for the fallback - wrapper, a distinct `worker_role`) can start successfully and cannot insert - into `_duroxide.orchestrator_queue` directly. -- **No 5 s waits:** worker logs contain no "transaction may have been rolled - back" entries during normal start/rollback flows. - -## Phasing - -1. **Phase 1 — implemented & validated:** atomic `df.start()` via the - `SECURITY DEFINER` enqueue entrypoint; readiness gate + legacy fallback; - simplify the `load-function-graph` retry; docs + tests. Remaining for - production: upgrade script and entrypoint hardening. -2. **Phase 2 — implemented & validated:** the same in-transaction enqueue for - `df.signal()` (root + running-descendant `ExternalRaised` fan-out) and - `df.cancel()` (`CancelInstance`), each via a `SECURITY DEFINER` wrapper that - builds the work item server-side and authorizes the caller via - `pg_has_role(session_user, ...)`. -3. **Phase 3 — reconciler implemented:** `df.reconcile()` plus a built-in durable - cron loop (worker-managed `df_reconciler` instance on - `pg_durable.reconciler_cron`) are validated (auto-start, idempotency, - scheduled orphan GC, no self-GC). Remaining: tune the grace/cadence policy and - role provisioning. - -## Open questions / risks - -- **Entrypoint placement:** duroxide-pg (clean layering, coordinated release) vs - a `df` fallback wrapper (no duroxide-pg release, but cross-role grants in - hardened setups). Recommendation: duroxide-pg, with the `df` fallback retained - for B1 during rollout. -- **Contract stability:** pg_durable now depends on the enqueue entrypoint - signature and the `WorkItem` JSON. Mitigated by reusing the `duroxide` - `WorkItem` type and the version-locked duroxide/duroxide-pg pair. -- **`visible_at` clock:** the POC uses `now()` (server transaction time) for - immediate starts and it works; no reliance on the Rust-side `now_ms` for - ordering at start. -- **Idempotency:** `enqueue_orchestrator_work` is a bare INSERT (no dedup); the - `df.instances`/`df.nodes` PKs already prevent duplicate instance ids within the - transaction, so no additional dedup is required. +| Check | Result | +|-------|:--:| +| `df.reconcile()` deletes an orphan root with children, leaves sub-orchestrations and live instances intact | pass | +| `df.reconcile()` marks a stuck instance failed | pass | +| Reconciler auto-starts as non-superuser `df_reconciler`, stays a single instance | pass | +| Reconciler fires on the cron schedule without spinning, and restarts after being cancelled | pass | +| E2E subset unaffected with the reconciler running | pass | +| `cargo fmt --check` clean; no new clippy warnings | pass | + +## Open questions + +- **Enqueue placement** — keep the `df`-schema wrapper, or move it into duroxide-pg + (cleaner layering, but a coordinated release). The wrapper ships today; the move is + a future option. +- **Contract stability** — pg_durable now depends on the enqueue function signature + and the `WorkItem` JSON shape. Mitigated by the version-locked + duroxide / duroxide-pg pair. +- **Reconciler policy** — the grace window and cron cadence still want tuning, and + `df_reconciler` is created by the worker rather than provisioned externally. From f03298211b2d6a8102151a60316ab0f004bbaae3 Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 16:19:55 +0000 Subject: [PATCH 12/20] Add E2E regression tests for atomic start/cancel/signal and reconcile 24_atomic_rollback: a rolled-back df.start/df.cancel/df.signal leaves no _duroxide effect (no orphan; the instance is not cancelled; the signal is not delivered). Fails against the pre-change out-of-band enqueue. 25_enqueue_wrapper_authz: the SECURITY DEFINER enqueue wrappers deny a non-owner forging a cancel/signal against a foreign instance, while the owner is allowed. Fails if the pg_has_role authorization check is removed. 26_reconcile_orphan_gc: df.reconcile() collects a planted root orphan that has running child sub-orchestrations (parallel join), gathering the full subtree, and leaves healthy instances untouched. Fails if reconcile collects only orphan roots (delete_instances_atomic refuses a parent without its children). Each test fails against a behavioral defect, verified by mutation testing, not merely the absence of the new functions. --- tests/e2e/sql/24_atomic_rollback.sql | 174 +++++++++++++++++++++ tests/e2e/sql/25_enqueue_wrapper_authz.sql | 136 ++++++++++++++++ tests/e2e/sql/26_reconcile_orphan_gc.sql | 149 ++++++++++++++++++ 3 files changed, 459 insertions(+) create mode 100644 tests/e2e/sql/24_atomic_rollback.sql create mode 100644 tests/e2e/sql/25_enqueue_wrapper_authz.sql create mode 100644 tests/e2e/sql/26_reconcile_orphan_gc.sql diff --git a/tests/e2e/sql/24_atomic_rollback.sql b/tests/e2e/sql/24_atomic_rollback.sql new file mode 100644 index 0000000..b2d8cc5 --- /dev/null +++ b/tests/e2e/sql/24_atomic_rollback.sql @@ -0,0 +1,174 @@ +-- Copyright (c) Microsoft Corporation. +-- Licensed under the PostgreSQL License. + +-- Tests: df.start / df.cancel / df.signal enqueue on the CALLER'S transaction. +-- +-- Regression test for df/_duroxide divergence. Originally the duroxide enqueue +-- happened out-of-band on a separate connection that committed independently, so +-- rolling back the caller's transaction left the runtime store ahead of the +-- control plane: +-- * a rolled-back df.start() left an orphaned orchestration in _duroxide; +-- * a rolled-back df.cancel() still cancelled the (committed) instance; +-- * a rolled-back df.signal() still delivered the signal. +-- +-- With the in-transaction enqueue, a ROLLBACK undoes the runtime work too. Each +-- scenario below FAILS without that change. +-- +-- _duroxide is owned by the worker role, so its tables are read as the superuser +-- (postgres); the durable functions themselves run as the non-superuser +-- df_e2e_user. + +-- Helper (superuser): assert no _duroxide residue remains for an instance id. +-- Resolves the runtime schema dynamically (df.duroxide_schema()). +CREATE OR REPLACE FUNCTION pg_temp.assert_no_duroxide_residue(p_id text) +RETURNS void LANGUAGE plpgsql AS $$ +DECLARE + sch text := df.duroxide_schema(); + n int; +BEGIN + EXECUTE format( + 'SELECT (SELECT count(*) FROM %I.orchestrator_queue WHERE instance_id = $1) ' + ' + (SELECT count(*) FROM %I.instances WHERE instance_id = $1)', + sch, sch) + INTO n USING p_id; + + IF n > 0 THEN + RAISE EXCEPTION 'TEST FAILED: rolled-back df.start left % _duroxide row(s) for instance % (non-atomic enqueue leaked)', n, p_id; + END IF; + RAISE NOTICE 'PASSED [start_rollback]: no _duroxide residue for %', p_id; +END $$; + +-- =========================================================================== +-- Scenario 1: a rolled-back df.start() leaves no _duroxide orphan +-- =========================================================================== + +SET SESSION AUTHORIZATION df_e2e_user; +BEGIN; +SELECT df.start('SELECT 42', 'atomic-rb-start') AS rb_id \gset +ROLLBACK; +RESET SESSION AUTHORIZATION; + +-- Give any (buggy) out-of-band enqueue a moment to surface before asserting. +SELECT pg_sleep(1); +SELECT pg_temp.assert_no_duroxide_residue(:'rb_id'); + +-- =========================================================================== +-- Scenario 2: a rolled-back df.cancel() does NOT cancel the instance +-- =========================================================================== + +SET SESSION AUTHORIZATION df_e2e_user; + +CREATE TEMP TABLE _t_cancel (instance_id TEXT); +INSERT INTO _t_cancel +SELECT df.start(df.loop(df.seq('SELECT 1', df.sleep(1))), 'atomic-rb-cancel'); + +-- Wait until the instance is genuinely running. +DO $$ +DECLARE inst_id TEXT; status TEXT; attempts INT := 0; +BEGIN + SELECT instance_id INTO inst_id FROM _t_cancel; + LOOP + SELECT s INTO status FROM df.status(inst_id) s; + EXIT WHEN lower(status) = 'running' OR attempts > 300; + PERFORM pg_sleep(0.1); + attempts := attempts + 1; + END LOOP; + IF lower(status) <> 'running' THEN + RAISE EXCEPTION 'Scenario 2 setup: cancel victim never reached running (status=%)', status; + END IF; +END $$; + +-- Issue a cancel, then roll it back. +BEGIN; +SELECT df.cancel((SELECT instance_id FROM _t_cancel), 'rolled-back-cancel'); +ROLLBACK; + +-- The instance must NOT become cancelled: the cancel enqueue was rolled back. +-- (Without the change the out-of-band CancelInstance committed, so the worker +-- cancels the instance within ~1s.) +DO $$ +DECLARE inst_id TEXT; status TEXT; attempts INT := 0; +BEGIN + SELECT instance_id INTO inst_id FROM _t_cancel; + LOOP + SELECT s INTO status FROM df.status(inst_id) s; + IF lower(status) = 'cancelled' THEN + RAISE EXCEPTION 'TEST FAILED: rolled-back df.cancel still cancelled instance % (non-atomic enqueue leaked)', inst_id; + END IF; + EXIT WHEN attempts > 50; -- watch for ~5s + PERFORM pg_sleep(0.1); + attempts := attempts + 1; + END LOOP; + RAISE NOTICE 'PASSED [cancel_rollback]: instance % still % after a rolled-back cancel', inst_id, status; +END $$; + +-- Cleanup: cancel the (infinite) loop for real. +SELECT df.cancel((SELECT instance_id FROM _t_cancel), 'cleanup'); +DROP TABLE _t_cancel; + +RESET SESSION AUTHORIZATION; + +-- =========================================================================== +-- Scenario 3: a rolled-back df.signal() does NOT deliver the signal +-- =========================================================================== + +SET SESSION AUTHORIZATION df_e2e_user; + +CREATE TEMP TABLE _t_signal (instance_id TEXT); +INSERT INTO _t_signal +SELECT df.start('SELECT 1' ~> (df.wait_for_signal('go') |=> 'sig') ~> 'SELECT 1', + 'atomic-rb-signal'); + +-- Wait until the instance is running (blocked on the signal). +DO $$ +DECLARE inst_id TEXT; status TEXT; attempts INT := 0; +BEGIN + SELECT instance_id INTO inst_id FROM _t_signal; + LOOP + SELECT s INTO status FROM df.status(inst_id) s; + EXIT WHEN lower(status) = 'running' OR attempts > 300; + PERFORM pg_sleep(0.1); + attempts := attempts + 1; + END LOOP; + IF lower(status) <> 'running' THEN + RAISE EXCEPTION 'Scenario 3 setup: signal victim never reached running (status=%)', status; + END IF; +END $$; + +-- Send a signal, then roll it back. +BEGIN; +SELECT df.signal((SELECT instance_id FROM _t_signal), 'go', '{}'); +ROLLBACK; + +-- The instance must NOT complete: the signal enqueue was rolled back. +-- (Without the change the out-of-band ExternalRaised committed, so the instance +-- receives the signal and completes within ~1s.) +DO $$ +DECLARE inst_id TEXT; status TEXT; attempts INT := 0; +BEGIN + SELECT instance_id INTO inst_id FROM _t_signal; + LOOP + SELECT s INTO status FROM df.status(inst_id) s; + IF lower(status) = 'completed' THEN + RAISE EXCEPTION 'TEST FAILED: rolled-back df.signal still delivered to instance % (non-atomic enqueue leaked)', inst_id; + END IF; + EXIT WHEN attempts > 50; -- watch for ~5s + PERFORM pg_sleep(0.1); + attempts := attempts + 1; + END LOOP; + RAISE NOTICE 'PASSED [signal_rollback]: instance % still % after a rolled-back signal', inst_id, status; +END $$; + +-- Cleanup: deliver the signal for real and let it finish. +SELECT df.signal((SELECT instance_id FROM _t_signal), 'go', '{}'); +DO $$ +DECLARE inst_id TEXT; status TEXT; +BEGIN + SELECT instance_id INTO inst_id FROM _t_signal; + SELECT df.wait_for_completion(inst_id, 30) INTO status; +END $$; +DROP TABLE _t_signal; + +RESET SESSION AUTHORIZATION; + +SELECT 'TEST PASSED: atomic rollback (start/cancel/signal)' AS result; diff --git a/tests/e2e/sql/25_enqueue_wrapper_authz.sql b/tests/e2e/sql/25_enqueue_wrapper_authz.sql new file mode 100644 index 0000000..68bb61d --- /dev/null +++ b/tests/e2e/sql/25_enqueue_wrapper_authz.sql @@ -0,0 +1,136 @@ +-- Copyright (c) Microsoft Corporation. +-- Licensed under the PostgreSQL License. + +-- Tests: authorization on the in-transaction enqueue wrappers used by +-- df.cancel() / df.signal(). +-- +-- df._enqueue_orchestrator_cancel/_signal are SECURITY DEFINER (the runtime +-- queue is writable by its owner only) and granted to every df user via +-- df.grant_usage(). They must therefore refuse to enqueue work against an +-- instance the caller does not own. Ownership is checked with +-- pg_has_role(session_user, , 'MEMBER'): session_user cannot be +-- spoofed inside a SECURITY DEFINER function, and membership lets a role that +-- owns the instance through SET ROLE still qualify. +-- +-- Without the change these wrappers do not exist, so the forge attempts below +-- raise undefined_function and the test fails. + +-- Fresh, non-superuser roles. (Superusers bypass pg_has_role, so the denial can +-- only be exercised by a non-superuser caller.) +DO $setup$ +BEGIN + BEGIN DROP OWNED BY authz_owner; EXCEPTION WHEN undefined_object THEN NULL; END; + BEGIN DROP OWNED BY authz_other; EXCEPTION WHEN undefined_object THEN NULL; END; + BEGIN DROP ROLE authz_owner; EXCEPTION WHEN undefined_object THEN NULL; END; + BEGIN DROP ROLE authz_other; EXCEPTION WHEN undefined_object THEN NULL; END; +END $setup$; + +CREATE ROLE authz_owner LOGIN; +CREATE ROLE authz_other LOGIN; +SELECT df.grant_usage('authz_owner'); +SELECT df.grant_usage('authz_other'); + +-- The owner starts a long-running instance. +SET SESSION AUTHORIZATION authz_owner; +CREATE TEMP TABLE _t_authz (instance_id TEXT); +INSERT INTO _t_authz +SELECT df.start(df.loop(df.seq('SELECT 1', df.sleep(1))), 'authz-owner-inst'); +RESET SESSION AUTHORIZATION; + +-- Let the non-owner read the instance id (so the forge attempts can target it). +GRANT SELECT ON _t_authz TO authz_other; + +-- Wait until it is running. +DO $$ +DECLARE inst_id TEXT; status TEXT; attempts INT := 0; +BEGIN + SELECT instance_id INTO inst_id FROM _t_authz; + LOOP + SELECT s INTO status FROM df.status(inst_id) s; + EXIT WHEN lower(status) = 'running' OR attempts > 300; + PERFORM pg_sleep(0.1); + attempts := attempts + 1; + END LOOP; + IF lower(status) <> 'running' THEN + RAISE EXCEPTION 'Setup: owner instance never reached running (status=%)', status; + END IF; +END $$; + +-- =========================================================================== +-- A non-owner cannot forge a cancel/signal against the owner's instance. +-- =========================================================================== + +SET SESSION AUTHORIZATION authz_other; +DO $$ +DECLARE inst_id TEXT; +BEGIN + SELECT instance_id INTO inst_id FROM _t_authz; + + -- Forge a cancel. + BEGIN + PERFORM df._enqueue_orchestrator_cancel(inst_id, 'forged'); + RAISE EXCEPTION 'TEST FAILED: authz_other was allowed to enqueue a cancel for owner instance %', inst_id; + EXCEPTION + WHEN insufficient_privilege THEN + IF SQLERRM LIKE '%not authorized%' THEN + RAISE NOTICE 'PASSED [cancel_forge_denied]: %', SQLERRM; + ELSE + RAISE EXCEPTION 'TEST FAILED: cancel denied, but not by the wrapper authorization check: %', SQLERRM; + END IF; + WHEN undefined_function THEN + RAISE EXCEPTION 'TEST FAILED: df._enqueue_orchestrator_cancel is missing (change not present): %', SQLERRM; + END; + + -- Forge a signal. + BEGIN + PERFORM df._enqueue_orchestrator_signal(inst_id, 'go', '{}'); + RAISE EXCEPTION 'TEST FAILED: authz_other was allowed to enqueue a signal for owner instance %', inst_id; + EXCEPTION + WHEN insufficient_privilege THEN + IF SQLERRM LIKE '%not authorized%' THEN + RAISE NOTICE 'PASSED [signal_forge_denied]: %', SQLERRM; + ELSE + RAISE EXCEPTION 'TEST FAILED: signal denied, but not by the wrapper authorization check: %', SQLERRM; + END IF; + WHEN undefined_function THEN + RAISE EXCEPTION 'TEST FAILED: df._enqueue_orchestrator_signal is missing (change not present): %', SQLERRM; + END; +END $$; +RESET SESSION AUTHORIZATION; + +-- =========================================================================== +-- The owner can still cancel its own instance (the authorized path works). +-- =========================================================================== + +SET SESSION AUTHORIZATION authz_owner; +SELECT df.cancel((SELECT instance_id FROM _t_authz), 'owner-cancel'); +RESET SESSION AUTHORIZATION; + +DO $$ +DECLARE inst_id TEXT; status TEXT; attempts INT := 0; +BEGIN + SELECT instance_id INTO inst_id FROM _t_authz; + LOOP + SELECT s INTO status FROM df.status(inst_id) s; + EXIT WHEN lower(status) = 'cancelled' OR attempts > 100; + PERFORM pg_sleep(0.1); + attempts := attempts + 1; + END LOOP; + IF lower(status) <> 'cancelled' THEN + RAISE EXCEPTION 'TEST FAILED: owner could not cancel its own instance % (status=%)', inst_id, status; + END IF; + RAISE NOTICE 'PASSED [owner_cancel_allowed]: instance % cancelled by its owner', inst_id; +END $$; + +DROP TABLE _t_authz; + +-- Cleanup roles. +DO $cleanup$ +BEGIN + BEGIN DROP OWNED BY authz_owner; EXCEPTION WHEN undefined_object THEN NULL; END; + BEGIN DROP OWNED BY authz_other; EXCEPTION WHEN undefined_object THEN NULL; END; + BEGIN DROP ROLE authz_owner; EXCEPTION WHEN undefined_object THEN NULL; END; + BEGIN DROP ROLE authz_other; EXCEPTION WHEN undefined_object THEN NULL; END; +END $cleanup$; + +SELECT 'TEST PASSED: enqueue wrapper authorization' AS result; diff --git a/tests/e2e/sql/26_reconcile_orphan_gc.sql b/tests/e2e/sql/26_reconcile_orphan_gc.sql new file mode 100644 index 0000000..c59c37b --- /dev/null +++ b/tests/e2e/sql/26_reconcile_orphan_gc.sql @@ -0,0 +1,149 @@ +-- Copyright (c) Microsoft Corporation. +-- Licensed under the PostgreSQL License. + +-- Tests: df.reconcile() repairs leftover df/_duroxide drift. +-- +-- df.reconcile() deletes orphaned ROOT runtime instances (no df.instances row, +-- older than the grace window), gathering each orphan's full subtree so the +-- delete is accepted, and leaves healthy instances untouched. +-- +-- Without the change df.reconcile() does not exist, so the calls below raise +-- undefined_function and the test fails. +-- +-- _duroxide is read as the superuser (postgres); durable functions run as +-- df_e2e_user. + +-- Helper (superuser): does a _duroxide instance row exist for this id? +CREATE OR REPLACE FUNCTION pg_temp.duroxide_instance_exists(p_id text) +RETURNS boolean LANGUAGE plpgsql AS $$ +DECLARE sch text := df.duroxide_schema(); ex boolean; +BEGIN + EXECUTE format('SELECT EXISTS(SELECT 1 FROM %I.instances WHERE instance_id = $1)', sch) + INTO ex USING p_id; + RETURN ex; +END $$; + +-- Helper (superuser): number of direct children (sub-orchestrations) of a root. +CREATE OR REPLACE FUNCTION pg_temp.duroxide_child_count(p_root text) +RETURNS bigint LANGUAGE plpgsql AS $$ +DECLARE sch text := df.duroxide_schema(); n bigint; +BEGIN + EXECUTE format('SELECT count(*) FROM %I.instances WHERE parent_instance_id = $1', sch) + INTO n USING p_root; + RETURN n; +END $$; + +-- Helper (superuser): size of the full instance subtree rooted at p_root. +CREATE OR REPLACE FUNCTION pg_temp.duroxide_subtree_count(p_root text) +RETURNS bigint LANGUAGE plpgsql AS $$ +DECLARE sch text := df.duroxide_schema(); n bigint; +BEGIN + EXECUTE format( + 'WITH RECURSIVE t AS ( ' + ' SELECT instance_id FROM %1$I.instances WHERE instance_id = $1 ' + ' UNION ' + ' SELECT c.instance_id FROM %1$I.instances c JOIN t ON c.parent_instance_id = t.instance_id ' + ') SELECT count(*) FROM t', sch) + INTO n USING p_root; + RETURN n; +END $$; + +-- =========================================================================== +-- Scenario 1: a planted root orphan WITH a running subtree is fully collected +-- =========================================================================== + +-- Start a parallel JOIN: each branch runs as its own sub-orchestration, which +-- has no df.instances row of its own. Orphaning the root then forces reconcile +-- to gather the FULL subtree (root + children) -- deleting only the root would be +-- refused by delete_instances_atomic, leaving the orphan behind. Long branches +-- keep the children running for the duration of the test. +SET SESSION AUTHORIZATION df_e2e_user; +CREATE TEMP TABLE _t_orphan (instance_id TEXT); +INSERT INTO _t_orphan +SELECT df.start('SELECT pg_sleep(30)' & 'SELECT pg_sleep(30)', 'reconcile-orphan-victim'); +RESET SESSION AUTHORIZATION; + +-- Wait until the worker has materialized the root AND at least one child, so the +-- subtree genuinely exists when we reconcile. +DO $$ +DECLARE inst_id TEXT; attempts INT := 0; +BEGIN + SELECT instance_id INTO inst_id FROM _t_orphan; + LOOP + EXIT WHEN pg_temp.duroxide_child_count(inst_id) >= 1 OR attempts > 300; + PERFORM pg_sleep(0.1); + attempts := attempts + 1; + END LOOP; + IF pg_temp.duroxide_child_count(inst_id) < 1 THEN + RAISE EXCEPTION 'Setup: orphan victim never spawned a child sub-orchestration'; + END IF; + RAISE NOTICE 'Setup: orphan victim % has % child sub-orchestration(s)', inst_id, pg_temp.duroxide_child_count(inst_id); +END $$; + +-- Orphan it: drop the control-plane rows (superuser bypasses row security). +-- df.instances and df.nodes reference each other, but the FKs are +-- DEFERRABLE INITIALLY DEFERRED, so deleting both in one transaction is fine. +BEGIN; +DELETE FROM df.instances WHERE id = (SELECT instance_id FROM _t_orphan); +DELETE FROM df.nodes WHERE instance_id = (SELECT instance_id FROM _t_orphan); +COMMIT; + +-- Reconcile with a zero grace window: the orphaned root AND its subtree must be +-- collected. +SELECT df.reconcile(0); + +DO $$ +DECLARE inst_id TEXT; remaining bigint; +BEGIN + SELECT instance_id INTO inst_id FROM _t_orphan; + remaining := pg_temp.duroxide_subtree_count(inst_id); + IF remaining > 0 THEN + RAISE EXCEPTION 'TEST FAILED: df.reconcile left % subtree row(s) for orphaned root % (subtree not gathered)', remaining, inst_id; + END IF; + RAISE NOTICE 'PASSED [orphan_subtree_collected]: root % and all children removed', inst_id; +END $$; + +DROP TABLE _t_orphan; + +-- =========================================================================== +-- Scenario 2: a healthy, running instance is left untouched +-- =========================================================================== + +SET SESSION AUTHORIZATION df_e2e_user; +CREATE TEMP TABLE _t_live (instance_id TEXT); +INSERT INTO _t_live SELECT df.start('SELECT pg_sleep(3)', 'reconcile-live'); +RESET SESSION AUTHORIZATION; + +-- Ensure it is running before reconciling. +DO $$ +DECLARE inst_id TEXT; status TEXT; attempts INT := 0; +BEGIN + SELECT instance_id INTO inst_id FROM _t_live; + LOOP + SELECT s INTO status FROM df.status(inst_id) s; + EXIT WHEN lower(status) = 'running' OR attempts > 300; + PERFORM pg_sleep(0.1); + attempts := attempts + 1; + END LOOP; + IF lower(status) <> 'running' THEN + RAISE EXCEPTION 'Setup: live instance never reached running (status=%)', status; + END IF; +END $$; + +-- Reconcile must not disturb a healthy instance (it has a df.instances row). +SELECT df.reconcile(0); + +DO $$ +DECLARE inst_id TEXT; status TEXT; +BEGIN + SELECT instance_id INTO inst_id FROM _t_live; + SELECT df.wait_for_completion(inst_id, 30) INTO status; + IF lower(status) <> 'completed' THEN + RAISE EXCEPTION 'TEST FAILED: df.reconcile disturbed a healthy instance % (status=%)', inst_id, status; + END IF; + RAISE NOTICE 'PASSED [live_untouched]: % completed normally despite reconcile', inst_id; +END $$; + +DROP TABLE _t_live; + +SELECT 'TEST PASSED: reconcile orphan GC' AS result; From 0053a187d5a7f8dff071d056df68d5f31e8b1899 Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 16:57:31 +0000 Subject: [PATCH 13/20] spec: call out direct duroxide-pg coupling and the PG-provider requirement The in-transaction enqueue and df.reconcile() are the first code to call duroxide-pg's SQL surface directly (enqueue_orchestrator_work, delete_instances_atomic, reads of orchestrator_queue/instances/executions), unlike the rest of pg_durable which uses duroxide's Rust provider/client API. Document why (only direct SQL can share the caller's transaction), the existing precedent (the _worker_ready table co-located in _duroxide; duroxide-pg migrations), and that non-PG providers fall back to the out-of-band path. --- docs/spec-atomic-start.md | 38 ++++++++++++++++++++++++++++++++------ 1 file changed, 32 insertions(+), 6 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index 17f64e7..4fc0d97 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -154,6 +154,32 @@ transaction: a rollback drops the queue row along with the `df.*` rows, and a co makes them visible together. `df.cancel()` and `df.signal()` work the same way through their own wrappers. +### Why this reaches into duroxide-pg directly (and needs the PG provider) + +Everywhere else, pg_durable talks to the runtime through duroxide's Rust +provider/client API — the out-of-band start/cancel/signal fallback (`src/client.rs`) +and the monitoring/explain reads (`src/monitoring.rs`, `src/explain.rs`) all go +through `Client`/`Provider` methods, which are provider-agnostic at the API level. +The in-transaction enqueue and `df.reconcile()` are the **first places pg_durable +calls duroxide-pg's SQL surface directly**: `enqueue_orchestrator_work` and +`delete_instances_atomic`, plus reads of `_duroxide.orchestrator_queue`, +`instances`, and `executions`. They therefore depend on duroxide-pg's table shapes +and function signatures, not just on the runtime existing. + +This is deliberate, and it is the whole point of the change: only direct SQL (via +SPI) can run inside the **caller's** transaction. The Rust client uses a separate +connection pool, so it can never be atomic with the backend's SPI work. The trade is +a deeper coupling to duroxide-pg in exchange for atomicity. + +It would not work for a non-PostgreSQL provider — there would be no SQL functions to +call — but pg_durable already assumes a duroxide-pg provider in a known schema: it +places its own `_worker_ready` table inside `_duroxide`, resolves the schema via +`df.duroxide_schema()`, and relies on duroxide-pg applying its migrations at startup. +In the pg_durable context — a PostgreSQL extension whose runtime state lives in the +same database — there is no real use case for another provider. The probe-and- +fallback (below) keeps any non-PG provider working, non-atomically, rather than +broken. + ### Why a privileged wrapper is needed `_duroxide.orchestrator_queue` is writable by its owner only (the worker role), so a @@ -265,12 +291,12 @@ but it *is* a change from "fires the moment the statement runs." `df.cancel()` a ## Implementation notes -- The atomic path only works with the **duroxide-pg (SQL-backed) provider**, because - it calls the provider's SQL functions directly. `df.start` / `df.cancel` / - `df.signal` each probe for that surface (`enqueue_orchestrator_work` in the - resolved schema); when it is absent — a different provider, or a schema predating - the wrappers — they fall back to the old out-of-band path and emit a warning, so - the change never breaks another provider. +- The atomic path requires the duroxide-pg provider (see *Why this reaches into + duroxide-pg directly* above). `df.start` / `df.cancel` / `df.signal` each probe for + its SQL surface (`enqueue_orchestrator_work` in the resolved schema); when it is + absent — a different provider, or a schema predating the wrappers — they fall back + to the old out-of-band path and emit a warning, so the change never breaks another + provider. - `visible_at = now()` (transaction time) is enough for immediate starts. - The work items are byte-compatible with duroxide's `WorkItem` JSON, so the worker behaves exactly as before. From d46f431d8a102b8a0005c0042d2dcd62d676dceb Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 17:56:14 +0000 Subject: [PATCH 14/20] Address self-review findings: upgrade path, fallback probe, start-wrapper hardening Fixes from the PR self-review: - Complete the 0.2.3->0.2.4 upgrade script with the atomic enqueue wrappers, df.reconcile(), and matching grant_usage()/revoke_usage() grants/revokes. - Make the in-transaction enqueue probe require both surfaces: the duroxide-pg enqueue SQL function and the df._enqueue_orchestrator_* wrappers. This keeps the new .so B1-compatible with older df schemas that have not run ALTER EXTENSION UPDATE, falling back to the out-of-band client path instead of calling missing wrappers. - Log (server-side) rather than emit a client-visible WARNING for the fallback, so scripts that capture SELECT df.start(...) output are not contaminated. - Harden df._enqueue_orchestrator_start so it is not a generic privileged start-any-orchestration primitive: it only accepts the root graph executor, validates input.instance_id, and enqueues the hard-coded root orchestration. - Extend the authz E2E test to cover the start-wrapper hardening and make its cleanup robust across interrupted prior runs. Verified with scripts/test-upgrade.sh --verbose and targeted E2E tests 24/25/26. --- docs/spec-atomic-start.md | 24 +- docs/upgrade-testing.md | 12 +- sql/pg_durable--0.2.3--0.2.4.sql | 283 ++++++++++++++++++++- src/dsl.rs | 41 ++- src/lib.rs | 16 +- tests/e2e/sql/25_enqueue_wrapper_authz.sql | 78 +++++- 6 files changed, 416 insertions(+), 38 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index 4fc0d97..1f2c3d0 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -195,11 +195,12 @@ with `%I`. They are revoked from `PUBLIC` and granted through `df.grant_usage()` Because a `SECURITY DEFINER` function runs as its owner, the wrappers cannot trust the caller. Two safeguards make them safe to grant to every user: -- **The work item is built inside the wrapper**, from the caller's arguments — never +- **The work item is built inside the wrapper**, from validated arguments — never passed in. A caller cannot choose the work-item type or smuggle in a foreign - target. (`df.start` builds `StartOrchestration`, `df.cancel` builds - `CancelInstance`, `df.signal` builds `ExternalRaised`, each with - `json_build_object`, matching duroxide's wire format.) + target. `df.start` only accepts the public root graph-executor orchestration and + requires the input JSON's `instance_id` to match the target id; `df.cancel` builds + `CancelInstance`; `df.signal` builds `ExternalRaised`. All three use + `json_build_object`, matching duroxide's wire format. - **The caller is authorized before any enqueue**, by two different rules: - *Start* targets a brand-new instance, so it authorizes by *state*: it permits the enqueue only for a `pending` `df.instances` row that has no queue entry and @@ -294,9 +295,9 @@ but it *is* a change from "fires the moment the statement runs." `df.cancel()` a - The atomic path requires the duroxide-pg provider (see *Why this reaches into duroxide-pg directly* above). `df.start` / `df.cancel` / `df.signal` each probe for its SQL surface (`enqueue_orchestrator_work` in the resolved schema); when it is - absent — a different provider, or a schema predating the wrappers — they fall back - to the old out-of-band path and emit a warning, so the change never breaks another - provider. + absent — a different provider, or a schema predating the wrappers — they log and + fall back to the old out-of-band path, so the change never breaks another provider + or contaminates `SELECT df.start(...)` output with client-visible warnings. - `visible_at = now()` (transaction time) is enough for immediate starts. - The work items are byte-compatible with duroxide's `WorkItem` JSON, so the worker behaves exactly as before. @@ -322,7 +323,7 @@ but it *is* a change from "fires the moment the statement runs." `df.cancel()` a - **Self-healing (medium).** The reconciler is re-checked from the steady-state poll loop, not only once per worker epoch, so a cancelled one comes back within the poll interval. -- **Silent fallback (medium).** `df.start` now warns when it uses the non-atomic +- **Silent fallback (medium).** `df.start` now logs when it uses the non-atomic fallback. - **Kept `LOGIN` on `df_reconciler`** (debated): the worker runs the reconcile node by connecting *as* the role, like every other durable-function role, so `NOLOGIN` @@ -341,9 +342,10 @@ they ship in the extension's install SQL and an upgrade script (`sql/pg_durable----.sql`, `CREATE FUNCTION` + `GRANT`), with a "Version-Specific Changes" entry in `docs/upgrade-testing.md`. A new binary running against an older `df` schema that predates the wrappers must still work: each caller -probes for the enqueue surface and falls back to the old out-of-band path when it is -missing. This keeps the binary compatible with every prior schema while the upgrade -rolls out. +uses the atomic path only when both the duroxide-pg SQL enqueue function **and** the +`df._enqueue_orchestrator_*` wrappers exist; otherwise it falls back to the old +out-of-band path. This keeps the binary compatible with every prior schema while +the upgrade rolls out. **Data migration.** None. No table shapes change, and in-flight instances are unaffected (their queue items were already enqueued). diff --git a/docs/upgrade-testing.md b/docs/upgrade-testing.md index 586e359..e4dd1eb 100644 --- a/docs/upgrade-testing.md +++ b/docs/upgrade-testing.md @@ -206,14 +206,22 @@ what the upgrade script handles, and any backward compatibility considerations. ### v0.2.3 → v0.2.4 #### Simplify `df.grant_usage()` — drop the explicit function allowlist -- **DDL change (df schema):** `df.grant_usage()` no longer loops over a hard-coded `func_sigs` array issuing `GRANT EXECUTE` per function. Fresh installs (`src/lib.rs`) and the upgrade script (`sql/pg_durable--0.2.3--0.2.4.sql`) both `CREATE OR REPLACE` the function with a body that grants `USAGE ON SCHEMA df` plus the table privileges, and conditionally grants `df.http()` / the admin helpers. The signature `df.grant_usage(text, boolean, boolean)` is unchanged. -- **DDL change (df schema):** `df.revoke_usage()` is made symmetric with the new `grant_usage()`. It no longer loops over every `df.*` function in `pg_proc` issuing `REVOKE EXECUTE` (which, post-simplification, only produced "no privileges could be revoked" warnings since ordinary functions are never granted per-function EXECUTE). The new body revokes only what `grant_usage()` grants: schema `USAGE`, EXECUTE on the sensitive functions (`df.http`, `df.grant_usage`, `df.revoke_usage`), and the table privileges. The signature `df.revoke_usage(text)` is unchanged. +- **DDL change (df schema):** `df.grant_usage()` no longer loops over a hard-coded `func_sigs` array issuing `GRANT EXECUTE` per function. Fresh installs (`src/lib.rs`) and the upgrade script (`sql/pg_durable--0.2.3--0.2.4.sql`) both `CREATE OR REPLACE` the function with a body that grants `USAGE ON SCHEMA df` plus the table privileges, conditionally grants `df.http()` / the admin helpers, and explicitly grants the new private enqueue wrappers (`df._enqueue_orchestrator_start`, `df._enqueue_orchestrator_cancel`, `df._enqueue_orchestrator_signal`) to ordinary df users. The signature `df.grant_usage(text, boolean, boolean)` is unchanged. +- **DDL change (df schema):** `df.revoke_usage()` is made symmetric with the new `grant_usage()`. It no longer loops over every `df.*` function in `pg_proc` issuing `REVOKE EXECUTE` (which, post-simplification, only produced "no privileges could be revoked" warnings since ordinary functions are never granted per-function EXECUTE). The new body revokes only what `grant_usage()` grants: schema `USAGE`, EXECUTE on the sensitive/admin functions, EXECUTE on the private enqueue wrappers, and the table privileges. The signature `df.revoke_usage(text)` is unchanged. - **Rationale:** The ordinary `df.*` functions retain PostgreSQL's default PUBLIC `EXECUTE`, so schema `USAGE` is the real access gate; the per-function grants/revokes were redundant. The sensitive functions have PUBLIC `EXECUTE` revoked at install time and were never in the allowlist, so their protection is unchanged. - **Behavioral note:** A newly added `df.*` function is now callable by any role with schema `USAGE` by default. To keep a future function private, `REVOKE EXECUTE ... FROM PUBLIC` at install time and grant it explicitly in `df.grant_usage()`. - **Legacy cleanup caveat:** A role that was granted under the *old* `grant_usage()` (explicit per-function EXECUTE) and is later revoked under the new `revoke_usage()` may retain inert EXECUTE entries on ordinary functions. These are harmless — revoking schema `USAGE` fully locks the role out — and clear on the next drop/regrant cycle. - **Scenario A considerations:** Signatures are identical on the fresh-install and upgrade paths (only the bodies differ), so the function-signature equivalence contract passes. - **Scenario B1/B2 considerations:** No schema/data migration and no new objects. The replaced bodies work against the existing schema and change no privileges already granted. +#### Atomic in-transaction enqueue wrappers + `df.reconcile()` +- **DDL change (df schema):** Adds three private `SECURITY DEFINER` wrappers: `df._enqueue_orchestrator_start(text, text, text)`, `df._enqueue_orchestrator_cancel(text, text)`, and `df._enqueue_orchestrator_signal(text, text, text)`. Fresh installs and the upgrade script create the same functions and `REVOKE EXECUTE ... FROM PUBLIC`; `df.grant_usage()` grants them explicitly to df users because `df.start()` / `df.cancel()` / `df.signal()` call them via SPI as the caller. +- **DDL change (df schema):** Adds admin-only `df.reconcile(integer)` (`SECURITY DEFINER`, `REVOKE EXECUTE ... FROM PUBLIC`) to delete orphaned duroxide instance subtrees and mark stuck `df.instances` rows failed. The background worker starts it through a built-in durable loop as the dedicated `df_reconciler` role. +- **Provider coupling:** These wrappers and `df.reconcile()` call the duroxide-pg SQL surface directly (`enqueue_orchestrator_work`, `delete_instances_atomic`, and `_duroxide` tables). This is intentional: only direct SQL via SPI can share the caller's transaction. Non-PG providers (or older schemas without the wrappers) use the legacy out-of-band fallback. +- **Scenario A considerations:** Fresh-install and upgrade schemas must expose the same new `df` functions and grants. The upgrade script creates the wrappers/reconciler and updates `grant_usage()`/`revoke_usage()` in the same release, so upgraded users can call `df.start()` / `df.cancel()` / `df.signal()` immediately after `ALTER EXTENSION UPDATE`. +- **Scenario B1 considerations:** The new `.so` remains compatible with pre-upgrade 0.2.3 schemas because `df.start()` / `df.cancel()` / `df.signal()` use the in-transaction path only when both the duroxide-pg provider SQL function and the `df._enqueue_orchestrator_*` wrappers exist; otherwise they log and fall back to the old out-of-band client path. This fallback is not client-visible as a SQL `WARNING`, so it does not contaminate scripts that capture `SELECT df.start(...)` output. +- **Scenario B2 considerations:** No data migration. Existing instances and queued work keep their original behavior; the new atomic semantics apply to calls made after the schema has the wrappers. + #### Rename `df.wait_for_completion()` to `df.await_instance()` - **DDL change (df schema):** Adds `df.await_instance(text, integer)` as the canonical C binding for the helper formerly exposed as `df.wait_for_completion(text, integer)`. The old SQL function remains present and the new `.so` continues exporting `wait_for_completion_wrapper` as a shim, so existing customer scripts keep working. - **Grant behavior:** No explicit grant migration is required. PostgreSQL grants `EXECUTE` on newly created functions to `PUBLIC` by default, and `df.await_instance` is not a sensitive helper whose default PUBLIC grant is revoked. diff --git a/sql/pg_durable--0.2.3--0.2.4.sql b/sql/pg_durable--0.2.3--0.2.4.sql index efaab86..9308bc1 100644 --- a/sql/pg_durable--0.2.3--0.2.4.sql +++ b/sql/pg_durable--0.2.3--0.2.4.sql @@ -40,7 +40,10 @@ DROP FUNCTION IF EXISTS df.debug_connection(); -- -- The sensitive functions (df.http, df.grant_usage, df.revoke_usage) had their -- PUBLIC EXECUTE revoked at install time and were never in the list; they are --- still granted explicitly here, so their protection is unchanged. +-- still granted explicitly here, so their protection is unchanged. The new +-- in-transaction enqueue wrappers are also private (REVOKE FROM PUBLIC below) +-- and are granted explicitly to df users because df.start()/df.cancel()/df.signal() +-- call them via SPI as the calling role. -- -- This CREATE OR REPLACE brings pre-existing installs in line with fresh -- 0.2.4 installs (see src/lib.rs). The new body works against the existing @@ -76,6 +79,14 @@ BEGIN EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df.revoke_usage(text) TO %I', p_role) OPERATOR(pg_catalog.||) grant_opt; END IF; + -- In-transaction enqueue wrappers — SECURITY DEFINER, revoked from PUBLIC at + -- install. Granted unconditionally to every df user because df.start() / + -- df.cancel() / df.signal() call them via SPI as the calling role; their own + -- internal authorization checks gate access to other users' instances. + EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text, text) TO %I', p_role) OPERATOR(pg_catalog.||) grant_opt; + EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df._enqueue_orchestrator_cancel(text, text) TO %I', p_role) OPERATOR(pg_catalog.||) grant_opt; + EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df._enqueue_orchestrator_signal(text, text, text) TO %I', p_role) OPERATOR(pg_catalog.||) grant_opt; + -- Table privileges EXECUTE pg_catalog.format('GRANT SELECT ON df.instances TO %I', p_role) OPERATOR(pg_catalog.||) grant_opt; EXECUTE pg_catalog.format('GRANT UPDATE (status, updated_at) ON df.instances TO %I', p_role) OPERATOR(pg_catalog.||) grant_opt; @@ -100,10 +111,11 @@ $fn$; -- the role out of every ordinary df.* function. -- -- The new body undoes exactly what grant_usage() grants: schema USAGE, EXECUTE --- on the sensitive functions, and the table privileges. Note: a role granted --- under the OLD grant_usage() (explicit per-function EXECUTE) may retain inert --- EXECUTE entries on ordinary functions after this revoke; they are harmless --- because schema USAGE is gone, and clear on the next drop/regrant cycle. +-- on the sensitive/admin functions, EXECUTE on the new enqueue wrappers, and +-- the table privileges. Note: a role granted under the OLD grant_usage() +-- (explicit per-function EXECUTE) may retain inert EXECUTE entries on ordinary +-- functions after this revoke; they are harmless because schema USAGE is gone, +-- and clear on the next drop/regrant cycle. -- ============================================================================ CREATE OR REPLACE FUNCTION df.revoke_usage(p_role TEXT) RETURNS VOID @@ -134,6 +146,24 @@ BEGIN NULL; END; + -- In-transaction enqueue wrappers (granted unconditionally by grant_usage()). + -- A delegated admin may not be the grantor of these; skip if so. + BEGIN + EXECUTE pg_catalog.format('REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text, text) FROM %I CASCADE', p_role); + EXCEPTION WHEN insufficient_privilege THEN + NULL; + END; + BEGIN + EXECUTE pg_catalog.format('REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_cancel(text, text) FROM %I CASCADE', p_role); + EXCEPTION WHEN insufficient_privilege THEN + NULL; + END; + BEGIN + EXECUTE pg_catalog.format('REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_signal(text, text, text) FROM %I CASCADE', p_role); + EXCEPTION WHEN insufficient_privilege THEN + NULL; + END; + -- Table privileges. -- Column-level revokes must match the column-level grants from grant_usage(). EXECUTE pg_catalog.format('REVOKE SELECT, INSERT, UPDATE, DELETE ON df.vars FROM %I CASCADE', p_role); @@ -166,3 +196,246 @@ CREATE FUNCTION df."await_instance"( STRICT LANGUAGE c AS 'MODULE_PATHNAME', 'await_instance_wrapper'; + +-- ============================================================================ +-- In-transaction enqueue wrappers + reconciler (atomic df.start/cancel/signal, +-- and df.reconcile()). +-- +-- df.start()/df.cancel()/df.signal() now enqueue the duroxide work item over SPI +-- inside the CALLER'S transaction (so a rollback undoes the enqueue), through +-- these SECURITY DEFINER wrappers. The orchestrator queue is owner-only, so the +-- wrappers perform the privileged INSERT; each builds the work item server-side +-- and authorizes the caller (df.start by brand-new-instance state; df.cancel / +-- df.signal by pg_has_role(session_user, , 'MEMBER')). They are revoked +-- from PUBLIC and granted to df users by df.grant_usage() (above). +-- +-- df.reconcile() is the admin-only backstop that deletes orphaned duroxide +-- instance subtrees with no df.instances row and fails stuck df.instances rows. +-- +-- The wrappers resolve the duroxide schema via df.duroxide_schema() (defined in +-- the 0.2.2→0.2.3 upgrade) and require the duroxide-pg provider; df.start / +-- df.cancel / df.signal fall back to the out-of-band path when it is absent, so +-- this upgrade is safe on a fresh '_duroxide' or a legacy 'duroxide' schema. +-- ============================================================================ +CREATE FUNCTION df._enqueue_orchestrator_start( + p_instance_id text, + p_orchestration text, + p_input text) +RETURNS void +LANGUAGE plpgsql +SECURITY DEFINER +SET search_path = pg_catalog +AS $fn$ +DECLARE + sch text := df.duroxide_schema(); + work_item text; + v_blocked boolean; +BEGIN + -- This wrapper is not a generic privileged "start any orchestration" entry + -- point. df.start() passes the root graph-executor name and FunctionInput + -- JSON; reject anything else so a caller cannot use the SECURITY DEFINER + -- privilege to enqueue an internal sub-orchestration with crafted input. + IF p_orchestration OPERATOR(pg_catalog.<>) 'pg_durable::orchestration::execute-function-graph' THEN + RAISE EXCEPTION 'pg_durable: invalid start orchestration %', p_orchestration + USING ERRCODE = 'invalid_parameter_value'; + END IF; + + IF (p_input::jsonb ->> 'instance_id') IS DISTINCT FROM p_instance_id THEN + RAISE EXCEPTION 'pg_durable: start input instance_id does not match %', p_instance_id + USING ERRCODE = 'invalid_parameter_value'; + END IF; + + -- Authorization. This runs as the (privileged) definer, so it must not + -- trust the caller to only target their own instance. Permit the enqueue + -- only for a brand-new, not-yet-started instance: a 'pending' df.instances + -- row with no orchestrator-queue entry and no duroxide instance. Under the + -- atomic-start path a committed df.instances row always has its queue row in + -- the same transaction, so this state is reachable only for the instance the + -- caller (df.start) just inserted in the current transaction — never another + -- user's committed instance. This is what stops a caller from forging work + -- against a foreign instance. + EXECUTE pg_catalog.format( + 'SELECT NOT EXISTS (SELECT 1 FROM df.instances i WHERE i.id = $1 AND i.status = ''pending'') ' + ' OR EXISTS (SELECT 1 FROM %I.orchestrator_queue q WHERE q.instance_id = $1) ' + ' OR EXISTS (SELECT 1 FROM %I.instances d WHERE d.instance_id = $1)', + sch, sch) + INTO v_blocked + USING p_instance_id; + + IF v_blocked THEN + RAISE EXCEPTION 'pg_durable: not authorized to enqueue a start for instance %', p_instance_id + USING ERRCODE = 'insufficient_privilege'; + END IF; + + -- Build the StartOrchestration work item server-side so the caller cannot + -- choose the work-item variant (no CancelInstance/ExternalRaised/etc.) or + -- target a different instance. Mirrors duroxide's WorkItem::StartOrchestration. + work_item := pg_catalog.json_build_object( + 'StartOrchestration', pg_catalog.json_build_object( + 'instance', p_instance_id, + 'orchestration', 'pg_durable::orchestration::execute-function-graph', + 'input', p_input, + 'version', NULL, + 'parent_instance', NULL, + 'parent_id', NULL, + 'execution_id', 1))::text; + + EXECUTE pg_catalog.format('SELECT %I.enqueue_orchestrator_work($1, $2, $3)', sch) + USING p_instance_id, work_item, pg_catalog.now(); +END; +$fn$; + +REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text, text) FROM PUBLIC; + +CREATE FUNCTION df._enqueue_orchestrator_cancel(p_instance_id text, p_reason text) +RETURNS void +LANGUAGE plpgsql +SECURITY DEFINER +SET search_path = pg_catalog +AS $fn$ +DECLARE + sch text := df.duroxide_schema(); + owner_oid oid; +BEGIN + SELECT i.submitted_by::oid INTO owner_oid FROM df.instances i WHERE i.id = p_instance_id; + IF owner_oid IS NULL OR NOT pg_catalog.pg_has_role(session_user, owner_oid, 'MEMBER') THEN + RAISE EXCEPTION 'pg_durable: not authorized to cancel instance %', p_instance_id + USING ERRCODE = 'insufficient_privilege'; + END IF; + + EXECUTE pg_catalog.format('SELECT %I.enqueue_orchestrator_work($1, $2, $3)', sch) + USING p_instance_id, + pg_catalog.json_build_object('CancelInstance', + pg_catalog.json_build_object('instance', p_instance_id, 'reason', p_reason))::text, + pg_catalog.now(); +END; +$fn$; + +REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_cancel(text, text) FROM PUBLIC; + +CREATE FUNCTION df._enqueue_orchestrator_signal(p_instance_id text, p_name text, p_data text) +RETURNS void +LANGUAGE plpgsql +SECURITY DEFINER +SET search_path = pg_catalog +AS $fn$ +DECLARE + sch text := df.duroxide_schema(); + owner_oid oid; +BEGIN + SELECT i.submitted_by::oid INTO owner_oid FROM df.instances i WHERE i.id = p_instance_id; + IF owner_oid IS NULL OR NOT pg_catalog.pg_has_role(session_user, owner_oid, 'MEMBER') THEN + RAISE EXCEPTION 'pg_durable: not authorized to signal instance %', p_instance_id + USING ERRCODE = 'insufficient_privilege'; + END IF; + + -- Raise the event for the target instance and every RUNNING descendant + -- (a sub-orchestration — JOIN/RACE branch or loop generation — may be the one + -- waiting on the signal), mirroring the out-of-band fan-out. %1$I = schema. + EXECUTE pg_catalog.format( + 'INSERT INTO %1$I.orchestrator_queue (instance_id, work_item, visible_at, created_at) ' + 'SELECT t.instance_id, ' + ' pg_catalog.json_build_object(''ExternalRaised'', ' + ' pg_catalog.json_build_object(''instance'', t.instance_id, ''name'', $2, ''data'', $3))::text, ' + ' pg_catalog.now(), pg_catalog.now() ' + 'FROM ( ' + ' WITH RECURSIVE tree AS ( ' + ' SELECT i.instance_id, i.current_execution_id, true AS is_root ' + ' FROM %1$I.instances i WHERE i.instance_id = $1 ' + ' UNION ' + ' SELECT c.instance_id, c.current_execution_id, false ' + ' FROM %1$I.instances c JOIN tree p ON c.parent_instance_id = p.instance_id ' + ' ) ' + ' SELECT tr.instance_id ' + ' FROM tree tr ' + ' LEFT JOIN %1$I.executions e ' + ' ON e.instance_id = tr.instance_id AND e.execution_id = tr.current_execution_id ' + ' WHERE tr.is_root OR pg_catalog.lower(COALESCE(e.status, '''')) = ''running'' ' + ') t', + sch) + USING p_instance_id, p_name, p_data; +END; +$fn$; + +REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_signal(text, text, text) FROM PUBLIC; + +CREATE FUNCTION df.reconcile(p_grace_seconds integer DEFAULT 60) +RETURNS TABLE(duroxide_orphans_deleted bigint, stuck_instances_failed bigint) +LANGUAGE plpgsql +SECURITY DEFINER +SET search_path = pg_catalog +AS $fn$ +DECLARE + sch text := df.duroxide_schema(); + orphan_ids text[]; + deleted bigint := 0; + stuck bigint := 0; +BEGIN + -- 1) Delete orphaned duroxide subtrees: every duroxide instance whose ROOT + -- ancestor has no df.instances row and is older than the grace window. + -- We must gather the FULL subtree (root + all descendants) because + -- delete_instances_atomic refuses (even with force) to delete a parent + -- whose children are not also in the list. Sub-orchestrations (JOIN/RACE + -- branches, loop generations) have no df.instances row and would be + -- mis-detected as roots if we keyed on "no df row" alone, so the orphan + -- seed is restricted to parent_instance_id IS NULL. + -- Wrapped so a GC failure never aborts reconcile or kills the built-in + -- reconciler loop. + BEGIN + EXECUTE pg_catalog.format( + 'WITH RECURSIVE orphan_root AS ( ' + ' SELECT d.instance_id ' + ' FROM %1$I.instances d ' + ' LEFT JOIN df.instances i ON i.id = d.instance_id ' + ' WHERE i.id IS NULL ' + ' AND d.parent_instance_id IS NULL ' + ' AND d.created_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1) ' + '), subtree AS ( ' + ' SELECT instance_id FROM orphan_root ' + ' UNION ' + ' SELECT c.instance_id FROM %1$I.instances c ' + ' JOIN subtree s ON c.parent_instance_id = s.instance_id ' + ') SELECT pg_catalog.array_agg(instance_id) FROM subtree', + sch) + INTO orphan_ids + USING p_grace_seconds; + + IF orphan_ids IS NOT NULL AND pg_catalog.array_length(orphan_ids, 1) > 0 THEN + EXECUTE pg_catalog.format( + 'SELECT instances_deleted FROM %I.delete_instances_atomic($1, $2)', sch) + INTO deleted + USING orphan_ids, true; + END IF; + EXCEPTION WHEN OTHERS THEN + deleted := 0; + RAISE WARNING 'pg_durable: reconcile orphan-GC pass failed: %', SQLERRM; + END; + + -- 2) df.instances stuck non-terminal with no live duroxide instance and no + -- queued start (lost enqueue) → mark failed. The duroxide queue row + -- persists (locked) until ack, and the instance row is created at ack, so + -- a healthy in-flight start always matches one of the NOT EXISTS guards + -- and is never failed here. Best-effort; wrapped like step 1. + BEGIN + EXECUTE pg_catalog.format( + 'UPDATE df.instances i ' + 'SET status = ''failed'', updated_at = pg_catalog.now() ' + 'WHERE i.status IN (''pending'', ''running'') ' + ' AND i.updated_at < pg_catalog.now() - pg_catalog.make_interval(secs => $1) ' + ' AND NOT EXISTS (SELECT 1 FROM %1$I.instances d WHERE d.instance_id = i.id) ' + ' AND NOT EXISTS (SELECT 1 FROM %1$I.orchestrator_queue q WHERE q.instance_id = i.id)', + sch) + USING p_grace_seconds; + GET DIAGNOSTICS stuck = ROW_COUNT; + EXCEPTION WHEN OTHERS THEN + stuck := 0; + RAISE WARNING 'pg_durable: reconcile stuck-failover pass failed: %', SQLERRM; + END; + + duroxide_orphans_deleted := deleted; + stuck_instances_failed := stuck; + RETURN NEXT; +END; +$fn$; + +REVOKE EXECUTE ON FUNCTION df.reconcile(integer) FROM PUBLIC; diff --git a/src/dsl.rs b/src/dsl.rs index 604cd61..6475420 100644 --- a/src/dsl.rs +++ b/src/dsl.rs @@ -642,7 +642,7 @@ pub fn signal(instance_id: &str, signal_name: &str, signal_data: default!(&str, } "OK".to_string() } else { - pgrx::warning!( + pgrx::log!( "pg_durable: df.signal() is using the non-atomic fallback enqueue \ (duroxide-pg SQL surface not detected)" ); @@ -657,17 +657,34 @@ pub fn signal(instance_id: &str, signal_name: &str, signal_data: default!(&str, // Orchestration Control Functions // ============================================================================ -/// True when the duroxide store exposes the SQL enqueue surface — i.e. the -/// duroxide-pg (PostgreSQL) provider — in the given schema. Detected by the -/// presence of `enqueue_orchestrator_work`. When false (a non-pg provider, or a -/// schema predating the wrapper), df.start() falls back to the out-of-band -/// client path. The catalog probe never raises, so it is safe in a backend -/// session even when the schema is absent. +/// True when both required pieces of the in-transaction enqueue path exist: +/// the duroxide-pg SQL enqueue surface in the provider schema and the df-schema +/// SECURITY DEFINER wrappers created by the extension/upgrade script. When +/// false (a non-pg provider, or a schema that predates the wrappers), callers +/// fall back to the out-of-band client path. The catalog probe never raises, so +/// it is safe in a backend session even when the schema is absent. fn in_tx_enqueue_supported(schema: &str) -> bool { Spi::get_one_with_args::( - "SELECT EXISTS(SELECT 1 FROM pg_catalog.pg_proc p \ - JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace \ - WHERE n.nspname = $1 AND p.proname = 'enqueue_orchestrator_work')", + "SELECT \ + EXISTS(SELECT 1 FROM pg_catalog.pg_proc p \ + JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace \ + WHERE n.nspname = $1 \ + AND p.proname = 'enqueue_orchestrator_work') \ + AND EXISTS(SELECT 1 FROM pg_catalog.pg_proc p \ + JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace \ + WHERE n.nspname = 'df' \ + AND p.proname = '_enqueue_orchestrator_start' \ + AND p.pronargs = 3) \ + AND EXISTS(SELECT 1 FROM pg_catalog.pg_proc p \ + JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace \ + WHERE n.nspname = 'df' \ + AND p.proname = '_enqueue_orchestrator_cancel' \ + AND p.pronargs = 2) \ + AND EXISTS(SELECT 1 FROM pg_catalog.pg_proc p \ + JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace \ + WHERE n.nspname = 'df' \ + AND p.proname = '_enqueue_orchestrator_signal' \ + AND p.pronargs = 3)", &[schema.into()], ) .ok() @@ -1014,7 +1031,7 @@ pub fn start( // the duroxide client. NOTE: this path is NOT atomic with the caller's // transaction — a rollback will NOT undo the start. Warn so the // non-atomic semantics are observable. - pgrx::warning!( + pgrx::log!( "pg_durable: df.start() is using the non-atomic fallback enqueue \ (duroxide-pg SQL surface not detected); a rollback will not undo this start" ); @@ -1064,7 +1081,7 @@ pub fn cancel(instance_id: &str, reason: default!(&str, "'Cancelled by user'")) pgrx::error!("Failed to cancel: {:?}", e); } } else { - pgrx::warning!( + pgrx::log!( "pg_durable: df.cancel() is using the non-atomic fallback enqueue \ (duroxide-pg SQL surface not detected)" ); diff --git a/src/lib.rs b/src/lib.rs index 754b206..1143da7 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -554,6 +554,20 @@ DECLARE work_item text; v_blocked boolean; BEGIN + -- This wrapper is not a generic privileged "start any orchestration" entry + -- point. df.start() passes the root graph-executor name and FunctionInput + -- JSON; reject anything else so a caller cannot use the SECURITY DEFINER + -- privilege to enqueue an internal sub-orchestration with crafted input. + IF p_orchestration OPERATOR(pg_catalog.<>) 'pg_durable::orchestration::execute-function-graph' THEN + RAISE EXCEPTION 'pg_durable: invalid start orchestration %', p_orchestration + USING ERRCODE = 'invalid_parameter_value'; + END IF; + + IF (p_input::jsonb ->> 'instance_id') IS DISTINCT FROM p_instance_id THEN + RAISE EXCEPTION 'pg_durable: start input instance_id does not match %', p_instance_id + USING ERRCODE = 'invalid_parameter_value'; + END IF; + -- Authorization. This runs as the (privileged) definer, so it must not -- trust the caller to only target their own instance. Permit the enqueue -- only for a brand-new, not-yet-started instance: a 'pending' df.instances @@ -582,7 +596,7 @@ BEGIN work_item := pg_catalog.json_build_object( 'StartOrchestration', pg_catalog.json_build_object( 'instance', p_instance_id, - 'orchestration', p_orchestration, + 'orchestration', 'pg_durable::orchestration::execute-function-graph', 'input', p_input, 'version', NULL, 'parent_instance', NULL, diff --git a/tests/e2e/sql/25_enqueue_wrapper_authz.sql b/tests/e2e/sql/25_enqueue_wrapper_authz.sql index 68bb61d..6944b88 100644 --- a/tests/e2e/sql/25_enqueue_wrapper_authz.sql +++ b/tests/e2e/sql/25_enqueue_wrapper_authz.sql @@ -1,22 +1,43 @@ -- Copyright (c) Microsoft Corporation. -- Licensed under the PostgreSQL License. --- Tests: authorization on the in-transaction enqueue wrappers used by --- df.cancel() / df.signal(). +-- Tests: authorization and hardening on the in-transaction enqueue wrappers. -- --- df._enqueue_orchestrator_cancel/_signal are SECURITY DEFINER (the runtime +-- The df._enqueue_orchestrator_* functions are SECURITY DEFINER (the runtime -- queue is writable by its owner only) and granted to every df user via -- df.grant_usage(). They must therefore refuse to enqueue work against an --- instance the caller does not own. Ownership is checked with --- pg_has_role(session_user, , 'MEMBER'): session_user cannot be --- spoofed inside a SECURITY DEFINER function, and membership lets a role that --- owns the instance through SET ROLE still qualify. +-- instance the caller does not own, and the start wrapper must not become a +-- generic "start any internal orchestration with arbitrary input" primitive. +-- Cancel/signal ownership is checked with pg_has_role(session_user, +-- , 'MEMBER'): session_user cannot be spoofed inside a +-- SECURITY DEFINER function, and membership lets a role that owns the instance +-- through SET ROLE still qualify. -- -- Without the change these wrappers do not exist, so the forge attempts below -- raise undefined_function and the test fails. -- Fresh, non-superuser roles. (Superusers bypass pg_has_role, so the denial can -- only be exercised by a non-superuser caller.) +DO $precleanup$ +DECLARE + inst_id TEXT; +BEGIN + -- A failed/interrupted prior run can leave a loop instance submitted by one + -- of these roles. Cancel it before DROP ROLE so the worker stops reconnecting + -- as that role while this test rebuilds privileges. + FOR inst_id IN + SELECT i.id + FROM df.instances i + JOIN pg_catalog.pg_roles r ON r.oid = i.submitted_by::oid + WHERE r.rolname IN ('authz_owner', 'authz_other') + AND i.status NOT IN ('completed', 'failed', 'cancelled') + LOOP + PERFORM df.cancel(inst_id, 'authz-test-precleanup'); + END LOOP; +END $precleanup$; + +SELECT pg_sleep(0.5); + DO $setup$ BEGIN BEGIN DROP OWNED BY authz_owner; EXCEPTION WHEN undefined_object THEN NULL; END; @@ -98,6 +119,49 @@ BEGIN END $$; RESET SESSION AUTHORIZATION; +-- =========================================================================== +-- A caller cannot use the start wrapper as a generic privileged entrypoint. +-- =========================================================================== + +SET SESSION AUTHORIZATION authz_other; +DO $$ +DECLARE + inst_id CONSTANT TEXT := 'feedf00d'; + root_id CONSTANT TEXT := 'deadbeef'; +BEGIN + -- Create a legitimate brand-new, pending instance owned by authz_other. The + -- start wrapper's state check should pass for this row; the test verifies + -- the wrapper still rejects a non-root internal orchestration name. + INSERT INTO df.instances (id, label, root_node, submitted_by, database) + VALUES (inst_id, 'authz-start-wrapper-hardening', root_id, current_user::regrole, 'postgres'); + + INSERT INTO df.nodes (id, instance_id, node_type, query, submitted_by, database) + VALUES (root_id, inst_id, 'SQL', 'SELECT 1', current_user::regrole, 'postgres'); + + BEGIN + PERFORM df._enqueue_orchestrator_start( + inst_id, + 'pg_durable::orchestration::execute-subtree', + json_build_object('instance_id', inst_id)::text); + RAISE EXCEPTION 'TEST FAILED: start wrapper accepted a non-root orchestration for instance %', inst_id; + EXCEPTION + WHEN invalid_parameter_value THEN + IF SQLERRM LIKE '%invalid start orchestration%' THEN + RAISE NOTICE 'PASSED [start_wrapper_rejects_internal_orchestration]: %', SQLERRM; + ELSE + RAISE EXCEPTION 'TEST FAILED: start wrapper rejected with an unexpected validation error: %', SQLERRM; + END IF; + WHEN undefined_function THEN + RAISE EXCEPTION 'TEST FAILED: df._enqueue_orchestrator_start is missing (change not present): %', SQLERRM; + END; +END $$; +RESET SESSION AUTHORIZATION; + +BEGIN; +DELETE FROM df.instances WHERE id = 'feedf00d'; +DELETE FROM df.nodes WHERE instance_id = 'feedf00d'; +COMMIT; + -- =========================================================================== -- The owner can still cancel its own instance (the authorized path works). -- =========================================================================== From 8472e1931e6b74cdddbd0070adcdb18234ae05b6 Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 18:10:49 +0000 Subject: [PATCH 15/20] Address rubber-duck feedback on early signals and upgrade docs Duroxide does not buffer external events until an orchestration has a pending subscription. A signal sent before the root runtime row exists would be accepted but skipped before the workflow can observe it, so df.signal now rejects that case with object_not_in_prerequisite_state instead of returning OK. - Add the root-materialization check to df._enqueue_orchestrator_signal in both fresh-install SQL and the 0.2.3->0.2.4 upgrade script. - Extend 24_atomic_rollback to assert same-transaction signal attempts are rejected before runtime materialization. - Align the spec/upgrade docs: wait_for_schedule in-flight instances may need restart after the history-shape change; early signals are rejected, not buffered. --- docs/spec-atomic-start.md | 17 +++++++++----- docs/upgrade-testing.md | 1 + sql/pg_durable--0.2.3--0.2.4.sql | 13 +++++++++++ src/lib.rs | 13 +++++++++++ tests/e2e/sql/24_atomic_rollback.sql | 33 ++++++++++++++++++++++++++++ 5 files changed, 71 insertions(+), 6 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index 1f2c3d0..f901c7a 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -219,10 +219,13 @@ the caller. Two safeguards make them safe to grant to every user: A signal must reach the root instance **and** every running sub-orchestration, because a child (a `JOIN`/`RACE` branch or a loop generation) may be the one waiting -on it. The `df.signal` wrapper walks the instance tree from the root -(`_duroxide.instances.parent_instance_id`), keeps the instances whose current -execution is still running, and enqueues one `ExternalRaised` per surviving -instance — all on the caller's transaction, so the whole fan-out is atomic. +on it. Duroxide does not buffer external events until an orchestration has a pending +subscription, so `df.signal` first requires the root runtime row to exist; a signal +sent in the same transaction as `df.start()` is rejected instead of returning `OK` +and being skipped before the workflow can observe it. Once the root is materialized, +the wrapper walks the instance tree (`_duroxide.instances.parent_instance_id`) and +enqueues one event for the root plus each running descendant — all on the caller's +transaction, so the whole fan-out is atomic. ### Worker readiness and ordering @@ -347,8 +350,10 @@ uses the atomic path only when both the duroxide-pg SQL enqueue function **and** out-of-band path. This keeps the binary compatible with every prior schema while the upgrade rolls out. -**Data migration.** None. No table shapes change, and in-flight instances are -unaffected (their queue items were already enqueued). +**Data migration.** None. No table shapes change, and already-enqueued work items +are left in place. One caveat: the `df.wait_for_schedule` fix changes the recorded +history shape for a wait-in-loop orchestration, so such in-flight instances may need +to be restarted after upgrade if they replay across the change. **Rollback.** Reverting the binary restores the out-of-band enqueue; the wrappers, if left in place, are simply unused. diff --git a/docs/upgrade-testing.md b/docs/upgrade-testing.md index e4dd1eb..ac95791 100644 --- a/docs/upgrade-testing.md +++ b/docs/upgrade-testing.md @@ -221,6 +221,7 @@ what the upgrade script handles, and any backward compatibility considerations. - **Scenario A considerations:** Fresh-install and upgrade schemas must expose the same new `df` functions and grants. The upgrade script creates the wrappers/reconciler and updates `grant_usage()`/`revoke_usage()` in the same release, so upgraded users can call `df.start()` / `df.cancel()` / `df.signal()` immediately after `ALTER EXTENSION UPDATE`. - **Scenario B1 considerations:** The new `.so` remains compatible with pre-upgrade 0.2.3 schemas because `df.start()` / `df.cancel()` / `df.signal()` use the in-transaction path only when both the duroxide-pg provider SQL function and the `df._enqueue_orchestrator_*` wrappers exist; otherwise they log and fall back to the old out-of-band client path. This fallback is not client-visible as a SQL `WARNING`, so it does not contaminate scripts that capture `SELECT df.start(...)` output. - **Scenario B2 considerations:** No data migration. Existing instances and queued work keep their original behavior; the new atomic semantics apply to calls made after the schema has the wrappers. +- **In-flight schedule caveat:** This release also changes `df.wait_for_schedule` inside loops to recompute from the orchestration's recorded clock each generation. That fixes a busy-loop bug, but changes the history shape for a wait-in-loop instance that replays across the upgrade; customers should drain or restart such instances if they encounter nondeterministic replay. #### Rename `df.wait_for_completion()` to `df.await_instance()` - **DDL change (df schema):** Adds `df.await_instance(text, integer)` as the canonical C binding for the helper formerly exposed as `df.wait_for_completion(text, integer)`. The old SQL function remains present and the new `.so` continues exporting `wait_for_completion_wrapper` as a shim, so existing customer scripts keep working. diff --git a/sql/pg_durable--0.2.3--0.2.4.sql b/sql/pg_durable--0.2.3--0.2.4.sql index 9308bc1..f5f49f5 100644 --- a/sql/pg_durable--0.2.3--0.2.4.sql +++ b/sql/pg_durable--0.2.3--0.2.4.sql @@ -322,6 +322,7 @@ AS $fn$ DECLARE sch text := df.duroxide_schema(); owner_oid oid; + root_exists boolean; BEGIN SELECT i.submitted_by::oid INTO owner_oid FROM df.instances i WHERE i.id = p_instance_id; IF owner_oid IS NULL OR NOT pg_catalog.pg_has_role(session_user, owner_oid, 'MEMBER') THEN @@ -329,6 +330,18 @@ BEGIN USING ERRCODE = 'insufficient_privilege'; END IF; + -- Duroxide does not buffer external events until an orchestration has a + -- pending subscription. If the root runtime row is not materialized yet, a + -- signal would be accepted but dropped before the workflow can observe it. + EXECUTE pg_catalog.format( + 'SELECT EXISTS (SELECT 1 FROM %I.instances WHERE instance_id = $1)', sch) + INTO root_exists + USING p_instance_id; + IF NOT root_exists THEN + RAISE EXCEPTION 'pg_durable: instance % is not ready to receive signals', p_instance_id + USING ERRCODE = 'object_not_in_prerequisite_state'; + END IF; + -- Raise the event for the target instance and every RUNNING descendant -- (a sub-orchestration — JOIN/RACE branch or loop generation — may be the one -- waiting on the signal), mirroring the out-of-band fan-out. %1$I = schema. diff --git a/src/lib.rs b/src/lib.rs index 1143da7..d97fdff 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -668,6 +668,7 @@ AS $fn$ DECLARE sch text := df.duroxide_schema(); owner_oid oid; + root_exists boolean; BEGIN SELECT i.submitted_by::oid INTO owner_oid FROM df.instances i WHERE i.id = p_instance_id; IF owner_oid IS NULL OR NOT pg_catalog.pg_has_role(session_user, owner_oid, 'MEMBER') THEN @@ -675,6 +676,18 @@ BEGIN USING ERRCODE = 'insufficient_privilege'; END IF; + -- Duroxide does not buffer external events until an orchestration has a + -- pending subscription. If the root runtime row is not materialized yet, a + -- signal would be accepted but dropped before the workflow can observe it. + EXECUTE pg_catalog.format( + 'SELECT EXISTS (SELECT 1 FROM %I.instances WHERE instance_id = $1)', sch) + INTO root_exists + USING p_instance_id; + IF NOT root_exists THEN + RAISE EXCEPTION 'pg_durable: instance % is not ready to receive signals', p_instance_id + USING ERRCODE = 'object_not_in_prerequisite_state'; + END IF; + -- Raise the event for the target instance and every RUNNING descendant -- (a sub-orchestration — JOIN/RACE branch or loop generation — may be the one -- waiting on the signal), mirroring the out-of-band fan-out. %1$I = schema. diff --git a/tests/e2e/sql/24_atomic_rollback.sql b/tests/e2e/sql/24_atomic_rollback.sql index b2d8cc5..d9de3d4 100644 --- a/tests/e2e/sql/24_atomic_rollback.sql +++ b/tests/e2e/sql/24_atomic_rollback.sql @@ -171,4 +171,37 @@ DROP TABLE _t_signal; RESET SESSION AUTHORIZATION; +-- =========================================================================== +-- Scenario 4: a signal sent before runtime materialization is rejected +-- =========================================================================== + +SET SESSION AUTHORIZATION df_e2e_user; + +DO $$ +DECLARE inst_id TEXT; +BEGIN + BEGIN + inst_id := df.start( + 'SELECT 1' ~> (df.wait_for_signal('go') |=> 'sig') ~> 'SELECT 1', + 'atomic-signal-before-runtime-ready' + ); + + -- The worker cannot have materialized the root _duroxide.instances row + -- yet (we are still inside the same backend statement/transaction). A + -- signal here would be accepted by the runtime but skipped before the + -- workflow subscribes, so df.signal must fail instead of returning OK. + PERFORM df.signal(inst_id, 'go', '{}'); + RAISE EXCEPTION 'TEST FAILED: df.signal accepted instance % before runtime materialization', inst_id; + EXCEPTION + WHEN object_not_in_prerequisite_state THEN + IF SQLERRM LIKE '%not ready to receive signals%' THEN + RAISE NOTICE 'PASSED [signal_before_runtime_ready_rejected]: %', SQLERRM; + ELSE + RAISE EXCEPTION 'TEST FAILED: unexpected not-ready signal error: %', SQLERRM; + END IF; + END; +END $$; + +RESET SESSION AUTHORIZATION; + SELECT 'TEST PASSED: atomic rollback (start/cancel/signal)' AS result; From 00108fd8aebaf046f392e8178c5eebe60475fd6d Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 18:17:55 +0000 Subject: [PATCH 16/20] docs: flag wait_for_schedule replay-sequence upgrade caveat The wait_for_schedule loop fix records an utc_now event before the timer, which changes the durable replay event sequence for in-flight scheduled-loop instances. Document the drain/restart requirement explicitly in the spec and upgrade notes. --- docs/spec-atomic-start.md | 17 ++++++++++++----- docs/upgrade-testing.md | 2 +- 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index f901c7a..983a203 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -321,8 +321,11 @@ but it *is* a change from "fires the moment the statement runs." `df.cancel()` a - **Cron busy-loop (medium).** `df.wait_for_schedule` baked a fixed delay at build time that the loop replayed every generation. Fixed so the wait node recomputes the delay each generation from the orchestration's recorded clock (deterministic on - replay). Note: this changes the recorded history shape, so a wait-in-loop instance - in flight across the upgrade may need a restart. + replay). **Upgrade caveat:** this changes the recorded replay event sequence + (`utc_now` is recorded before the timer). A `df.loop(df.wait_for_schedule(...))` + instance that is already waiting when the binary changes may replay expecting the + old sequence and fail as nondeterministic; drain or restart those scheduled + instances during upgrade. - **Self-healing (medium).** The reconciler is re-checked from the steady-state poll loop, not only once per worker epoch, so a cancelled one comes back within the poll interval. @@ -351,9 +354,13 @@ out-of-band path. This keeps the binary compatible with every prior schema while the upgrade rolls out. **Data migration.** None. No table shapes change, and already-enqueued work items -are left in place. One caveat: the `df.wait_for_schedule` fix changes the recorded -history shape for a wait-in-loop orchestration, so such in-flight instances may need -to be restarted after upgrade if they replay across the change. +are left in place. + +**In-flight scheduled loops.** `df.wait_for_schedule` now records an `utc_now` event +before the timer so a loop computes the next cron tick each generation. That changes +the recorded replay event sequence for existing wait-in-loop instances. Drain or +restart any in-flight `df.loop(df.wait_for_schedule(...))` instances during upgrade; +otherwise they may fail on replay as nondeterministic. **Rollback.** Reverting the binary restores the out-of-band enqueue; the wrappers, if left in place, are simply unused. diff --git a/docs/upgrade-testing.md b/docs/upgrade-testing.md index ac95791..b7f8672 100644 --- a/docs/upgrade-testing.md +++ b/docs/upgrade-testing.md @@ -221,7 +221,7 @@ what the upgrade script handles, and any backward compatibility considerations. - **Scenario A considerations:** Fresh-install and upgrade schemas must expose the same new `df` functions and grants. The upgrade script creates the wrappers/reconciler and updates `grant_usage()`/`revoke_usage()` in the same release, so upgraded users can call `df.start()` / `df.cancel()` / `df.signal()` immediately after `ALTER EXTENSION UPDATE`. - **Scenario B1 considerations:** The new `.so` remains compatible with pre-upgrade 0.2.3 schemas because `df.start()` / `df.cancel()` / `df.signal()` use the in-transaction path only when both the duroxide-pg provider SQL function and the `df._enqueue_orchestrator_*` wrappers exist; otherwise they log and fall back to the old out-of-band client path. This fallback is not client-visible as a SQL `WARNING`, so it does not contaminate scripts that capture `SELECT df.start(...)` output. - **Scenario B2 considerations:** No data migration. Existing instances and queued work keep their original behavior; the new atomic semantics apply to calls made after the schema has the wrappers. -- **In-flight schedule caveat:** This release also changes `df.wait_for_schedule` inside loops to recompute from the orchestration's recorded clock each generation. That fixes a busy-loop bug, but changes the history shape for a wait-in-loop instance that replays across the upgrade; customers should drain or restart such instances if they encounter nondeterministic replay. +- **In-flight schedule caveat:** This release also changes `df.wait_for_schedule` inside loops to recompute from the orchestration's recorded clock each generation. That fixes a busy-loop bug, but changes the recorded replay event sequence (`utc_now` is recorded before the timer). Existing `df.loop(df.wait_for_schedule(...))` instances that are already waiting when the binary changes should be drained or restarted during upgrade; otherwise they may replay expecting the old sequence and fail as nondeterministic. #### Rename `df.wait_for_completion()` to `df.await_instance()` - **DDL change (df schema):** Adds `df.await_instance(text, integer)` as the canonical C binding for the helper formerly exposed as `df.wait_for_completion(text, integer)`. The old SQL function remains present and the new `.so` continues exporting `wait_for_completion_wrapper` as a shim, so existing customer scripts keep working. From 084562038805bd8a4b6d6e38e38960aaceeb70cf Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 18:43:35 +0000 Subject: [PATCH 17/20] Address correctness review: backfill wrapper grants and broaden schedule caveat - During 0.2.3->0.2.4 upgrade, backfill EXECUTE on the new private atomic enqueue wrappers to roles that already had explicit USAGE on schema df, preserving grant option. Without this, existing df users could lose df.start()/df.cancel()/df.signal() after ALTER EXTENSION UPDATE because the new binary would choose the atomic path and then fail on wrapper EXECUTE. - Extend scripts/test-upgrade.sh with a B2 assertion that a pre-upgrade df user retains wrapper privileges after upgrade. - Broaden the wait_for_schedule upgrade caveat from looped schedules to any in-flight WAIT_SCHEDULE node, because the recorded replay event sequence now includes utc_now before the timer. --- docs/spec-atomic-start.md | 18 +++++++++--------- docs/upgrade-testing.md | 6 +++--- scripts/test-upgrade.sh | 17 +++++++++++++++++ sql/pg_durable--0.2.3--0.2.4.sql | 27 +++++++++++++++++++++++++++ 4 files changed, 56 insertions(+), 12 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index 983a203..e10cbe7 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -322,10 +322,10 @@ but it *is* a change from "fires the moment the statement runs." `df.cancel()` a time that the loop replayed every generation. Fixed so the wait node recomputes the delay each generation from the orchestration's recorded clock (deterministic on replay). **Upgrade caveat:** this changes the recorded replay event sequence - (`utc_now` is recorded before the timer). A `df.loop(df.wait_for_schedule(...))` - instance that is already waiting when the binary changes may replay expecting the - old sequence and fail as nondeterministic; drain or restart those scheduled - instances during upgrade. + (`utc_now` is recorded before the timer). Any instance already waiting in a + `WAIT_SCHEDULE` node when the binary changes may replay expecting the old sequence + and fail as nondeterministic; drain or restart those scheduled instances during + upgrade. - **Self-healing (medium).** The reconciler is re-checked from the steady-state poll loop, not only once per worker epoch, so a cancelled one comes back within the poll interval. @@ -356,11 +356,11 @@ the upgrade rolls out. **Data migration.** None. No table shapes change, and already-enqueued work items are left in place. -**In-flight scheduled loops.** `df.wait_for_schedule` now records an `utc_now` event -before the timer so a loop computes the next cron tick each generation. That changes -the recorded replay event sequence for existing wait-in-loop instances. Drain or -restart any in-flight `df.loop(df.wait_for_schedule(...))` instances during upgrade; -otherwise they may fail on replay as nondeterministic. +**In-flight scheduled waits.** `df.wait_for_schedule` now records an `utc_now` event +before the timer. That changes the recorded replay event sequence for existing +instances already waiting in any `WAIT_SCHEDULE` node. Drain or restart those +in-flight scheduled waits during upgrade; otherwise they may fail on replay as +nondeterministic. **Rollback.** Reverting the binary restores the out-of-band enqueue; the wrappers, if left in place, are simply unused. diff --git a/docs/upgrade-testing.md b/docs/upgrade-testing.md index b7f8672..03d5ca1 100644 --- a/docs/upgrade-testing.md +++ b/docs/upgrade-testing.md @@ -215,13 +215,13 @@ what the upgrade script handles, and any backward compatibility considerations. - **Scenario B1/B2 considerations:** No schema/data migration and no new objects. The replaced bodies work against the existing schema and change no privileges already granted. #### Atomic in-transaction enqueue wrappers + `df.reconcile()` -- **DDL change (df schema):** Adds three private `SECURITY DEFINER` wrappers: `df._enqueue_orchestrator_start(text, text, text)`, `df._enqueue_orchestrator_cancel(text, text)`, and `df._enqueue_orchestrator_signal(text, text, text)`. Fresh installs and the upgrade script create the same functions and `REVOKE EXECUTE ... FROM PUBLIC`; `df.grant_usage()` grants them explicitly to df users because `df.start()` / `df.cancel()` / `df.signal()` call them via SPI as the caller. +- **DDL change (df schema):** Adds three private `SECURITY DEFINER` wrappers: `df._enqueue_orchestrator_start(text, text, text)`, `df._enqueue_orchestrator_cancel(text, text)`, and `df._enqueue_orchestrator_signal(text, text, text)`. Fresh installs and the upgrade script create the same functions and `REVOKE EXECUTE ... FROM PUBLIC`; `df.grant_usage()` grants them explicitly to df users because `df.start()` / `df.cancel()` / `df.signal()` call them via SPI as the caller. The upgrade script also backfills these wrapper grants to roles that already had explicit `USAGE` on schema `df` before `ALTER EXTENSION UPDATE`, preserving grant option where present. - **DDL change (df schema):** Adds admin-only `df.reconcile(integer)` (`SECURITY DEFINER`, `REVOKE EXECUTE ... FROM PUBLIC`) to delete orphaned duroxide instance subtrees and mark stuck `df.instances` rows failed. The background worker starts it through a built-in durable loop as the dedicated `df_reconciler` role. - **Provider coupling:** These wrappers and `df.reconcile()` call the duroxide-pg SQL surface directly (`enqueue_orchestrator_work`, `delete_instances_atomic`, and `_duroxide` tables). This is intentional: only direct SQL via SPI can share the caller's transaction. Non-PG providers (or older schemas without the wrappers) use the legacy out-of-band fallback. -- **Scenario A considerations:** Fresh-install and upgrade schemas must expose the same new `df` functions and grants. The upgrade script creates the wrappers/reconciler and updates `grant_usage()`/`revoke_usage()` in the same release, so upgraded users can call `df.start()` / `df.cancel()` / `df.signal()` immediately after `ALTER EXTENSION UPDATE`. +- **Scenario A considerations:** Fresh-install and upgrade schemas must expose the same new `df` functions and grants. The upgrade script creates the wrappers/reconciler, updates `grant_usage()`/`revoke_usage()`, and backfills existing df users in the same release, so upgraded users can call `df.start()` / `df.cancel()` / `df.signal()` immediately after `ALTER EXTENSION UPDATE`. - **Scenario B1 considerations:** The new `.so` remains compatible with pre-upgrade 0.2.3 schemas because `df.start()` / `df.cancel()` / `df.signal()` use the in-transaction path only when both the duroxide-pg provider SQL function and the `df._enqueue_orchestrator_*` wrappers exist; otherwise they log and fall back to the old out-of-band client path. This fallback is not client-visible as a SQL `WARNING`, so it does not contaminate scripts that capture `SELECT df.start(...)` output. - **Scenario B2 considerations:** No data migration. Existing instances and queued work keep their original behavior; the new atomic semantics apply to calls made after the schema has the wrappers. -- **In-flight schedule caveat:** This release also changes `df.wait_for_schedule` inside loops to recompute from the orchestration's recorded clock each generation. That fixes a busy-loop bug, but changes the recorded replay event sequence (`utc_now` is recorded before the timer). Existing `df.loop(df.wait_for_schedule(...))` instances that are already waiting when the binary changes should be drained or restarted during upgrade; otherwise they may replay expecting the old sequence and fail as nondeterministic. +- **In-flight schedule caveat:** This release also changes `df.wait_for_schedule` to compute from the orchestration's recorded clock. That fixes a loop busy-loop bug, but changes the recorded replay event sequence (`utc_now` is recorded before the timer). Any in-flight instance already waiting in a `WAIT_SCHEDULE` node when the binary changes should be drained or restarted during upgrade; otherwise it may replay expecting the old sequence and fail as nondeterministic. #### Rename `df.wait_for_completion()` to `df.await_instance()` - **DDL change (df schema):** Adds `df.await_instance(text, integer)` as the canonical C binding for the helper formerly exposed as `df.wait_for_completion(text, integer)`. The old SQL function remains present and the new `.so` continues exporting `wait_for_completion_wrapper` as a shim, so existing customer scripts keep working. diff --git a/scripts/test-upgrade.sh b/scripts/test-upgrade.sh index 1778543..bb0cfeb 100755 --- a/scripts/test-upgrade.sh +++ b/scripts/test-upgrade.sh @@ -914,6 +914,7 @@ echo "" B2_PRE_INSTANCE_ID="" B2_INFLIGHT_INSTANCE_ID="" B2_POST_INSTANCE_ID="" +B2_PRE_GRANTED_ROLE="durable_b2_pre_grant" test_b2_data_survives_upgrade() { # Step 1: Install previous version and create test data @@ -924,6 +925,14 @@ test_b2_data_survives_upgrade() { assert_sql_equals "SELECT df.clearvars();" "OK" || return 1 assert_sql_equals "SELECT df.setvar('b2_key', 'b2_value');" "OK" || return 1 + # Role granted under the previous schema. The upgrade must backfill EXECUTE + # on new private wrappers so this existing df user can keep calling + # df.start()/df.cancel()/df.signal() without re-running df.grant_usage(). + run_sql_capture "DROP OWNED BY ${B2_PRE_GRANTED_ROLE};" >/dev/null 2>&1 || true + run_sql_capture "DROP ROLE IF EXISTS ${B2_PRE_GRANTED_ROLE};" >/dev/null 2>&1 || true + run_sql_capture "CREATE ROLE ${B2_PRE_GRANTED_ROLE} LOGIN;" >/dev/null || return 1 + run_sql_capture "SELECT df.grant_usage('${B2_PRE_GRANTED_ROLE}');" >/dev/null || return 1 + B2_PRE_INSTANCE_ID=$(run_sql_capture "SELECT df.start('INSERT INTO test_upgrade_b2_log (kind, msg) VALUES (''pre'', ''{b2_key}'') RETURNING msg', 'b2-pre-upgrade');") || return 1 B2_INFLIGHT_INSTANCE_ID=$(run_sql_capture "SELECT df.start(df.sleep(2) ~> 'SELECT ''b2-running'' AS value', 'b2-inflight');") || return 1 @@ -968,6 +977,12 @@ test_b2_new_data_after_upgrade() { assert_sql_equals "SELECT msg FROM test_upgrade_b2_log WHERE kind = 'post' ORDER BY id DESC LIMIT 1;" "new_value" } +test_b2_existing_grants_after_upgrade() { + assert_sql_equals "SELECT has_function_privilege('${B2_PRE_GRANTED_ROLE}', 'df._enqueue_orchestrator_start(text, text, text)', 'EXECUTE');" "t" && + assert_sql_equals "SELECT has_function_privilege('${B2_PRE_GRANTED_ROLE}', 'df._enqueue_orchestrator_cancel(text, text)', 'EXECUTE');" "t" && + assert_sql_equals "SELECT has_function_privilege('${B2_PRE_GRANTED_ROLE}', 'df._enqueue_orchestrator_signal(text, text, text)', 'EXECUTE');" "t" +} + test_b2_grant_usage_after_upgrade() { # Regression guard for #110: after ALTER EXTENSION UPDATE, df.debug_connection() # must be gone from the catalog. Scenario A only compares function name/args/ @@ -996,6 +1011,7 @@ test_b2_grant_usage_after_upgrade() { # Clean up the probe role. run_sql_capture "DROP OWNED BY ${probe_role}; DROP ROLE IF EXISTS ${probe_role};" >/dev/null 2>&1 || true + run_sql_capture "DROP OWNED BY ${B2_PRE_GRANTED_ROLE}; DROP ROLE IF EXISTS ${B2_PRE_GRANTED_ROLE};" >/dev/null 2>&1 || true } if [ "$HAS_COMPAT_PREV" = true ]; then @@ -1003,6 +1019,7 @@ if [ "$HAS_COMPAT_PREV" = true ]; then run_test "B2: Pre-upgrade instance remains queryable" test_b2_pre_upgrade_instance_after_upgrade run_test "B2: In-flight work completes after upgrade" test_b2_inflight_work_after_upgrade run_test "B2: New data and execution after upgrade" test_b2_new_data_after_upgrade + run_test "B2: Existing df users retain wrapper privileges after upgrade" test_b2_existing_grants_after_upgrade run_test "B2: df.grant_usage() works and df.debug_connection() is gone after upgrade" test_b2_grant_usage_after_upgrade fi diff --git a/sql/pg_durable--0.2.3--0.2.4.sql b/sql/pg_durable--0.2.3--0.2.4.sql index f5f49f5..e6fc0fa 100644 --- a/sql/pg_durable--0.2.3--0.2.4.sql +++ b/sql/pg_durable--0.2.3--0.2.4.sql @@ -372,6 +372,33 @@ $fn$; REVOKE EXECUTE ON FUNCTION df._enqueue_orchestrator_signal(text, text, text) FROM PUBLIC; +-- Backfill wrapper EXECUTE to roles that already had df usage before ALTER +-- EXTENSION UPDATE. New calls to df.grant_usage() grant these wrappers via the +-- function body above, but existing users would otherwise lose df.start() / +-- df.cancel() / df.signal() when the new .so chooses the atomic path. +DO $$ +DECLARE + r RECORD; + grant_opt TEXT; +BEGIN + FOR r IN + SELECT + pg_catalog.quote_ident(pg_catalog.pg_get_userbyid(a.grantee)) AS grantee, + pg_catalog.bool_or(a.is_grantable) AS with_grant_option + FROM pg_catalog.pg_namespace n + CROSS JOIN LATERAL pg_catalog.aclexplode(n.nspacl) AS a + WHERE n.nspname OPERATOR(pg_catalog.=) 'df' + AND a.privilege_type OPERATOR(pg_catalog.=) 'USAGE' + AND a.grantee OPERATOR(pg_catalog.<>) 0 -- skip PUBLIC + GROUP BY a.grantee + LOOP + grant_opt := CASE WHEN r.with_grant_option THEN ' WITH GRANT OPTION' ELSE '' END; + EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df._enqueue_orchestrator_start(text, text, text) TO %s', r.grantee) OPERATOR(pg_catalog.||) grant_opt; + EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df._enqueue_orchestrator_cancel(text, text) TO %s', r.grantee) OPERATOR(pg_catalog.||) grant_opt; + EXECUTE pg_catalog.format('GRANT EXECUTE ON FUNCTION df._enqueue_orchestrator_signal(text, text, text) TO %s', r.grantee) OPERATOR(pg_catalog.||) grant_opt; + END LOOP; +END $$; + CREATE FUNCTION df.reconcile(p_grace_seconds integer DEFAULT 60) RETURNS TABLE(duroxide_orphans_deleted bigint, stuck_instances_failed bigint) LANGUAGE plpgsql From 458a422ea1397d6eac65a123252006b5346ff35b Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 19:00:16 +0000 Subject: [PATCH 18/20] Fix pgspot search_path finding for SECURITY DEFINER functions Use SET search_path = pg_catalog, pg_temp for the new SECURITY DEFINER wrappers and df.reconcile() in both fresh-install SQL and the 0.2.3->0.2.4 upgrade script. This satisfies the pgspot PS004 gate while preserving the fully qualified SQL references inside the functions. --- sql/pg_durable--0.2.3--0.2.4.sql | 8 ++++---- src/lib.rs | 8 ++++---- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/sql/pg_durable--0.2.3--0.2.4.sql b/sql/pg_durable--0.2.3--0.2.4.sql index e6fc0fa..49f7d7c 100644 --- a/sql/pg_durable--0.2.3--0.2.4.sql +++ b/sql/pg_durable--0.2.3--0.2.4.sql @@ -224,7 +224,7 @@ CREATE FUNCTION df._enqueue_orchestrator_start( RETURNS void LANGUAGE plpgsql SECURITY DEFINER -SET search_path = pg_catalog +SET search_path = pg_catalog, pg_temp AS $fn$ DECLARE sch text := df.duroxide_schema(); @@ -291,7 +291,7 @@ CREATE FUNCTION df._enqueue_orchestrator_cancel(p_instance_id text, p_reason tex RETURNS void LANGUAGE plpgsql SECURITY DEFINER -SET search_path = pg_catalog +SET search_path = pg_catalog, pg_temp AS $fn$ DECLARE sch text := df.duroxide_schema(); @@ -317,7 +317,7 @@ CREATE FUNCTION df._enqueue_orchestrator_signal(p_instance_id text, p_name text, RETURNS void LANGUAGE plpgsql SECURITY DEFINER -SET search_path = pg_catalog +SET search_path = pg_catalog, pg_temp AS $fn$ DECLARE sch text := df.duroxide_schema(); @@ -403,7 +403,7 @@ CREATE FUNCTION df.reconcile(p_grace_seconds integer DEFAULT 60) RETURNS TABLE(duroxide_orphans_deleted bigint, stuck_instances_failed bigint) LANGUAGE plpgsql SECURITY DEFINER -SET search_path = pg_catalog +SET search_path = pg_catalog, pg_temp AS $fn$ DECLARE sch text := df.duroxide_schema(); diff --git a/src/lib.rs b/src/lib.rs index d97fdff..dd7f7ac 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -547,7 +547,7 @@ CREATE FUNCTION df._enqueue_orchestrator_start( RETURNS void LANGUAGE plpgsql SECURITY DEFINER -SET search_path = pg_catalog +SET search_path = pg_catalog, pg_temp AS $fn$ DECLARE sch text := df.duroxide_schema(); @@ -637,7 +637,7 @@ CREATE FUNCTION df._enqueue_orchestrator_cancel(p_instance_id text, p_reason tex RETURNS void LANGUAGE plpgsql SECURITY DEFINER -SET search_path = pg_catalog +SET search_path = pg_catalog, pg_temp AS $fn$ DECLARE sch text := df.duroxide_schema(); @@ -663,7 +663,7 @@ CREATE FUNCTION df._enqueue_orchestrator_signal(p_instance_id text, p_name text, RETURNS void LANGUAGE plpgsql SECURITY DEFINER -SET search_path = pg_catalog +SET search_path = pg_catalog, pg_temp AS $fn$ DECLARE sch text := df.duroxide_schema(); @@ -751,7 +751,7 @@ CREATE FUNCTION df.reconcile(p_grace_seconds integer DEFAULT 60) RETURNS TABLE(duroxide_orphans_deleted bigint, stuck_instances_failed bigint) LANGUAGE plpgsql SECURITY DEFINER -SET search_path = pg_catalog +SET search_path = pg_catalog, pg_temp AS $fn$ DECLARE sch text := df.duroxide_schema(); From a38eb9f8aee33e072052c4635ac4aab93345307e Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 22:36:51 +0000 Subject: [PATCH 19/20] docs: shorten atomic consistency spec to high-level design Trim implementation details that are visible in the diff and keep the spec focused on the problem, chosen prevent/repair design, provider coupling, behavior changes, upgrade caveats, validation, and remaining risks. --- docs/spec-atomic-start.md | 547 ++++++++++---------------------------- 1 file changed, 146 insertions(+), 401 deletions(-) diff --git a/docs/spec-atomic-start.md b/docs/spec-atomic-start.md index e10cbe7..683ecf1 100644 --- a/docs/spec-atomic-start.md +++ b/docs/spec-atomic-start.md @@ -1,408 +1,153 @@ # Keeping the control plane and the runtime consistent -**Status:** Implemented and validated. Atomic `df.start`/`df.cancel`/`df.signal`, -plus a reconciler that repairs leftover drift. +**Status:** Implemented and validated. Atomic `df.start` / `df.cancel` / +`df.signal`, plus a reconciler that repairs leftover drift. **Author:** pg_durable team **Date:** June 2026 -## The problem +## Problem -A durable function lives in two stores that are written separately: - -- the **pg_durable control plane** — `df.nodes` and `df.instances`, written on the - caller's transaction; and -- the **duroxide runtime** — `_duroxide.orchestrator_queue`, where the orchestration - is enqueued. - -Originally `df.start()` wrote the first on the caller's transaction but enqueued the -second out-of-band, on a separate connection pool that committed on its own. The two -were never atomic, so they could disagree. +A durable function has state in two places: -The common failure was an **orphaned orchestration**: a `df.start()` whose -surrounding transaction rolled back (or whose batch aborted on a primary-key -collision) left a row in `_duroxide` with no matching `df.*` rows. The background -worker then kept trying to load a graph that did not exist, failed after a 5-second -timeout, and the orphan stayed behind. Under load these orphans accumulated and -saturated the worker's two threads. - -We observed this directly: a `df.start()` inside a rolled-back transaction left -`_duroxide.instances` row `48819f12` with zero rows in `df.instances`, surviving the -rollback. - -## What we did - -Two complementary changes: - -1. **Prevent** the drift at the source. `df.start()`, `df.cancel()`, and - `df.signal()` now enqueue their runtime work on the **caller's transaction**, so - the control-plane writes and the runtime enqueue commit or roll back together. -2. **Repair** what prevention cannot cover. A reconciler periodically deletes - leftover orphans and marks stuck instances failed — catching drift from crashes - mid-execution and from the legacy fallback path. - -Both are pg_durable-only. No `duroxide` / `duroxide-pg` or `Cargo.*` changes. - -## Designs we considered - -Four approaches were weighed for keeping the two stores in agreement. - -**Option 1 — One store.** Drop `df.nodes` / `df.instances` entirely and keep all -state in `_duroxide`, exposing what pg_durable needs through views. -*Strongest guarantee* (only one store can't diverge) but the most disruptive: it -moves the `submitted_by` security boundary, rebuilds per-user row security over -runtime-owned tables, re-exposes every read path, and needs a destructive migration. -A possible long-term direction, not this change. - -**Option 2 — Run our writes inside duroxide's transaction.** Extend duroxide-pg so -the enqueue also performs the `df.*` writes, on duroxide's worker-pool connection. -This makes the two stores atomic — but on the *worker's* connection, so `df.start()` -no longer joins the *caller's* transaction and the `df.*` writes run as the pool -role, breaking identity capture and row security. Strictly worse than Option 3. - -**Option 3 — Run duroxide's enqueue inside the caller's transaction.** Keep both -stores; call the runtime's enqueue over SPI on the caller's transaction, next to the -`df.*` writes. Everything commits or rolls back together, identity capture and the -read paths are untouched, and the 5-second load race disappears. The only cost is a -`SECURITY DEFINER` wrapper, because the runtime queue is writable by its owner only. -**Chosen as the primary fix.** - -**Option 4 — Tolerate drift, repair it later.** Accept a brief inconsistency window -and add a reconciler that finds and fixes mismatches. Adds no guarantee on its own, -but it is the only option that also catches drift from crashes *mid-execution* — -something no start-time atomicity can prevent. **Chosen as a backstop.** - -| Option | Atomic with `df.*` | Atomic with caller's tx | Removes 5 s race | Fixes crash-time drift | Effort | -|--------|:--:|:--:|:--:|:--:|:--:| -| 1 — one store | n/a | no | yes | yes | Large | -| 2 — our writes in duroxide's tx | yes | **no** | yes | no | Medium | -| 3 — enqueue in caller's tx | yes | **yes** | **yes** | no | Small | -| 4 — async reconciler | repairs | — | no | **yes** | Small | - -We implemented **Option 3 (prevent) plus Option 4 (repair)**. Options 1 and 2 were -rejected: Option 1 is out of proportion to the problem, and Option 2 is a weaker -version of Option 3. - -## How a start is written - -`df.start()` (`src/dsl.rs`) runs these steps on the caller's transaction: - -1. Pick an instance id (`short_id()` — eight hex characters). -2. Validate (database exists, caller can log in, superuser policy). -3. Insert the graph rows into `df.nodes`. -4. Insert the instance row into `df.instances`. -5. Snapshot the caller's `df.vars`. -6. Enqueue the orchestration. - -The fix is small because the start is **a single enqueue**: one -`StartOrchestration` row in `_duroxide.orchestrator_queue`. The worker builds the -`_duroxide.instances` and history rows later. And because duroxide-pg exposes every -operation as a SQL function in the `_duroxide` schema, that enqueue can run over SPI -on the caller's transaction — no separate pool needed. - -### What each store owns - -The fix must not move any of these boundaries. - -| Data | Source of truth | Notes | -|------|-----------------|-------| -| Graph (`df.nodes`) | `df.*` | The worker loads it; duroxide does not store it. | -| Execution history | `_duroxide.history` | Runtime-internal; transactional per turn. | -| Control plane (`root_node`, `submitted_by`, `database`, `label`) | `df.instances` | The worker needs it to load and authenticate. | -| `df.instances.status` | best-effort mirror | Written by the worker; self-heals, not atomic with the runtime. | -| `df.nodes.status/result/error` | best-effort mirror | Same as above. | - -`submitted_by` is a **security boundary**: it is captured as the calling role at -start time and must stay unforgeable. The worker re-checks the superuser policy when -it loads the graph. - -### Where the two stores were written together - -Auditing every writer, the cross-store paths are bounded. All three now enqueue on -the caller's transaction: - -| Caller | `df.*` write | `_duroxide` write | On caller's tx now? | -|--------|------|------|:--:| -| `df.start()` | insert `df.nodes` + `df.instances` | enqueue `StartOrchestration` | yes | -| `df.cancel()` | update `df.instances.status` | enqueue `CancelInstance` | yes | -| `df.signal()` | none (only a row-security read) | enqueue `ExternalRaised` (+ fan-out) | yes | - -- **`df.cancel()`** was a genuine dual-write: it enqueued the cancel, then updated - the status mirror. The status update is only an optimistic hint — the worker - re-applies the authoritative `cancelled` status when it processes the cancel — so - any disagreement was already transient. Both now commit together. -- **`df.signal()`** writes only `_duroxide`, so it never created cross-store drift. - Moving it onto the caller's transaction simply gives it the same rule as the - others: *if my transaction rolls back, the signal is not delivered.* - -The worker also keeps the `df.*` status columns in sync as activities run. Those -writes are eventually consistent and self-healing by design; the reconciler cleans -up any lasting drift. - -## The mechanism - -Step 6 of `df.start()` is now an SPI call on the caller's transaction, right after -the `df.instances` insert: - -```text -INSERT INTO df.nodes ... -- step 3 -INSERT INTO df.instances ... -- step 4 -SELECT df._enqueue_orchestrator_start($id, $name, $input); -- step 6 (SPI) -``` - -Because the enqueue is an ordinary SPI statement, it shares the caller's -transaction: a rollback drops the queue row along with the `df.*` rows, and a commit -makes them visible together. `df.cancel()` and `df.signal()` work the same way -through their own wrappers. - -### Why this reaches into duroxide-pg directly (and needs the PG provider) - -Everywhere else, pg_durable talks to the runtime through duroxide's Rust -provider/client API — the out-of-band start/cancel/signal fallback (`src/client.rs`) -and the monitoring/explain reads (`src/monitoring.rs`, `src/explain.rs`) all go -through `Client`/`Provider` methods, which are provider-agnostic at the API level. -The in-transaction enqueue and `df.reconcile()` are the **first places pg_durable -calls duroxide-pg's SQL surface directly**: `enqueue_orchestrator_work` and -`delete_instances_atomic`, plus reads of `_duroxide.orchestrator_queue`, -`instances`, and `executions`. They therefore depend on duroxide-pg's table shapes -and function signatures, not just on the runtime existing. - -This is deliberate, and it is the whole point of the change: only direct SQL (via -SPI) can run inside the **caller's** transaction. The Rust client uses a separate -connection pool, so it can never be atomic with the backend's SPI work. The trade is -a deeper coupling to duroxide-pg in exchange for atomicity. - -It would not work for a non-PostgreSQL provider — there would be no SQL functions to -call — but pg_durable already assumes a duroxide-pg provider in a known schema: it -places its own `_worker_ready` table inside `_duroxide`, resolves the schema via -`df.duroxide_schema()`, and relies on duroxide-pg applying its migrations at startup. -In the pg_durable context — a PostgreSQL extension whose runtime state lives in the -same database — there is no real use case for another provider. The probe-and- -fallback (below) keeps any non-PG provider working, non-atomically, rather than -broken. - -### Why a privileged wrapper is needed - -`_duroxide.orchestrator_queue` is writable by its owner only (the worker role), so a -normal caller cannot insert into it. The enqueue therefore runs through a -`SECURITY DEFINER` wrapper in the `df` schema, owned by the extension owner. Each -wrapper pins `search_path` and schema-qualifies every reference (per CVE-2018-1058, -enforced by `scripts/pgspot-gate.sh`). - -The wrappers resolve the runtime schema at call time via `df.duroxide_schema()` -(`_duroxide` on current installs, legacy `duroxide` on older ones) and quote it -with `%I`. They are revoked from `PUBLIC` and granted through `df.grant_usage()`. - -Because a `SECURITY DEFINER` function runs as its owner, the wrappers cannot trust -the caller. Two safeguards make them safe to grant to every user: - -- **The work item is built inside the wrapper**, from validated arguments — never - passed in. A caller cannot choose the work-item type or smuggle in a foreign - target. `df.start` only accepts the public root graph-executor orchestration and - requires the input JSON's `instance_id` to match the target id; `df.cancel` builds - `CancelInstance`; `df.signal` builds `ExternalRaised`. All three use - `json_build_object`, matching duroxide's wire format. -- **The caller is authorized before any enqueue**, by two different rules: - - *Start* targets a brand-new instance, so it authorizes by *state*: it permits - the enqueue only for a `pending` `df.instances` row that has no queue entry and - no runtime instance yet. Under the atomic path a committed instance always has - its queue row in the same transaction, so this state is reachable only for the - row the caller just inserted — never someone else's instance. - - *Cancel* and *signal* target an already-committed instance, so they authorize by - *ownership*: `pg_has_role(session_user, , 'MEMBER')`. - `session_user` is the real authenticated role (it cannot be spoofed inside a - `SECURITY DEFINER` function), and checking membership rather than an exact name - means a role that owns the instance through `SET ROLE` still qualifies. A - non-member is rejected before anything is enqueued — the same gate `df.cancel` / - `df.signal` already enforce through row security. - -### Signal fan-out - -A signal must reach the root instance **and** every running sub-orchestration, -because a child (a `JOIN`/`RACE` branch or a loop generation) may be the one waiting -on it. Duroxide does not buffer external events until an orchestration has a pending -subscription, so `df.signal` first requires the root runtime row to exist; a signal -sent in the same transaction as `df.start()` is rejected instead of returning `OK` -and being skipped before the workflow can observe it. Once the root is materialized, -the wrapper walks the instance tree (`_duroxide.instances.parent_instance_id`) and -enqueues one event for the root plus each running descendant — all on the caller's -transaction, so the whole fan-out is atomic. - -### Worker readiness and ordering - -- The existing readiness gate still applies: if the worker has not initialized the - runtime schema yet, `df.start()` fails clearly (and atomically — the `df.*` - inserts roll back). -- The enqueue is the last write in `df.start()`. A primary-key collision on - `df.nodes` / `df.instances` still aborts before it, now cleanly inside one - transaction. - -### The 5-second load race is gone (on the atomic path) - -The worker's `load-function-graph` activity used to wait up to five seconds for the -`df.*` rows to appear, because it could dequeue a start before they committed. On -the atomic path the queue row becomes visible only *after* those rows commit in the -same transaction, so that wait can't trigger. The retry is kept for the -non-atomic fallback path, where it remains a no-op on the atomic path. - -## The reconciler - -`df.reconcile(p_grace_seconds integer DEFAULT 60)` is a `SECURITY DEFINER`, -admin-only function that repairs leftover drift: - -- It deletes orphaned **root** runtime instances — those with no parent, no matching - `df.instances` row, and older than the grace window — gathering each orphan's full - subtree so the delete is accepted. Sub-orchestrations (branches and loop - generations) have no `df.instances` row of their own and are excluded by the root - filter, so they are never collected on their own. -- It marks `df.instances` rows that are stuck non-terminal — no live runtime - instance and no queued start — as `failed`. - -Each pass is wrapped so a single failure never aborts the reconcile or stops the -loop. - -The background worker runs the reconciler as a **built-in durable cron loop**, -keeping exactly one instance per cluster. It reads the schedule, connects as the -`df_reconciler` role (`SET ROLE`), and starts: - -```sql --- $1 = the pg_durable.reconciler_cron value, $2 = 'df_reconciler' -SELECT df.start( - df.loop(df.seq('SELECT * FROM df.reconcile()', df.wait_for_schedule($1))), - $2); -``` - -`worker::ensure_reconciler` starts it, skips if one is already pending or running, -and restarts it if it died. It is driven by the `pg_durable.reconciler_cron` setting -(default `*/5 * * * *`; empty disables it) and submitted by a dedicated -**non-superuser** role, `df_reconciler` (so `submitted_by` is `df_reconciler`). A -non-superuser identity keeps it clear of the superuser-instance guard and limits what -the instance can do to "run the reconciler" even if it were somehow forged. - -## Behavior change for users - -`df.start()` now takes part in the caller's transaction: - -```sql -BEGIN; -SELECT df.start('SELECT 1'); -- enqueued on THIS transaction -ROLLBACK; -- fully undone; nothing runs -``` - -Previously the `df.*` rows rolled back but the orchestration still ran. The new -behavior is the least surprising one and is the deliberate decision of this spec, -but it *is* a change from "fires the moment the statement runs." `df.cancel()` and -`df.signal()` gain the same property. This should be called out in `USER_GUIDE.md`. - -## Implementation notes - -- The atomic path requires the duroxide-pg provider (see *Why this reaches into - duroxide-pg directly* above). `df.start` / `df.cancel` / `df.signal` each probe for - its SQL surface (`enqueue_orchestrator_work` in the resolved schema); when it is - absent — a different provider, or a schema predating the wrappers — they log and - fall back to the old out-of-band path, so the change never breaks another provider - or contaminates `SELECT df.start(...)` output with client-visible warnings. -- `visible_at = now()` (transaction time) is enough for immediate starts. -- The work items are byte-compatible with duroxide's `WorkItem` JSON, so the worker - behaves exactly as before. - -### Hardening applied during review - -- **Cross-tenant enqueue (high).** The start wrapper first accepted an opaque work - item and was granted to everyone, which would let a caller forge a cancel or - signal against another instance. Fixed by building the work item inside the - wrapper and authorizing only a brand-new instance. -- **Reconciler deleting live work (high).** The runtime refuses to delete a parent - without its children, so deleting only orphan roots failed on any parallel-workflow - orphan. Fixed by gathering the full subtree and isolating each pass so one failure - can't stop the loop. -- **Suppressing the reconciler (high).** Its single-instance check first keyed on the - user-writable label, so a user could park a same-labelled instance to block it. - Fixed by keying on the unforgeable `df_reconciler` submitter. -- **Cron busy-loop (medium).** `df.wait_for_schedule` baked a fixed delay at build - time that the loop replayed every generation. Fixed so the wait node recomputes the - delay each generation from the orchestration's recorded clock (deterministic on - replay). **Upgrade caveat:** this changes the recorded replay event sequence - (`utc_now` is recorded before the timer). Any instance already waiting in a - `WAIT_SCHEDULE` node when the binary changes may replay expecting the old sequence - and fail as nondeterministic; drain or restart those scheduled instances during - upgrade. -- **Self-healing (medium).** The reconciler is re-checked from the steady-state poll - loop, not only once per worker epoch, so a cancelled one comes back within the poll - interval. -- **Silent fallback (medium).** `df.start` now logs when it uses the non-atomic - fallback. -- **Kept `LOGIN` on `df_reconciler`** (debated): the worker runs the reconcile node - by connecting *as* the role, like every other durable-function role, so `NOLOGIN` - would break it. - -> **Local-dev note:** `DROP EXTENSION pg_durable CASCADE` can leave the worker-owned -> `_duroxide` schema half-dropped (its migration row remains but functions are gone), -> which shows up as flaky `JOIN` behavior and a missing `_worker_ready`. Recover with -> `DROP SCHEMA _duroxide CASCADE` and a restart so the worker re-applies its -> migrations. - -## Upgrade and migration - -**New binary against an older schema.** The wrappers live in the `df` schema, so -they ship in the extension's install SQL and an upgrade script -(`sql/pg_durable----.sql`, `CREATE FUNCTION` + `GRANT`), with a -"Version-Specific Changes" entry in `docs/upgrade-testing.md`. A new binary running -against an older `df` schema that predates the wrappers must still work: each caller -uses the atomic path only when both the duroxide-pg SQL enqueue function **and** the -`df._enqueue_orchestrator_*` wrappers exist; otherwise it falls back to the old -out-of-band path. This keeps the binary compatible with every prior schema while -the upgrade rolls out. - -**Data migration.** None. No table shapes change, and already-enqueued work items -are left in place. - -**In-flight scheduled waits.** `df.wait_for_schedule` now records an `utc_now` event -before the timer. That changes the recorded replay event sequence for existing -instances already waiting in any `WAIT_SCHEDULE` node. Drain or restart those -in-flight scheduled waits during upgrade; otherwise they may fail on replay as -nondeterministic. - -**Rollback.** Reverting the binary restores the out-of-band enqueue; the wrappers, if -left in place, are simply unused. - -**Future direction.** The privileged enqueue could move into duroxide-pg, owned by -the runtime schema owner — cleaner layering, at the cost of a coordinated -duroxide-pg release. The `df`-schema wrapper avoids that coordination and is what -ships today. - -## Validation (PostgreSQL 17) - -Prevent: - -| Check | Result | -|-------|:--:| -| Committed `df.start()` runs to completion | pass | -| Rolled-back `df.start()` leaves no `df.*` and no `_duroxide` orphan | pass | -| Rolled-back `df.cancel()` leaves the instance running and rolls back the enqueue; committed cancel takes effect | pass | -| Rolled-back `df.signal()` rolls back the fan-out enqueue; committed signal delivers (incl. fan-out into a sub-orchestration) | pass | -| A caller cannot enqueue a start, cancel, or signal against a foreign or arbitrary instance; cancel/signal succeed for the owning role (incl. via `SET ROLE`) | pass | -| Schema resolved dynamically; atomicity holds | pass | -| E2E `07_signals`, `23_signal_in_race` (fan-out), `22_cancel_status_consistency` | 3/3 | -| Broader E2E subset (core, conditionals, loops, variables, join, race) | pass | - -Repair: - -| Check | Result | -|-------|:--:| -| `df.reconcile()` deletes an orphan root with children, leaves sub-orchestrations and live instances intact | pass | -| `df.reconcile()` marks a stuck instance failed | pass | -| Reconciler auto-starts as non-superuser `df_reconciler`, stays a single instance | pass | -| Reconciler fires on the cron schedule without spinning, and restarts after being cancelled | pass | -| E2E subset unaffected with the reconciler running | pass | -| `cargo fmt --check` clean; no new clippy warnings | pass | - -## Open questions - -- **Enqueue placement** — keep the `df`-schema wrapper, or move it into duroxide-pg - (cleaner layering, but a coordinated release). The wrapper ships today; the move is - a future option. -- **Contract stability** — pg_durable now depends on the enqueue function signature - and the `WorkItem` JSON shape. Mitigated by the version-locked - duroxide / duroxide-pg pair. -- **Reconciler policy** — the grace window and cron cadence still want tuning, and - `df_reconciler` is created by the worker rather than provisioned externally. +- **pg_durable control plane** — `df.nodes` and `df.instances`, written on the + caller's PostgreSQL transaction. +- **duroxide runtime** — `_duroxide` queue/history/instance state, previously + written out-of-band through a separate connection. + +Those writes were not atomic. The visible failure was a rolled-back `df.start()`: +`df.*` rows rolled back, but the duroxide enqueue survived. The worker then retried +an orchestration whose graph no longer existed, waited up to 5 seconds for rows that +would never appear, and left an orphan behind. We reproduced this with rolled-back +starts leaving `_duroxide.instances` rows that had no matching `df.instances` row. + +`df.cancel()` and `df.signal()` had the same transaction-boundary problem for their +runtime enqueue: a rollback of the caller's transaction did not roll back the runtime +work. + +## Decision + +Implement **prevent + repair**: + +1. **Prevent:** enqueue `df.start()`, `df.cancel()`, and `df.signal()` runtime work + inside the caller's transaction via SPI. The control-plane writes and runtime + enqueue now commit or roll back together. +2. **Repair:** run a lightweight reconciler that removes leftover runtime orphans + and marks stuck control-plane rows failed. This catches legacy/fallback drift and + crash-time drift that transaction-local enqueue cannot prevent. + +## Alternatives considered + +| Option | Summary | Decision | +|---|---|---| +| Single source of truth in `_duroxide` | Move all state into the runtime schema and rebuild pg_durable reads/security over it. | Too large and migration-heavy for this fix. | +| Run pg_durable writes inside duroxide's transaction | Let duroxide-pg perform the `df.*` writes on its worker-pool connection. | Atomic, but not with the caller's transaction; breaks identity capture and row security expectations. | +| Run duroxide enqueue inside the caller's transaction | Keep both stores and use SPI to enqueue runtime work next to `df.*` writes. | Chosen primary fix. | +| Async reconciler only | Tolerate drift and repair it later. | Useful backstop, but insufficient alone. | + +## Design overview + +`df.start()` still creates the graph and instance rows in `df.*`. The final enqueue +step now happens through SQL on the same transaction, so a rollback undoes both the +control-plane rows and the runtime queue row. `df.cancel()` and `df.signal()` follow +the same transaction rule. + +The runtime queue is not writable by ordinary users, so the enqueue goes through +private `SECURITY DEFINER` wrappers in the `df` schema. These wrappers are granted +through `df.grant_usage()`, build the runtime work items themselves, and perform +their own authorization checks before writing to `_duroxide`. + +The start wrapper is intentionally **not** a general-purpose privileged runtime +entrypoint. It only starts the root function-graph orchestration and validates that +the input targets the same instance id. Cancel/signal wrappers authorize against the +instance owner. + +### Direct duroxide-pg coupling + +This is the abstraction break in the design. Most pg_durable code talks to the +runtime through duroxide's Rust provider/client API. That API uses a separate +connection pool, so it cannot share the caller's backend transaction. + +To get caller-transaction atomicity, this PR calls the duroxide-pg SQL surface +directly (`enqueue_orchestrator_work`, `delete_instances_atomic`, and selected +runtime tables). That only works with a PostgreSQL-backed provider. In practice, +pg_durable is itself a PostgreSQL extension whose runtime state lives in the same +database, so a non-PG provider is not a meaningful deployment target; still, the +code probes for the SQL surface and falls back to the old out-of-band path when it +is absent. + +### Signals + +`df.signal()` fans out to the root instance and any running sub-orchestrations, +because a child branch may be the one waiting on the signal. + +Duroxide does not buffer external events until an orchestration is ready to receive +them. Therefore a signal sent before the root runtime row exists is rejected instead +of returning `OK` and being silently skipped. Once the runtime row exists, the +signal enqueue is atomic with the caller's transaction. + +### Reconciler + +`df.reconcile()` is an admin-only backstop. It: + +- deletes orphaned runtime **root** instances whose full subtree has no matching + `df.instances` row; and +- marks stuck `df.instances` rows failed when there is no live runtime instance and + no queued start. + +The background worker keeps one reconciler durable loop running per cluster on +`pg_durable.reconciler_cron` (default `*/5 * * * *`; empty disables it), submitted +by the dedicated non-superuser role `df_reconciler`. + +## Behavior changes + +- `df.start()`, `df.cancel()`, and `df.signal()` now participate in the caller's + transaction. For example, `BEGIN; SELECT df.start(...); ROLLBACK;` no longer + starts the workflow on the atomic path. +- If the duroxide-pg SQL surface or the `df` wrappers are missing, pg_durable logs + and falls back to the previous non-atomic client path. The fallback is not emitted + as a client-visible SQL `WARNING`, so scripts that capture `SELECT df.start(...)` + output remain compatible. +- `df.wait_for_schedule` now records an `utc_now` event before the timer so repeated + schedule waits compute the next cron tick each generation. This fixes a loop + busy-loop bug, but changes the recorded replay event sequence. + +## Upgrade and compatibility + +The wrappers and `df.reconcile()` are `df`-schema objects, so they are included in +both fresh-install SQL and the `0.2.3 -> 0.2.4` upgrade script. The upgrade script +also updates `df.grant_usage()` / `df.revoke_usage()` and backfills wrapper EXECUTE +privileges to existing roles that already had explicit `USAGE` on schema `df`. + +Binary backward compatibility is preserved for old schemas that have not yet run +`ALTER EXTENSION UPDATE`: the new binary uses the atomic path only when both the +duroxide-pg enqueue function and the `df._enqueue_orchestrator_*` wrappers exist; +otherwise it falls back to the old out-of-band client path. + +There is no table data migration. + +**Upgrade caveat:** drain or restart any in-flight instance already waiting in a +`WAIT_SCHEDULE` node during upgrade. Those histories may replay expecting the old +sequence and fail as nondeterministic after the `utc_now` change. + +## Validation + +Automated coverage added in this PR: + +- `24_atomic_rollback` — rolled-back start/cancel/signal leave no runtime effect; + signaling before runtime materialization is rejected. +- `25_enqueue_wrapper_authz` — wrapper authorization and start-wrapper hardening. +- `26_reconcile_orphan_gc` — orphan root with child sub-orchestrations is collected + as a full subtree; healthy instances remain untouched. +- `scripts/test-upgrade.sh` — schema equivalence, binary compatibility, data + compatibility, and existing-role wrapper-grant backfill. + +Manual/targeted validation included signal fan-out, cancel consistency, reconciler +liveness, cron scheduling, formatting, clippy, pgspot, and upgrade tests. + +## Remaining risks / follow-up + +- The direct duroxide-pg SQL dependency is intentional but should remain small and + well documented. Moving the privileged enqueue surface into duroxide-pg would be a + cleaner long-term boundary. +- Reconciler grace/cadence and role provisioning may need tuning after operational + experience. From 73a8cfe5792a10e1b994a31d2cd9a88ec8e072fd Mon Sep 17 00:00:00 2001 From: tjgreen42 <1738591+tjgreen42@users.noreply.github.com> Date: Fri, 19 Jun 2026 23:22:10 +0000 Subject: [PATCH 20/20] docs: clarify start wrapper authorization comment The SECURITY DEFINER start wrapper allows same-owner pending instances that have not yet been enqueued; that state can be created directly by a df user, not only by df.start in the same transaction. Clarify the comment and point to the actual safety properties: fixed root orchestration, matching input instance id, and no foreign already-started target. --- sql/pg_durable--0.2.3--0.2.4.sql | 13 +++++++------ src/lib.rs | 13 +++++++------ 2 files changed, 14 insertions(+), 12 deletions(-) diff --git a/sql/pg_durable--0.2.3--0.2.4.sql b/sql/pg_durable--0.2.3--0.2.4.sql index 49f7d7c..947c9b7 100644 --- a/sql/pg_durable--0.2.3--0.2.4.sql +++ b/sql/pg_durable--0.2.3--0.2.4.sql @@ -248,12 +248,13 @@ BEGIN -- Authorization. This runs as the (privileged) definer, so it must not -- trust the caller to only target their own instance. Permit the enqueue -- only for a brand-new, not-yet-started instance: a 'pending' df.instances - -- row with no orchestrator-queue entry and no duroxide instance. Under the - -- atomic-start path a committed df.instances row always has its queue row in - -- the same transaction, so this state is reachable only for the instance the - -- caller (df.start) just inserted in the current transaction — never another - -- user's committed instance. This is what stops a caller from forging work - -- against a foreign instance. + -- row with no orchestrator-queue entry and no duroxide instance. A df user + -- can create such a pending instance directly (not only through df.start), + -- so this is a same-owner/not-yet-started check rather than proof that the + -- row was inserted in this transaction. The wrapper is safe because it also + -- fixes the orchestration to the root graph executor and validates the input + -- instance id, so callers cannot start internal orchestrations or target + -- someone else's already-started instance. EXECUTE pg_catalog.format( 'SELECT NOT EXISTS (SELECT 1 FROM df.instances i WHERE i.id = $1 AND i.status = ''pending'') ' ' OR EXISTS (SELECT 1 FROM %I.orchestrator_queue q WHERE q.instance_id = $1) ' diff --git a/src/lib.rs b/src/lib.rs index dd7f7ac..2debab9 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -571,12 +571,13 @@ BEGIN -- Authorization. This runs as the (privileged) definer, so it must not -- trust the caller to only target their own instance. Permit the enqueue -- only for a brand-new, not-yet-started instance: a 'pending' df.instances - -- row with no orchestrator-queue entry and no duroxide instance. Under the - -- atomic-start path a committed df.instances row always has its queue row in - -- the same transaction, so this state is reachable only for the instance the - -- caller (df.start) just inserted in the current transaction — never another - -- user's committed instance. This is what stops a caller from forging work - -- against a foreign instance. + -- row with no orchestrator-queue entry and no duroxide instance. A df user + -- can create such a pending instance directly (not only through df.start), + -- so this is a same-owner/not-yet-started check rather than proof that the + -- row was inserted in this transaction. The wrapper is safe because it also + -- fixes the orchestration to the root graph executor and validates the input + -- instance id, so callers cannot start internal orchestrations or target + -- someone else's already-started instance. EXECUTE pg_catalog.format( 'SELECT NOT EXISTS (SELECT 1 FROM df.instances i WHERE i.id = $1 AND i.status = ''pending'') ' ' OR EXISTS (SELECT 1 FROM %I.orchestrator_queue q WHERE q.instance_id = $1) '