From 77021f232665fe796050d79644a5e185e050adae Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 30 May 2026 08:42:56 +0000 Subject: [PATCH 1/8] docs: add durable-execution feasibility study Evaluate whether PgQue should extend into a durable-workflow engine (DBOS/absurd-style) on Postgres, and the adoption odds if so. Synthesizes deep research on DBOS, absurd, Temporal, Restate, Rivet, and Gadget Silo, grounded against SPECx 2.3 positioning and the PgQ engine constraints. Key finding: the durable layer needs SKIP-LOCKED claim/lease semantics, a second concurrency model beside PgQ rotation, so the zero-bloat differentiator does not transfer. Recommends a thin transactional-durable-enqueue + experimental checkpointed-steps path rather than a head-on Temporal/DBOS competitor. --- blueprints/DURABLE_EXECUTION_FEASIBILITY.md | 313 ++++++++++++++++++++ 1 file changed, 313 insertions(+) create mode 100644 blueprints/DURABLE_EXECUTION_FEASIBILITY.md diff --git a/blueprints/DURABLE_EXECUTION_FEASIBILITY.md b/blueprints/DURABLE_EXECUTION_FEASIBILITY.md new file mode 100644 index 00000000..9b76267f --- /dev/null +++ b/blueprints/DURABLE_EXECUTION_FEASIBILITY.md @@ -0,0 +1,313 @@ +# Durable Execution on PgQue — Feasibility & Adoption Study + +- **Status:** Brainstorm / decision input (not approved scope) +- **Date:** 2026-05-30 +- **Question:** Should PgQue extend beyond a queue into a **durable-workflow / + durable-execution engine on Postgres**, the way DBOS and absurd have? What are + the realistic chances of adoption success if we go that route? +- **Companion reading:** `blueprints/SPECx.md` §2.3 (workflow engines are + deliberately treated as a *different category* today), `CLAUDE.md` Key Design + Rules #2 (the PgQ engine is sacred) and #3 (modern API must reduce cleanly to + PgQ primitives). + +This study was produced after a deep review of the Hacker News thread +["Building durable workflows on Postgres"](https://news.ycombinator.com/item?id=48313530) +and a parallel investigation of the six systems that thread orbits around: +**DBOS, absurd, Temporal, Restate, Rivet, and Gadget's Silo**. + +--- + +## 1. Verdict up front + +**Do not pivot PgQue into a Temporal/DBOS-style durable-execution platform.** +Going head-to-head as a general "durable workflows on Postgres" engine is a +**late, undifferentiated, SDK-heavy bet** in a category that already has a +well-distributed Postgres-native incumbent (absurd) and a celebrity-founder +incumbent (DBOS). PgQue's signature advantage — zero-bloat snapshot+TRUNCATE +rotation — **does not transfer** to the workflow layer, which needs per-run +exclusive claiming, not batch rotation. + +**Do pursue the adjacent, well-leveraged slice:** *transactional durable +queues + checkpointed steps*, shipped as an **optional, experimental +`pgque-api` layer** that reduces to PgQ primitives plus one new +claim/lease table. This stays inside PgQue's identity ("the best zero-bloat +Postgres queue, now with durable steps"), exploits the one capability DBOS +under-documents (atomic enqueue inside the caller's own transaction), and +defers the expensive part (multi-language deterministic-replay runtimes) until +demand is proven. + +Net: **moderate-to-low feasibility for a standalone workflow-engine play; +moderate-to-high feasibility for a focused "durable steps" extension of the +queue.** The recommendation is the latter, phased and explicitly experimental. + +--- + +## 2. Is the category real? (Yes.) + +Durable execution is a funded, growing category, and the "just Postgres" +variant specifically is where the energy is: + +| System | Model | License | Stars (May 2026) | Funding / backing | +|---|---|---|---|---| +| **Temporal** | Separate Go cluster, event-sourced replay, Cassandra/MySQL/PG | MIT | ~20.6k | ~$350M total, $1.72B valuation | +| **DBOS** | Embedded library, Postgres system-DB, checkpoint+replay | MIT (Transact) / proprietary (Conductor) | ~3.5k across 4 SDKs | $8.5M seed; Stonebraker + Zaharia | +| **absurd** | Single SQL file + thin SDK, SKIP-LOCKED claim/lease, checkpoint-replay | Apache-2.0 | ~1.95k | Armin Ronacher / Earendil | +| **Restate** | Self-contained Rust binary, RocksDB+log, journaling-replay | BSL 1.1 → Apache | ~3.9k | $7M seed (Redpoint); ex-Flink team | +| **Rivet** | Actor platform, Postgres/RocksDB/FoundationDB | Apache-2.0 | ~5.6k | YC W23 | +| **Silo (Gadget)** | Rust broker on SlateDB/object storage | MIT | ~31 (prototype) | Internal Gadget project | + +Signals worth internalizing: + +- **The market keeps asking "why not just Postgres?"** Restate (own RocksDB + store, BSL) took repeated HN criticism on exactly this. Rivet hedged back + toward Postgres as a self-host backend. That recurring question *is* PgQue's + thesis — but absurd already answered it first in the workflow space. +- **Licensing trust is a real axis.** Restate's BSL drew loud "open source is + misleading" criticism; Rivet's Apache-2.0 drew none. PgQue (Apache-2.0, + literally "your own Postgres") inherits maximum trust by default. +- **The wedge against Temporal is operational, not technical.** Every + competitor's pitch is the same: *don't run a second distributed system; reuse + the database you already operate.* The complaints about Temporal are + determinism learning curve, immutable shard-count decisions, and the + Cassandra/Elasticsearch operational floor — not correctness. +- **AI agents are the current demand driver.** DBOS, absurd, Restate, and + Temporal are all repositioning durable execution as the substrate for + long-running LLM agent loops. absurd's canonical example is an agent loop; + Temporal's Series C narrative is "durable AI workloads." + +So the category is real and PgQue's "Postgres-native, OSI-licensed, no new +infra" framing is genuinely well-aligned with where the market is pulling. The +problem is not the thesis — it is **who already occupies the exact niche**. + +--- + +## 3. The technical crux: a concurrency-model mismatch + +This is the single most important finding of the study. + +Every durable-execution engine examined (DBOS, absurd, Silo, and Temporal's +matching service) claims work units with **`SELECT … FOR UPDATE SKIP LOCKED` + +a time-limited lease**: one run is claimed exclusively by one worker, the lease +auto-extends while the worker checkpoints, and if the worker dies the lease +expires and another worker steals the run. Sleep/await is modelled as +`available_at` re-scheduling on a per-run row. + +**PgQue does not work this way and must not be made to.** PgQ's engine is +*snapshot + TRUNCATE batch rotation*: consumers read all events committed +between two ticks using `pg_snapshot` visibility, and old event tables are +rotated and TRUNCATEd wholesale. There is: + +- no per-item exclusive claim (consumption is cooperative, batch-oriented, + at-least-once); +- no lease / steal-on-crash for an individual message; +- no per-row `available_at` rescheduling; +- and — critically — **a hard conflict with long-lived state**: a workflow that + `sleep`s for a week or `awaitEvent`s indefinitely must persist its run and + checkpoints far longer than any rotating event window survives. An open batch + already blocks rotation; a long-running workflow on the rotation tables would + be pathological. + +`CLAUDE.md` Rule #2 ("the PgQ engine is sacred") forecloses retrofitting +rotation to behave like a claim/lease queue. So a durable layer on PgQue would +require a **new SKIP-LOCKED claim/lease table living *beside* the PgQ engine**, +not on top of it. That means **PgQue would carry two concurrency models at +once** — the rotation engine for streaming/CDC/fan-out, and a claim/lease engine +for durable runs. + +This is the crux of the whole decision: + +> PgQue's marketing identity is "the one Postgres queue that never bloats +> because it uses rotation instead of SKIP LOCKED + DELETE." A durable-execution +> layer is, by necessity, **a SKIP-LOCKED + UPDATE engine** — exactly the +> mechanism PgQue defines itself against. The zero-bloat differentiator does not +> apply to the workflow tables; they will accumulate dead tuples and need VACUUM +> like everyone else's (mitigated by partition-detach, as absurd does). + +The latency objection, by contrast, **is no longer real**: PgQue 0.2.0 ticks at +a 100 ms cadence (`ticker_loop()` runs `pgque.ticker()` every +`tick_period_ms`, default 100 ms, committing between iterations), so end-to-end +delivery is sub-second — squarely in the same range as absurd's and DBOS's +worker polling loops. Latency is not what would hold a durable layer back. + +--- + +## 4. What is reusable vs. net-new + +**Reusable / aligned (low cost):** + +- **Single-file, anti-extension, managed-PG install.** absurd's `absurd.sql` + philosophy is identical to PgQue's `\i pgque.sql`. Validated approach. +- **pg_cron for maintenance.** absurd uses pg_cron for partition + provisioning/cleanup/detach; PgQue already uses pg_cron for the ticker and + `maint()`. Same muscle. +- **The data model** (tasks / runs / steps / checkpoints / events / waits) is + clean and orthogonal to how raw events are queued underneath. PgQue could + adopt absurd's schema shape almost verbatim as a reference. +- **The no-determinism, checkpoint-replay, task-level-retry design.** This is + the most reusable idea and the one that *does* reduce cleanly to primitives + (Rule #3): a `pgque.step()` checkpoint table + replay-on-retry needs no change + to the PgQ engine. It also sidesteps Temporal's biggest adoption tax (the + determinism paradigm and "deploy correct code, break all running workflows"). +- **SQL-native observability comes free.** Workflows-as-rows means `psql` + inspection, `list_workflows`-style queries, and forking — all trivial in a + system that *is* SQL. DBOS markets this heavily; PgQue gets it for free. + +**Net-new (the real cost):** + +1. **The deterministic/checkpoint replay *runtime*, per language.** This is + SDK-side logic — step memoization, resume-from-checkpoint, crash recovery + coordination across executors. The DBOS study's headline finding: + **~80% of DBOS's engineering lives in the multi-language SDK + replay runtime, + not in SQL.** PgQ's PL/pgSQL gives you the substrate, not this. +2. **Claim / lease / watchdog machinery.** Claim expiry, lease extension on + checkpoint write, watchdog termination of broken workers — net-new, and + exactly the part absurd had to harden over five months in production. +3. **Long-lived-state lifecycle.** Non-rotated run/checkpoint tables with TTL / + partition-detach cleanup, decoupled from event rotation. +4. **Event caching / first-write-wins** races (await/emit), needing a + uniquely-indexed events table carefully ordered against wait registrations. +5. **N synchronized SDKs.** DBOS maintains four (Python, TS mature; Go, Java + still 0.x). Keeping each in lockstep with engine semantics is the dominant + ongoing cost — well beyond the "~1,500 lines" budgeted for pgque-api in + SPECx. Durable workflows are a different order of magnitude. + +--- + +## 5. Adoption analysis — can PgQue win here? + +### Where PgQue *could* win + +- **"It's literally just your Postgres."** No new datastore (vs Restate's + RocksDB, Silo's SlateDB, Temporal's Cassandra), Apache-2.0 (vs Restate's BSL), + managed-PG compatible. The strongest anti-lock-in story in the field. +- **True transactional exactly-once for in-DB effects.** Because PgQue *is* + SQL, enqueue is just an `INSERT` in the caller's transaction — so "commit your + business write and the workflow enqueue atomically" is a first-class, + demonstrable feature. **DBOS markets a version of this but the study could not + find it explicitly documented** — there is an opening to own it cleanly. +- **Proven-engine credibility.** Silo openly calls itself a prototype; absurd is + "an experiment in durability"; Restate has no rigorous benchmarks. PgQue's + PgQ lineage (15+ years, Skype/Microsoft scale) is the inverse story. +- **Concurrency/rate-limiting as native primitives.** Silo's standout feature + (per-key concurrency queues, floating limits) is something Postgres does + atomically and well — a credible differentiator if pursued. + +### Where PgQue would *struggle* + +- **absurd already owns the exact niche.** "Single SQL file, Postgres-only, + self-hostable, thin SDK, no determinism" is *precisely* the slot a PgQue + durable layer would target — and absurd has ~1.95k stars, Apache-2.0, + production hardening, a Rust port (TensorZero), and **Armin Ronacher's + distribution**. PgQue would arrive second with a near-identical pitch. +- **The differentiator doesn't transfer.** Zero-bloat rotation — the entire + reason to choose PgQue over pgmq — is irrelevant to the claim/lease workflow + tables. On the workflow layer PgQue is just another SKIP-LOCKED engine. +- **Two concurrency models = a muddier story.** "We never use SKIP LOCKED" and + "our workflow engine uses SKIP LOCKED" coexisting invites the exact + "isn't this just a complicated queue with state?" critique DBOS already takes + on HN. +- **SDK breadth is years of work.** Competing on the workflow programming model + means carrying replay runtimes in multiple languages — the opposite of + PgQue's "language-agnostic, the SQL API *is* the product" advantage. +- **Single-Postgres ceiling.** DBOS guides ~a few thousand state + transitions/sec on one Postgres; honest, but it concedes hyperscale to + Temporal. Fine for a queue; a constraint to state loudly for workflows. + +### Honest read on success probability + +A **standalone "PgQue Workflows" engine** competing with DBOS/absurd head-on: +**low chance of breakout adoption.** Late entry, transferable differentiator +absent, high and ongoing SDK cost, against a founder-distribution incumbent +(absurd) and a credential incumbent (DBOS). + +A **thin "durable steps + transactional enqueue" extension** of the existing +queue: **moderate-to-good chance of being genuinely useful and adopted by +PgQue's own users** — because it deepens the queue they already chose rather +than asking them to evaluate PgQue as a workflow platform. It rides existing +adoption instead of opening a new, contested front. + +--- + +## 6. Strategic options + +**Tier 0 — Stay the course (no change).** Keep SPECx §2.3's stance: PgQue is a +queue; if you need step-by-step workflows, use Temporal/Restate/absurd. Lowest +risk. Forgoes the category's momentum. + +**Tier 1 — Own "transactional durable enqueue" (recommended, low risk).** +No workflow engine. Lean into what PgQue already does better than DBOS can +document: enqueue inside the caller's transaction, exactly-once consumption +patterns, idempotency helpers, and the supporting docs/examples. Pure +extension of the queue identity. Mostly documentation + small helpers + +TDD-tested examples. Reduces trivially to PgQ primitives. + +**Tier 2 — Experimental absurd-style "durable steps" (optional, medium risk).** +Add a `sql/experimental/durable.sql` layer: +- a **new claim/lease run table** beside the PgQ engine (not on rotation), +- `tasks / runs / steps / checkpoints / events / waits` modelled on absurd, +- checkpoint-replay with **task-level retry, no determinism requirement**, +- pg_cron partition-detach cleanup for long-lived state, +- **exactly one** reference SDK (Python *or* TypeScript — whichever the user + base skews to), explicitly experimental. +Gated behind the `blueprints/PHASES.md` promotion rule. Honest "experimental, +single-Postgres, up to a few thousand runs/sec" labelling. This is the only +tier that touches the workflow space, and it does so additively and reversibly. + +**Tier 3 — Full multi-language deterministic-replay workflow platform +(not recommended).** Competing with Temporal/DBOS on their terms. Highest cost, +weakest differentiation, contradicts PgQue's language-agnostic identity. + +--- + +## 7. Recommendation + +1. **Adopt Tier 1 now.** It is low-cost, on-identity, and claims a feature DBOS + leaves under-specified. It also makes the eventual Tier 2 story coherent. +2. **Prototype Tier 2 as an explicit experiment**, behind `sql/experimental/`, + only if there is real pull from PgQue users (especially AI-agent use cases). + Treat absurd's schema and the no-determinism checkpoint-replay model as the + reference design. Build the claim/lease table as a *separate* mechanism and + document plainly that this layer is a SKIP-LOCKED engine — do not pretend it + inherits the zero-bloat property. +3. **Do not pursue Tier 3.** Concede hyperscale and full polyglot workflow + orchestration to Temporal/DBOS; that honesty is itself credibility. +4. **Keep the queue the headline.** PgQue's winning, defensible story remains + "the zero-bloat, managed-PG-compatible, language-agnostic Postgres queue." + Durable steps are a *feature of that queue*, never a repositioning of it. + +--- + +## 8. Open questions for the maintainers + +- Is there demonstrated demand from PgQue's users for durable steps, or is this + driven by category FOMO? (Tier 2 should wait for the former.) +- If Tier 2 proceeds, which single SDK first — Python (AI-agent gravity) or + TypeScript (absurd's primary)? +- Are we comfortable carrying two concurrency models in one project, and + messaging that without diluting the zero-bloat story? +- Does the transactional-enqueue feature (Tier 1) warrant a dedicated + benchmark + example against DBOS's `DBOSClient` enqueue path? + +--- + +## Appendix — per-system one-liners + +- **DBOS** — embedded library, Postgres `dbos` system schema + (`workflow_status`, `operation_outputs`, …), 1 write/step + 2/workflow, + SKIP-LOCKED queue dequeue, ~40k workflows/sec on one Postgres, replay runtime + is ~80% of the work, Conductor (ops) is the proprietary money-maker. +- **absurd** — single `absurd.sql`, per-queue `t_/r_/c_/e_/w_/i_` tables, + SKIP-LOCKED claim-with-lease, task-level retry, no determinism, pg_cron + partition detach, thin TS/Python SDKs, Rust port (TensorZero). +- **Temporal** — separate cluster (Frontend/History/Matching/Worker), + event-sourced deterministic replay, Cassandra at scale, immutable + `numHistoryShards`, 7 SDKs, determinism is the adoption tax. +- **Restate** — single Rust binary, log + RocksDB, virtual objects, durable + promises, BSL license drew "not really open source" criticism. +- **Rivet** — Apache-2.0 actor platform (Durable-Objects-style), Postgres / + RocksDB / FoundationDB, broader than workflows, no multi-actor atomic txns. +- **Silo** — Gadget's Rust broker on SlateDB/object storage, durable *job + queue* (not workflow engine), single-shard-per-tenant (~4k jobs/sec cap), + first-class concurrency + rate limiting, self-described prototype. + + From edbaf71d0111931ad336af623e3c949f42626be8 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 30 May 2026 09:08:59 +0000 Subject: [PATCH 2/8] docs: reframe durable-execution study around event sourcing Earlier draft concluded the zero-bloat differentiator does not transfer to a workflow layer, assuming a mutable workflow_status row updated per step (the DBOS/absurd strategy). That was wrong. Model workflow state transitions as appended events over the rotating log (continuation-passing): each step enqueues its successor instead of mutating a row. Transitions become appends, not UPDATEs, so zero-bloat carries through. Exactly-once handoff falls out of insert_event + finish_batch in one transaction; sleep/timers use the rotating send_at from PR #237; exclusivity is structural via cooperative consumers; the only mutable state is a current-state projection bounded by concurrency. Verdict flips from 'do not compete' to 'compete on a substrate SKIP-LOCKED systems cannot match for high-throughput durable workflows'. Remaining real risk: awaitEvent/join semantics. --- blueprints/DURABLE_EXECUTION_FEASIBILITY.md | 504 +++++++++++--------- 1 file changed, 286 insertions(+), 218 deletions(-) diff --git a/blueprints/DURABLE_EXECUTION_FEASIBILITY.md b/blueprints/DURABLE_EXECUTION_FEASIBILITY.md index 9b76267f..dcb0ce92 100644 --- a/blueprints/DURABLE_EXECUTION_FEASIBILITY.md +++ b/blueprints/DURABLE_EXECUTION_FEASIBILITY.md @@ -5,8 +5,9 @@ - **Question:** Should PgQue extend beyond a queue into a **durable-workflow / durable-execution engine on Postgres**, the way DBOS and absurd have? What are the realistic chances of adoption success if we go that route? -- **Companion reading:** `blueprints/SPECx.md` §2.3 (workflow engines are - deliberately treated as a *different category* today), `CLAUDE.md` Key Design +- **Companion reading:** `blueprints/SPECx.md` §2.3 (workflow engines treated as + a separate category today), `blueprints/COOPERATIVE_CONSUMERS.md` (0.2, + experimental), PR #237 (rotating zero-bloat `send_at`), `CLAUDE.md` Key Design Rules #2 (the PgQ engine is sacred) and #3 (modern API must reduce cleanly to PgQ primitives). @@ -15,30 +16,58 @@ This study was produced after a deep review of the Hacker News thread and a parallel investigation of the six systems that thread orbits around: **DBOS, absurd, Temporal, Restate, Rivet, and Gadget's Silo**. +> **Revision note (2026-05-30):** An earlier draft of this study concluded that +> PgQue's zero-bloat differentiator "does not transfer" to a workflow layer, +> because durable engines need `SELECT … FOR UPDATE SKIP LOCKED` claim/lease +> semantics that conflict with PgQ's rotation model. **That conclusion was +> wrong.** It assumed the DBOS/absurd implementation strategy (a mutable +> `workflow_status` row updated per step). If instead workflow state transitions +> are modelled as **appended events over the rotating log** — i.e. durable +> execution as event sourcing — the rotation model is not an obstacle but an +> *advantage*. This revision rebuilds the analysis around that architecture. + --- ## 1. Verdict up front -**Do not pivot PgQue into a Temporal/DBOS-style durable-execution platform.** -Going head-to-head as a general "durable workflows on Postgres" engine is a -**late, undifferentiated, SDK-heavy bet** in a category that already has a -well-distributed Postgres-native incumbent (absurd) and a celebrity-founder -incumbent (DBOS). PgQue's signature advantage — zero-bloat snapshot+TRUNCATE -rotation — **does not transfer** to the workflow layer, which needs per-run -exclusive claiming, not batch rotation. - -**Do pursue the adjacent, well-leveraged slice:** *transactional durable -queues + checkpointed steps*, shipped as an **optional, experimental -`pgque-api` layer** that reduces to PgQ primitives plus one new -claim/lease table. This stays inside PgQue's identity ("the best zero-bloat -Postgres queue, now with durable steps"), exploits the one capability DBOS -under-documents (atomic enqueue inside the caller's own transaction), and -defers the expensive part (multi-language deterministic-replay runtimes) until -demand is proven. - -Net: **moderate-to-low feasibility for a standalone workflow-engine play; -moderate-to-high feasibility for a focused "durable steps" extension of the -queue.** The recommendation is the latter, phased and explicitly experimental. +**Durable execution is feasible on PgQue's model, and for the workloads that +dominate today's demand it can win *because of* the model, not despite it.** + +The key realisation: durable execution *is* event sourcing (this is literally +how Temporal's event-history-and-replay works), and PgQ is already an +append-only event log with snapshot-batched consumption and TRUNCATE rotation. +The mistake is to copy DBOS/absurd's storage strategy — a mutable +`workflow_status` row that gets `UPDATE`d on every step — because *that* is what +bloats, and it is exactly the pattern PgQue exists to avoid. The right strategy +is to model each workflow as a **stream of state-transition events**: process a +step, then **enqueue the next state as a new message** rather than mutating a +row. The workflow is always either (a) one in-flight message, (b) a *scheduled* +message awaiting a wake time, or (c) terminal. It never holds a batch open +across a wait, so it never blocks rotation, and every state transition is an +**append**, not an `UPDATE` — so the zero-bloat property carries straight +through to the workflow layer. + +**What this means for strategy:** the well-leveraged bet is no longer "stay out +of the workflow category." It is to build an **event-sourced durable-execution +layer that is rotation-native**, shipped as an optional, experimental +`pgque-api` layer that reduces to PgQ primitives plus a small bounded +current-state projection. This stays inside PgQue's identity ("the zero-bloat +Postgres queue") and turns the engine into a genuine competitive moat for +high-throughput, fan-out-heavy, short-step durable workflows — precisely the +AI-agent-loop and event-processing workloads the whole category is currently +chasing. + +**Honest scoping:** the part to defer is *not* the engine — it is the +"write-it-as-ordinary-linear-code" developer experience (Temporal/DBOS magic +checkpointing). PgQue's natural programming model is a message-driven state +machine (closer to AWS Step Functions / actors). That is a DX difference, not a +capability gap, and it is recoverable later with an SDK that compiles linear +code into re-enqueued continuations. + +Net: **moderate-to-high feasibility**, with a real and defensible +differentiator — conditional on solving one genuinely hard piece +(`awaitEvent`/join semantics) and accepting a state-machine programming model +first, linear-code DX later. --- @@ -61,241 +90,281 @@ Signals worth internalizing: - **The market keeps asking "why not just Postgres?"** Restate (own RocksDB store, BSL) took repeated HN criticism on exactly this. Rivet hedged back toward Postgres as a self-host backend. That recurring question *is* PgQue's - thesis — but absurd already answered it first in the workflow space. + thesis. - **Licensing trust is a real axis.** Restate's BSL drew loud "open source is misleading" criticism; Rivet's Apache-2.0 drew none. PgQue (Apache-2.0, literally "your own Postgres") inherits maximum trust by default. - **The wedge against Temporal is operational, not technical.** Every competitor's pitch is the same: *don't run a second distributed system; reuse - the database you already operate.* The complaints about Temporal are + the database you already operate.* The complaints about Temporal are the determinism learning curve, immutable shard-count decisions, and the Cassandra/Elasticsearch operational floor — not correctness. -- **AI agents are the current demand driver.** DBOS, absurd, Restate, and - Temporal are all repositioning durable execution as the substrate for - long-running LLM agent loops. absurd's canonical example is an agent loop; - Temporal's Series C narrative is "durable AI workloads." +- **AI agents are the current demand driver**, and they are + **high-volume, short-step, fan-out-heavy** — the exact workload shape where + append+rotate beats update+vacuum (see §5). -So the category is real and PgQue's "Postgres-native, OSI-licensed, no new -infra" framing is genuinely well-aligned with where the market is pulling. The -problem is not the thesis — it is **who already occupies the exact niche**. +So the category is real, PgQue's "Postgres-native, OSI-licensed, no new infra" +framing is well-aligned with where the market is pulling, *and* — once the +event-sourced architecture is adopted — PgQue's engine is a substrate advantage +rather than the liability the first draft assumed. --- -## 3. The technical crux: a concurrency-model mismatch - -This is the single most important finding of the study. - -Every durable-execution engine examined (DBOS, absurd, Silo, and Temporal's -matching service) claims work units with **`SELECT … FOR UPDATE SKIP LOCKED` + -a time-limited lease**: one run is claimed exclusively by one worker, the lease -auto-extends while the worker checkpoints, and if the worker dies the lease -expires and another worker steals the run. Sleep/await is modelled as -`available_at` re-scheduling on a per-run row. - -**PgQue does not work this way and must not be made to.** PgQ's engine is -*snapshot + TRUNCATE batch rotation*: consumers read all events committed -between two ticks using `pg_snapshot` visibility, and old event tables are -rotated and TRUNCATEd wholesale. There is: - -- no per-item exclusive claim (consumption is cooperative, batch-oriented, - at-least-once); -- no lease / steal-on-crash for an individual message; -- no per-row `available_at` rescheduling; -- and — critically — **a hard conflict with long-lived state**: a workflow that - `sleep`s for a week or `awaitEvent`s indefinitely must persist its run and - checkpoints far longer than any rotating event window survives. An open batch - already blocks rotation; a long-running workflow on the rotation tables would - be pathological. - -`CLAUDE.md` Rule #2 ("the PgQ engine is sacred") forecloses retrofitting -rotation to behave like a claim/lease queue. So a durable layer on PgQue would -require a **new SKIP-LOCKED claim/lease table living *beside* the PgQ engine**, -not on top of it. That means **PgQue would carry two concurrency models at -once** — the rotation engine for streaming/CDC/fan-out, and a claim/lease engine -for durable runs. - -This is the crux of the whole decision: - -> PgQue's marketing identity is "the one Postgres queue that never bloats -> because it uses rotation instead of SKIP LOCKED + DELETE." A durable-execution -> layer is, by necessity, **a SKIP-LOCKED + UPDATE engine** — exactly the -> mechanism PgQue defines itself against. The zero-bloat differentiator does not -> apply to the workflow tables; they will accumulate dead tuples and need VACUUM -> like everyone else's (mitigated by partition-detach, as absurd does). - -The latency objection, by contrast, **is no longer real**: PgQue 0.2.0 ticks at -a 100 ms cadence (`ticker_loop()` runs `pgque.ticker()` every -`tick_period_ms`, default 100 ms, committing between iterations), so end-to-end -delivery is sub-second — squarely in the same range as absurd's and DBOS's -worker polling loops. Latency is not what would hold a durable layer back. +## 3. The architecture: durable execution as event sourcing + +### 3.1 The core pattern — continuation-passing over the log + +A workflow is a state machine. Each step is a short, independently-triggered +handler that does its work and **enqueues its successor state as a new event**: + +``` +msg {wf: 42, state: charge} → charge_card(); enqueue {wf:42, state: ship}; ack +msg {wf: 42, state: ship} → create_shipment(); enqueue {wf:42, state: notify}; ack +msg {wf: 42, state: notify} → notify(); ack -- terminal +``` + +There is no long-running function held in a worker process, so there is nothing +to "replay" (contrast §3.4). State lives in the event chain (and, for large +state, in a side row keyed by workflow id; for small state, in the payload +itself — continuation-passing). Every transition is an **append**. No +`UPDATE`, no per-step dead tuple, no VACUUM dependence on the hot path. + +### 3.2 Exactly-once handoff between steps (PgQue is *stronger* here) + +`pgque.insert_event()` (enqueue) and `pgque.finish_batch()` (ack) both run in +the consumer's own transaction, so a step's effect, its successor enqueue, and +the batch ack are **one atomic commit**: + +```sql +begin; + -- step's own DB side effects (idempotent or in-txn) + perform pgque.insert_event(queue, next_state); -- enqueue successor + perform pgque.ack(batch_id); -- finish_batch +commit; +``` + +Commit → successor durably enqueued *and* batch finished atomically. Crash +before commit → txn aborts, no successor exists, the step redelivers cleanly via +PgQ's normal at-least-once redelivery. This is **exactly-once handoff** — the +capability DBOS markets as "piggyback the checkpoint in the transaction," except +here it is literally just SQL in the caller's transaction, and PgQue can +*demonstrate* it where DBOS documents it only indirectly. + +### 3.3 The five durable-execution requirements, on this model + +1. **Exclusive ownership — structural, not lease-based.** One logical consumer + + cooperative subconsumers (`COOPERATIVE_CONSUMERS.md`, shipping experimental + in 0.2). Invariant: **one live message per workflow** (each step enqueues + exactly one successor). Each message goes to exactly one subconsumer, so only + one worker touches a given workflow at any instant. absurd/DBOS need + claim-with-lease + steal-on-crash; PgQue gets exclusivity for free from the + single-live-continuation invariant, with the cooperative `dead_interval` + takeover already designed for the worker-died-mid-batch case. + +2. **Mutable run state — re-enqueue, don't update.** A transition appends a new + event carrying the new state. For small state it rides in the payload and + there is no long-lived table at all. + +3. **Long-lived persistence — PR #237 is the foundation.** A step that sleeps a + week acks immediately and enqueues a *scheduled* continuation: + `sleep("7 days")` = `send_at(continuation, now()+7d)`. PR #237 makes + `send_at` itself **TRUNCATE-rotated and zero-bloat**. A long sleep costs one + row in a rotating delayed table — never an open batch, never a vacuum + problem. (This dissolves the old "open batch blocks rotation" objection: the + workflow does not hold a batch across the wait.) + +4. **Per-row scheduling — half solved, half genuinely hard.** Timers/sleep: + solved by PR #237's rotating `send_at`. Waking on an **external event** + (`awaitEvent`) with a timeout is the real new design work — a small "waiting" + registry keyed by `(workflow_id, event_name)`, an `emit` path that injects + the continuation, and a maint sweep for timeouts. Low-volume (bounded by + in-flight waiters, not throughput), tractable, but it must be designed + carefully (first-write-wins event caching to avoid emit/await races — absurd's + `e_`/`w_` table pair is a good reference shape). + +5. **Checkpoint replay — not needed (see §3.4).** + +### 3.4 Why "checkpoint replay" is unnecessary here + +In Temporal/DBOS/absurd a workflow is one linear function run in one process; to +survive a crash, each step's result is saved, and on restart the function is +**re-run from the top** with completed steps short-circuited from the saved log +("replay"). That requires a long-lived run owned by a worker for its whole life. + +The continuation-passing model **eliminates the concept**: there is no +long-running function to resume, so nothing to replay. Recovery is just +redelivery of the single in-flight step, and correctness comes from the +exactly-once handoff in §3.2 plus per-step idempotency keyed by +`(workflow_id, step_seq)` (a unique index that prevents double-advance). This is +strictly simpler than replay and native to a queue. + +### 3.5 The one piece of mutable state — bounded by concurrency, not throughput + +For observability, addressing, cancellation, and joins, keep a **current-state +projection**: one row per *live* workflow, replaced as it advances, deleted on +completion. This is the only mutable table, and it is bounded by **in-flight +concurrency, not total throughput** — a million finished runs leave zero rows. +VACUUM load scales with concurrency (fine), not with step volume. + +**This split is the whole trick:** hot, high-churn step transitions → rotating +append-only log (zero bloat); cold, low-volume current-state index → tiny +mutable table (negligible bloat). DBOS/absurd put the high-churn part on the +mutable table and inherit the bloat wall; PgQue keeps the high-churn part on the +log. --- ## 4. What is reusable vs. net-new -**Reusable / aligned (low cost):** - -- **Single-file, anti-extension, managed-PG install.** absurd's `absurd.sql` - philosophy is identical to PgQue's `\i pgque.sql`. Validated approach. -- **pg_cron for maintenance.** absurd uses pg_cron for partition - provisioning/cleanup/detach; PgQue already uses pg_cron for the ticker and - `maint()`. Same muscle. -- **The data model** (tasks / runs / steps / checkpoints / events / waits) is - clean and orthogonal to how raw events are queued underneath. PgQue could - adopt absurd's schema shape almost verbatim as a reference. -- **The no-determinism, checkpoint-replay, task-level-retry design.** This is - the most reusable idea and the one that *does* reduce cleanly to primitives - (Rule #3): a `pgque.step()` checkpoint table + replay-on-retry needs no change - to the PgQ engine. It also sidesteps Temporal's biggest adoption tax (the - determinism paradigm and "deploy correct code, break all running workflows"). -- **SQL-native observability comes free.** Workflows-as-rows means `psql` - inspection, `list_workflows`-style queries, and forking — all trivial in a - system that *is* SQL. DBOS markets this heavily; PgQue gets it for free. - -**Net-new (the real cost):** - -1. **The deterministic/checkpoint replay *runtime*, per language.** This is - SDK-side logic — step memoization, resume-from-checkpoint, crash recovery - coordination across executors. The DBOS study's headline finding: - **~80% of DBOS's engineering lives in the multi-language SDK + replay runtime, - not in SQL.** PgQ's PL/pgSQL gives you the substrate, not this. -2. **Claim / lease / watchdog machinery.** Claim expiry, lease extension on - checkpoint write, watchdog termination of broken workers — net-new, and - exactly the part absurd had to harden over five months in production. -3. **Long-lived-state lifecycle.** Non-rotated run/checkpoint tables with TTL / - partition-detach cleanup, decoupled from event rotation. -4. **Event caching / first-write-wins** races (await/emit), needing a - uniquely-indexed events table carefully ordered against wait registrations. -5. **N synchronized SDKs.** DBOS maintains four (Python, TS mature; Go, Java - still 0.x). Keeping each in lockstep with engine semantics is the dominant - ongoing cost — well beyond the "~1,500 lines" budgeted for pgque-api in - SPECx. Durable workflows are a different order of magnitude. +**Reusable / already in flight (low cost):** + +- **Single-file, anti-extension, managed-PG install** — identical to absurd's + validated `absurd.sql` philosophy and PgQue's `\i pgque.sql`. +- **Cooperative consumers** (0.2) — gives parallel execution + structural + per-workflow exclusivity. +- **Rotating `send_at`** (PR #237) — gives zero-bloat timers/sleep. +- **Transactional `insert_event` + `finish_batch`** — gives exactly-once handoff + with no new primitive. +- **`jsontriga`** — CDC-triggered workflow starts, native to the engine. +- **SQL-native observability** — workflows-as-rows/events are `psql`-inspectable. + +**Net-new (the real cost, in order of difficulty):** + +1. **`awaitEvent` / wait registry + emit path** (§3.3.4) — the genuinely hard + design: race-free event caching, timeout sweep, join/fan-in semantics. +2. **Fan-out / join primitive** — a step that spawns N children (distinct child + workflow ids, each independently single-live) and a parent that awaits all N + (a counter in the projection, or children emit completion events the parent + awaits). +3. **Current-state projection + step idempotency index** (§3.5) — small, but + needs careful transition logic so advance is exactly-once. +4. **A reference SDK** — *one* language first (Python for AI-agent gravity, or + TypeScript for absurd-parity), exposing the state-machine API. Linear-code + DX (compiling an `async` function with `await` points into re-enqueued + continuations) is a *later* library project, not an engine requirement. + +Critically, **none of these touch the PgQ engine** (Rule #2) and all reduce to +PgQ primitives + a couple of small side tables (Rule #3). The expensive +multi-language deterministic-replay runtime that dominates DBOS's effort +**does not exist in this model** — that cost simply isn't incurred. --- -## 5. Adoption analysis — can PgQue win here? - -### Where PgQue *could* win - -- **"It's literally just your Postgres."** No new datastore (vs Restate's - RocksDB, Silo's SlateDB, Temporal's Cassandra), Apache-2.0 (vs Restate's BSL), - managed-PG compatible. The strongest anti-lock-in story in the field. -- **True transactional exactly-once for in-DB effects.** Because PgQue *is* - SQL, enqueue is just an `INSERT` in the caller's transaction — so "commit your - business write and the workflow enqueue atomically" is a first-class, - demonstrable feature. **DBOS markets a version of this but the study could not - find it explicitly documented** — there is an opening to own it cleanly. -- **Proven-engine credibility.** Silo openly calls itself a prototype; absurd is - "an experiment in durability"; Restate has no rigorous benchmarks. PgQue's - PgQ lineage (15+ years, Skype/Microsoft scale) is the inverse story. -- **Concurrency/rate-limiting as native primitives.** Silo's standout feature - (per-key concurrency queues, floating limits) is something Postgres does - atomically and well — a credible differentiator if pursued. - -### Where PgQue would *struggle* - -- **absurd already owns the exact niche.** "Single SQL file, Postgres-only, - self-hostable, thin SDK, no determinism" is *precisely* the slot a PgQue - durable layer would target — and absurd has ~1.95k stars, Apache-2.0, - production hardening, a Rust port (TensorZero), and **Armin Ronacher's - distribution**. PgQue would arrive second with a near-identical pitch. -- **The differentiator doesn't transfer.** Zero-bloat rotation — the entire - reason to choose PgQue over pgmq — is irrelevant to the claim/lease workflow - tables. On the workflow layer PgQue is just another SKIP-LOCKED engine. -- **Two concurrency models = a muddier story.** "We never use SKIP LOCKED" and - "our workflow engine uses SKIP LOCKED" coexisting invites the exact - "isn't this just a complicated queue with state?" critique DBOS already takes - on HN. -- **SDK breadth is years of work.** Competing on the workflow programming model - means carrying replay runtimes in multiple languages — the opposite of - PgQue's "language-agnostic, the SQL API *is* the product" advantage. -- **Single-Postgres ceiling.** DBOS guides ~a few thousand state - transitions/sec on one Postgres; honest, but it concedes hyperscale to - Temporal. Fine for a queue; a constraint to state loudly for workflows. - -### Honest read on success probability - -A **standalone "PgQue Workflows" engine** competing with DBOS/absurd head-on: -**low chance of breakout adoption.** Late entry, transferable differentiator -absent, high and ongoing SDK cost, against a founder-distribution incumbent -(absurd) and a credential incumbent (DBOS). - -A **thin "durable steps + transactional enqueue" extension** of the existing -queue: **moderate-to-good chance of being genuinely useful and adopted by -PgQue's own users** — because it deepens the queue they already chose rather -than asking them to evaluate PgQue as a workflow platform. It rides existing -adoption instead of opening a new, contested front. +## 5. Adoption analysis — where PgQue wins, and the tradeoffs + +### Where PgQue wins *because of* the model + +1. **Zero-bloat at high step-throughput — the differentiator now transfers.** + DBOS/absurd do `UPDATE workflow_status` + `INSERT operation_outputs` per + step: mutable-row churn → the exact bloat wall PgQue exists to defeat. In the + event-sourced model every transition is an append to the rotating log; a + million agent iterations leave zero dead tuples on the hot path. For the + AI-agent-loop-at-scale workload everyone is chasing, **append+rotate + structurally beats update+vacuum.** This is the headline. +2. **Native fan-out + batch step execution.** PgQ hands a *batch* of many + workflows' step-events at once, snapshot-isolated — advance thousands of + workflows in one transaction. DBOS is 1-write-per-step over per-row + `SKIP LOCKED`. PgQue amortizes where they pay per item. +3. **Transactional exactly-once handoff** (§3.2) — stronger than at-least-once + competitors, and just SQL. +4. **"It's literally just your Postgres"** — no new datastore (vs Restate's + RocksDB, Silo's SlateDB, Temporal's Cassandra), Apache-2.0 (vs Restate's + BSL), managed-PG compatible. Strongest anti-lock-in story in the field. +5. **Proven-engine credibility** — PgQ's 15+ years vs absurd's "an experiment in + durability" and Silo's self-described prototype. + +### The honest tradeoffs + +- **State-machine programming model, not magic linear code.** Workflows are + expressed as message-driven steps (Step-Functions/actor style), not Temporal's + "write normal code, we checkpoint it invisibly." A *DX* difference, recoverable + later via a continuation-compiling SDK. +- **`awaitEvent`/join is real new design** (§4.1–4.2), the main engineering risk. +- **Single-Postgres ceiling** — honest "up to a few thousand workflow + transitions/sec per database" framing, as DBOS does; concede hyperscale to + Temporal. +- **absurd already has distribution** in the "Postgres-only durable workflows" + framing. PgQue's counter is not "also Postgres" but "**zero-bloat at + throughput absurd's mutable-row design can't sustain**" — a concrete, + benchmarkable claim, not a me-too. + +### Success probability (revised) + +- A **rotation-native, event-sourced durable-execution layer** marketed on + *zero-bloat high-throughput durable workflows*: **moderate-to-high** — it is + differentiated, reduces to existing primitives, and rides existing adoption. +- A **DBOS/absurd clone** (mutable status row + multi-language replay runtime): + **low** — late, undifferentiated, and it would forfeit the one advantage. + +The strategic point: don't enter the category the way the incumbents built it. +Enter it the way only PgQue *can* build it. --- ## 6. Strategic options -**Tier 0 — Stay the course (no change).** Keep SPECx §2.3's stance: PgQue is a -queue; if you need step-by-step workflows, use Temporal/Restate/absurd. Lowest -risk. Forgoes the category's momentum. - -**Tier 1 — Own "transactional durable enqueue" (recommended, low risk).** -No workflow engine. Lean into what PgQue already does better than DBOS can -document: enqueue inside the caller's transaction, exactly-once consumption -patterns, idempotency helpers, and the supporting docs/examples. Pure -extension of the queue identity. Mostly documentation + small helpers + -TDD-tested examples. Reduces trivially to PgQ primitives. - -**Tier 2 — Experimental absurd-style "durable steps" (optional, medium risk).** -Add a `sql/experimental/durable.sql` layer: -- a **new claim/lease run table** beside the PgQ engine (not on rotation), -- `tasks / runs / steps / checkpoints / events / waits` modelled on absurd, -- checkpoint-replay with **task-level retry, no determinism requirement**, -- pg_cron partition-detach cleanup for long-lived state, -- **exactly one** reference SDK (Python *or* TypeScript — whichever the user - base skews to), explicitly experimental. -Gated behind the `blueprints/PHASES.md` promotion rule. Honest "experimental, -single-Postgres, up to a few thousand runs/sec" labelling. This is the only -tier that touches the workflow space, and it does so additively and reversibly. - -**Tier 3 — Full multi-language deterministic-replay workflow platform -(not recommended).** Competing with Temporal/DBOS on their terms. Highest cost, -weakest differentiation, contradicts PgQue's language-agnostic identity. +**Tier 0 — Stay a pure queue.** Lowest risk; forgoes a genuine, defensible +differentiator that the engine uniquely enables. + +**Tier 1 — Own "transactional durable enqueue" now (low risk).** Document and +helper-ize the exactly-once handoff (§3.2) and idempotent-step patterns. Pure +extension of the queue identity; mostly docs + small helpers + TDD examples. +Also the foundation the durable layer builds on. + +**Tier 2 — Event-sourced durable steps (recommended, medium risk).** A +`sql/experimental/durable.sql` layer: +- continuation-passing steps over the rotating log (no mutable status row on the + hot path), +- transactional handoff (`insert_event` + `finish_batch` in one txn), +- rotating `send_at` (PR #237) for sleep/timers, +- a bounded current-state projection + `(workflow_id, step_seq)` idempotency + index, +- the `awaitEvent`/emit registry + fan-out/join primitive (the hard part — + design and TDD this first), +- exactly **one** reference SDK, explicitly experimental, gated behind the + `PHASES.md` promotion rule. +Marketed on the zero-bloat-at-throughput advantage, not "also Postgres." + +**Tier 3 — Multi-language deterministic-replay platform (not recommended).** +Competing with Temporal/DBOS on their terms and their costs, forfeiting the +model's advantage. --- ## 7. Recommendation -1. **Adopt Tier 1 now.** It is low-cost, on-identity, and claims a feature DBOS - leaves under-specified. It also makes the eventual Tier 2 story coherent. -2. **Prototype Tier 2 as an explicit experiment**, behind `sql/experimental/`, - only if there is real pull from PgQue users (especially AI-agent use cases). - Treat absurd's schema and the no-determinism checkpoint-replay model as the - reference design. Build the claim/lease table as a *separate* mechanism and - document plainly that this layer is a SKIP-LOCKED engine — do not pretend it - inherits the zero-bloat property. -3. **Do not pursue Tier 3.** Concede hyperscale and full polyglot workflow - orchestration to Temporal/DBOS; that honesty is itself credibility. -4. **Keep the queue the headline.** PgQue's winning, defensible story remains - "the zero-bloat, managed-PG-compatible, language-agnostic Postgres queue." - Durable steps are a *feature of that queue*, never a repositioning of it. +1. **Adopt Tier 1 now** — low-cost, on-identity, and the foundation for Tier 2. +2. **Prototype Tier 2 as an explicit experiment**, leading with the + `awaitEvent`/join design (the one place the model has real new risk) and a + throughput+bloat benchmark vs a mutable-status-row baseline (absurd/DBOS + shape) — that benchmark *is* the marketing. +3. **Do not pursue Tier 3.** +4. **Keep the queue the headline; durable steps are a feature of the engine's + zero-bloat append-and-rotate design**, which is exactly why PgQue can offer + them where SKIP-LOCKED systems hit a wall. --- ## 8. Open questions for the maintainers -- Is there demonstrated demand from PgQue's users for durable steps, or is this - driven by category FOMO? (Tier 2 should wait for the former.) -- If Tier 2 proceeds, which single SDK first — Python (AI-agent gravity) or - TypeScript (absurd's primary)? -- Are we comfortable carrying two concurrency models in one project, and - messaging that without diluting the zero-bloat story? -- Does the transactional-enqueue feature (Tier 1) warrant a dedicated - benchmark + example against DBOS's `DBOSClient` enqueue path? +- Is there demonstrated user pull for durable steps (especially AI-agent + use cases), or is this category FOMO? Tier 2 should follow real pull. +- Reference SDK first: Python (AI-agent gravity) or TypeScript (absurd-parity)? +- `awaitEvent` semantics: race-free emit/await caching, timeout handling, and + fan-in/join — design and TDD before any code. +- Does the throughput+bloat benchmark vs a mutable-status-row baseline hold up + on server hardware? (If yes, it is the whole pitch.) +- Are we comfortable shipping a state-machine programming model first, with + linear-code DX as a later SDK project? --- ## Appendix — per-system one-liners - **DBOS** — embedded library, Postgres `dbos` system schema - (`workflow_status`, `operation_outputs`, …), 1 write/step + 2/workflow, - SKIP-LOCKED queue dequeue, ~40k workflows/sec on one Postgres, replay runtime - is ~80% of the work, Conductor (ops) is the proprietary money-maker. + (`workflow_status`, `operation_outputs`, …), **mutable status row updated per + step**, SKIP-LOCKED queue dequeue, ~40k workflows/sec on one Postgres, replay + runtime is ~80% of the work, Conductor (ops) is the proprietary money-maker. - **absurd** — single `absurd.sql`, per-queue `t_/r_/c_/e_/w_/i_` tables, SKIP-LOCKED claim-with-lease, task-level retry, no determinism, pg_cron partition detach, thin TS/Python SDKs, Rust port (TensorZero). @@ -310,4 +379,3 @@ weakest differentiation, contradicts PgQue's language-agnostic identity. queue* (not workflow engine), single-shard-per-tenant (~4k jobs/sec cap), first-class concurrency + rate limiting, self-described prototype. - From 84aefdf572d06619cc8c130fb3b4f1e289a52320 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 30 May 2026 09:43:09 +0000 Subject: [PATCH 3/8] spec(workflows): add durable-execution spec v0.1 + HTML brief Event-sourced durable-execution layer authored with samospec (all-Claude panel). Ships SPEC.md, self-contained HTML brief (BRIEF.html/index.html), and auxiliary artifacts under blueprints/workflows/. .nojekyll added for GitHub Pages. --- .nojekyll | 0 blueprints/workflows/BRIEF.html | 501 +++++++++++++++++++++++++ blueprints/workflows/README.md | 18 + blueprints/workflows/SPEC.md | 305 +++++++++++++++ blueprints/workflows/TLDR.md | 23 ++ blueprints/workflows/architecture.json | 5 + blueprints/workflows/changelog.md | 6 + blueprints/workflows/decisions.md | 3 + blueprints/workflows/index.html | 501 +++++++++++++++++++++++++ 9 files changed, 1362 insertions(+) create mode 100644 .nojekyll create mode 100644 blueprints/workflows/BRIEF.html create mode 100644 blueprints/workflows/README.md create mode 100644 blueprints/workflows/SPEC.md create mode 100644 blueprints/workflows/TLDR.md create mode 100644 blueprints/workflows/architecture.json create mode 100644 blueprints/workflows/changelog.md create mode 100644 blueprints/workflows/decisions.md create mode 100644 blueprints/workflows/index.html diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/blueprints/workflows/BRIEF.html b/blueprints/workflows/BRIEF.html new file mode 100644 index 00000000..27bd9e18 --- /dev/null +++ b/blueprints/workflows/BRIEF.html @@ -0,0 +1,501 @@ + + + + + +Brief — PgQue Durable Workflows — SPEC v0.1 (v0.1) + + + + +
+
+
+workflows +vv0.1 +2026-05-30 +
+
+brief + + + + + +
+
+
+
+
+

Brief — derivative summary

+

PgQue Durable Workflows — SPEC v0.1

+

+workflows · + Version v0.1 · + Published · + canonical SPEC.md → +

+

+Summary, not the spec. Skim this for shape, architecture, scope, risks, decisions, and open questions in 5–10 minutes; consult SPEC.md for the full text. +

+
+ +
+

01Goal

+

> Status: experimental, ships as optional sql/experimental/durable.sql gated by the project promotion rule. One reference SDK. Engine layer is sacred and untouched.

+
+
+

021. Goal & why it's needed

+

Goal. Provide a durable-execution / durable-workflow layer for PgQue that models each workflow as an append-only stream of state-transition events running over PgQ's existing snapshot + TRUNCATE rotation engine, so durable workflows inherit PgQ's zero-bloat property instead of fighting it.

+

Why this exists. Every Postgres-native durable-execution engine in the category (DBOS, absurd, and the long tail of SELECT … FOR UPDATE SKIP LOCKED + DELETE queues) shares one structural liability: they model a workflow as a mutable workflow_status row that is UPDATEd on every step. At the throughput the category is actually chasing — AI agent loops doing millions of cheap iterations — that per-step UPDATE churns dead tuples until the workload hits a VACUUM wall, and throughput degrades. PgQ already solved exactly this problem for queues with snapshot-batch isolation + wholesale TRUNCATE rotation: zero dead-tuple bloat under sustained load. The insight this spec operationalizes is that durable execution is event sourcing, PgQ is already an append-only event log, and therefore a workflow can be modeled as a stream of appended transitions (continuation-passing) rather than a mutated row. The zero-bloat property then carries through, for free, from the queue layer to the workflow layer.

+

This exists because no one else can credibly claim "durable workflows with a flat dead-tuple curve at agent-loop throughput, on just your managed Postgres, no separate datastore." That is the entire pitch, and it is only reachable by building on top of the rotation engine rather than re-introducing a mutable-status model beside it.

+
+
+

032. Scope & resolved interview decisions

+

The interview answers were all delegated to the lead ("decide for me"). Resolved:

+

| Question | Decision (v0.1) | |---|---| | Primary users | Backend engineers running long-lived or high-iteration orchestration (AI agent loops, multi-step business processes, fan-out jobs) on managed Postgres who refuse a second datastore and refuse a VACUUM wall. | | Core job | Advance a workflow from one step to the next with exactly-once handoff and at-least-once step execution, never losing or silently duplicating a workflow's progress — on a hot path that appends and rotates rather than updates. | | Durability / recovery guarantee | At-least-once step execution + exactly-once handoff between steps; per-step idempotency keyed on (workflow_id, step_seq). On crash, exactly the single in-flight step redelivers (PgQ's existing redelivery); there is no long function to replay. | | Success metric | A throughput-and-bloat benchmark vs a mutable-status-row baseline (DBOS/absurd shape) on server hardware: flat dead-tuple count + sustained throughput on the append+rotate hot path where the baseline degrades. | | Out of scope for v0.1 | Cancellation / orphan-join propagation; linear-code (async/await-compiled) DX; N synchronized SDKs; hyperscale (>~ a few thousand transitions/sec/db); deterministic replay. |

+

---

+
+
+

043. User stories

+

Each story is persona + action + outcome and is directly exercised as a manual acceptance test (§6.4).

+

---

+
    +
  • Agent-loop builder (zero-bloat at iteration scale). As a backend engineer running an AI agent that loops thousands of times per run, I define each iteration as a step that processes and enqueues its successor, so that a million iterations complete with a flat dead-tuple count on the hot tables and no VACUUM-driven throughput cliff — verifiable with pg_stat_user_tables.n_dead_tup staying flat through the run.
  • +
+
+
+

054. Architecture

+

<!-- architecture:begin -->

+

<!-- architecture:end -->

+

The durable layer only calls the five PgQ primitives + send_at. It adds no modification to rotation/tick/batch logic and introduces no second concurrency model.

+
    +
  • Workflow — a logical state machine identified by workflow_id. At any instant it is in exactly one of three conditions: (a) one in-flight message (a step-event sitting in a PgQ batch being processed), (b) scheduled (a send_at continuation awaiting a wake time, or a registered wait awaiting an event), or (c) terminal. The single-live-continuation invariant — each processed step enqueues exactly one successor — is what makes exclusivity structural rather than lease-based.
  • +
  • Step-event — the message on the PgQ queue. Payload carries: workflow_id, step_seq (monotonic progress anchor), step_name/state tag, and small continuation state (continuation-passing). Large state is the user's responsibility to hold in their own tables, addressed by workflow_id.
  • +
  • Transition — process a step → emit successor as a new append. Never an UPDATE of a status row.
  • +
  • Coordination side tables (the only mutable state; see §5.5) — wf_wait, wf_event_cache, wf_join, wf_join_done, wf_dedup, wf_live (current-state projection). Their churn is bounded by concurrency and coordination-point count, not total step volume (the honest correction, §5.6).
  • +
+
(architecture not yet specified)
+
┌──────────────────────────────────────────────────────────┐
+│  Reference SDK (Python, v0.1)                            │  ← thin client
+│  defineWorkflow / step / sleep / awaitEvent / emit /     │
+│  spawn / awaitAll  →  plain SQL function calls           │
+├──────────────────────────────────────────────────────────┤
+│  Durable layer  (sql/experimental/durable.sql)          │  ← THIS SPEC
+│  • dispatch loop over PgQ batches                        │
+│  • coordination side tables (waits, joins, dedup, proj.) │
+│  • SECURITY DEFINER fns, search_path=pgque,pg_catalog    │
+├──────────────────────────────────────────────────────────┤
+│  PgQ engine  (SACRED — UNMODIFIED)                       │  ← do not touch
+│  insert_event · next_batch · get_batch_events ·          │
+│  finish_batch · event_retry · send_at (PR #237) ·        │
+│  ticks · snapshot rotation · TRUNCATE · cooperative      │
+│  consumers (dead_interval takeover)                      │
+└──────────────────────────────────────────────────────────┘
+
+Subsections (3) +
  1. 4.1 Layering (the sacred boundary)
  2. +
  3. 4.2 Key abstractions
  4. +
  5. 4.3 Concurrency / ownership model
+
+
+
+

065. Implementation details

+

The foundational guarantee. insert_event() (enqueue successor) and finish_batch() (ack) run in the consumer's own transaction, so a step's side effects, its successor enqueue, and its batch ack are one atomic commit:

+

No subtransactions are used on this path (hard constraint).

+

A batch may contain step-events for thousands of distinct workflows; they advance in one transaction (native fan-out + batch step execution). A step that fails transiently calls pgque.event_retry() for that single event rather than aborting the whole batch where the engine allows per-event retry; a poisoned step lands in PgQ's existing DLQ after max retries.

+
    +
  • Commit ⇒ successor durably enqueued AND batch finished, atomically ⇒ exactly-once handoff.
  • +
  • Crash before commit ⇒ txn aborts ⇒ no successor, no dedup marker, batch not finished ⇒ the step redelivers cleanly.
  • +
+
begin;
+  -- 1. step's own DB side effects (idempotent or naturally in-txn)
+  -- 2. record per-step dedup marker (workflow_id, step_seq)   [if first delivery]
+  perform pgque.insert_event(queue, next_state);   -- enqueue exactly one successor
+  perform pgque.finish_batch(batch_id);            -- ack this step
+commit;
+
loop:
+  batch_id := pgque.next_batch(queue, consumer)        -- snapshot-bounded
+  if batch_id is null: sleep to next tick; continue
+  events  := pgque.get_batch_events(batch_id)          -- many workflows at once
+  begin
+    for each event in events:                          -- batch step execution
+        advance_one(event)                             -- §5.3, appends successor(s)
+    pgque.finish_batch(batch_id)
+  commit
+
+Subsections (8) +
  1. 5.1 The hot path: one transition = append + ack, atomically
  2. +
  3. 5.2 Dispatch loop
  4. +
  5. 5.3 The five durable-execution requirements, mapped
  6. +
  7. 5.4 Per-step idempotency
  8. +
  9. 5.5 Coordination side tables
  10. +
  11. 5.6 The honest zero-bloat claim (stated precisely, never overstated)
  12. +
  13. 5.7 `awaitEvent` / `emit` — the ~20% with real risk (designed and TDD'd first)
  14. +
  15. 5.8 fan-out / join (spawn + `awaitAll`)
+
+
+
+

076. Tests plan

+

Red/green TDD for ALL new code. Every function below is written test-first: a failing test asserting the behavior, then the implementation that makes it pass. CI rejects any new SQL function or SDK method without a preceding failing-then-passing test in the same change.

+

These are the pieces explicitly required to be designed and TDD'd first, before any happy-path step plumbing:

+

Each of the five user stories has a runnable scenario script the reviewer executes by hand against a managed-PG-like instance.

+
    +
  • Exactly-once handoff (§5.1) — red test: kill the txn between insert_event and commit, assert no successor + clean redelivery; assert no double-handoff on commit.
  • +
  • Per-step idempotency (§5.4) — red test: deliver the same (workflow_id, step_seq) twice, assert exactly one successor + one side effect.
  • +
  • awaitEvent / emit race matrix (§5.7) — one red test per row of the race table (emit-before-await, interleave, double-resume, stale cache, await redelivery, timeout). Single-resume token proven by concurrent emit+sweep test.
  • +
  • fan-out / join (§5.8) — red tests: race-free join-total recording; idempotent completed-set under duplicated completion; exactly-once parent resume; per-child failure surfaced in the result array.
  • +
+
+Subsections (5) +
  1. 6.1 Hard repo rule
  2. +
  3. 6.2 Built test-first, in this order (highest risk first)
  4. +
  5. 6.3 CI test suites
  6. +
  7. 6.4 Manual acceptance (maps 1:1 to §3 user stories)
  8. +
  9. 6.5 Success-criterion benchmark (the entire pitch)
+
+
+
+

087. Team (veteran experts to hire)

+

Veteran "Durable Workflow Engineer" (accepted).

+

---

+
    +
  • Veteran PostgreSQL internals / MVCC engineer (1) — owns the snapshot/visibility reasoning, xid8/pg_snapshot, rotation interaction, no-subtransaction guarantee.
  • +
  • Veteran durable-execution / distributed-systems engineer (1) — owns the await/emit and fan-out/join race designs and the single-resume-token correctness proofs.
  • +
  • Veteran PL/pgSQL + SQL test engineer (pgTAP) (1) — owns the red/green TDD harness, concurrency/property tests, crash-recovery injection.
  • +
  • Veteran SDK / developer-experience engineer (Python) (1) — owns the one reference SDK and the thin-client surface.
  • +
  • Veteran performance / benchmarking engineer (1) — owns the throughput-and-bloat benchmark harness and the published curves.
  • +
  • Veteran technical writer / DX reviewer (0.5, shared) — owns the experimental-feature docs and the honest-claim framing.
  • +
+
+Subsections (1) +
  1. 7.1 Persona for this spec round
+
+
+
+

098. Implementation plan (sprints, parallelization, ordering)

+

Sprint 0 — Foundations & harness (1 wk).

+

Sprint 1 — Exactly-once core (1.5 wk). (highest risk first, per constraint)

+

Sprint 2 — Coordination primitives (2 wk). Two independent tracks in parallel:

+
    +
  • Test engineer: stand up pgTAP red/green harness, CI matrix (PG 14–18), engine-sacredness diff-guard. (blocks everyone — must land first.)
  • +
  • PG-internals engineer: spike next_batch/get_batch_events/finish_batch reduction; confirm send_at (PR #237) shape.
  • +
  • Parallel: SDK engineer scaffolds the thin Python client surface against stub SQL signatures.
  • +
+
+
+

109. Topic-specific: API surface (reference SDK, Python v0.1)

+

Every SDK call compiles to one of the five PgQ primitives + a coordination-table touch. No async/await-compiled linear-code DX in v0.1 (deferred, §10).

+

---

+
wf = defineWorkflow("order_fulfillment")
+
+@wf.step("charge")
+def charge(ctx, state):
+    ctx.side_effect(...)              # user's own idempotent/in-txn write
+    return ctx.goto("await_ship", state)        # append successor
+
+@wf.step("await_ship")
+def await_ship(ctx, state):
+    return ctx.await_event("shipped", timeout="24h",
+                           on_event="notify", on_timeout="escalate")
+
+@wf.step("fan")
+def fan(ctx, state):
+    return ctx.spawn([...N children...], join="collect")   # awaitAll → result array
+
+# external producer, anywhere:
+emit(workflow_id, "shipped", payload)
+
+
+

1110. Non-goals / disclaimers (honored strictly — not reintroduced anywhere above)

+

---

+
    +
  • NOT a Temporal/DBOS-style deterministic-replay engine. No workflow-determinism requirement, no replay of a long linear function, no per-step workflow_status UPDATE churn.
  • +
  • NOT a multi-language deterministic-replay runtime in v1. No N synchronized SDKs — one reference SDK.
  • +
  • NOT a separate server, daemon, or external datastore. No Cassandra, RocksDB, FoundationDB, or Redis.
  • +
  • NOT targeting hyperscale (>~ a few thousand workflow transitions/sec per database) — conceded to Temporal honestly.
  • +
  • NOT changing the sacred PgQ engine, and NOT introducing a second SELECT … FOR UPDATE SKIP LOCKED claim/lease concurrency model as the primary mechanism — exclusivity comes from the single-live-continuation invariant over the existing rotation engine.
  • +
  • Cancellation / orphan-join propagation is deferred to a follow-up, not in v0.1.
  • +
  • Linear-code (async/await-compiled) DX is an explicit later SDK project, not an engine requirement.
  • +
+
+
+

1211. Embedded Changelog

+
    +
  • v0.1 (2026-05-30) — Initial spec scaffold fleshed into full structure. Resolved all five delegated interview questions (primary users, core job, durability model, success metric, out-of-scope). Added Goal-&-why framing, 5 user stories, layered architecture with the sacred-engine boundary, hot-path/coordination implementation detail incl. the honest zero-bloat correction, await/emit + fan-out/join race designs, red/green TDD-first ordering, team roster, 5-sprint plan with parallel tracks, SDK surface, and strict non-goals. No reviewer findings yet (first authoring round).
  • +
+
+
+

+Generated by samospec brief workflows on . + 1 review round — + lead —, + reviewers — + + —. +

+

Re-run after each samospec publish to refresh. Canonical document: SPEC.md.

+
+
+ + + diff --git a/blueprints/workflows/README.md b/blueprints/workflows/README.md new file mode 100644 index 00000000..1c973a74 --- /dev/null +++ b/blueprints/workflows/README.md @@ -0,0 +1,18 @@ +# PgQue Durable Workflows — spec (experimental, samospec-authored) + +This directory holds the versioned specification for the proposed +**event-sourced durable-execution layer** on PgQue (see +`../DURABLE_EXECUTION_FEASIBILITY.md` for the strategy this spec realizes). + +- `SPEC.md` — the spec (current version in its header). +- `BRIEF.html` / `index.html` — self-contained HTML brief (derivative of + SPEC.md). `index.html` is the GitHub Pages entry point. +- `TLDR.md`, `decisions.md`, `changelog.md`, `architecture.json` — auxiliary + artifacts. + +Authored and iterated with [samospec](https://github.com/NikolayS/samospec) +running an all-Claude review panel (lead + two reviewer personas). Each +version is committed and the brief republished. + +Status: **experimental** — ships as optional `sql/experimental/durable.sql` +gated by the project promotion rule. diff --git a/blueprints/workflows/SPEC.md b/blueprints/workflows/SPEC.md new file mode 100644 index 00000000..7b250499 --- /dev/null +++ b/blueprints/workflows/SPEC.md @@ -0,0 +1,305 @@ +# PgQue Durable Workflows — SPEC v0.1 + +> Status: **experimental**, ships as optional `sql/experimental/durable.sql` gated by the project promotion rule. One reference SDK. Engine layer is sacred and untouched. + +--- + +## 1. Goal & why it's needed + +**Goal.** Provide a durable-execution / durable-workflow layer for PgQue that models each workflow as an **append-only stream of state-transition events** running over PgQ's existing snapshot + TRUNCATE rotation engine, so durable workflows inherit PgQ's zero-bloat property instead of fighting it. + +**Why this exists.** Every Postgres-native durable-execution engine in the category (DBOS, absurd, and the long tail of `SELECT … FOR UPDATE SKIP LOCKED` + `DELETE` queues) shares one structural liability: they model a workflow as a **mutable `workflow_status` row that is `UPDATE`d on every step**. At the throughput the category is actually chasing — AI agent loops doing millions of cheap iterations — that per-step `UPDATE` churns dead tuples until the workload hits a VACUUM wall, and throughput degrades. PgQ already solved exactly this problem for *queues* with snapshot-batch isolation + wholesale `TRUNCATE` rotation: zero dead-tuple bloat under sustained load. The insight this spec operationalizes is that **durable execution is event sourcing**, PgQ is **already an append-only event log**, and therefore a workflow can be modeled as a stream of appended transitions (continuation-passing) rather than a mutated row. The zero-bloat property then carries through, *for free*, from the queue layer to the workflow layer. + +This exists because no one else can credibly claim "durable workflows with a flat dead-tuple curve at agent-loop throughput, on just your managed Postgres, no separate datastore." That is the entire pitch, and it is only reachable by building **on top of** the rotation engine rather than re-introducing a mutable-status model beside it. + +**What it is NOT** (honored strictly throughout — see §10): not a Temporal/DBOS deterministic-replay engine; not a multi-language replay runtime; not a separate server/daemon/datastore; not a hyperscale engine; not a `FOR UPDATE SKIP LOCKED` claim/lease model; cancellation/orphan-join propagation is deferred. + +--- + +## 2. Scope & resolved interview decisions + +The interview answers were all delegated to the lead ("decide for me"). Resolved: + +| Question | Decision (v0.1) | +|---|---| +| **Primary users** | Backend engineers running long-lived or high-iteration orchestration (AI agent loops, multi-step business processes, fan-out jobs) **on managed Postgres** who refuse a second datastore and refuse a VACUUM wall. | +| **Core job** | Advance a workflow from one step to the next with **exactly-once handoff** and **at-least-once step execution**, never losing or silently duplicating a workflow's progress — on a hot path that appends and rotates rather than updates. | +| **Durability / recovery guarantee** | At-least-once step execution + exactly-once handoff between steps; per-step idempotency keyed on `(workflow_id, step_seq)`. On crash, exactly the single in-flight step redelivers (PgQ's existing redelivery); there is no long function to replay. | +| **Success metric** | A throughput-and-bloat benchmark vs a mutable-status-row baseline (DBOS/absurd shape) on server hardware: **flat dead-tuple count + sustained throughput** on the append+rotate hot path where the baseline degrades. | +| **Out of scope for v0.1** | Cancellation / orphan-join propagation; linear-code (`async/await`-compiled) DX; N synchronized SDKs; hyperscale (>~ a few thousand transitions/sec/db); deterministic replay. | + +--- + +## 3. User stories + +Each story is persona + action + outcome and is directly exercised as a manual acceptance test (§6.4). + +1. **Agent-loop builder (zero-bloat at iteration scale).** *As* a backend engineer running an AI agent that loops thousands of times per run, *I* define each iteration as a step that processes and enqueues its successor, *so that* a million iterations complete with a **flat dead-tuple count** on the hot tables and no VACUUM-driven throughput cliff — verifiable with `pg_stat_user_tables.n_dead_tup` staying flat through the run. + +2. **Long-sleep orchestrator (durable timers).** *As* an engineer modeling a "wait 7 days, then send a reminder" process, *I* call `sleep('7 days')` inside a step, *so that* the workflow durably resumes after the wait **without holding any batch open** and **without a per-workflow polling row** — the sleep is one row in a TRUNCATE-rotated delayed-delivery table. + +3. **Human-in-the-loop integrator (await external event).** *As* an engineer building an approval flow, *I* call `awaitEvent('approval', timeout => '24h')` and have another part of my system call `emit(workflow_id, 'approval', payload)`, *so that* the workflow resumes **exactly once** on the event — robust against emit-before-await, await/emit interleave, and emit-racing-the-timeout — or resumes on the timeout branch if the deadline passes first. + +4. **Fan-out batch processor (spawn + join).** *As* an engineer processing a parent job that splits into N independent children, *I* spawn N child workflows and `awaitAll`, *so that* the parent resumes **exactly once** when all N complete, with a **per-child result array** (success/failure each), even under redelivery of any child's completion. + +5. **Exactly-once integrator (transactional handoff).** *As* an engineer whose step writes a row to *my own* business table and then advances the workflow, *I* run my side effect, the successor enqueue, and the batch ack in **one transaction**, *so that* a crash either commits all three or none — no successor without the side effect, no side effect without the ack, no duplicate handoff. + +--- + +## 4. Architecture + + + +```text +(architecture not yet specified) +``` + + + +### 4.1 Layering (the sacred boundary) + +``` +┌──────────────────────────────────────────────────────────┐ +│ Reference SDK (Python, v0.1) │ ← thin client +│ defineWorkflow / step / sleep / awaitEvent / emit / │ +│ spawn / awaitAll → plain SQL function calls │ +├──────────────────────────────────────────────────────────┤ +│ Durable layer (sql/experimental/durable.sql) │ ← THIS SPEC +│ • dispatch loop over PgQ batches │ +│ • coordination side tables (waits, joins, dedup, proj.) │ +│ • SECURITY DEFINER fns, search_path=pgque,pg_catalog │ +├──────────────────────────────────────────────────────────┤ +│ PgQ engine (SACRED — UNMODIFIED) │ ← do not touch +│ insert_event · next_batch · get_batch_events · │ +│ finish_batch · event_retry · send_at (PR #237) · │ +│ ticks · snapshot rotation · TRUNCATE · cooperative │ +│ consumers (dead_interval takeover) │ +└──────────────────────────────────────────────────────────┘ +``` + +The durable layer **only calls** the five PgQ primitives + `send_at`. It adds **no** modification to rotation/tick/batch logic and introduces **no** second concurrency model. + +### 4.2 Key abstractions + +- **Workflow** — a logical state machine identified by `workflow_id`. At any instant it is in exactly one of three conditions: **(a)** one *in-flight* message (a step-event sitting in a PgQ batch being processed), **(b)** *scheduled* (a `send_at` continuation awaiting a wake time, or a registered wait awaiting an event), or **(c)** *terminal*. The **single-live-continuation invariant** — each processed step enqueues *exactly one* successor — is what makes exclusivity structural rather than lease-based. +- **Step-event** — the message on the PgQ queue. Payload carries: `workflow_id`, `step_seq` (monotonic progress anchor), `step_name`/state tag, and small continuation state (continuation-passing). Large state is the user's responsibility to hold in their own tables, addressed by `workflow_id`. +- **Transition** — process a step → emit successor as a *new append*. Never an `UPDATE` of a status row. +- **Coordination side tables** (the only mutable state; see §5.5) — `wf_wait`, `wf_event_cache`, `wf_join`, `wf_join_done`, `wf_dedup`, `wf_live` (current-state projection). Their churn is bounded by **concurrency and coordination-point count, not total step volume** (the honest correction, §5.6). + +### 4.3 Concurrency / ownership model + +One **logical consumer** with cooperative **subconsumers** splitting batches (PgQ 0.2 feature). Because exactly one live message exists per workflow, only one subconsumer ever touches a given workflow at a given instant — exclusivity is an emergent property of the invariant, requiring **no claim/lease/steal machinery**. Worker death mid-batch is covered by PgQ's existing cooperative `dead_interval` takeover: the unfinished batch is reassigned and the in-flight step redelivers (at-least-once), made safe by per-step idempotency (§5.4). + +--- + +## 5. Implementation details + +### 5.1 The hot path: one transition = append + ack, atomically + +The foundational guarantee. `insert_event()` (enqueue successor) and `finish_batch()` (ack) run in the **consumer's own transaction**, so a step's side effects, its successor enqueue, and its batch ack are **one atomic commit**: + +``` +begin; + -- 1. step's own DB side effects (idempotent or naturally in-txn) + -- 2. record per-step dedup marker (workflow_id, step_seq) [if first delivery] + perform pgque.insert_event(queue, next_state); -- enqueue exactly one successor + perform pgque.finish_batch(batch_id); -- ack this step +commit; +``` + +- **Commit** ⇒ successor durably enqueued **AND** batch finished, atomically ⇒ exactly-once handoff. +- **Crash before commit** ⇒ txn aborts ⇒ no successor, no dedup marker, batch not finished ⇒ the step redelivers cleanly. + +No subtransactions are used on this path (hard constraint). + +### 5.2 Dispatch loop + +``` +loop: + batch_id := pgque.next_batch(queue, consumer) -- snapshot-bounded + if batch_id is null: sleep to next tick; continue + events := pgque.get_batch_events(batch_id) -- many workflows at once + begin + for each event in events: -- batch step execution + advance_one(event) -- §5.3, appends successor(s) + pgque.finish_batch(batch_id) + commit +``` + +A batch may contain step-events for **thousands of distinct workflows**; they advance in **one transaction** (native fan-out + batch step execution). A step that fails transiently calls `pgque.event_retry()` for that single event rather than aborting the whole batch where the engine allows per-event retry; a poisoned step lands in PgQ's existing DLQ after max retries. + +### 5.3 The five durable-execution requirements, mapped + +1. **Exclusive ownership — structural.** Single-live-continuation invariant + cooperative `dead_interval` takeover. No lease. +2. **Mutable run state — re-enqueue, don't update.** Each transition appends a new event carrying new state; small state rides the payload; no long-lived per-run row on the hot path. +3. **Long-lived persistence — rotating `send_at`.** `sleep('7d')` = `send_at(continuation, now()+7d)`; the step acks immediately; the sleep is one row in a TRUNCATE-rotated delayed table — zero-bloat, never an open batch. +4. **Per-row scheduling.** Timers via rotating `send_at`. **`awaitEvent` with timeout** is the genuinely hard new piece (§5.7). +5. **Checkpoint replay — not needed.** No long-running function to resume. Recovery = PgQ's at-least-once redelivery of the single in-flight step. Correctness = exactly-once handoff (§5.1) + per-step idempotency (§5.4). + +### 5.4 Per-step idempotency + +Every step is keyed `(workflow_id, step_seq)`. On (re)delivery a step first checks/inserts a dedup marker; the marker insert and the successor enqueue commit together (§5.1). A redelivered step whose successor already committed is a no-op (marker present) and simply re-acks. The dedup store is **append-based and short-horizon (rotating)** so it does not itself become a bloat source (§5.6). + +### 5.5 Coordination side tables + +| Table | Role | Churn driver | Lifecycle | +|---|---|---|---| +| `wf_live` | current-state projection: one row per **live** workflow (observability + addressing) | concurrency (live count) | inserted at start, deleted at terminal | +| `wf_wait` | registered event waits, single-resume token | open awaits | `DELETE … RETURNING` on resume/timeout | +| `wf_event_cache` | first-write-wins cache for emit-before-await | emit/await coordination points | TTL-swept | +| `wf_join` | join row: parent + total N, single-resume token | spawn points | deleted when parent resumes | +| `wf_join_done` | idempotent completed-set `(parent, child_idx)` | child completions | dropped with the join | +| `wf_dedup` | per-step `(workflow_id, step_seq)` markers | redelivery horizon | rotating / short-horizon | + +All are small relative to total step volume. `wf_live`, `wf_wait`, `wf_join` are deleted on resolution (row-count bounded by concurrency); `wf_event_cache` and `wf_dedup` are TTL/rotation-bounded. + +### 5.6 The honest zero-bloat claim (stated precisely, never overstated) + +Zero-bloat holds on the **hot step-transition path** (appends + rotation). The coordination tables churn proportionally to **concurrency and coordination-point count, not to total step volume**. `wf_live` holds one row per *live* workflow (deleted on completion), so its row-count is bounded by concurrency; per-step dedup is append-based and short-horizon (rotating). The precise marketed claim is therefore: **zero-bloat hot path, concurrency-bounded coordination churn** — still dramatically better than per-step status-row churn. The benchmark (§6.5) measures and publishes both curves; we never claim "zero dead tuples anywhere." + +### 5.7 `awaitEvent` / `emit` — the ~20% with real risk (designed and TDD'd first) + +Wait registry keyed `(workflow_id, event_name)`, event names **correlation-scoped** to prevent cross-talk. Race table: + +- **emit-before-await** → `emit` writes `wf_event_cache` **first-write-wins**; a later `awaitEvent` finds the cached event and resumes immediately (no wait row created). +- **await/emit interleave** → both serialize on a **per-key advisory/row lock** so exactly one of {register-wait, consume-cache} wins deterministically. +- **double-resume (emit racing the timeout sweep)** → the wait row is a **single-resume token** resolved by `DELETE … RETURNING` in the **same txn** as the continuation enqueue. Whoever deletes the row first (emit or sweep) resumes; the loser sees zero rows and does nothing. +- **stale / cross-talk cached events** → correlation-scoped names + cache TTL. +- **redelivery of the await step itself** → idempotent registration on `(workflow_id, step_seq)`, with the projection's `step_seq` as the progress anchor; re-registering is a no-op. +- **timeout** → a maintenance sweep (optional pg_cron) injects the timeout-branch continuation via the same single-resume `DELETE … RETURNING` path. + +### 5.8 fan-out / join (spawn + `awaitAll`) + +- Spawn N children with **distinct child workflow ids**; **record the join total `N` atomically with the spawn** — tick visibility makes this race-free for free (children become visible only at the next tick boundary, after the join row is committed). +- Count completions with an **idempotent completed-set** `(parent, child_idx)` — redelivery-safe. +- Resume the parent **exactly once** via the `wf_join` row as a deletable single-resume token (last child to flip the count to N deletes the join and enqueues the parent continuation, in one txn). +- **Explicit per-child failure semantics**: the parent receives a **result array**, one entry per child (success value or failure marker). A failed child does not block the join; it reports failure in its slot. +- **Cancellation / orphan handling is explicitly deferred** (§10). + +### 5.9 Constraints honored + +Reduces cleanly to `insert_event`, `next_batch`, `get_batch_events`, `finish_batch`, `event_retry` (+ `send_at`) plus the small side tables of §5.5. Single-file, no C extension, no `shared_preload_libraries`, no restart; managed-PG compatible; optional pg_cron for ticker + maint sweeps. PostgreSQL 14–18; `pg_snapshot`/`xid8`. All SECURITY DEFINER functions pin `search_path = pgque, pg_catalog`. No subtransactions in hot paths. Ships as optional experimental `sql/experimental/durable.sql` gated by the promotion rule. + +--- + +## 6. Tests plan + +### 6.1 Hard repo rule + +**Red/green TDD for ALL new code.** Every function below is written test-first: a failing test asserting the behavior, then the implementation that makes it pass. CI rejects any new SQL function or SDK method without a preceding failing-then-passing test in the same change. + +### 6.2 Built test-first, in this order (highest risk first) + +These are the pieces explicitly required to be designed and TDD'd **first**, before any happy-path step plumbing: + +1. **Exactly-once handoff** (§5.1) — red test: kill the txn between `insert_event` and `commit`, assert no successor + clean redelivery; assert no double-handoff on commit. +2. **Per-step idempotency** (§5.4) — red test: deliver the same `(workflow_id, step_seq)` twice, assert exactly one successor + one side effect. +3. **`awaitEvent` / `emit` race matrix** (§5.7) — one red test per row of the race table (emit-before-await, interleave, double-resume, stale cache, await redelivery, timeout). Single-resume token proven by concurrent emit+sweep test. +4. **fan-out / join** (§5.8) — red tests: race-free join-total recording; idempotent completed-set under duplicated completion; exactly-once parent resume; per-child failure surfaced in the result array. + +### 6.3 CI test suites + +- **Unit (pgTAP/SQL):** each durable function in isolation, all six coordination tables' invariants, `search_path` pinning assertion, "no subtransaction in hot path" lint. +- **Concurrency/property tests:** randomized interleavings of emit/await/timeout and spawn/complete under multiple subconsumers; assert exactly-once resume + no orphaned waits/joins. +- **Crash-recovery tests:** inject worker death mid-batch, assert `dead_interval` takeover + single redelivery + idempotent no-op. +- **Matrix:** PostgreSQL 14, 15, 16, 17, 18. +- **Engine-sacredness guard:** CI diff-check that no file under the PgQ engine path is modified by this change. + +### 6.4 Manual acceptance (maps 1:1 to §3 user stories) + +Each of the five user stories has a runnable scenario script the reviewer executes by hand against a managed-PG-like instance. + +### 6.5 Success-criterion benchmark (the entire pitch) + +Throughput-and-bloat benchmark vs a mutable-status-row baseline (DBOS/absurd shape) on server hardware. Publishes two curves over a long sustained run: **`n_dead_tup`** (flat for PgQue hot path; rising for baseline) and **sustained transitions/sec** (stable for PgQue; degrading for baseline at the VACUUM wall). Also publishes the **coordination-table churn** curve to substantiate the honest claim of §5.6. This benchmark is a CI-runnable harness, not a one-off. + +--- + +## 7. Team (veteran experts to hire) + +- **Veteran PostgreSQL internals / MVCC engineer (1)** — owns the snapshot/visibility reasoning, `xid8`/`pg_snapshot`, rotation interaction, no-subtransaction guarantee. +- **Veteran durable-execution / distributed-systems engineer (1)** — owns the await/emit and fan-out/join race designs and the single-resume-token correctness proofs. +- **Veteran PL/pgSQL + SQL test engineer (pgTAP) (1)** — owns the red/green TDD harness, concurrency/property tests, crash-recovery injection. +- **Veteran SDK / developer-experience engineer (Python) (1)** — owns the one reference SDK and the thin-client surface. +- **Veteran performance / benchmarking engineer (1)** — owns the throughput-and-bloat benchmark harness and the published curves. +- **Veteran technical writer / DX reviewer (0.5, shared)** — owns the experimental-feature docs and the honest-claim framing. + +### 7.1 Persona for this spec round + +Veteran **"Durable Workflow Engineer"** (accepted). + +--- + +## 8. Implementation plan (sprints, parallelization, ordering) + +**Sprint 0 — Foundations & harness (1 wk).** +- Test engineer: stand up pgTAP red/green harness, CI matrix (PG 14–18), engine-sacredness diff-guard. *(blocks everyone — must land first.)* +- PG-internals engineer: spike `next_batch`/`get_batch_events`/`finish_batch` reduction; confirm `send_at` (PR #237) shape. +- *Parallel:* SDK engineer scaffolds the thin Python client surface against stub SQL signatures. + +**Sprint 1 — Exactly-once core (1.5 wk).** *(highest risk first, per constraint)* +- PG-internals + distributed-systems engineers (pair): exactly-once handoff (§5.1) and per-step idempotency (§5.4), test-first. +- Test engineer: crash-recovery injection + `dead_interval` takeover tests. +- *Gate:* no further work merges until §5.1/§5.4 tests are green. + +**Sprint 2 — Coordination primitives (2 wk).** *Two independent tracks in parallel:* +- **Track A** (distributed-systems engineer): `awaitEvent`/`emit` race matrix (§5.7) — wait registry, first-write-wins cache, single-resume token, timeout sweep. +- **Track B** (PG-internals engineer): fan-out/join (§5.8) — join-total atomicity, idempotent completed-set, exactly-once parent resume, per-child result array. +- Test engineer rotates across both tracks writing the red tests ahead of each piece. + +**Sprint 3 — SDK + projection + dispatch (1.5 wk).** +- SDK engineer: finalize `defineWorkflow/step/sleep/awaitEvent/emit/spawn/awaitAll` over the now-stable SQL. +- PG-internals engineer: `wf_live` projection + dispatch loop (§5.2), `sleep` via rotating `send_at`. +- *Parallel:* benchmarking engineer builds the baseline (DBOS/absurd-shape) rig. + +**Sprint 4 — Benchmark, hardening, docs (1.5 wk).** +- Benchmarking engineer: run the success-criterion benchmark (§6.5), publish dead-tuple + throughput + coordination-churn curves. +- Whole team: concurrency/property-test hardening, `search_path` audit, no-subtransaction lint pass. +- Writer: experimental docs with the honest-claim framing (§5.6); promotion-rule checklist. + +**Critical path:** Sprint 0 harness → Sprint 1 exactly-once gate → Sprint 2 Track A & Track B (parallel) → Sprint 3 → Sprint 4 benchmark. SDK and benchmark-rig work parallelize off the critical path from Sprint 0/3 respectively. + +--- + +## 9. Topic-specific: API surface (reference SDK, Python v0.1) + +```python +wf = defineWorkflow("order_fulfillment") + +@wf.step("charge") +def charge(ctx, state): + ctx.side_effect(...) # user's own idempotent/in-txn write + return ctx.goto("await_ship", state) # append successor + +@wf.step("await_ship") +def await_ship(ctx, state): + return ctx.await_event("shipped", timeout="24h", + on_event="notify", on_timeout="escalate") + +@wf.step("fan") +def fan(ctx, state): + return ctx.spawn([...N children...], join="collect") # awaitAll → result array + +# external producer, anywhere: +emit(workflow_id, "shipped", payload) +``` + +Every SDK call compiles to one of the five PgQ primitives + a coordination-table touch. **No** `async/await`-compiled linear-code DX in v0.1 (deferred, §10). + +--- + +## 10. Non-goals / disclaimers (honored strictly — not reintroduced anywhere above) + +- **NOT** a Temporal/DBOS-style deterministic-replay engine. No workflow-determinism requirement, no replay of a long linear function, no per-step `workflow_status` UPDATE churn. +- **NOT** a multi-language deterministic-replay runtime in v1. No N synchronized SDKs — one reference SDK. +- **NOT** a separate server, daemon, or external datastore. No Cassandra, RocksDB, FoundationDB, or Redis. +- **NOT** targeting hyperscale (>~ a few thousand workflow transitions/sec per database) — conceded to Temporal honestly. +- **NOT** changing the sacred PgQ engine, and **NOT** introducing a second `SELECT … FOR UPDATE SKIP LOCKED` claim/lease concurrency model as the primary mechanism — exclusivity comes from the single-live-continuation invariant over the existing rotation engine. +- **Cancellation / orphan-join propagation is deferred** to a follow-up, not in v0.1. +- Linear-code (`async/await`-compiled) DX is an explicit **later** SDK project, not an engine requirement. + +--- + +## 11. Embedded Changelog + +- **v0.1** (2026-05-30) — Initial spec scaffold fleshed into full structure. Resolved all five delegated interview questions (primary users, core job, durability model, success metric, out-of-scope). Added Goal-&-why framing, 5 user stories, layered architecture with the sacred-engine boundary, hot-path/coordination implementation detail incl. the honest zero-bloat correction, await/emit + fan-out/join race designs, red/green TDD-first ordering, team roster, 5-sprint plan with parallel tracks, SDK surface, and strict non-goals. No reviewer findings yet (first authoring round). diff --git a/blueprints/workflows/TLDR.md b/blueprints/workflows/TLDR.md new file mode 100644 index 00000000..7ab568eb --- /dev/null +++ b/blueprints/workflows/TLDR.md @@ -0,0 +1,23 @@ +# TL;DR + +## Goal + +> Status: **experimental**, ships as optional `sql/experimental/durable.sql` gated by the project promotion rule. One reference SDK. Engine layer is sacred and untouched. + +## Scope summary + +- 1. Goal & why it's needed +- 2. Scope & resolved interview decisions +- 3. User stories +- 4. Architecture +- 5. Implementation details +- 6. Tests plan +- 7. Team (veteran experts to hire) +- 8. Implementation plan (sprints, parallelization, ordering) +- 9. Topic-specific: API surface (reference SDK, Python v0.1) +- 10. Non-goals / disclaimers (honored strictly — not reintroduced anywhere above) +- 11. Embedded Changelog + +## Next action + +`samospec resume workflows` diff --git a/blueprints/workflows/architecture.json b/blueprints/workflows/architecture.json new file mode 100644 index 00000000..f46cd630 --- /dev/null +++ b/blueprints/workflows/architecture.json @@ -0,0 +1,5 @@ +{ + "version": "1", + "nodes": [], + "edges": [] +} diff --git a/blueprints/workflows/changelog.md b/blueprints/workflows/changelog.md new file mode 100644 index 00000000..3978dadb --- /dev/null +++ b/blueprints/workflows/changelog.md @@ -0,0 +1,6 @@ +# changelog + +## v0.1 — 2026-05-30T09:34:38.223Z + +- Initial draft authored by the lead. +- Persona: Veteran "Durable Workflow Engineer" expert diff --git a/blueprints/workflows/decisions.md b/blueprints/workflows/decisions.md new file mode 100644 index 00000000..88773af1 --- /dev/null +++ b/blueprints/workflows/decisions.md @@ -0,0 +1,3 @@ +# decisions + +- No review-loop decisions yet. diff --git a/blueprints/workflows/index.html b/blueprints/workflows/index.html new file mode 100644 index 00000000..27bd9e18 --- /dev/null +++ b/blueprints/workflows/index.html @@ -0,0 +1,501 @@ + + + + + +Brief — PgQue Durable Workflows — SPEC v0.1 (v0.1) + + + + +
+
+
+workflows +vv0.1 +2026-05-30 +
+
+brief + + + + + +
+
+
+
+
+

Brief — derivative summary

+

PgQue Durable Workflows — SPEC v0.1

+

+workflows · + Version v0.1 · + Published · + canonical SPEC.md → +

+

+Summary, not the spec. Skim this for shape, architecture, scope, risks, decisions, and open questions in 5–10 minutes; consult SPEC.md for the full text. +

+
+ +
+

01Goal

+

> Status: experimental, ships as optional sql/experimental/durable.sql gated by the project promotion rule. One reference SDK. Engine layer is sacred and untouched.

+
+
+

021. Goal & why it's needed

+

Goal. Provide a durable-execution / durable-workflow layer for PgQue that models each workflow as an append-only stream of state-transition events running over PgQ's existing snapshot + TRUNCATE rotation engine, so durable workflows inherit PgQ's zero-bloat property instead of fighting it.

+

Why this exists. Every Postgres-native durable-execution engine in the category (DBOS, absurd, and the long tail of SELECT … FOR UPDATE SKIP LOCKED + DELETE queues) shares one structural liability: they model a workflow as a mutable workflow_status row that is UPDATEd on every step. At the throughput the category is actually chasing — AI agent loops doing millions of cheap iterations — that per-step UPDATE churns dead tuples until the workload hits a VACUUM wall, and throughput degrades. PgQ already solved exactly this problem for queues with snapshot-batch isolation + wholesale TRUNCATE rotation: zero dead-tuple bloat under sustained load. The insight this spec operationalizes is that durable execution is event sourcing, PgQ is already an append-only event log, and therefore a workflow can be modeled as a stream of appended transitions (continuation-passing) rather than a mutated row. The zero-bloat property then carries through, for free, from the queue layer to the workflow layer.

+

This exists because no one else can credibly claim "durable workflows with a flat dead-tuple curve at agent-loop throughput, on just your managed Postgres, no separate datastore." That is the entire pitch, and it is only reachable by building on top of the rotation engine rather than re-introducing a mutable-status model beside it.

+
+
+

032. Scope & resolved interview decisions

+

The interview answers were all delegated to the lead ("decide for me"). Resolved:

+

| Question | Decision (v0.1) | |---|---| | Primary users | Backend engineers running long-lived or high-iteration orchestration (AI agent loops, multi-step business processes, fan-out jobs) on managed Postgres who refuse a second datastore and refuse a VACUUM wall. | | Core job | Advance a workflow from one step to the next with exactly-once handoff and at-least-once step execution, never losing or silently duplicating a workflow's progress — on a hot path that appends and rotates rather than updates. | | Durability / recovery guarantee | At-least-once step execution + exactly-once handoff between steps; per-step idempotency keyed on (workflow_id, step_seq). On crash, exactly the single in-flight step redelivers (PgQ's existing redelivery); there is no long function to replay. | | Success metric | A throughput-and-bloat benchmark vs a mutable-status-row baseline (DBOS/absurd shape) on server hardware: flat dead-tuple count + sustained throughput on the append+rotate hot path where the baseline degrades. | | Out of scope for v0.1 | Cancellation / orphan-join propagation; linear-code (async/await-compiled) DX; N synchronized SDKs; hyperscale (>~ a few thousand transitions/sec/db); deterministic replay. |

+

---

+
+
+

043. User stories

+

Each story is persona + action + outcome and is directly exercised as a manual acceptance test (§6.4).

+

---

+
    +
  • Agent-loop builder (zero-bloat at iteration scale). As a backend engineer running an AI agent that loops thousands of times per run, I define each iteration as a step that processes and enqueues its successor, so that a million iterations complete with a flat dead-tuple count on the hot tables and no VACUUM-driven throughput cliff — verifiable with pg_stat_user_tables.n_dead_tup staying flat through the run.
  • +
+
+
+

054. Architecture

+

<!-- architecture:begin -->

+

<!-- architecture:end -->

+

The durable layer only calls the five PgQ primitives + send_at. It adds no modification to rotation/tick/batch logic and introduces no second concurrency model.

+
    +
  • Workflow — a logical state machine identified by workflow_id. At any instant it is in exactly one of three conditions: (a) one in-flight message (a step-event sitting in a PgQ batch being processed), (b) scheduled (a send_at continuation awaiting a wake time, or a registered wait awaiting an event), or (c) terminal. The single-live-continuation invariant — each processed step enqueues exactly one successor — is what makes exclusivity structural rather than lease-based.
  • +
  • Step-event — the message on the PgQ queue. Payload carries: workflow_id, step_seq (monotonic progress anchor), step_name/state tag, and small continuation state (continuation-passing). Large state is the user's responsibility to hold in their own tables, addressed by workflow_id.
  • +
  • Transition — process a step → emit successor as a new append. Never an UPDATE of a status row.
  • +
  • Coordination side tables (the only mutable state; see §5.5) — wf_wait, wf_event_cache, wf_join, wf_join_done, wf_dedup, wf_live (current-state projection). Their churn is bounded by concurrency and coordination-point count, not total step volume (the honest correction, §5.6).
  • +
+
(architecture not yet specified)
+
┌──────────────────────────────────────────────────────────┐
+│  Reference SDK (Python, v0.1)                            │  ← thin client
+│  defineWorkflow / step / sleep / awaitEvent / emit /     │
+│  spawn / awaitAll  →  plain SQL function calls           │
+├──────────────────────────────────────────────────────────┤
+│  Durable layer  (sql/experimental/durable.sql)          │  ← THIS SPEC
+│  • dispatch loop over PgQ batches                        │
+│  • coordination side tables (waits, joins, dedup, proj.) │
+│  • SECURITY DEFINER fns, search_path=pgque,pg_catalog    │
+├──────────────────────────────────────────────────────────┤
+│  PgQ engine  (SACRED — UNMODIFIED)                       │  ← do not touch
+│  insert_event · next_batch · get_batch_events ·          │
+│  finish_batch · event_retry · send_at (PR #237) ·        │
+│  ticks · snapshot rotation · TRUNCATE · cooperative      │
+│  consumers (dead_interval takeover)                      │
+└──────────────────────────────────────────────────────────┘
+
+Subsections (3) +
  1. 4.1 Layering (the sacred boundary)
  2. +
  3. 4.2 Key abstractions
  4. +
  5. 4.3 Concurrency / ownership model
+
+
+
+

065. Implementation details

+

The foundational guarantee. insert_event() (enqueue successor) and finish_batch() (ack) run in the consumer's own transaction, so a step's side effects, its successor enqueue, and its batch ack are one atomic commit:

+

No subtransactions are used on this path (hard constraint).

+

A batch may contain step-events for thousands of distinct workflows; they advance in one transaction (native fan-out + batch step execution). A step that fails transiently calls pgque.event_retry() for that single event rather than aborting the whole batch where the engine allows per-event retry; a poisoned step lands in PgQ's existing DLQ after max retries.

+
    +
  • Commit ⇒ successor durably enqueued AND batch finished, atomically ⇒ exactly-once handoff.
  • +
  • Crash before commit ⇒ txn aborts ⇒ no successor, no dedup marker, batch not finished ⇒ the step redelivers cleanly.
  • +
+
begin;
+  -- 1. step's own DB side effects (idempotent or naturally in-txn)
+  -- 2. record per-step dedup marker (workflow_id, step_seq)   [if first delivery]
+  perform pgque.insert_event(queue, next_state);   -- enqueue exactly one successor
+  perform pgque.finish_batch(batch_id);            -- ack this step
+commit;
+
loop:
+  batch_id := pgque.next_batch(queue, consumer)        -- snapshot-bounded
+  if batch_id is null: sleep to next tick; continue
+  events  := pgque.get_batch_events(batch_id)          -- many workflows at once
+  begin
+    for each event in events:                          -- batch step execution
+        advance_one(event)                             -- §5.3, appends successor(s)
+    pgque.finish_batch(batch_id)
+  commit
+
+Subsections (8) +
  1. 5.1 The hot path: one transition = append + ack, atomically
  2. +
  3. 5.2 Dispatch loop
  4. +
  5. 5.3 The five durable-execution requirements, mapped
  6. +
  7. 5.4 Per-step idempotency
  8. +
  9. 5.5 Coordination side tables
  10. +
  11. 5.6 The honest zero-bloat claim (stated precisely, never overstated)
  12. +
  13. 5.7 `awaitEvent` / `emit` — the ~20% with real risk (designed and TDD'd first)
  14. +
  15. 5.8 fan-out / join (spawn + `awaitAll`)
+
+
+
+

076. Tests plan

+

Red/green TDD for ALL new code. Every function below is written test-first: a failing test asserting the behavior, then the implementation that makes it pass. CI rejects any new SQL function or SDK method without a preceding failing-then-passing test in the same change.

+

These are the pieces explicitly required to be designed and TDD'd first, before any happy-path step plumbing:

+

Each of the five user stories has a runnable scenario script the reviewer executes by hand against a managed-PG-like instance.

+
    +
  • Exactly-once handoff (§5.1) — red test: kill the txn between insert_event and commit, assert no successor + clean redelivery; assert no double-handoff on commit.
  • +
  • Per-step idempotency (§5.4) — red test: deliver the same (workflow_id, step_seq) twice, assert exactly one successor + one side effect.
  • +
  • awaitEvent / emit race matrix (§5.7) — one red test per row of the race table (emit-before-await, interleave, double-resume, stale cache, await redelivery, timeout). Single-resume token proven by concurrent emit+sweep test.
  • +
  • fan-out / join (§5.8) — red tests: race-free join-total recording; idempotent completed-set under duplicated completion; exactly-once parent resume; per-child failure surfaced in the result array.
  • +
+
+Subsections (5) +
  1. 6.1 Hard repo rule
  2. +
  3. 6.2 Built test-first, in this order (highest risk first)
  4. +
  5. 6.3 CI test suites
  6. +
  7. 6.4 Manual acceptance (maps 1:1 to §3 user stories)
  8. +
  9. 6.5 Success-criterion benchmark (the entire pitch)
+
+
+
+

087. Team (veteran experts to hire)

+

Veteran "Durable Workflow Engineer" (accepted).

+

---

+
    +
  • Veteran PostgreSQL internals / MVCC engineer (1) — owns the snapshot/visibility reasoning, xid8/pg_snapshot, rotation interaction, no-subtransaction guarantee.
  • +
  • Veteran durable-execution / distributed-systems engineer (1) — owns the await/emit and fan-out/join race designs and the single-resume-token correctness proofs.
  • +
  • Veteran PL/pgSQL + SQL test engineer (pgTAP) (1) — owns the red/green TDD harness, concurrency/property tests, crash-recovery injection.
  • +
  • Veteran SDK / developer-experience engineer (Python) (1) — owns the one reference SDK and the thin-client surface.
  • +
  • Veteran performance / benchmarking engineer (1) — owns the throughput-and-bloat benchmark harness and the published curves.
  • +
  • Veteran technical writer / DX reviewer (0.5, shared) — owns the experimental-feature docs and the honest-claim framing.
  • +
+
+Subsections (1) +
  1. 7.1 Persona for this spec round
+
+
+
+

098. Implementation plan (sprints, parallelization, ordering)

+

Sprint 0 — Foundations & harness (1 wk).

+

Sprint 1 — Exactly-once core (1.5 wk). (highest risk first, per constraint)

+

Sprint 2 — Coordination primitives (2 wk). Two independent tracks in parallel:

+
    +
  • Test engineer: stand up pgTAP red/green harness, CI matrix (PG 14–18), engine-sacredness diff-guard. (blocks everyone — must land first.)
  • +
  • PG-internals engineer: spike next_batch/get_batch_events/finish_batch reduction; confirm send_at (PR #237) shape.
  • +
  • Parallel: SDK engineer scaffolds the thin Python client surface against stub SQL signatures.
  • +
+
+
+

109. Topic-specific: API surface (reference SDK, Python v0.1)

+

Every SDK call compiles to one of the five PgQ primitives + a coordination-table touch. No async/await-compiled linear-code DX in v0.1 (deferred, §10).

+

---

+
wf = defineWorkflow("order_fulfillment")
+
+@wf.step("charge")
+def charge(ctx, state):
+    ctx.side_effect(...)              # user's own idempotent/in-txn write
+    return ctx.goto("await_ship", state)        # append successor
+
+@wf.step("await_ship")
+def await_ship(ctx, state):
+    return ctx.await_event("shipped", timeout="24h",
+                           on_event="notify", on_timeout="escalate")
+
+@wf.step("fan")
+def fan(ctx, state):
+    return ctx.spawn([...N children...], join="collect")   # awaitAll → result array
+
+# external producer, anywhere:
+emit(workflow_id, "shipped", payload)
+
+
+

1110. Non-goals / disclaimers (honored strictly — not reintroduced anywhere above)

+

---

+
    +
  • NOT a Temporal/DBOS-style deterministic-replay engine. No workflow-determinism requirement, no replay of a long linear function, no per-step workflow_status UPDATE churn.
  • +
  • NOT a multi-language deterministic-replay runtime in v1. No N synchronized SDKs — one reference SDK.
  • +
  • NOT a separate server, daemon, or external datastore. No Cassandra, RocksDB, FoundationDB, or Redis.
  • +
  • NOT targeting hyperscale (>~ a few thousand workflow transitions/sec per database) — conceded to Temporal honestly.
  • +
  • NOT changing the sacred PgQ engine, and NOT introducing a second SELECT … FOR UPDATE SKIP LOCKED claim/lease concurrency model as the primary mechanism — exclusivity comes from the single-live-continuation invariant over the existing rotation engine.
  • +
  • Cancellation / orphan-join propagation is deferred to a follow-up, not in v0.1.
  • +
  • Linear-code (async/await-compiled) DX is an explicit later SDK project, not an engine requirement.
  • +
+
+
+

1211. Embedded Changelog

+
    +
  • v0.1 (2026-05-30) — Initial spec scaffold fleshed into full structure. Resolved all five delegated interview questions (primary users, core job, durability model, success metric, out-of-scope). Added Goal-&-why framing, 5 user stories, layered architecture with the sacred-engine boundary, hot-path/coordination implementation detail incl. the honest zero-bloat correction, await/emit + fan-out/join race designs, red/green TDD-first ordering, team roster, 5-sprint plan with parallel tracks, SDK surface, and strict non-goals. No reviewer findings yet (first authoring round).
  • +
+
+
+

+Generated by samospec brief workflows on . + 1 review round — + lead —, + reviewers — + + —. +

+

Re-run after each samospec publish to refresh. Canonical document: SPEC.md.

+
+
+ + + From aafaf7bc7cd5fce92ed0e879719f702825406651 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 30 May 2026 09:50:39 +0000 Subject: [PATCH 4/8] spec(workflows): iterate to v0.2.0 + refresh brief --- blueprints/workflows/BRIEF.html | 4 ++-- blueprints/workflows/TLDR.md | 8 +++++--- blueprints/workflows/changelog.md | 3 +++ blueprints/workflows/decisions.md | 16 ++++++++++++++++ blueprints/workflows/index.html | 4 ++-- 5 files changed, 28 insertions(+), 7 deletions(-) diff --git a/blueprints/workflows/BRIEF.html b/blueprints/workflows/BRIEF.html index 27bd9e18..c24cf7ef 100644 --- a/blueprints/workflows/BRIEF.html +++ b/blueprints/workflows/BRIEF.html @@ -475,8 +475,8 @@

1211. Embedded Changelog