From 44c171e634738534615a95caa27ce8b27c0c3613 Mon Sep 17 00:00:00 2001
From: Pino de Candia <32303022+pinodeca@users.noreply.github.com>
Date: Tue, 23 Jun 2026 16:24:17 +0000
Subject: [PATCH 1/2] docs: add node & instance state-transition model proposal

Adds docs/node-state-model.md, a design proposal consolidating the
node-status work for issues #240 (skipped downstream nodes) and #171
(race-loser nodes left running/pending). Defines instance and node
lifecycle states, legal transitions, a coarse status set plus nullable
status_reason, loop iteration-scoping/reset semantics, and prior-art
alignment with Airflow/Temporal/BPMN/Step Functions.

The #240 skipped-node work shipped earlier in PR #249; this PR (#263)
defines the consolidated model and will add the remaining implementation
(cancelled, status_reason, race-loser reconciliation, loop reset) after
review.

Design only; implementation to follow on this PR.
---
 docs/node-state-model.md | 416 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 416 insertions(+)
 create mode 100644 docs/node-state-model.md

diff --git a/docs/node-state-model.md b/docs/node-state-model.md
new file mode 100644
index 0000000..61a738c
--- /dev/null
+++ b/docs/node-state-model.md
@@ -0,0 +1,416 @@
+# Node & Instance State Model
+
+Status: **Design proposal** (consolidates issues [#240] and [#171]).
+
+This document defines the lifecycle states that `pg_durable` workflow **instances**
+and **nodes** move through, the legal transitions between them, and the reason a
+node ends in a non-executed terminal state. It exists so that operators inspecting
+`df.instances` / `df.nodes` (and the `df.status()` / `df.instance_nodes` views) can
+unambiguously tell *what happened* to every step of a run.
+
+## Proposal summary
+
+This is the proposed transition graph for node states (states in **bold** are
+terminal):
+
+```mermaid
+stateDiagram-v2
+    [*] --> pending
+    pending --> running: scheduled
+    pending --> skipped: upstream_failed / branch_not_taken
+    pending --> cancelled: race_lost (never started)
+    running --> completed
+    running --> failed
+    running --> cancelled: race_lost (abandoned)
+    completed --> pending: loop boundary reset
+    failed --> pending: loop boundary reset
+    skipped --> pending: loop boundary reset
+    cancelled --> pending: loop boundary reset
+
+    classDef terminal font-weight:bold;
+    class completed,failed,skipped,cancelled terminal
+```
+
+**Loop reset:** a `LOOP` body node has one row but runs once per iteration, so at each `continue_as_new` boundary the loop-body subgraph is reset to `pending`, preventing stale terminal marks from leaking into the next iteration (see [Loops and node-state reuse](#loops-and-node-state-reuse-iteration-scoping)).
+
+**Key open decision** — how `df.nodes` projects looped nodes:
+
+- **Latest-only row + loop reset** (this proposal): one mutable row per node; simplest, but no per-iteration history in SQL.
+- **Per-execution rows** (`df.node_executions` keyed by `(node_id, iteration)`, à la Airflow): full per-iteration history and no reset needed, at the cost of a heavier schema.
+
+---
+
+## Reference model
+
+We align with **Apache Airflow's `TaskInstance` state model**, which is the closest
+well-known analog: a DAG of steps with branching, short-circuiting, and parallelism.
+The decisions below also stay consistent with **Temporal** (durable-execution
+semantics, since `pg_durable` runs on duroxide) at the instance level, and borrow
+**BPMN 2.0's** notion of a *withdrawn* activity for race-losers.
+
+| Concept here | Airflow | Temporal | BPMN 2.0 | AWS Step Functions |
+|---|---|---|---|---|
+| Step ran OK | `success` | `Completed` | `Completed` | `SUCCEEDED` |
+| Step errored | `failed` | `Failed` | `Failed` | `FAILED` |
+| Not run — upstream failed | `upstream_failed` | (n/a) | `Terminated` | `ABORTED` |
+| Not run — branch not taken | `skipped` | (n/a) | `Withdrawn` | (n/a) |
+| Abandoned — lost a race | (cancelled) | `Canceled` | `Withdrawn` | `ABORTED` |
+| Run still in flight, not reached | `scheduled`/`none` | `Scheduled` | `Ready` | `RUNNING` |
+
+Citing this lets us say: *"node states follow Airflow's task-instance model, with
+BPMN 'withdrawn' semantics for race-losers and Temporal alignment at the instance
+level."*
+
+The crucial lesson from Airflow: **"not executed" is more than one state.** Airflow
+deliberately separates `skipped` (a branch deliberately not taken) from
+`upstream_failed` (blocked by a failure). We adopt that distinction — but capture
+the nuance in a **reason** rather than an ever-growing status enum (see below).
+
+## Design decision: coarse status + reason
+
+We keep a **small, stable terminal status set** and add a nullable **`status_reason`**
+that explains *why* a node reached a non-obvious terminal state.
+
+Rationale:
+
+- `pg_durable` is a PostgreSQL extension with a strict schema-upgrade contract (see
+  [docs/upgrade-testing.md](upgrade-testing.md)). Every new status value is a
+  `CHECK` constraint migration plus a binary backward-compatibility concern. Adding
+  statuses should be rare.
+- A reason column captures arbitrarily fine nuance (`upstream_failed`,
+  `branch_not_taken`, `race_lost`, `loop_not_entered`, …) without schema churn, and
+  is exactly the "note or reason" operators want when debugging.
+- It keeps dashboards simple: filter on a handful of statuses, drill into the reason.
+
+### Node status set
+
+| Status | Terminal | Meaning |
+|---|---|---|
+| `pending` | no | Materialized by `df.start()`, not yet started. |
+| `running` | no | Currently executing. |
+| `completed` | yes | Finished successfully (may carry `result`). |
+| `failed` | yes | Raised an error (carries error in `result`). |
+| `skipped` | yes | Eligible work that was never started because the run already terminated, or a branch the run chose not to take. **(#240, new in 0.2.4)** |
+| `cancelled` | yes | Work that had started (`running`) or was queued and was abandoned because its enclosing scope (a decided `RACE`) no longer needs it. **(#171, proposed)** |
+
+> Current state of the code: `pending`, `running`, `completed`, `failed`, and
+> `skipped` exist; `skipped` was added in 0.2.4 for issue #240 (PR #249).
+> `cancelled` for **nodes** is proposed here for #171 (the `df.instances` table
+> already has a `cancelled` status, but `df.nodes` does not yet).
+
+### `status_reason` vocabulary (proposed)
+
+`status_reason` is `NULL` for the self-explanatory terminal states (`completed`,
+`failed` carrying their own error). It is populated for `skipped` / `cancelled`:
+
+| Reason | Applies to | Set when |
+|---|---|---|
+| `upstream_failed` | `skipped` | A prior/sibling step failed and the run is terminating ([#240]). |
+| `branch_not_taken` | `skipped` | An `IF`/`LOOP` condition chose another path, so this branch never runs. |
+| `race_lost` | `cancelled` | This node is in the losing branch of a resolved `RACE` ([#171]). |
+| `scope_cancelled` | `cancelled` | A reserved, more general form for future cancellation scopes. |
+
+## Instance state model
+
+Instance statuses are **unchanged** by this work (it is a non-goal to alter instance
+vocabulary). They are: `pending`, `running`, `completed`, `failed`, `cancelled`
+(see `instances_status_chk` in [src/lib.rs](../src/lib.rs)).
+
+```mermaid
+stateDiagram-v2
+    [*] --> pending
+    pending --> running: worker picks up
+    running --> completed: root node completed
+    running --> failed: root node / orchestration failed
+    running --> cancelled: instance cancelled
+    pending --> cancelled: cancelled before start
+    completed --> [*]
+    failed --> [*]
+    cancelled --> [*]
+```
+
+A node-level failure drives the instance to `failed`; node-level `skipped` /
+`cancelled` reconciliation happens *underneath* a `failed` (or `completed`, for race
+winners) instance and does not change the instance status.
+
+## Node state model
+
+```mermaid
+stateDiagram-v2
+    [*] --> pending
+    pending --> running: scheduled by orchestration
+    pending --> skipped: upstream_failed / branch_not_taken
+    pending --> cancelled: race_lost (never started)
+    running --> completed
+    running --> failed
+    running --> cancelled: race_lost (abandoned mid-flight)
+    completed --> pending: loop iteration boundary (reset)
+    failed --> pending: loop iteration boundary (reset)
+    skipped --> pending: loop iteration boundary (reset)
+    cancelled --> pending: loop iteration boundary (reset)
+    completed --> [*]
+    failed --> [*]
+    skipped --> [*]
+    cancelled --> [*]
+```
+
+Invariants:
+
+- `result` may only be non-NULL when status is `completed` or `failed`
+  (enforced today by `nodes_result_status_chk`). `skipped` and `cancelled` carry no
+  `result`; the explanation lives in `status_reason`.
+- **Within a single loop iteration (or a non-looping run), terminal is final:** a
+  reconciliation pass never moves an already-terminal node to a different terminal
+  state. The *only* terminal → non-terminal transition is the explicit
+  **loop iteration-boundary reset** (terminal → `pending`), which clears the body
+  subgraph between iterations (see [Loops and node-state reuse](#loops-and-node-state-reuse-iteration-scoping)).
+  Outside a loop, that reset never fires, so terminal states are truly final.
+
+## Per-construct outcomes
+
+How each construct's children may end. "Children" = the nodes referenced by
+`left_node` / `right_node` (and `extra_nodes` for `JOIN`/`RACE` of arity > 2).
+
+| Construct | On success | On failure of a child | Notes |
+|---|---|---|---|
+| `SQL` / `HTTP` / `SLEEP` / `WAIT_SCHEDULE` / `SIGNAL` | self `completed` | self `failed` | Leaf nodes; no children. |
+| `THEN` (sequence) | both children `completed` | failing child `failed`; **later children `skipped` (`upstream_failed`)** | The #240 case. Right branch is never reached after a left failure. |
+| `IF` | taken branch runs; **untaken branch `skipped` (`branch_not_taken`)** | condition failure ⇒ IF `failed`, **both branches `skipped` (`upstream_failed`)** | Today the untaken branch is left `pending`; it should be `skipped/branch_not_taken`. |
+| `LOOP` | body runs each iteration; on exit, condition path resolved | body/condition failure ⇒ LOOP `failed`, unreached body nodes `skipped` | A loop that never enters its body leaves the body `skipped/branch_not_taken`. |
+| `BREAK` | self `completed` (control-flow value) | n/a | Unwinds to enclosing loop; not a failure. |
+| `JOIN` (all branches) | all branches `completed` | any branch fails ⇒ JOIN `failed`; other branches run to their natural end | Fan-out/fan-in; siblings are *not* cancelled today. |
+| `RACE` (first wins) | winner `completed`; **loser branch `cancelled` (`race_lost`)** | if the *winner* errors, RACE `failed` | The #171 case. Today the loser's started nodes stay `running` and unreached nodes stay `pending`. |
+
+## Loops and node-state reuse (iteration scoping)
+
+This is the subtlest part of the model, and the one the single-status-per-row
+shape gets wrong if treated naively.
+
+`df.start()` materializes **one** `df.nodes` row per node, with a stable `id`. A
+`LOOP` re-executes the *same* subgraph (same node ids) every iteration via
+`continue_as_new`. Therefore a node inside a loop body is executed **once per
+iteration**, but it has only **one** status row. That row can only ever show the
+*latest* iteration's outcome — it is iteration-scoped state, not run-scoped state.
+
+### Current behavior (the gap)
+
+There is no reset logic today. Within an iteration, a visited node is re-marked
+`running` on entry (overwriting its prior-iteration terminal status), then
+`completed`/`failed` on exit. That means:
+
+- Nodes that **are** re-visited every iteration self-correct (they get re-marked).
+- Nodes that are **not** visited this iteration keep their **stale** status from a
+  previous iteration. The textbook case is a `RACE` (or `IF`) inside a `LOOP`: in
+  iteration 1 the left branch wins and the right (loser) nodes are marked terminal;
+  in iteration 2 the right branch wins — but the left branch's nodes still carry
+  iteration-1 statuses, and the iteration-1 loser marks were never cleared.
+
+So any per-iteration reconciliation (race-loser cancellation, branch-not-taken
+skip) is only correct *for the current iteration* and must not leak across the
+`continue_as_new` boundary.
+
+### Rule: reset the loop-body subgraph at each iteration boundary
+
+The two halves of the user-identified distinction become explicit rules:
+
+1. **Within one execution (same iteration):** losing-`RACE` nodes and untaken
+   branches transition to their terminal reconciliation state
+   (`cancelled/race_lost`, `skipped/branch_not_taken`) as described above. These
+   marks describe *this* iteration only.
+2. **At the iteration boundary (`continue_as_new` fires):** every node in the loop
+   body subgraph is reset to `pending` (clearing `status`, `result`, and
+   `status_reason`) before the next iteration runs, so the new iteration starts from
+   a clean slate and no stale terminal status survives.
+
+Concretely, reset happens at the **start of each loop body execution** (covering
+iteration 1 too, harmlessly, since those nodes are already `pending`), scoped to the
+loop-body subtree — a recursive walk over `left_node`/`right_node` from the loop's
+`left_node` (body) root. This is the same subtree-scoping primitive used for
+`RACE`-loser cancellation, just applied to "the whole body" instead of "the losing
+branch."
+
+```mermaid
+sequenceDiagram
+    participant L as LOOP node
+    participant B as Body subgraph
+    Note over L,B: iteration N
+    L->>B: reset body subgraph → pending (clear result/reason)
+    B->>B: run body; RACE/IF reconcile losers/untaken → terminal (this iteration)
+    L->>L: evaluate condition
+    alt continue
+        L->>L: continue_as_new (iteration N+1)
+        Note over L,B: iteration N+1 — body reset again, prior marks cleared
+    else exit
+        Note over L,B: last iteration's marks are the final observable state
+    end
+```
+
+When the loop finally exits (via `df.break()` or a false condition), the **last**
+iteration's node states are preserved as the final, correct observable result —
+nothing resets them after the loop ends.
+
+### Determinism
+
+The reset is performed by a **scheduled activity** (a set-based `UPDATE` over the
+body subtree), so the orchestration stays deterministic. Within a single execution,
+replay reads the reset's result from history. Across `continue_as_new`, the next
+iteration is a fresh execution whose history records the reset once. Both are
+replay-safe. As with all reconciliation activities, the reset must no-op gracefully
+against schemas that predate the relevant columns (`status_reason`).
+
+### Why not per-iteration history in `df.nodes`
+
+Resetting means `df.nodes` shows only the *current/last* iteration — it deliberately
+does **not** retain a per-iteration audit trail. That is the right trade-off for the
+operational question these tables answer ("what is the run doing / where did it
+stop"). Full per-iteration history already lives in the duroxide event history; if a
+first-class per-execution projection is ever wanted, model node *executions*
+separately (e.g. a `df.node_executions` table keyed by `(node_id, iteration)`) and
+keep `df.nodes` as the static definition + latest status. That is explicitly out of
+scope here and noted as a future option.
+
+### Prior art: how Airflow and Temporal handle loop/iteration state
+
+Both reference systems avoid the stale-iteration problem the same way — by giving
+each iteration its **own identity** rather than overwriting one mutable status. The
+contrast is instructive because it frames the projection decision below.
+
+**Apache Airflow.** A DAG is *acyclic* — there are no in-graph loops. "Iteration" is
+modeled by minting new identities, never by overwriting:
+
+- TaskInstance identity is `(dag_id, task_id, run_id, map_index)`. The repeated-run
+  unit is the **DagRun** (one per schedule/trigger); each gets a fresh set of
+  TaskInstance rows keyed by `run_id`, so re-running a task is a *different row*.
+- **Dynamic Task Mapping** (`map_index`) models for-each/fan-out inside one run:
+  a task expands at runtime into N mapped TaskInstances, each its own row. Fan-out is
+  more rows, not mutated state.
+- Even retries are non-destructive: a TaskInstance tracks `try_number` and recent
+  versions persist prior attempts in a `TaskInstanceHistory` table.
+- True while-loops aren't first-class; you self-trigger a new DagRun (a new
+  `run_id`).
+
+→ Airflow's projection is *per-iteration by construction*; "loop body state for
+iteration 3" is just `WHERE run_id = …` / `map_index = …`.
+
+**Temporal** (same durable-execution family as duroxide). Loops are ordinary code
+loops; state is an **append-only Event History**, not a mutable table:
+
+- Each activity invocation emits distinct events (`ActivityTaskScheduled` →
+  `Started` → `Completed/Failed`) with unique, increasing `eventId`s. Looping five
+  times yields five event triples — nothing is overwritten; the activity's identity
+  is its sequence number.
+- **`ContinueAsNew`** (the direct analog of duroxide's `continue_as_new` and the
+  pg_durable loop boundary) closes the current execution and starts a fresh one with
+  a new `RunId` and empty history, carrying state forward as input. Segments chain
+  via `continuedExecutionRunId`/`firstExecutionRunId`, preserving a per-segment audit
+  trail.
+- Parallel/raced branches are **child workflows** with their own execution, history,
+  and terminal status. A race-loser is explicitly **`Canceled`** (a terminal event),
+  never a lingering `running`.
+
+→ Temporal's "node state" is an immutable log keyed by `eventId`; loops add events,
+`ContinueAsNew` segments runs by `RunId`, branch state is a child's terminal status.
+
+**Implication for pg_durable.** The difference is that `df.nodes.status` is a single
+**mutable** row updated in place — a shape neither system uses. Crucially, pg_durable
+*already has* the Temporal-style answer underneath: duroxide keeps an append-only
+event history and segments loops via `continue_as_new`. `df.nodes` is just a
+denormalized SQL **projection** on top of that log for operator queries. So the real
+fork is the projection shape, not the state tracking:
+
+| Projection option | Analog | Buys | Cost |
+|---|---|---|---|
+| **Latest-only row + reset on `continue_as_new`** (this proposal) | neither — pragmatic simplification | Simple "what is the run doing now" view | No per-iteration history in SQL (still in duroxide log) |
+| **Per-execution rows** — `df.node_executions` keyed by `(node_id, iteration)` | Airflow's `(task_id, run_id, map_index)` | Full per-iteration history in SQL; *eliminates* the stale-state bug class | New table, more writes, larger schema change |
+| **Expose duroxide history directly** | Temporal event history | No duplication; single source of truth | Not relational/queryable like `df.nodes` |
+
+The proposal takes the **pragmatic middle**: keep the single-row projection, accept
+"latest iteration only," and delegate per-iteration history to duroxide's log. The
+**per-execution-rows** direction is the more future-proof, Airflow-style design — it
+makes the iteration-boundary reset *unnecessary* (nothing is overwritten, so nothing
+goes stale) and expresses "race-loser in iteration 2" as just another execution row —
+at the cost of a heavier schema. This is the key long-term design fork to decide
+before committing to the reset mechanism.
+
+## How this resolves the two issues
+
+### #240 — downstream steps after a failure
+
+When a run terminates with a node-level failure, remaining un-started nodes should
+become `skipped` with reason `upstream_failed`. The `skipped` status (#240, shipped
+in PR #249) already implements the set-based sweep (`pending → skipped` guarded by
+"an instance node is `failed`"). Under this model that sweep should additionally
+stamp `status_reason = 'upstream_failed'`.
+
+Open refinement: the coarse sweep also catches `IF`/`LOOP` branches that were
+*deliberately* not taken. With the reason column those should ideally be classified
+`branch_not_taken` at the point the branch decision is made (in `execute_if_node` /
+`execute_loop_node`), rather than lumped as `upstream_failed` by the terminal sweep.
+
+### #171 — race-loser nodes left running/pending
+
+`RACE` runs each branch as a **sub-orchestration** (`schedule_sub_orchestration`) and
+`select2` keeps only the winner. The loser's sub-orchestration nodes are never moved
+to a terminal state. The fix is a **reconciliation at race resolution**: from the
+losing branch's root node, mark every descendant still in `pending`/`running` as
+`cancelled` with reason `race_lost`. This is a distinct code path from #240 because:
+
+- it triggers on race resolution, not on terminal instance failure;
+- it must handle `running → cancelled`, not just `pending → skipped`;
+- it needs subtree scoping (a recursive walk over `left_node`/`right_node` from the
+  losing root), not an instance-wide sweep.
+
+The same subtree-cancellation primitive generalizes to any future explicit
+cancellation scope (`scope_cancelled`).
+
+When the `RACE` is **inside a `LOOP`**, the loser marks apply only to the current
+iteration and are wiped by the iteration-boundary reset before the next iteration
+re-runs the race (see [Loops and node-state reuse](#loops-and-node-state-reuse-iteration-scoping)).
+Only the final iteration's loser marks survive as the observable result.
+
+## Schema & implementation implications
+
+- **Status set:** `skipped` is already added (done in PR #249 for #240); add
+  `cancelled` to `nodes_status_chk` in both the install DDL
+  ([src/lib.rs](../src/lib.rs)) and a new upgrade script, per the upgrade contract.
+- **Reason column:** add `status_reason TEXT` to `df.nodes` (nullable, optionally a
+  `CHECK` against the reason vocabulary). New column ⇒ install DDL + upgrade DDL;
+  binary backward-compat is straightforward (new `.so` only writes it when the
+  column exists, mirroring the existing schema-probe pattern in
+  `mark_pending_nodes_skipped`).
+- **Reconciliation points:**
+  - terminal failure sweep → `skipped/upstream_failed` (#240),
+  - `IF`/`LOOP` branch decision → `skipped/branch_not_taken`,
+  - `RACE` resolution → subtree `cancelled/race_lost` (#171).
+- **Backward compatibility:** every reconciliation must no-op gracefully on schemas
+  that predate the relevant status/column (probe first, as the current activity
+  already does for `skipped`).
+
+## Backward-compatibility note (history determinism)
+
+Reconciliation work is performed by scheduling **activities** from the orchestration,
+which keeps orchestration code deterministic. Adding such a scheduled activity to an
+existing failure/resolution path changes the recorded history for any orchestration
+that crosses that path *during* a binary upgrade — a narrow window, but worth calling
+out when each reconciliation is introduced.
+
+## Open questions
+
+1. Should a `RACE` loser that had already `completed` a node keep that node
+   `completed`, or also be rolled to `cancelled`? (Proposed: keep terminal nodes as-is;
+   only `pending`/`running` loser nodes become `cancelled`.)
+2. Do we expose `status_reason` in `df.status()` / `df.instance_nodes` output and the
+   `USER_GUIDE.md` legend? (Proposed: yes, as an optional column + a legend row.)
+3. Should `JOIN` cancel still-running siblings when one branch fails, or let them
+   finish? (Today they finish; Step Functions would abort them. Out of scope here but
+   the `cancelled/scope_cancelled` primitive would cover it.)
+4. Loop-body reset granularity: reset the **entire** body subgraph at each iteration
+   start (proposed — simplest and always-correct), or reset lazily/only the nodes
+   that won't be re-visited? The eager whole-subtree reset is preferred because the
+   set of re-visited nodes is not known ahead of a `RACE`/`IF` decision.
+5. Do we want a first-class per-iteration projection (`df.node_executions`) for audit,
+   or is "latest iteration in `df.nodes` + duroxide history" sufficient? (Proposed:
+   the latter, for now.)
+
+[#240]: https://github.com/microsoft/pg_durable/issues/240
+[#171]: https://github.com/microsoft/pg_durable/issues/171

From b54774e2c590e74e6b4cec1120fa4c215581a365 Mon Sep 17 00:00:00 2001
From: Pino de Candia <32303022+pinodeca@users.noreply.github.com>
Date: Tue, 23 Jun 2026 16:27:48 +0000
Subject: [PATCH 2/2] docs: fix mermaid sequence-diagram parse error (semicolon
 treated as separator)

---
 docs/node-state-model.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/node-state-model.md b/docs/node-state-model.md
index 61a738c..4726db6 100644
--- a/docs/node-state-model.md
+++ b/docs/node-state-model.md
@@ -235,7 +235,7 @@ sequenceDiagram
     participant B as Body subgraph
     Note over L,B: iteration N
     L->>B: reset body subgraph → pending (clear result/reason)
-    B->>B: run body; RACE/IF reconcile losers/untaken → terminal (this iteration)
+    B->>B: run body, RACE/IF reconcile losers/untaken → terminal (this iteration)
     L->>L: evaluate condition
     alt continue
         L->>L: continue_as_new (iteration N+1)