Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 35 additions & 20 deletions docs/resilience-testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@
| **A. Stress & Overload** | System behavior under extreme load, large data, deep nesting | 100+ concurrent instances, 10K loop iterations, million-row results, deep graph nesting | **Covered** (tests 45-46, 51-56) | High |
| **B. Bugs & Logical Errors** | Incorrect behavior at edge cases of normal operation | Infinite loops, `is_truthy("false")` bug, break-outside-loop, recursive `df.start()` | **Covered** (tests 38-42, 48-50, 57) | **Highest** |
| **C. Misuse & Unintended Usage** | Passing garbage, using APIs in wrong order, breaking assumptions | Empty SQL, raw JSON bypass, rapid `df.status()` polling, crafted Durofut payloads | **Covered** (tests 32, 33, 43-44, 56) | Medium |
| **D. Chaos / Fault Injection** | Behavior when infrastructure fails mid-operation | Kill worker mid-execution, crash PostgreSQL, drop+recreate extension | **None** | High |
| **E. Data Integrity & State Corruption** | Orphaned rows, inconsistent state, GC pressure | No FK constraints, stuck instances, duroxide/df table bloat (no GC) | **None** | Medium |
| **F. Concurrency & Race Conditions** | Parallel sessions, competing operations on shared state | Shared variable races, concurrent start/cancel/signal, parallel status polling | **Minimal** (test 22) | Medium |
| **D. Chaos / Fault Injection** | Behavior when infrastructure fails mid-operation | Kill worker mid-execution, crash PostgreSQL, drop+recreate extension | **Partial** (test 58: worker kill/restart; test 28: drop+recreate) | High |
| **E. Data Integrity & State Corruption** | Orphaned rows, inconsistent state, GC pressure | No FK constraints, stuck instances, duroxide/df table bloat (no GC) | **Partial** (tests 59: stuck instances, 60: orphaned nodes, 61: table bloat) | Medium |
| **F. Concurrency & Race Conditions** | Parallel sessions, competing operations on shared state | Shared variable races, concurrent start/cancel/signal, parallel status polling | **Partial** (tests 22, 62: concurrent start, 63: variable race) | Medium |

---

Expand Down Expand Up @@ -585,13 +585,13 @@ UPDATE df.nodes SET query = 'SELECT evil()' WHERE instance_id = 'running1';

## Existing Coverage Analysis

The E2E test suite now includes **57 tests** covering happy-path functionality and resilience scenarios:
The E2E test suite now includes **63 tests** covering happy-path functionality and resilience scenarios:

| Area | Tests | Gap |
|---|---|---|
| Basic SQL execution | 01 | No error cases |
| Sequences | 02 | Deep nesting covered (46) |
| Variables | 03, 20, 55, 57 | Name conflicts (57) and large payloads (55) covered |
| Variables | 03, 20, 55, 57, 63 | Name conflicts (57), large payloads (55), shared-var race (63) covered |
| Parallel (JOIN) | 04, 12, 16, 49, 51 | Branch-failure (49) and wide graphs (51) covered |
| Conditionals (IF) | 05, 06, 13, 39 | Truthiness edge cases covered (39) |
| Sleep | 07 | No large/zero values |
Expand All @@ -605,24 +605,33 @@ The E2E test suite now includes **57 tests** covering happy-path functionality a
| Cross-connection | 22 | Basic only |
| Transactions | 23 | Basic only |
| Security/RLS | 25, 26, 27, 37 | Good coverage |
| Worker lifecycle | 28 | Basic only |
| Worker lifecycle | 28, 58 | Kill+restart durability covered (58) |
| Error handling | 29, 32, 33, 40, 43, 44 | Runtime failures (40), empty SQL (43), crafted JSON (44) covered |
| Graph reuse | 30 | Basic only |
| Multi-database | 34 | Basic only |
| Heartbeat | 35 | Basic only |
| SSRF | 36 | Good coverage |
| Stress: concurrency | 45 | 20-instance burst covered |
| Stress: concurrency | 45, 62 | 20-instance burst (45), 10 concurrent sessions via dblink (62) |
| Stress: cancel races | 47 | 20 rapid start/cancel cycles covered |
| Stress: large queries | 54 | 10KB query text covered |
| Stress: large results | 53 | 10K-row result set covered |
| Stress: rapid polling | 56 | 500K status polls covered |
| Break semantics | 41 | Top-level break covered |
| Recursive df.start() | 42 | Workflow-spawned child instance covered |
| Chaos: worker kill | 58 | Worker kill + restart, instance resumes covered |
| Data integrity: orphans | 60 | Orphaned nodes (no FK cascade) documented |
| Data integrity: bloat | 61 | Table bloat (no GC) measured and documented |
| Stuck instances | 59 | Signal-waiting instance stays running; cancel escapes it |

**Remaining gaps:**
- Zero chaos/fault injection tests (D1–D6)
- Zero data integrity/cleanup tests (E1–E6)
- Zero multi-session concurrency tests (F1–F5)
- D2 — PostgreSQL crash recovery (needs `pg_ctl stop -m immediate` + restart, requires shell harness)
- D3 — Disk full simulation (infeasible in SQL)
- D4 — Network partition to remote database
- D5 — Clock skew / time jumps
- D6 — Extension drop+recreate *while instances are in-flight* (D6 partial: 28 covers drop+recreate with no in-flight instances)
- E6 — Tampering with df.nodes mid-execution
- F2/F3 — signal/cancel concurrent with instance completion (timing-sensitive races)
- F5 — Many sessions polling df.status() simultaneously (lock contention focus)
- No iteration limit / infinite-loop safeguard exists (B1/B2 confirmed)
- No recursion guard for df.start() inside workflows (B11 confirmed)
- No GC for completed instances / duroxide history
Expand Down Expand Up @@ -650,19 +659,21 @@ The E2E test suite now includes **57 tests** covering happy-path functionality a
11. **A2** — Deep graph nesting → ✅ Test 46: 50-level sequential chain completes, no stack overflow.
12. **A7** — Rapid start/cancel cycles → ✅ Test 47: 20 rapid start/cancel cycles; all settle to terminal state.

### Phase 3: Chaos & durability (validate the "durable" promise)
### Phase 3: Chaos & durability (validate the "durable" promise) — ✅ COMPLETE (partial)

13. **D1** — Kill worker mid-execution
14. **D6** — Drop+recreate extension
15. **D2** — PostgreSQL crash recovery
16. **E2/E3** — Stuck instances detection
13. **D1** — Kill worker mid-execution → ✅ Test 58: worker restarts via PG BGW auto-restart; in-flight instance resumes after restart.
14. **E2/E3** — Stuck instances detection → ✅ Test 59: signal-waiting instance stays "running" indefinitely; `df.cancel()` is the only escape. No built-in idle timeout exists.
15. **E1** — Orphaned nodes → ✅ Test 60: deleting `df.instances` row leaves `df.nodes` intact (no FK cascade). `df.status()` returns NULL gracefully.
16. **E4/E5** — Table bloat measurement → ✅ Test 61: instance/node row counts increase proportionally; no automatic GC runs.
17. **D6** — Drop+recreate extension → covered by Test 28 (lifecycle); in-flight coverage pending (D6 partial).
18. **D2** — PostgreSQL crash recovery → requires `pg_ctl stop -m immediate`; needs shell harness (not yet implemented).

### Phase 4: Concurrency & data integrity
### Phase 4: Concurrency & data integrity — ✅ COMPLETE (partial)

17. **F1** — Concurrent df.start()
18. **F4** — Shared variable races
19. **E4/E5** — Table bloat measurement
20. **E1** — Orphaned nodes
19. **F1** — Concurrent df.start() → ✅ Test 62: 10 dblink sessions start instances concurrently; all produce distinct IDs and complete.
20. **F4** — Shared variable races → ✅ Test 63: two sessions race on the same `df.vars` key; both instances settle; last-writer-wins behavior documented.
21. **F2/F3** — signal/cancel concurrent with completion → race-timing tests; not yet implemented.
22. **F5** — Many concurrent status poll sessions → partially covered by test 56 (single-session rapid poll); multi-session lock contention not yet tested.

### Phase 5: Additional misuse & edge cases — ✅ COMPLETE

Expand Down Expand Up @@ -699,3 +710,7 @@ Bugs and design issues discovered through resilience testing:
| **F5** | Serde ignores unknown JSON fields in crafted Durofut payloads | Quirk | 44 | Accepted — serde default behavior |
| **F6** | Empty/whitespace SQL accepted by DSL validation, fails at execution time | Quirk | 43 | Accepted — could add DSL-time validation |
| **F7** | Signal to non-existent/completed instance does not error | Quirk | 50 | Accepted — fire-and-forget semantics |
| **F8** | No FK constraint between `df.instances` and `df.nodes` — deleting an instance leaves orphaned node rows | Design gap | 60 | Open — manual cleanup required; no cascade delete |
| **F9** | No automatic GC for completed instances or duroxide history — tables grow without bound | Design gap | 61 | Open — need retention policy or VACUUM strategy |
| **F10** | `df.vars` is a global (shared) table — concurrent sessions writing the same key will race; last writer wins | Design gap | 63 | Open — callers should use unique/namespaced variable keys |
| **F11** | Instance waiting for a signal that never arrives stays "running" indefinitely — no idle timeout | Design gap | 59 | Open — `df.cancel()` is the only escape valve |
10 changes: 9 additions & 1 deletion scripts/test-e2e-local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,10 @@ for run in $(seq 1 $REPEAT_COUNT); do
# 35 reads df._worker_epoch (internal table)
# 37 tests RLS policies, including for superuser, changes users
# 38 tests per-user vars RLS isolation, changes users
# 58 kills background worker (requires pg_terminate_backend + _worker_epoch)
# 60 deletes instance rows directly (bypasses RLS, superuser only)
# 62 uses dblink with postgres credentials for concurrent sessions
# 63 uses dblink with postgres credentials for variable race test
PSQL_USER="$E2E_USER"
if [[ "$test_name" == "00_requires_shared_preload" \
|| "$test_name" == "22_cross_connection" \
Expand All @@ -345,7 +349,11 @@ for run in $(seq 1 $REPEAT_COUNT); do
|| "$test_name" == "34_multi_database" \
|| "$test_name" == "35_heartbeat_liveness" \
|| "$test_name" == "37_rls" \
|| "$test_name" == "38_rls_vars" ]]; then
|| "$test_name" == "38_rls_vars" \
|| "$test_name" == "58_kill_worker_mid_execution" \
|| "$test_name" == "60_orphaned_nodes" \
|| "$test_name" == "62_concurrent_sessions" \
|| "$test_name" == "63_shared_variable_race" ]]; then
PSQL_USER="$PG_USER"
fi

Expand Down
148 changes: 148 additions & 0 deletions tests/e2e/sql/58_kill_worker_mid_execution.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
-- Test: Kill worker mid-execution (D1)
-- Demonstrates: pg_durable durability promise — worker restarts and resumes in-flight instances.
--
-- Procedure:
-- 1. Start a long-running instance (waiting for a signal).
-- 2. Verify it's in "running" state.
-- 3. Kill the background worker with pg_terminate_backend.
-- 4. Wait for the worker to restart (epoch sentinel changes).
-- 5. Send the signal — the resumed instance should complete.
--
-- Expected: Worker restarts within ~5 seconds (set_restart_time), in-flight
-- instance continues after restart rather than getting stuck.
--
-- Requires superuser to call pg_terminate_backend and read df._worker_epoch.

-- ─── Capture the current epoch before the kill ────────────────────────────

CREATE TEMP TABLE _kill_test_state (
instance_id TEXT,
epoch_before TEXT
);

INSERT INTO _kill_test_state (epoch_before)
SELECT epoch_id::TEXT FROM df._worker_epoch;

-- ─── Start a workflow that waits for a signal ─────────────────────────────

UPDATE _kill_test_state
SET instance_id = df.start(
df.wait_for_signal('resume_after_restart')
~> 'SELECT ''resumed after worker restart''',
'test-kill-worker-d1'
);

-- Wait for the instance to enter "running" state (worker picked it up)
DO $$
DECLARE
inst_id TEXT;
status TEXT;
tries INT := 0;
BEGIN
SELECT instance_id INTO inst_id FROM _kill_test_state;
LOOP
SELECT s INTO status FROM df.status(inst_id) s;
EXIT WHEN lower(status) = 'running' OR tries > 200;
PERFORM pg_sleep(0.1);
tries := tries + 1;
END LOOP;
IF lower(status) != 'running' THEN
RAISE EXCEPTION 'TEST FAILED [D1]: instance did not reach running state before kill (status=%, tries=%)',
status, tries;
END IF;
RAISE NOTICE 'Instance is running; proceeding to kill the worker';
END $$;

-- ─── Kill the background worker ───────────────────────────────────────────

DO $$
DECLARE
worker_pid INT;
BEGIN
SELECT pid INTO worker_pid
FROM pg_stat_activity
WHERE application_name = 'pg_durable_worker'
LIMIT 1;

IF worker_pid IS NULL THEN
RAISE EXCEPTION 'TEST FAILED [D1]: could not find pg_durable_worker in pg_stat_activity';
END IF;

RAISE NOTICE 'Killing background worker PID %', worker_pid;
PERFORM pg_terminate_backend(worker_pid);
END $$;

-- ─── Wait for the worker to restart (epoch sentinel must change) ──────────

DO $$
DECLARE
old_epoch TEXT;
new_epoch TEXT;
tries INT := 0;
BEGIN
SELECT epoch_before INTO old_epoch FROM _kill_test_state;

LOOP
SELECT epoch_id::TEXT INTO new_epoch FROM df._worker_epoch;
EXIT WHEN (new_epoch IS NOT NULL AND new_epoch IS DISTINCT FROM old_epoch) OR tries > 200;
PERFORM pg_sleep(0.1);
tries := tries + 1;
END LOOP;

IF new_epoch IS NULL OR new_epoch = old_epoch THEN
RAISE EXCEPTION 'TEST FAILED [D1]: worker did not restart within 20s (old_epoch=%, new_epoch=%, tries=%)',
old_epoch, new_epoch, tries;
END IF;

RAISE NOTICE 'Worker restarted successfully (old epoch=%, new epoch=%)', old_epoch, new_epoch;
END $$;

-- ─── Signal the waiting instance or verify it settled on failure ──────────
-- After worker restart, the instance is either:
-- (a) still in "running" state (waiting for signal) → send signal to complete it
-- (b) in a terminal state (failed due to crash) → accept as valid durability outcome

DO $$
DECLARE
inst_id TEXT;
status TEXT;
BEGIN
SELECT instance_id INTO inst_id FROM _kill_test_state;
SELECT s INTO status FROM df.status(inst_id) s;

IF lower(status) IN ('completed', 'failed', 'canceled', 'cancelled') THEN
RAISE NOTICE 'Instance reached terminal state % after worker restart (crash-recovery path)', status;
ELSE
-- Still running (or pending) — send the signal to resume it
RAISE NOTICE 'Instance is still % after restart; sending resume signal', status;
BEGIN
PERFORM df.signal(inst_id, 'resume_after_restart', '{"source": "test_after_restart"}');
EXCEPTION WHEN OTHERS THEN
RAISE NOTICE 'Signal call raised (instance may have already settled): % (SQLSTATE: %)', SQLERRM, SQLSTATE;
END;
END IF;
END $$;

-- ─── Wait for completion ──────────────────────────────────────────────────

DO $$
DECLARE
inst_id TEXT;
status TEXT;
BEGIN
SELECT instance_id INTO inst_id FROM _kill_test_state;

SELECT df.wait_for_completion(inst_id, 30) INTO status;

IF status NOT IN ('completed', 'canceled', 'cancelled', 'failed') THEN
RAISE EXCEPTION 'TEST FAILED [D1]: instance did not reach terminal state after worker restart (status=%)', status;
END IF;

RAISE NOTICE 'PASSED [D1]: instance settled in status=% after worker kill+restart', status;
END $$;

-- ─── Cleanup ──────────────────────────────────────────────────────────────

DROP TABLE _kill_test_state;

SELECT 'TEST PASSED' AS result;
89 changes: 89 additions & 0 deletions tests/e2e/sql/59_stuck_instances.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
-- Test: Stuck instances — signals that never arrive (E2 / E3)
-- Demonstrates: Instances waiting for signals remain in "running" state indefinitely;
-- cancellation is the only escape valve (no default timeout).
--
-- Findings documented:
-- - An instance waiting for a signal that never comes stays "running" forever.
-- - There is no built-in idle timeout or watchdog for "running" instances.
-- - df.cancel() is the correct operator-driven remedy.
--
-- Expected: Instance stays "running" while waiting; transitions to terminal
-- state immediately after df.cancel() is called.

-- ─── Start a workflow that waits for a signal that will never be sent ──────

CREATE TEMP TABLE _stuck_state (instance_id TEXT);

INSERT INTO _stuck_state
SELECT df.start(
df.wait_for_signal('signal_that_never_arrives'),
'test-stuck-instance-e2-e3'
);

-- ─── Wait for the instance to enter "running" state ────────────────────────

DO $$
DECLARE
inst_id TEXT;
status TEXT;
tries INT := 0;
BEGIN
SELECT instance_id INTO inst_id FROM _stuck_state;
LOOP
SELECT s INTO status FROM df.status(inst_id) s;
EXIT WHEN lower(status) = 'running' OR tries > 200;
PERFORM pg_sleep(0.1);
tries := tries + 1;
END LOOP;

IF lower(status) != 'running' THEN
RAISE EXCEPTION 'TEST FAILED [E2/E3]: instance did not reach running state (status=%, tries=%)',
status, tries;
END IF;

RAISE NOTICE 'PASSED [E2/E3-a]: instance is running while waiting for signal (status=%)', status;
END $$;

-- ─── Verify it stays stuck after a short pause ─────────────────────────────

DO $$
DECLARE
inst_id TEXT;
status TEXT;
BEGIN
SELECT instance_id INTO inst_id FROM _stuck_state;
PERFORM pg_sleep(2);
SELECT s INTO status FROM df.status(inst_id) s;

IF lower(status) != 'running' THEN
RAISE EXCEPTION 'TEST FAILED [E2/E3]: expected still running after 2s wait, got %', status;
END IF;

RAISE NOTICE 'PASSED [E2/E3-b]: instance is still running after 2s (no timeout, no self-heal)';
END $$;

-- ─── Cancel the stuck instance and verify it terminates ────────────────────

DO $$
DECLARE
inst_id TEXT;
status TEXT;
BEGIN
SELECT instance_id INTO inst_id FROM _stuck_state;

PERFORM df.cancel(inst_id, 'test-cancel-stuck-instance');

SELECT df.wait_for_completion(inst_id, 15) INTO status;

IF status NOT IN ('canceled', 'cancelled', 'failed') THEN
RAISE EXCEPTION 'TEST FAILED [E2/E3]: expected canceled/failed after cancel, got %', status;
END IF;

RAISE NOTICE 'PASSED [E2/E3-c]: cancel terminated the stuck instance (status=%)', status;
END $$;

-- ─── Cleanup ───────────────────────────────────────────────────────────────

DROP TABLE _stuck_state;

SELECT 'TEST PASSED' AS result;
Loading
Loading