Skip to content

Add resilience testing plan and 26 E2E tests#52

Closed
pinodeca wants to merge 3 commits into
mainfrom
pinodeca/breakit
Closed

Add resilience testing plan and 26 E2E tests#52
pinodeca wants to merge 3 commits into
mainfrom
pinodeca/breakit

Conversation

@pinodeca

@pinodeca pinodeca commented Mar 11, 2026

Copy link
Copy Markdown
Contributor

Resilience Testing Plan + 26 E2E Tests for pg_durable

Comprehensive plan to systematically break pg_durable the way a real user might — finding resource limits, edge-case bugs, and failure modes not covered by existing E2E tests.

What's in this PR

Test plan: docs/resilience-testing.md — six testing categories with prioritized phased rollout.

26 new E2E tests (tests 38–63) covering Phases 1–5:

Test Plan Item What it tests
38 B1/B2 Infinite loop cancellation
39 B3 Truthiness edge cases (NULL, 0, "false", "no", etc.)
40 B5/B6 Empty/DML result $var substitution (bug found)
41 B10 df.break() outside a loop
42 B11 Recursive df.start() inside workflow (no guard found)
43 C1 Empty/whitespace/invalid SQL
44 C7 Manually crafted JSON bypassing DSL
45 A1 20 concurrent instances burst
46 A2 50-level deep graph nesting
47 A7 20 rapid start/cancel cycles
48 B8 RACE where both branches fail
49 B9 JOIN where one branch fails
50 B12/B13 Signal edge cases (non-existent, completed, duplicate)
51 A3 9-branch wide parallel graph (3×3 join3)
52 A4 100-iteration loop history
53 A5 10K-row result set
54 A6 ~10KB query text
55 A8 5KB variable payload
56 C5 500K rapid df.status() polls
57 B4/B14 Variable name conflicts and result shadowing
58 D1 pg_terminate_backend on BGW → PG auto-restarts in ≤5s; in-flight instance reaches terminal state
59 E2/E3 Signal-waiting instance stays running indefinitely — no idle timeout; df.cancel() is the only exit
60 E1 Deleting df.instances row leaves df.nodes rows intact (no FK cascade); df.status() returns gracefully
61 E4/E5 Instance and node row counts grow proportionally after 10 runs; confirms zero automatic GC
62 F1 10 independent dblink connections each call df.start(); all produce distinct IDs and complete
63 F4 Variable snapshot-at-start semantics verified; cross-session last-writer-wins on df.vars demonstrated

Test runner improvements:

  • --keep-going / -k flag for test-e2e-local.sh to continue past failures with summary at end.
  • Tests 58, 60, 62, 63 added to the superuser (postgres) override list — they require pg_terminate_backend, direct RLS-bypassing table access, or dblink with postgres credentials.

Key findings

ID Finding Severity Test
F1 $var substitution of empty/0-row results produces unquoted JSON → syntax error Bug 40
F2 No iteration limit on df.loop() — infinite loops run forever Design gap 38
F3 No recursion guard on df.start() — can spawn unbounded child instances Design gap 42
F4 df.break() outside loop returns break sentinel as result (not an error) Quirk 41
F5 Serde ignores unknown JSON fields in crafted Durofut payloads Quirk 44
F6 Empty/whitespace SQL accepted by DSL, fails at execution time Quirk 43
F7 Signal to non-existent/completed instance silently succeeds Quirk 50
F8 No FK between df.instances / df.nodes — orphaned nodes accumulate silently Design gap 60
F9 No GC for completed instances or duroxide history — unbounded table growth Design gap 61
F10 df.vars is per-owner-global — concurrent sessions race on the same key (last writer wins) Design gap 63
F11 No idle timeout for signal-waiting instances — df.cancel() is the only escape Design gap 59

Remaining gaps (not addressed here)

  • D2 (PostgreSQL crash recovery), F2/F3 (signal/cancel vs. completion races), and F5 (concurrent df.status() lock contention) require a shell-level or timing-sensitive test harness beyond what SQL-only E2E tests support.
  • D3 (Disk full simulation — infeasible in SQL)
  • D4 (Network partition to remote database)
  • D5 (Clock skew / time jumps)
  • D6 (Extension drop+recreate while instances are in-flight — partial: test 28 covers drop+recreate with no in-flight instances)
  • E6 (Tampering with df.nodes mid-execution)

@pinodeca

Copy link
Copy Markdown
Contributor Author

@copilot go ahead and implement the resilience testing plan proposed in this PR.

Copilot AI commented Mar 13, 2026

Copy link
Copy Markdown
Contributor

@pinodeca I've opened a new pull request, #56, to work on those changes. Once the pull request is ready, I'll request review from you.

@pinodeca

Copy link
Copy Markdown
Contributor Author

@copilot #56 implemented some of the resilience test plan but wasn't tested and then was merged into branch pinodeca/breakit, but this PR's description wasn't updated. Can you update it?

Regardless, continue implementing the resilience test plan. You may want to check the status of #56 - that's where we left off.

Copilot AI commented Mar 14, 2026

Copy link
Copy Markdown
Contributor

@pinodeca I've opened a new pull request, #62, to work on those changes. Once the pull request is ready, I'll request review from you.

@pinodeca pinodeca marked this pull request as draft March 14, 2026 15:06
@pinodeca pinodeca force-pushed the pinodeca/breakit branch 3 times, most recently from a620f3e to c5507a0 Compare March 14, 2026 21:03
@pinodeca pinodeca changed the title Add resilience testing plan Add resilience testing plan and 20 E2E tests Mar 14, 2026
Comprehensive plan to stress-test, chaos-test, and find edge-case bugs
in pg_durable. Covers six testing categories:

- Stress & Overload (concurrent instances, deep nesting, large results)
- Bugs & Logical Errors (infinite loops, truthiness edge cases, recursive start)
- Misuse & Unintended Usage (empty SQL, raw JSON, rapid polling)
- Chaos / Fault Injection (kill worker, crash PG, drop+recreate extension)
- Data Integrity & State Corruption (orphaned nodes, stuck instances, bloat)
- Concurrency & Race Conditions (shared vars, concurrent start/cancel/signal)

Includes existing coverage gap analysis and prioritized phased rollout.
@pinodeca

Copy link
Copy Markdown
Contributor Author

@copilot continue executing the plan in docs/resilience-testing.md

Copilot AI commented Mar 14, 2026

Copy link
Copy Markdown
Contributor

@pinodeca I've opened a new pull request, #66, to work on those changes. Once the pull request is ready, I'll request review from you.

…ncurrency"

Co-authored-by: pinodeca <32303022+pinodeca@users.noreply.github.com>
@pinodeca pinodeca changed the title Add resilience testing plan and 20 E2E tests Add resilience testing plan and 26 E2E tests Mar 14, 2026
@pinodeca pinodeca closed this Apr 28, 2026
@pinodeca pinodeca deleted the pinodeca/breakit branch April 28, 2026 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants