zodan failover-slot support on PG17+ and Phase 9 sub_create stabilization#461
Conversation
Phase 9 created enabled new→existing subscriptions and returned without confirming any peer had begun applying. Mirror Phases 6/8.
📝 WalkthroughWalkthroughProbes remote server_version_num to choose the correct pg_create_logical_replication_slot signature, adds a clarifying comment in the ZODAN subscription loop, preserves/restores the caller memory context when copying ErrorData in subscriber init error handling, and adds a regression check that a standby in recovery remains queryable with Spock loaded. ChangesZODAN & subscriber init error handling
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Duplication | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
samples/Z0DAN/zodan.sql (1)
1780-1811: 💤 Low valueSync handshake correctly mirrors Phase 6/8 pattern.
The added
sync_eventemit on the new node followed bywait_for_sync_eventon each subscriberrec.dsnis the right shape: provider-side emit, subscriber-side wait. Together withenabled := trueoncreate_subabove, this guarantees Phase 9 cannot return until each existing node's apply worker has actually progressed past a known LSN from the new node. The defensivesync_ok IS NULL OR sync_ok::boolean IS NOT TRUEcheck and theEXCEPTION WHEN OTHERSwrappers are consistent with the rest of the file.One small consideration (non-blocking): since every iteration uses the same provider (new node) for its respective subscription, you could emit
spock.sync_event()on the new node a single time after thecreate_subloop and then run only the per-subscriberwait_for_sync_eventcalls in the loop. That trims O(N) WAL sync events on the new node down to O(1) and still gives each subscriber a reachable target. The current per-iteration form has the upside of failing faster on a specific subscriber, so this is a tradeoff rather than a defect.Note on timeout naming: The variable
timeout_msis misleading—wait_for_sync_event()compares the timeout parameter directly toEXTRACT(EPOCH FROM ...), which returns elapsed time in seconds. The value 180 paired with-- 3 minutesin comments across the file (lines 1749, 1591, 1677, 1903, 2227) is correctly interpreted as 180 seconds, not 180 milliseconds. The handshake will not timeout prematurely. However, the naming convention should be clarified: consider renamingtimeout_mstotimeout_sortimeoutto avoid future confusion, as this pre-existing pattern is used elsewhere in the file.♻️ Optional refactor: emit sync_event once outside the loop
+ -- Emit one sync_event on the new node after all subscriptions are created; + -- each subscriber will wait for the same target LSN. + BEGIN + remotesql := 'SELECT spock.sync_event();'; + SELECT * FROM dblink(new_node_dsn, remotesql) AS t(lsn pg_lsn) INTO sync_lsn; + RAISE NOTICE ' OK: %', rpad('Triggered sync_event on new node ' || new_node_name || ' (LSN: ' || sync_lsn || ')...', 120, ' '); + EXCEPTION + WHEN OTHERS THEN + RAISE EXCEPTION ' ✗ %', rpad('Triggering sync_event on new node ' || new_node_name || ' (error: ' || SQLERRM || ')', 120, ' '); + END; + + FOR rec IN SELECT * FROM temp_spock_nodes WHERE node_name != new_node_name LOOP + BEGIN + remotesql := format('CALL spock.wait_for_sync_event(true, %L, %L::pg_lsn, %s, true);', + new_node_name, sync_lsn, timeout_ms); + SELECT * INTO sync_ok FROM dblink(rec.dsn, remotesql) AS t(result text); + IF sync_ok IS NULL OR sync_ok::boolean IS NOT TRUE THEN + RAISE EXCEPTION 'wait_for_sync_event timed out for % on node %', new_node_name, rec.node_name; + END IF; + EXCEPTION + WHEN OTHERS THEN + RAISE EXCEPTION ' ✗ %', rpad('Unable to wait for sync event from ' || new_node_name || ' on node ' || rec.node_name || ' (error: ' || SQLERRM || ')', 120, ' '); + END; + END LOOP;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@samples/Z0DAN/zodan.sql` around lines 1780 - 1811, The timeout variable name is misleading: wait_for_sync_event() expects seconds but the code uses the identifier timeout_ms; rename the identifier (e.g. timeout_ms -> timeout_s or timeout) everywhere it is declared and referenced (including where format() builds the CALL for spock.wait_for_sync_event and any comments near create_sub/Phase 9) so the name matches units, update the in-line comments that currently say "-- 3 minutes" to reference seconds if needed, and run a project-wide search/replace for timeout_ms to keep all callers/uses (e.g., the format(...) call that builds the CALL to spock.wait_for_sync_event and any other functions referencing timeout_ms) consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@samples/Z0DAN/zodan.sql`:
- Around line 1780-1811: The timeout variable name is misleading:
wait_for_sync_event() expects seconds but the code uses the identifier
timeout_ms; rename the identifier (e.g. timeout_ms -> timeout_s or timeout)
everywhere it is declared and referenced (including where format() builds the
CALL for spock.wait_for_sync_event and any comments near create_sub/Phase 9) so
the name matches units, update the in-line comments that currently say "-- 3
minutes" to reference seconds if needed, and run a project-wide search/replace
for timeout_ms to keep all callers/uses (e.g., the format(...) call that builds
the CALL to spock.wait_for_sync_event and any other functions referencing
timeout_ms) consistent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: de5ac347-062a-4d4d-8a52-92b4213ec480
📒 Files selected for processing (1)
samples/Z0DAN/zodan.sql
create_replication_slot now detects the remote PG version and passes failover := true on PG17+, matching spock_sync.c.
Also revert Phase 9 sync_event handshake; deadlocks against cswp.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
samples/Z0DAN/zodan.sql (1)
1416-1431:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPhase 3 slot creation bypasses the version check — PG17+ "other" nodes get slots without
failover=true.
create_disable_subscriptions_and_slots(Phase 3) builds thepg_create_logical_replication_slotcall directly at Line 1420 using the unconditional 2-argument form, bypassing the version-awarespock.create_replication_slotprocedure. On PG17+ "other" nodes this creates slots withoutfailover=true, inconsistent with whatcreate_replication_slotnow does in Phase 4. All slots in a ZODAN-built cluster should have the same failover posture.🐛 Proposed fix — add a version check consistent with `create_replication_slot`
- remotesql := format('SELECT slot_name, lsn FROM pg_create_logical_replication_slot(%L, ''spock_output'');', slot_name); + DECLARE + _remote_ver int; + BEGIN + SELECT v INTO _remote_ver + FROM dblink(rec.dsn, 'SHOW server_version_num') AS t(v int); + END; + IF _remote_ver >= 170000 THEN + remotesql := format( + 'SELECT slot_name, lsn FROM pg_create_logical_replication_slot(%L, ''spock_output'', false, false, true)', + slot_name); + ELSE + remotesql := format( + 'SELECT slot_name, lsn FROM pg_create_logical_replication_slot(%L, ''spock_output'')', + slot_name); + END IF;Alternatively, refactor Phase 3 to call
spock.create_replication_slotdirectly (the same helper already used in Phase 4 viacreate_replication_slots) to avoid this duplication entirely.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@samples/Z0DAN/zodan.sql` around lines 1416 - 1431, Phase 3's create_disable_subscriptions_and_slots currently builds a raw pg_create_logical_replication_slot call (using spock.spock_gen_slot_name and remotesql) which bypasses the PG-version-aware behavior and omits failover=true on PG17+ nodes; change it to honor the same version check used by spock.create_replication_slot (or simply call spock.create_replication_slot/create_replication_slots instead of building the two-argument pg_create_logical_replication_slot SQL) so slots created in Phase 3 include failover=true when appropriate; update the code in create_disable_subscriptions_and_slots to either branch on the server version (and use the correct 3-argument form with failover=true for PG17+) or delegate to the existing spock.create_replication_slot helper to ensure consistent failover posture across phases.
🧹 Nitpick comments (1)
src/spock_sync.c (1)
1498-1516: ⚡ Quick win
FlushErrorState()+PG_RE_THROW()should be replaced withReThrowError(edata).After
FlushErrorState()(Line 1498) the error stack is empty (errordata_stack_depth == -1). The subsequentPG_RE_THROW()(Line 1516) longjmps to the outer handler, which will find no valid error entry and fall back to a genericERRCODE_INTERNAL_ERRORwith an empty message — the original error details are silently lost even though they were captured inedata. The correct PostgreSQL idiom afterCopyErrorData()+FlushErrorState()isReThrowError(edata), which pushesedataback onto the error stack before re-throwing. This is a pre-existing issue, but since this PR now holdsedatain scope, it is a low-effort fix.♻️ Proposed fix
FreeErrorData(edata); MemoryContextSwitchTo(savecxt); - PG_RE_THROW(); + ReThrowError(edata); /* push edata back onto stack, then longjmp */Note:
ReThrowErrorcallsFlushErrorStateinternally, so you can also drop the explicitFlushErrorState()call on line 1498 and keep bothedataand error state in sync.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/spock_sync.c` around lines 1498 - 1516, The error handler currently calls FlushErrorState() then PG_RE_THROW(), which clears the error stack and causes the rethrow to lose the original edata; replace this sequence by calling ReThrowError(edata) (or remove the explicit FlushErrorState() and call ReThrowError(edata)) so that the saved edata is pushed back onto the error stack before re-throwing; update the block around edata handling in spock_sync.c to call ReThrowError(edata) instead of FlushErrorState() + PG_RE_THROW(), keeping FreeErrorData(edata) / MemoryContextSwitchTo(savecxt) semantics as appropriate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@samples/Z0DAN/zodan.sql`:
- Around line 1416-1431: Phase 3's create_disable_subscriptions_and_slots
currently builds a raw pg_create_logical_replication_slot call (using
spock.spock_gen_slot_name and remotesql) which bypasses the PG-version-aware
behavior and omits failover=true on PG17+ nodes; change it to honor the same
version check used by spock.create_replication_slot (or simply call
spock.create_replication_slot/create_replication_slots instead of building the
two-argument pg_create_logical_replication_slot SQL) so slots created in Phase 3
include failover=true when appropriate; update the code in
create_disable_subscriptions_and_slots to either branch on the server version
(and use the correct 3-argument form with failover=true for PG17+) or delegate
to the existing spock.create_replication_slot helper to ensure consistent
failover posture across phases.
---
Nitpick comments:
In `@src/spock_sync.c`:
- Around line 1498-1516: The error handler currently calls FlushErrorState()
then PG_RE_THROW(), which clears the error stack and causes the rethrow to lose
the original edata; replace this sequence by calling ReThrowError(edata) (or
remove the explicit FlushErrorState() and call ReThrowError(edata)) so that the
saved edata is pushed back onto the error stack before re-throwing; update the
block around edata handling in spock_sync.c to call ReThrowError(edata) instead
of FlushErrorState() + PG_RE_THROW(), keeping FreeErrorData(edata) /
MemoryContextSwitchTo(savecxt) semantics as appropriate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bbeabbbc-7a02-4ac4-a66c-4c285d7ac940
📒 Files selected for processing (2)
samples/Z0DAN/zodan.sqlsrc/spock_sync.c
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/tap/t/018_failover_slots.pl`:
- Around line 449-455: The test currently only checks $write_rc from system(...)
and discards stderr, so change the invocation to capture command output (both
stdout and stderr) instead of using system; for example run the psql command
with backticks or qx(...) and assign output to a variable like $write_out and
set $write_rc from $?. Then keep the existing non-zero assertion on $write_rc
and add an assertion that $write_out matches a read-only error (e.g. /read[-
]?only|cannot execute INSERT in a read-only transaction/i) so the test verifies
the failure reason for the "Write against read-only standby is rejected
(read-only enforced)" case.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a70451cd-cee1-4bf9-80a6-5e55a4238b16
📒 Files selected for processing (1)
tests/tap/t/018_failover_slots.pl
Phase 9 created enabled new→existing subscriptions and returned without confirming any peer had begun applying. Mirror Phases 6/8.