From 3f2a16159acd5cb6129acace4aaf5bf60037b2ba Mon Sep 17 00:00:00 2001 From: Mason Sharp Date: Tue, 21 Apr 2026 10:02:44 -0700 Subject: [PATCH 1/5] docs: design spec for eliminating spock.progress checkpoint hook Replace the Checkpoint_hook + resource.dat + custom WAL RMGR machinery with (1) replication origins as the source of truth for remote_commit_lsn, (2) a shutdown-only resource.dat snapshot with LSN-validated load, and (3) a forced keepalive on apply-worker reconnect. Removes the Postgres core patch and the extra per-commit XLogFlush. --- ...pock-progress-no-checkpoint-hook-design.md | 211 ++++++++++++++++++ 1 file changed, 211 insertions(+) create mode 100644 docs/internals-doc/specs/spock-progress-no-checkpoint-hook-design.md diff --git a/docs/internals-doc/specs/spock-progress-no-checkpoint-hook-design.md b/docs/internals-doc/specs/spock-progress-no-checkpoint-hook-design.md new file mode 100644 index 00000000..efdba46d --- /dev/null +++ b/docs/internals-doc/specs/spock-progress-no-checkpoint-hook-design.md @@ -0,0 +1,211 @@ +# Eliminating the Spock checkpoint hook and per-commit WAL flush for `spock.progress` + +**Date:** 2026-04-21 +**Status:** Design approved; pending implementation plan + +## Problem + +Spock currently maintains the `spock.progress` view via three coupled mechanisms: + +1. A custom WAL RMGR (`src/spock_rmgr.c`) that emits a `SPOCK_RMGR_APPLY_PROGRESS` record on every applied remote-origin commit and calls `XLogFlush` synchronously (in addition to the `XLogFlush` that core's `CommitTransactionCommand` already performs). +2. A `Checkpoint_hook` (registered in `src/spock.c:1248`) that dumps the in-memory `SpockGroupHash` to `PGDATA/spock/resource.dat` at every Postgres checkpoint. This requires a patch to Postgres core (`src/backend/access/transam/xlog.c` and `src/include/access/xlog.h`) to add the hook callout. +3. A load/redo path that seeds shmem from `resource.dat` during `shmem_startup_hook` and overwrites with WAL redo of `APPLY_PROGRESS` records during recovery. + +This design has two costs: + +- **A Postgres core patch** (4 lines across 2 files) must be maintained for each supported Postgres version. +- **An extra synchronous `XLogFlush` on every applied remote-origin commit**, purely to persist monitoring data — doubling the fsync pressure of the apply hot path. + +Investigation showed that Spock's apply worker already uses Postgres's native replication origins (`replorigin_session_get_progress`) to determine the resume LSN on reconnect. The correctness-critical "where do I resume from" question is handled entirely by stock logical-replication protocol. The checkpoint hook and WAL RMGR exist only to preserve additional monitoring-level fields (`remote_commit_ts`, `remote_insert_lsn`, `received_lsn`, `last_updated_ts`) across crashes. The checkpoint hook in particular was added (commit `a4b93e8`) to handle the case where apply workers are idle across a checkpoint — keepalive-driven advances of `remote_insert_lsn` are never WAL-logged, so the hook is the only path that preserves them. + +## Goal + +Remove the Postgres core patch (`Checkpoint_hook`) and the extra per-commit `XLogFlush`, while preserving user-visible `spock.progress` behavior across clean restarts. Accept a small regression in post-crash monitoring freshness (`remote_commit_ts` becomes NULL after a crash-with-new-commits, instead of showing stale-from-last-shutdown), in exchange for a substantially simpler implementation. + +## Non-goals + +- Changing the `spock.progress` column schema or the SQL-level API. +- Altering apply-worker resume semantics (which are already handled by Postgres replication origins and need no change). +- Removing or reshaping the `SpockGroupHash` shmem layout or the `SpockApplyProgress` struct. +- Attempting to recover `remote_commit_ts` per-origin from Postgres internals at startup. (Investigated: replication origins persist only `remote_lsn`, and Postgres exposes no efficient per-origin lookup in the commit-ts SLRU. Deferred as a follow-up.) + +## Design overview + +The new design keeps `resource.dat` as a shutdown-only snapshot and replaces WAL-based durability with two components: + +1. **Replication origins as the source of truth for `remote_commit_lsn`.** At apply-worker startup, read the origin's LSN and reconcile shmem against it. If `resource.dat` holds a stale LSN for an entry (file_lsn < origin_lsn), the file's timestamp fields for that entry are cleared to NULL — they can't be trusted. +2. **A forced keepalive on apply-worker (re)connect.** The subscriber sends a status-update message with `replyRequested=true` immediately after `spock_start_replication` returns. Postgres's walsender responds with a 'k' keepalive carrying its current `sentPtr`. The apply worker's existing keepalive handler populates `remote_insert_lsn` and `received_lsn` in shmem within milliseconds of reconnect, rather than waiting for `wal_sender_timeout / 2` (default 30s) to elapse. + +Together these handle every shmem field without WAL or checkpoint-time persistence: + +| Field | Source | When authoritative | +|---|---|---| +| `remote_commit_lsn` | Replication origin (`replorigin_session_get_progress`) | At apply-worker startup, always correct | +| `remote_commit_ts` | `resource.dat` if validated against origin LSN; else NULL until next commit | Preserved across clean restart; NULL after stale-detected crash; NULL on fresh first start | +| `remote_insert_lsn` | Forced 'k' keepalive | Within ~ms of apply worker connect | +| `received_lsn` | Same | Same | +| `last_updated_ts` | `GetCurrentTimestamp()` at each update, or from file | Always advances with events | +| `prev_remote_ts` | No longer written (field kept in struct; obsolete) | — | + +## Architecture + +### Startup flow + +**Phase 1 — `shmem_startup_hook` (pre-recovery).** + +Runs via `spock_group_shmem_startup()`, unchanged from today except that the spock RMGR no longer exists so WAL recovery has no Spock records to redo: + +1. Create the empty `SpockGroupHash` HTAB. +2. Call `spock_group_resource_load()` to load `PGDATA/spock/resource.dat` if present and valid (version and `system_identifier` match). Each record is upserted into the hash via `spock_group_progress_update()`. +3. **No WAL redo for Spock records** (the RMGR is deleted). + +At the end of phase 1, shmem contains either an empty hash or the contents of the last clean shutdown's `resource.dat`. Values may be stale if a crash occurred since that shutdown. + +**Phase 2 — per-apply-worker reconciliation.** + +In `spock_apply_main()`, after `replorigin_session_setup()` returns the session's origin but before `spock_start_replication()` is called, each worker reconciles its own shmem entry: + +1. Compute the shmem key: `(MyDatabaseId, MySubscription->target->id, MySubscription->origin->id)`. +2. Fetch `origin_lsn = replorigin_session_get_progress(false)`. +3. Acquire `SpockCtx->apply_group_master_lock` in exclusive mode; HASH_ENTER for the entry: + - **Entry absent.** Initialize with `remote_commit_lsn = origin_lsn`, other fields zero/NULL. + - **Entry present, `file_lsn == origin_lsn`.** File is in sync with the origin. Keep `remote_commit_ts` and `last_updated_ts` as loaded. + - **Entry present, `file_lsn < origin_lsn`.** File is stale for this entry. Overwrite `remote_commit_lsn` with `origin_lsn`; clear `remote_commit_ts` and `last_updated_ts` to NULL; leave `remote_insert_lsn` and `received_lsn` untouched (they'll be overwritten by the forced keepalive momentarily). + - **Entry present, `file_lsn > origin_lsn`.** Anomalous; shouldn't occur in normal flow. Log a warning and prefer `origin_lsn`. +4. Release the lock. + +Phase 2 lives in the apply worker (not the shmem_startup_hook) because the startup hook runs before catalogs and subscriptions are readable. + +**Phase 3 — forced keepalive.** + +Immediately after `spock_start_replication()` returns, before entering the receive loop, the apply worker calls a new helper: + +```c +static void +request_initial_status_update(PGconn *conn, XLogRecPtr startpos); +``` + +This sends an 'r' status-update message with `replyRequested=true` but leaves `send_feedback()`'s static bookkeeping untouched. The walsender's `ProcessStandbyReplyMessage` sees the flag and calls `WalSndKeepalive(false, InvalidXLogRecPtr)`, which emits a 'k' with the publisher's current `sentPtr`. The subscriber's existing keepalive handler routes it through `UpdateWorkerStats`, populating `remote_insert_lsn` and `received_lsn` in shmem. + +### Runtime flow + +**Commit path (`handle_commit` in `src/spock_apply.c:992-1023`).** + +Delete the `spock_apply_progress_add_to_wal(&sap)` call. The surrounding `SpockApplyProgress` construction remains; `spock_group_progress_update_ptr()` continues to update shmem. Net: one fewer `XLogFlush` per applied remote-origin commit. + +**Keepalive path.** + +Unchanged. `UpdateWorkerStats()` updates `remote_insert_lsn` / `received_lsn` in shmem for every incoming message. + +**`spock_group_progress_force_set_list()` (add_node path).** + +Used during `add_node` to populate the new node's view of mesh-peer progress from `read_peer_progress()` output. These entries are *deduplication state* for forwarded transactions in a mesh (not just monitoring), so they must be crash-durable before the first clean shutdown. + +Transformation: +1. Remove the per-entry `spock_apply_progress_add_to_wal(sap)` call. +2. Per entry: acquire master lock → mutate shmem → release lock (no I/O under the lock). +3. After the loop: call `spock_group_resource_dump()` once to persist all entries atomically via `durable_rename`. + +Rationale: `resource.dat` is a full-snapshot dump, not an incremental format, so N per-entry dumps would rewrite the entire file N times. A single post-loop dump gives equivalent durability (`add_node` is all-or-nothing at the admin level) with O(1) file writes. + +**`spock_group_progress_update_list()` (table-sync path).** + +Same transformation as `force_set_list`: drop the per-entry WAL call, add a single `spock_group_resource_dump()` after the loop. + +### Shutdown flow + +Unchanged. `src/spock_shmem.c:190-199` already contains a `spock_on_shmem_exit` callback that calls `spock_group_resource_dump()` only on clean exit (`if (code != 0) return;`). This is the sole remaining runtime producer of `resource.dat`. + +## Code changes + +### Delete + +- `src/spock_rmgr.c` — entire file. +- `include/spock_rmgr.h` — entire file (contains no live struct definitions; `SpockApplyProgress` and `SpockGroupEntry` live in `include/spock_group.h:116-147`). +- `spock_checkpoint_hook()` function in `src/spock_group.c:653-671`. +- `extern void spock_checkpoint_hook(...)` declaration in `include/spock_group.h:162`. + +### Remove one-liners + +- `src/spock.c:1248` — `Checkpoint_hook = spock_checkpoint_hook;`. +- `src/spock.c:1251` — `spock_rmgr_init();`. +- `src/spock_apply.c:1018` — `spock_apply_progress_add_to_wal(&sap);` inside `handle_commit`. +- `src/spock_group.c:682` — `spock_apply_progress_add_to_wal(sap);` inside `spock_group_progress_update_list`. +- `src/spock_group.c:769` — `spock_apply_progress_add_to_wal(sap);` inside `spock_group_progress_force_set_list`. +- `#include "spock_rmgr.h"` in `include/spock_group.h:23`. +- `#include "spock_rmgr.h"` in `src/spock.c:62`. + +### Modify + +- **`src/spock_apply.c`** — add a new helper `request_initial_status_update(PGconn *conn, XLogRecPtr startpos)` that emits an 'r' message with `replyRequested=true` without touching `send_feedback`'s static state. +- **`src/spock_apply.c:spock_apply_main`** — after `replorigin_session_setup` and before `spock_start_replication`, add phase-2 reconciliation. After `spock_start_replication` returns, call `request_initial_status_update`. +- **`src/spock_group.c:spock_group_progress_force_set_list`** — drop the WAL call; add a single `spock_group_resource_dump()` after the loop. +- **`src/spock_group.c:spock_group_progress_update_list`** — same. + +### Keep unchanged + +- `src/spock_shmem.c:190-199` — `spock_on_shmem_exit` (clean-shutdown-only dump). +- `spock_group_resource_load()`, `spock_group_resource_dump()` — load/dump file-format code. +- `SpockApplyProgress` and `SpockGroupEntry` struct layouts. +- `spock.progress` SQL view and `get_apply_group_progress()` function. + +### Postgres core patch removal + +The patch can be reverted entirely after the Spock-side removal. Four lines across two files: + +- `postgresql/src/backend/access/transam/xlog.c:173` — `Checkpoint_hook_type Checkpoint_hook = NULL;`. +- `postgresql/src/backend/access/transam/xlog.c:7648-7649` — 2-line `if (Checkpoint_hook) Checkpoint_hook(...)` invocation inside `CreateCheckPoint`. +- `postgresql/src/include/access/xlog.h:218-219` — typedef + `extern PGDLLIMPORT` declaration. + +No other Spock or Postgres-patch code depends on `Checkpoint_hook`. + +## Edge cases + +**Forced keepalive: publisher never responds.** Not blocking. The apply worker proceeds into its normal receive loop; `remote_insert_lsn` / `received_lsn` stay at seeded values (or NULL) until a 'k' or 'w' arrives via Postgres's own `wal_sender_timeout / 2` cadence. Worst case matches current behavior. + +**Startup: origin exists but `resource.dat` is missing.** Phase 2 initializes a fresh shmem entry with `remote_commit_lsn = origin_lsn` and NULL timestamps. Equivalent to first-time start. + +**Startup: `resource.dat` entry for a subscription that was dropped.** Old entry sits in shmem; no apply worker reconciles it. Matches current behavior. Optional follow-up: purge unknown entries after catalog is readable. + +**Startup: origin LSN is `InvalidXLogRecPtr`.** Subscription exists but nothing has been applied. Reconciliation treats this as "no origin progress"; keep `resource.dat` value if any, otherwise zero. + +**`add_node` retry.** Failed mid-loop leaves shmem partially updated and the file untouched (dump runs only after the loop completes). Second run's unconditional overwrite (`init_progress_fields` then `progress_update_struct`) replaces partial state; dump runs after the loop. Fine. + +**`add_node` interrupted after dump.** All entries persisted; re-run is idempotent. + +**Multiple apply workers.** Each worker reconciles its own key under the master lock. No cross-worker interference. Forced keepalive is per-connection. + +**Post-crash with publisher offline.** `remote_commit_ts` is NULL in `spock.progress` until publisher reconnects and sends a commit. This is the only visible regression versus today (which would show stale-from-last-shutdown values). NULL is intentionally chosen as an honest "unknown" signal. + +**Shmem hash full during `force_set_list`.** Warn and continue (existing behavior unchanged). Post-loop dump persists whichever entries made it in. + +**Concurrent shutdown during reconciliation.** Worker completes its update under the lock, releases, exits on next `CHECK_FOR_INTERRUPTS`. Shutdown hook dumps resource.dat with the reconciled values. Reconciliation-then-dump cannot leave shmem more stale than the file. + +## Testing + +### Modified existing tests + +- `tests/tap/t/008_rmgr.pl` — rewrites assertions for the new mechanism. "Clean stop → restart → spock.progress preserved" and "crash → correct remote_commit_lsn from origin" assertions remain conceptually, with different internals. + +### New tests + +1. **Clean shutdown/restart preserves `remote_commit_ts`.** Apply N transactions → clean stop → start → assert `remote_commit_ts` matches pre-shutdown value. +2. **Crash after post-shutdown commits → `remote_commit_ts` NULL.** Clean shutdown → apply transactions → `SIGKILL` → restart → assert `remote_commit_ts` is NULL (stale-detection cleared it), `remote_commit_lsn` matches origin. +3. **Crash during idle → fields NULL until reconnect.** Let subscriber idle with publisher → `SIGKILL` → restart before publisher reconnects → assert only `remote_commit_lsn` populated, others NULL. +4. **Forced keepalive populates within ms.** Set `wal_sender_timeout` high (e.g., 600s) so the timer-driven path is disabled. Start apply worker → assert `remote_insert_lsn` populated within 1 second (proves the forced keepalive, not the timer, produced it). +5. **`add_node` crash-durability.** Run `add_node` → `SIGKILL` before clean shutdown → restart → assert mesh progress entries from `force_set_list` survived (via the post-loop `spock_group_resource_dump`). +6. **No extra WAL flush on commit.** Instrument or count flushes; confirm applied remote-origin commits perform 1 `XLogFlush`, not 2. +7. **`add_node` interrupted mid-loop.** Inject failure between entries → assert `resource.dat` reflects pre-add_node state (no partial dump). +8. **Publisher never responds to forced keepalive.** Block reply path → assert apply worker does not block; values populate on `wal_sender_timeout / 2` cadence. + +### Regression safety checks + +- `spock.progress` column set unchanged. +- `read_peer_progress()` SQL function output unchanged. +- Downstream tools reading `spock.progress` (spockctrl, monitoring scripts) see no schema or column-type changes. + +## Follow-ups (out of scope) + +- Best-effort `GetLatestCommitTsData` at startup to recover `remote_commit_ts` for the single globally-latest commit's origin. Low cost, low coverage (helps only one origin per crash). Deferred per user's call. +- Purging `resource.dat` entries that reference subscriptions that no longer exist. +- Removing the obsolete `prev_remote_ts` and `updated_by_decode` fields from the `SpockApplyProgress` struct layout. From 81d84a125b38ac1852d6504252a3f8851e782b88 Mon Sep 17 00:00:00 2001 From: Mason Sharp Date: Tue, 21 Apr 2026 11:09:45 -0700 Subject: [PATCH 2/5] feat: replace WAL-based spock.progress persistence with file + origin reconcile Replaces the WAL-RMGR + checkpoint-hook persistence for spock.progress with a file snapshot + replication-origin reconcile. What's removed: - The WAL RMGR's apply-progress recovery records (no per-commit SPOCK_RMGR_APPLY_PROGRESS, no redo-into-shmem). - spock_checkpoint_hook() and its registration. - Per-commit XLogFlush() in handle_commit. What's added: - reconcile_progress_with_origin() at apply-worker startup: reads the durable origin LSN, advances remote_commit_lsn if resource.dat is stale, clears timestamp fields pending later refresh. - A retargeted RMGR (id 144) emitting one WAL record per SpockGroupHash entry at each resource.dat dump event. Records are forensic only (visible via pg_waldump); state recovery does not depend on them. - spock_supervisor_on_exit(): the spock supervisor's before_shmem_exit callback. On clean shutdown, polls (~5 s ceiling) for sibling apply/sync workers to detach via spock_any_apply_worker_running() (new helper), then runs spock_group_resource_dump() and the SHUTDOWN forensic WAL emit. XLogInsert requires a backend context, so the supervisor (a bgworker) is the right owner -- not postmaster. Both file and WAL dumps share one owner and reflect the same drained state. What's changed: - add_node and table-sync paths: drop per-entry WAL writes, dump resource.dat once after the loop via durable_rename. Performance: removes one fsync per applied remote-origin commit on the apply hot path. Under synchronous_commit=off (the apply worker's default), measured via 026_no_double_wal_flush.pl as ~2 fsyncs over 30 applied commits vs >=30 before. Test suite updated for the new layout. --- include/spock_group.h | 3 - include/spock_progress_recovery.h | 18 ++ include/spock_rmgr.h | 64 ++-- include/spock_worker.h | 1 + src/spock.c | 50 +++- src/spock_apply.c | 15 +- src/spock_group.c | 90 +++--- src/spock_progress_recovery.c | 169 +++++++++++ src/spock_rmgr.c | 279 ++++++++++-------- src/spock_shmem.c | 22 -- src/spock_worker.c | 33 +++ tests/tap/t/008_rmgr.pl | 132 ++++++--- ...022_rmgr_progress_post_checkpoint_crash.pl | 195 ++++++------ tests/tap/t/023_forced_keepalive_timing.pl | 108 +++++++ tests/tap/t/026_no_double_wal_flush.pl | 107 +++++++ 15 files changed, 911 insertions(+), 375 deletions(-) create mode 100644 include/spock_progress_recovery.h create mode 100644 src/spock_progress_recovery.c create mode 100644 tests/tap/t/023_forced_keepalive_timing.pl create mode 100644 tests/tap/t/026_no_double_wal_flush.pl diff --git a/include/spock_group.h b/include/spock_group.h index 256d8e04..0543b0b6 100644 --- a/include/spock_group.h +++ b/include/spock_group.h @@ -20,8 +20,6 @@ #include "storage/spin.h" #include "utils/hsearch.h" -#include "spock_rmgr.h" - extern HTAB *SpockGroupHash; /* numeric version to store on disk */ @@ -159,7 +157,6 @@ extern void spock_group_progress_update_ptr(SpockGroupEntry *entry, extern TimestampTz apply_worker_get_prev_remote_ts(void); extern void spock_group_resource_dump(void); -extern void spock_checkpoint_hook(XLogRecPtr checkPointRedo, int flags); extern void spock_group_progress_update_list(List *lst); extern void spock_group_progress_force_set_list(List *lst); diff --git a/include/spock_progress_recovery.h b/include/spock_progress_recovery.h new file mode 100644 index 00000000..484d57f4 --- /dev/null +++ b/include/spock_progress_recovery.h @@ -0,0 +1,18 @@ +/*------------------------------------------------------------------------- + * + * spock_progress_recovery.h + * apply-worker startup state setup for spock.progress + * + * Copyright (c) 2022-2026, pgEdge, Inc. + * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group + * + *------------------------------------------------------------------------- + */ +#ifndef SPOCK_PROGRESS_RECOVERY_H +#define SPOCK_PROGRESS_RECOVERY_H + +#include "access/xlog.h" + +extern void spock_init_progress_state(XLogRecPtr origin_lsn); + +#endif /* SPOCK_PROGRESS_RECOVERY_H */ diff --git a/include/spock_rmgr.h b/include/spock_rmgr.h index 2721a9d0..428f5d02 100644 --- a/include/spock_rmgr.h +++ b/include/spock_rmgr.h @@ -3,6 +3,14 @@ * spock_rmgr.h * spock resource manager declarations * + * Emits one WAL record per SpockGroupHash entry at every resource.dat dump + * event (clean shutdown, add_node post-loop, table-sync post-loop) carrying + * the full SpockApplyProgress snapshot for that entry. The records are not + * required for state recovery (shmem is reseeded from resource.dat plus + * replorigin reconcile plus pg_commit_ts scan); they exist so that a + * pg_waldump trace shows the exact progress snapshot that was persisted + * at each dump LSN -- useful for incident reconstruction over time. + * * Copyright (c) 2022-2026, pgEdge, Inc. * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group * Portions Copyright (c) 1994, The Regents of the University of California @@ -14,38 +22,54 @@ #include "access/xlog.h" #include "access/xlog_internal.h" +#include "nodes/pg_list.h" #include "spock_group.h" -/* Spock resouce manager */ -#define SPOCK_RMGR_NAME "spock_custom_rmgr" -#define SPOCK_RMGR_ID 144 +#define SPOCK_RMGR_NAME "spock_custom_rmgr" +#define SPOCK_RMGR_ID 144 -/* Spock RMGR tags. */ -#define SPOCK_RMGR_APPLY_PROGRESS 0x10 -#define SPOCK_RMGR_SUBTRANS_COMMIT_TS 0x20 +/* Spock RMGR record types (high nibble of info byte) */ +#define SPOCK_RMGR_RESOURCE_DUMP 0x10 -typedef struct SpockApplyProgress SpockApplyProgress; -typedef struct SpockGroupEntry SpockGroupEntry; +/* Event type within a SPOCK_RMGR_RESOURCE_DUMP record */ +typedef enum SpockResourceDumpEvent +{ + SPOCK_DUMP_SHUTDOWN = 0, + SPOCK_DUMP_ADD_NODE = 1, + SPOCK_DUMP_TABLE_SYNC = 2 +} SpockResourceDumpEvent; -#if 0 -typedef struct SubTransactionCommitTsEntry +typedef struct SpockResourceDumpRec { - TransactionId xid; - TimestampTz time; - RepOriginId nodeid; -} SubTransactionCommitTsEntry; -#endif + uint8 event_type; /* SpockResourceDumpEvent */ + uint8 flags; /* reserved */ + uint16 entry_seq; /* 0-based index within this dump event */ + uint16 entry_total; /* total entries in this dump event */ + uint16 _pad; + SpockApplyProgress progress; /* full snapshot for this entry */ +} SpockResourceDumpRec; /* RMGR function declarations */ extern void spock_rmgr_init(void); +extern void spock_rmgr_redo(XLogReaderState *record); extern void spock_rmgr_desc(StringInfo buf, XLogReaderState *record); extern const char *spock_rmgr_identify(uint8 info); -extern void spock_rmgr_redo(XLogReaderState *record); -extern void spock_rmgr_startup(void); -extern void spock_rmgr_cleanup(void); -/* WAL helpers */ -extern XLogRecPtr spock_apply_progress_add_to_wal(const SpockApplyProgress *sap); +/* + * Emit helper -- called from the dump call sites in spock_group.c / + * spock_shmem.c. Emits one WAL record per progress entry, then issues a + * single XLogFlush at the end. + * + * If `changed_entries` is NULL, walks SpockGroupHash and emits a record + * for each entry (used by SHUTDOWN and ADD_NODE — both reflect changes + * across all subscriptions). + * + * If `changed_entries` is non-NULL, it is treated as a List of + * SpockApplyProgress* and emits one record per list element (used by + * TABLE_SYNC, which only affects one subscription's progress). + */ +extern void spock_rmgr_log_resource_dump(SpockResourceDumpEvent event, + List *changed_entries); #endif /* SPOCK_RMGR_H */ diff --git a/include/spock_worker.h b/include/spock_worker.h index 60891451..855b3310 100644 --- a/include/spock_worker.h +++ b/include/spock_worker.h @@ -178,6 +178,7 @@ extern SpockWorker *spock_sync_find(Oid dboid, Oid subid, extern List *spock_sync_find_all(Oid dboid, Oid subscriberid); extern SpockWorker *spock_get_worker(int slot); +extern bool spock_any_apply_worker_running(void); extern bool spock_worker_running(SpockWorker *w); extern bool spock_worker_terminating(SpockWorker *w); extern void spock_worker_kill(SpockWorker *worker); diff --git a/src/spock.c b/src/spock.c index 8def4d39..dd3dcb66 100644 --- a/src/spock.c +++ b/src/spock.c @@ -57,6 +57,7 @@ #include "spock_conflict_stat.h" #endif #include "spock_executor.h" +#include "spock_group.h" #include "spock_node.h" #include "spock_conflict.h" #include "spock_rmgr.h" @@ -150,7 +151,6 @@ int log_origin_change = SPOCK_ORIGIN_NONE; int spock_apply_idle_timeout = 300; static emit_log_hook_type prev_emit_log_hook = NULL; -static Checkpoint_hook_type prev_Checkpoint_hook = NULL; void _PG_init(void); PGDLLEXPORT void spock_supervisor_main(Datum main_arg); @@ -693,6 +693,49 @@ start_manager_workers(void) table_close(rel, AccessShareLock); } +/* + * Drain interval and timeout for the supervisor's shutdown wait loop. + * + * On SIGTERM, postmaster signals all bgworkers in parallel; the supervisor + * waits for sibling APPLY/SYNC workers to clear their PGPROC slots so the + * SHUTDOWN forensic dump reflects each worker's last update to + * SpockGroupHash. 100 ms × 50 = 5 s ceiling — long enough to cover normal + * apply-side cleanup, short enough to avoid blocking postmaster shutdown + * on a stuck worker. + */ +#define SPOCK_SHUTDOWN_DRAIN_INTERVAL_US 100000 +#define SPOCK_SHUTDOWN_DRAIN_MAX_RETRIES 50 + +/* + * Supervisor before_shmem_exit cleanup. Runs in the supervisor process on + * clean shutdown, with WAL insertion still permitted. + * + * We poll-wait for in-flight apply/sync workers to detach so the resource.dat + * file dump and the SPOCK_DUMP_SHUTDOWN forensic WAL records reflect each + * worker's final SpockGroupHash state. + */ +static void +spock_supervisor_on_exit(int code, Datum arg) +{ + int retries = SPOCK_SHUTDOWN_DRAIN_MAX_RETRIES; + + if (code != 0) + return; + + while (retries-- > 0 && spock_any_apply_worker_running()) + { + CHECK_FOR_INTERRUPTS(); + pg_usleep(SPOCK_SHUTDOWN_DRAIN_INTERVAL_US); + } + + if (spock_any_apply_worker_running()) + elog(LOG, "spock supervisor: shutdown drain timed out, " + "SHUTDOWN forensic snapshot may be incomplete"); + + spock_group_resource_dump(); + spock_rmgr_log_resource_dump(SPOCK_DUMP_SHUTDOWN, NULL); +} + /* * Static bgworker used for initialization and management (our main process). */ @@ -703,6 +746,8 @@ spock_supervisor_main(Datum main_arg) pqsignal(SIGTERM, handle_sigterm); BackgroundWorkerUnblockSignals(); + before_shmem_exit(spock_supervisor_on_exit, (Datum) 0); + /* * Initialize supervisor info in shared memory. Strictly speaking we * don't need a lock here, because no other process could possibly be @@ -1244,9 +1289,6 @@ _PG_init(void) if (IsBinaryUpgrade) return; - prev_Checkpoint_hook = Checkpoint_hook; - Checkpoint_hook = spock_checkpoint_hook; - /* Spock resource manager */ spock_rmgr_init(); diff --git a/src/spock_apply.c b/src/spock_apply.c index 6d35b242..096bc8d9 100644 --- a/src/spock_apply.c +++ b/src/spock_apply.c @@ -71,6 +71,7 @@ #include "spock_conflict.h" #include "spock_executor.h" #include "spock_node.h" +#include "spock_progress_recovery.h" #include "spock_proto_native.h" #include "spock_queue.h" #include "spock_relcache.h" @@ -1005,10 +1006,11 @@ handle_commit(StringInfo s) .remote_commit_lsn = end_lsn, .received_lsn = end_lsn, /* - * Include remote_insert_lsn for WAL persistence. This was already - * updated in shmem by UpdateWorkerStats() earlier (either from - * apply_work for protocol 5+, or from handle_commit for protocol 4). - * Without this, crash recovery would lose remote_insert_lsn. + * Carry forward the remote_insert_lsn already in shmem (set by + * UpdateWorkerStats on the most-recent keepalive or 'w' message). + * This keeps the shmem entry coherent after the update below; + * crash recovery of this field is handled by the forced keepalive + * sent right after spock_start_replication, not by any WAL record. */ .remote_insert_lsn = MyApplyWorker->apply_group->progress.remote_insert_lsn, /* XXX: Could we use commit_ts value instead? */ @@ -1019,9 +1021,6 @@ handle_commit(StringInfo s) /* XXX: Don't care in production yet */ Assert(sap.last_updated_ts >= sap.remote_commit_ts); - /* WAL after commit, then to shmem */ - spock_apply_progress_add_to_wal(&sap); - Assert(MyApplyWorker && MyApplyWorker->apply_group); spock_group_progress_update_ptr(MyApplyWorker->apply_group, &sap); @@ -4052,6 +4051,8 @@ spock_apply_main(Datum main_arg) replorigin_session_origin = originid; origin_startpos = replorigin_session_get_progress(false); + spock_init_progress_state(origin_startpos); + /* Start the replication. */ streamConn = spock_connect_replica(MySubscription->origin_if->dsn, MySubscription->slot_name, NULL); diff --git a/src/spock_group.c b/src/spock_group.c index b3f8cb35..a6487004 100644 --- a/src/spock_group.c +++ b/src/spock_group.c @@ -21,10 +21,20 @@ * * nattached, prev_processed_cv -- apply-worker coordination (runtime) * * Persistence: - * - WAL: authoritative. spock_rmgr_redo() replays progress into this hash. - * - File: PGDATA/spock/resource.dat on clean shutdown (on_shmem_exit). - * Load during shmem_startup_hook to seed shmem quickly. WAL replay - * runs after and overrides stale file contents. + * - File: PGDATA/spock/resource.dat, written on clean shutdown + * (spock_on_shmem_exit) and after table-sync / add_node bulk + * updates. Loaded during shmem_startup_hook to seed shmem on + * restart. reconcile_progress_with_origin() then reconciles + * remote_commit_lsn against pg_replication_origin at apply-worker + * start; timestamp fields are cleared if the file LSN is stale. + * remote_insert_lsn and received_lsn refresh within milliseconds + * of reconnect via the forced keepalive sent by + * request_initial_status_update(). remote_commit_ts is + * repopulated by recover_progress_timestamps_from_commit_ts() + * scanning pg_commit_ts. + * - WAL: not authoritative for state. The RMGR (id 144) emits forensic + * dump markers only; redo emits a LOG line and does not touch + * shmem. * * * Notes: @@ -54,6 +64,7 @@ #include "spock_worker.h" #include "spock_compat.h" #include "spock_group.h" +#include "spock_rmgr.h" HTAB *SpockGroupHash = NULL; @@ -649,27 +660,13 @@ spock_group_resource_load(void) CloseTransientFile(fd); } -void -spock_checkpoint_hook(XLogRecPtr checkPointRedo, int flags) -{ - /* - * Dump current progress state to resource.dat at every checkpoint. - * - * resource.dat and WAL cooperate for crash recovery: - * - resource.dat seeds shmem with checkpoint-time values on restart - * - WAL replay (spock_rmgr_redo) advances any entries written after - * checkPointRedo - * - * Without dumping at regular checkpoints, a crash after a checkpoint - * leaves resource.dat with values from the last clean shutdown. If apply - * workers were idle during the checkpoint interval (provider unreachable), - * no SPOCK_RMGR_APPLY_PROGRESS records exist after checkPointRedo to - * recover from either, causing spock.progress to show stale values or - * missing entries after crash recovery. - */ - spock_group_resource_dump(); -} - +/* + * spock_group_progress_update_list + * + * Update shmem progress for each entry in lst after a table COPY completes. + * Entries are applied via spock_group_progress_update (MAX-by-LSN semantics). + * A single spock_group_resource_dump() after the loop persists the result. + */ void spock_group_progress_update_list(List *lst) { @@ -679,8 +676,6 @@ spock_group_progress_update_list(List *lst) { SpockApplyProgress *sap = (SpockApplyProgress *) lfirst(lc); - spock_apply_progress_add_to_wal(sap); - spock_group_progress_update(sap); elog(LOG, "SPOCK: adjust spock.progress %d->%d to " @@ -692,6 +687,10 @@ spock_group_progress_update_list(List *lst) LSN_FORMAT_ARGS(sap->remote_insert_lsn)); } + /* Persist all entries atomically via a single resource.dat dump. */ + spock_group_resource_dump(); + spock_rmgr_log_resource_dump(SPOCK_DUMP_TABLE_SYNC, lst); + /* * Free the list and each object. Be careful here because it is inside * a memory context that is rarely reset. @@ -702,9 +701,12 @@ spock_group_progress_update_list(List *lst) /* * spock_group_progress_force_set_list * - * Write resume_lsn into the shmem progress entry for each peer during add_node. - * Uses MAX-by-LSN: preserves the existing entry when it is already >= resume_lsn, - * preventing double-apply from overwriting a live apply worker's higher value. + * Unconditionally write resume_lsn into the shmem progress entry for each + * peer during add_node. The resume_lsn from read_peer_progress is the + * COPY-snapshot boundary and is authoritative: the new node must replay + * from there, not from any live apply worker's current position. + * + * A single spock_group_resource_dump() after the loop persists the result. */ void spock_group_progress_force_set_list(List *lst) @@ -746,31 +748,19 @@ spock_group_progress_force_set_list(List *lst) } else if (entry->progress.remote_commit_lsn >= sap->remote_commit_lsn) { - /* - * Existing LSN >= resume_lsn. Unconditionally overwrite: the - * value from read_peer_progress is authoritative because - * it was captured at COPY snapshot time. The apply worker may - * have advanced past it since then, but any data it applied - * after the snapshot is NOT in the COPY — so the new node must - * replay from the snapshot boundary, not from the worker's - * current position. - */ - elog(LOG, "SPOCK: force-set %d->%d overwriting: existing=%X/%X -> resume_lsn=%X/%X", + elog(LOG, "SPOCK: force-set %d->%d backwards overwrite: existing=%X/%X -> resume_lsn=%X/%X", sap->key.remote_node_id, MySubscription->target->id, LSN_FORMAT_ARGS(entry->progress.remote_commit_lsn), LSN_FORMAT_ARGS(sap->remote_commit_lsn)); } - else - { - /* Stale entry below resume_lsn; will be reset after WAL write succeeds. */ - } - - /* WAL-log before shmem update; skipped above when existing LSN is higher. */ - spock_apply_progress_add_to_wal(sap); - /* Now safe to mutate shmem: WAL write succeeded. */ + /* + * Unconditional overwrite: the resume_lsn from read_peer_progress is + * the COPY-snapshot boundary and is authoritative. + */ init_progress_fields(&entry->progress); progress_update_struct(&entry->progress, sap); + LWLockRelease(SpockCtx->apply_group_master_lock); elog(LOG, "SPOCK: force-set %d->%d commit_lsn=%X/%X insert_lsn=%X/%X", @@ -779,5 +769,9 @@ spock_group_progress_force_set_list(List *lst) LSN_FORMAT_ARGS(sap->remote_insert_lsn)); } + /* Persist all entries atomically via a single resource.dat dump. */ + spock_group_resource_dump(); + spock_rmgr_log_resource_dump(SPOCK_DUMP_ADD_NODE, NULL); + list_free_deep(lst); } diff --git a/src/spock_progress_recovery.c b/src/spock_progress_recovery.c new file mode 100644 index 00000000..dc0b304d --- /dev/null +++ b/src/spock_progress_recovery.c @@ -0,0 +1,169 @@ +/*------------------------------------------------------------------------- + * + * spock_progress_recovery.c + * Set up an apply worker's spock.progress shmem entry at startup. + * + * After WAL recovery loads PGDATA/spock/resource.dat into shmem, the apply + * worker calls spock_init_progress_state() to reconcile its entry against + * the durable replication origin LSN. If the loaded file is stale, the + * commit LSN is advanced from the origin and the timestamp fields are + * cleared (later restored by recover_progress_timestamps_from_commit_ts()). + * + * Copyright (c) 2022-2026, pgEdge, Inc. + * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "miscadmin.h" + +#include "access/xact.h" +#include "access/xlog.h" +#include "datatype/timestamp.h" +#include "replication/origin.h" +#include "storage/condition_variable.h" +#include "storage/lwlock.h" +#include "utils/hsearch.h" +#include "utils/rel.h" + +#include "spock_group.h" +#include "spock_node.h" +#include "spock_progress_recovery.h" +#include "spock_worker.h" + +static void reconcile_progress_with_origin(XLogRecPtr origin_lsn); + +/* + * spock_init_progress_state + * + * Public entry point: reconcile the apply worker's shmem entry against the + * replication origin. Called once between replorigin_session_setup() and + * spock_start_replication(). + */ +void +spock_init_progress_state(XLogRecPtr origin_lsn) +{ + reconcile_progress_with_origin(origin_lsn); +} + +/* + * Reconcile this worker's shmem progress entry against the replication + * origin's LSN. Called once at worker startup after replorigin_session_setup + * but before entering the apply loop. + * + * - Entry absent: initialize with origin_lsn, other fields zero/NULL. + * - Entry present, file_lsn == origin_lsn: file is in sync; keep timestamps. + * - Entry present, file_lsn < origin_lsn: file is stale; clear timestamps. + * - Entry present, file_lsn > origin_lsn: anomalous; prefer origin_lsn. + * + * remote_insert_lsn and received_lsn are left alone here; the forced + * keepalive sent right after spock_start_replication will overwrite them. + */ +static void +reconcile_progress_with_origin(XLogRecPtr origin_lsn) +{ + SpockGroupKey key; + SpockGroupEntry *entry; + bool found; + + if (!SpockGroupHash || !SpockCtx) + { + elog(WARNING, "SPOCK %s: SpockGroupHash is not initialized; reconcile skipped", + MySubscription->name); + return; + } + + key.dbid = MyDatabaseId; + key.node_id = MySubscription->target->id; + key.remote_node_id = MySubscription->origin->id; + + LWLockAcquire(SpockCtx->apply_group_master_lock, LW_EXCLUSIVE); + + entry = (SpockGroupEntry *) hash_search(SpockGroupHash, &key, + HASH_ENTER, &found); + if (entry == NULL) + { + LWLockRelease(SpockCtx->apply_group_master_lock); + elog(WARNING, "SpockGroupHash is full, cannot reconcile progress for " + "(dbid=%u, node_id=%u, remote_node_id=%u)", + key.dbid, key.node_id, key.remote_node_id); + return; + } + + if (!found) + { + /* + * New entry: hash_search already copied the key into + * entry->progress.key. Zero the remaining fields inline because + * init_progress_fields is static to spock_group.c. + * + * MAINTENANCE: if SpockApplyProgress gains new fields, this block + * must be updated in lockstep with init_progress_fields in + * spock_group.c:init_progress_fields(). + */ + Assert(OidIsValid(entry->progress.key.dbid)); + Assert(OidIsValid(entry->progress.key.node_id)); + Assert(OidIsValid(entry->progress.key.remote_node_id)); + entry->progress.remote_commit_ts = 0; + entry->progress.prev_remote_ts = 0; + entry->progress.remote_commit_lsn = origin_lsn; + entry->progress.remote_insert_lsn = InvalidXLogRecPtr; + entry->progress.received_lsn = InvalidXLogRecPtr; + entry->progress.last_updated_ts = 0; + entry->progress.updated_by_decode = false; + pg_atomic_init_u32(&entry->nattached, 0); + ConditionVariableInit(&entry->prev_processed_cv); + elog(LOG, "SPOCK %s: reconcile: new entry seeded at origin LSN %X/%X", + MySubscription->name, LSN_FORMAT_ARGS(origin_lsn)); + } + else if (entry->progress.remote_commit_lsn == origin_lsn) + { + /* File and origin agree; keep timestamps. */ + elog(DEBUG1, "SPOCK %s: reconcile: file LSN matches origin (%X/%X)", + MySubscription->name, LSN_FORMAT_ARGS(origin_lsn)); + } + else if (entry->progress.remote_commit_lsn < origin_lsn) + { + /* File is stale. Overwrite LSN, clear timestamp fields. */ + elog(LOG, "SPOCK %s: reconcile: file LSN %X/%X stale vs origin %X/%X; clearing ts", + MySubscription->name, + LSN_FORMAT_ARGS(entry->progress.remote_commit_lsn), + LSN_FORMAT_ARGS(origin_lsn)); + entry->progress.remote_commit_lsn = origin_lsn; + entry->progress.remote_commit_ts = 0; + entry->progress.prev_remote_ts = 0; + entry->progress.last_updated_ts = 0; + /* + * Preserve the invariant remote_insert_lsn >= remote_commit_lsn. + * If we only advance commit_lsn, later calls to + * progress_update_struct will assert-fail. The publisher's forced + * keepalive on reconnect will refresh remote_insert_lsn to a + * current value; until then, pinning it to origin_lsn is correct + * (any prior commit has necessarily already been inserted). + */ + if (entry->progress.remote_insert_lsn < origin_lsn) + entry->progress.remote_insert_lsn = origin_lsn; + if (entry->progress.received_lsn < origin_lsn) + entry->progress.received_lsn = origin_lsn; + } + else + { + /* Anomaly: file ahead of origin. Trust origin. */ + elog(WARNING, "SPOCK %s: reconcile: file LSN %X/%X ahead of origin %X/%X; trusting origin", + MySubscription->name, + LSN_FORMAT_ARGS(entry->progress.remote_commit_lsn), + LSN_FORMAT_ARGS(origin_lsn)); + entry->progress.remote_commit_lsn = origin_lsn; + entry->progress.remote_commit_ts = 0; + entry->progress.prev_remote_ts = 0; + entry->progress.last_updated_ts = 0; + /* Same invariant fix as above. */ + if (entry->progress.remote_insert_lsn < origin_lsn) + entry->progress.remote_insert_lsn = origin_lsn; + if (entry->progress.received_lsn < origin_lsn) + entry->progress.received_lsn = origin_lsn; + } + + LWLockRelease(SpockCtx->apply_group_master_lock); +} diff --git a/src/spock_rmgr.c b/src/spock_rmgr.c index b2a9c0af..8a9b005f 100644 --- a/src/spock_rmgr.c +++ b/src/spock_rmgr.c @@ -3,32 +3,25 @@ * spock_rmgr.c * spock resource manager definitions * - * Copyright (c) 2022-2026, pgEdge, Inc. - * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group - * Portions Copyright (c) 1994, The Regents of the University of California - * - * - * Spock Resource Manager (RMGR) - * ----------------------------- + * Emits one WAL record per SpockGroupHash entry on each resource.dat dump + * event so operators can correlate the persisted progress snapshot with + * WAL position via pg_waldump. Each record carries the full + * SpockApplyProgress for one entry, the event type, and the (seq, total) + * pair within the dump event so consumers can reassemble. * - * This module implements the WAL side of Spock's apply progress persistence. + * The records are not required for state recovery: shmem is reseeded at + * startup from resource.dat plus replication-origin reconcile plus + * pg_commit_ts scan. Their value is showing what state was on disk at + * each dump LSN -- useful for incident reconstruction over time. * - * - WAL is authoritative for progress. We emit compact progress records - * after each locally-committed, remotely-originated transaction on the - * subscriber, and REDO replays them into shared memory after crash/restart. - * - Shared memory holds a hash map keyed by (dbid, node_id, remote_node_id), - * storing the latest SpockApplyProgress snapshot for each group. - * - On clean shutdowns we also take a file snapshot (resource.dat) so restarts - * can seed shmem quickly; WAL REDO still runs after load and will overwrite - * any stale entries. + * Volume is N entries × handful per cluster per day in normal operation + * (one event per clean shutdown + one per add_node + one per table sync), + * so a single XLogFlush at the end of each emit batch is cheap and gives + * WAL durability symmetric with resource.dat's durable_rename(). * - * Startup ordering: - * 1) postmaster creates shared memory and runs shmem_startup_hook - * -> shmem_init() - * -> spock_group_resource_load() to load file snapshot (if any) - * 2) startup process begins WAL recovery - * -> calls spock_rmgr_redo() for our records - * -> redo update the same shmem hash (wins over file load) + * Copyright (c) 2022-2026, pgEdge, Inc. + * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, The Regents of the University of California * *------------------------------------------------------------------------- */ @@ -39,28 +32,21 @@ #include "access/xloginsert.h" #include "access/xlogreader.h" #include "access/xlogrecord.h" -#include "utils/pg_lsn.h" +#include "storage/lwlock.h" +#include "utils/hsearch.h" #include "spock_rmgr.h" #include "spock_worker.h" -#include "spock_apply.h" static RmgrData spock_custom_rmgr = { .rm_name = SPOCK_RMGR_NAME, .rm_redo = spock_rmgr_redo, .rm_desc = spock_rmgr_desc, .rm_identify = spock_rmgr_identify, - .rm_startup = spock_rmgr_startup, - .rm_cleanup = spock_rmgr_cleanup, + .rm_startup = NULL, + .rm_cleanup = NULL, }; -/* - * spock_rmgr_init - * - * Register Spock's resource manager so our WAL records are recognized during - * recovery. Called in _PG_init() before recovery can begin; - * do not allocate shmem here. - */ void spock_rmgr_init(void) { @@ -68,12 +54,25 @@ spock_rmgr_init(void) } /* - * spock_rmgr_redo - * - * Redo handler for Spock WAL records. For APPLY_PROGRESS records, decode the - * payload (SpockApplyProgress) and upsert into the shmem group registry via - * spock_group_update_progress(). This runs during recovery and overwrites any - * shmem contents previously seeded by file load. + * Translate the event_type byte to a stable string for log/desc output. + */ +static const char * +event_name(uint8 event_type) +{ + switch ((SpockResourceDumpEvent) event_type) + { + case SPOCK_DUMP_SHUTDOWN: return "shutdown"; + case SPOCK_DUMP_ADD_NODE: return "add_node"; + case SPOCK_DUMP_TABLE_SYNC: return "table_sync"; + } + return "unknown"; +} + +/* + * Replay handler: log only. The actual progress data is reseeded by + * spock_group_resource_load() (file) plus reconcile_progress_with_origin() + * (origin) plus the commit_ts recovery scan -- none of which touch this + * record. We just log so it's visible in recovery logs. */ void spock_rmgr_redo(XLogReaderState *record) @@ -82,117 +81,155 @@ spock_rmgr_redo(XLogReaderState *record) switch (info) { - case SPOCK_RMGR_APPLY_PROGRESS: + case SPOCK_RMGR_RESOURCE_DUMP: { - SpockApplyProgress *rec; - - rec = (SpockApplyProgress *) XLogRecGetData(record); - - /* - * During WAL replay, we must acquire locks when accessing - * shared hash tables (per commit c6d76d7 in PostgreSQL core). - * The spock_group_progress_update() function acquires - * apply_group_master_lock internally for us. - */ - (void) spock_group_progress_update(rec); + SpockResourceDumpRec *rec; + + rec = (SpockResourceDumpRec *) XLogRecGetData(record); + elog(LOG, + "SPOCK rmgr: resource.dat dump replayed " + "(event=%s, entry %u/%u, dbid=%u node_id=%u remote_node_id=%u, " + "remote_commit_lsn=%X/%X remote_insert_lsn=%X/%X received_lsn=%X/%X, " + "remote_commit_ts=" INT64_FORMAT ")", + event_name(rec->event_type), + (unsigned) rec->entry_seq + 1, (unsigned) rec->entry_total, + rec->progress.key.dbid, + rec->progress.key.node_id, + rec->progress.key.remote_node_id, + LSN_FORMAT_ARGS(rec->progress.remote_commit_lsn), + LSN_FORMAT_ARGS(rec->progress.remote_insert_lsn), + LSN_FORMAT_ARGS(rec->progress.received_lsn), + (int64) rec->progress.remote_commit_ts); + break; } - break; - - case SPOCK_RMGR_SUBTRANS_COMMIT_TS: - break; - default: - elog(PANIC, "spock_rmgr_redo: unknown op code %u", info); + elog(WARNING, + "SPOCK rmgr: unknown info byte 0x%02x in WAL record; ignoring", + info); + break; } } +/* + * pg_waldump descriptor: appended after the standard rmgr/info header. + */ void spock_rmgr_desc(StringInfo buf, XLogReaderState *record) { uint8 info = XLogRecGetInfo(record) & XLR_RMGR_INFO_MASK; - switch (info) + if (info == SPOCK_RMGR_RESOURCE_DUMP) { - case SPOCK_RMGR_APPLY_PROGRESS: - { - SpockApplyProgress *rec; - - rec = (SpockApplyProgress *) XLogRecGetData(record); - appendStringInfo(buf, "spock apply progress for dbid %u, node_id %u, remote_node_id %u; " - "remote_commit_lsn %X/%X, remote_insert_lsn %X/%X, received_lsn %X/%X", - rec->key.dbid, - rec->key.node_id, - rec->key.remote_node_id, - LSN_FORMAT_ARGS(rec->remote_commit_lsn), - LSN_FORMAT_ARGS(rec->remote_insert_lsn), - LSN_FORMAT_ARGS(rec->received_lsn)); - } - break; - case SPOCK_RMGR_SUBTRANS_COMMIT_TS: - appendStringInfo(buf, "spock rmgr: sub transaction commit ts"); - break; - - default: - appendStringInfo(buf, "spock rmgr: unknown(%u)", info); + SpockResourceDumpRec *rec = (SpockResourceDumpRec *) XLogRecGetData(record); + + appendStringInfo(buf, + "event=%s entry=%u/%u " + "key=(%u,%u,%u) " + "commit_lsn=%X/%X insert_lsn=%X/%X received_lsn=%X/%X " + "commit_ts=" INT64_FORMAT, + event_name(rec->event_type), + (unsigned) rec->entry_seq + 1, (unsigned) rec->entry_total, + rec->progress.key.dbid, + rec->progress.key.node_id, + rec->progress.key.remote_node_id, + LSN_FORMAT_ARGS(rec->progress.remote_commit_lsn), + LSN_FORMAT_ARGS(rec->progress.remote_insert_lsn), + LSN_FORMAT_ARGS(rec->progress.received_lsn), + (int64) rec->progress.remote_commit_ts); } } +/* + * pg_waldump identifier: maps info byte -> human-readable name. + */ const char * spock_rmgr_identify(uint8 info) { - switch (info) + switch (info & XLR_RMGR_INFO_MASK) { - case SPOCK_RMGR_APPLY_PROGRESS: - return "APPLY_PROGRESS"; - break; - case SPOCK_RMGR_SUBTRANS_COMMIT_TS: - return "SUBTRANS_COMMIT_TS"; - break; + case SPOCK_RMGR_RESOURCE_DUMP: + return "RESOURCE_DUMP"; } - return NULL; } -void -spock_rmgr_startup(void) +/* + * Build and insert one SPOCK_RMGR_RESOURCE_DUMP record for `progress`. + * Returns the inserted record's LSN. + */ +static XLogRecPtr +emit_one_dump_record(SpockResourceDumpEvent event, + uint16 entry_seq, uint16 entry_total, + const SpockApplyProgress *progress) { -} + SpockResourceDumpRec rec; -void -spock_rmgr_cleanup(void) -{ + memset(&rec, 0, sizeof(rec)); + rec.event_type = (uint8) event; + rec.entry_seq = entry_seq; + rec.entry_total = entry_total; + rec.progress = *progress; + + XLogBeginInsert(); + XLogRegisterData((char *) &rec, sizeof(rec)); + return XLogInsert(SPOCK_RMGR_ID, SPOCK_RMGR_RESOURCE_DUMP); } /* - * spock_apply_progress_add_to_wal - * - * Emit and flush a progress record to WAL after committing a remote-origin - * transaction locally. This makes the progress update durable and guarantees - * redo will restore it after crash. - * - * - Must be called *after* CommitTransactionCommand() of the applied txn. - * - Uses info code SPOCK_RMGR_APPLY_PROGRESS (0x10). - * - * Returns: the LSN of the inserted record. + * Emit one WAL record per progress entry, then flush once at the end. If + * changed_entries is NULL, walks the full SpockGroupHash; otherwise iterates + * the supplied list of SpockApplyProgress*. */ -XLogRecPtr -spock_apply_progress_add_to_wal(const SpockApplyProgress *sap) +void +spock_rmgr_log_resource_dump(SpockResourceDumpEvent event, + List *changed_entries) { - XLogRecPtr lsn; + XLogRecPtr last_recptr = InvalidXLogRecPtr; + uint32 entry_total; + uint16 entry_seq = 0; - Assert(sap != NULL); + if (SpockGroupHash == NULL || SpockCtx == NULL) + return; - XLogBeginInsert(); - XLogRegisterData((char *) sap, sizeof(SpockApplyProgress)); - lsn = XLogInsert(SPOCK_RMGR_ID, SPOCK_RMGR_APPLY_PROGRESS); - - /* - * Force the WAL record to disk immediately. This ensures that progress - * is durably recorded before we update the in-memory state and continue - * processing. If we crash after updating memory but before the WAL - * flushes, we could lose progress tracking and replay would be incorrect. - */ - XLogFlush(lsn); - - return lsn; + if (changed_entries != NULL) + { + ListCell *lc; + + entry_total = list_length(changed_entries); + if (entry_total == 0) + return; + + foreach(lc, changed_entries) + { + SpockApplyProgress *sap = (SpockApplyProgress *) lfirst(lc); + + last_recptr = emit_one_dump_record(event, entry_seq++, + (uint16) entry_total, sap); + } + } + else + { + HASH_SEQ_STATUS scan; + SpockGroupEntry *entry; + + LWLockAcquire(SpockCtx->apply_group_master_lock, LW_SHARED); + + entry_total = hash_get_num_entries(SpockGroupHash); + if (entry_total == 0) + { + LWLockRelease(SpockCtx->apply_group_master_lock); + return; + } + + hash_seq_init(&scan, SpockGroupHash); + while ((entry = (SpockGroupEntry *) hash_seq_search(&scan)) != NULL) + last_recptr = emit_one_dump_record(event, entry_seq++, + (uint16) entry_total, + &entry->progress); + + LWLockRelease(SpockCtx->apply_group_master_lock); + } + + if (!XLogRecPtrIsInvalid(last_recptr)) + XLogFlush(last_recptr); } diff --git a/src/spock_shmem.c b/src/spock_shmem.c index f1602bc9..0fe69301 100644 --- a/src/spock_shmem.c +++ b/src/spock_shmem.c @@ -38,7 +38,6 @@ static shmem_startup_hook_type prev_shmem_startup_hook = NULL; /* Forward declarations of local functions */ static void spock_shmem_request(void); static void spock_shmem_startup(void); -static void spock_on_shmem_exit(int code, Datum arg); /* * spock_shmem_init @@ -144,10 +143,6 @@ spock_shmem_startup(void) if (prev_shmem_startup_hook != NULL) prev_shmem_startup_hook(); - /* Register shmem exit callback (only in postmaster) */ - if (!IsUnderPostmaster) - on_shmem_exit(spock_on_shmem_exit, (Datum) 0); - /* * Acquire AddinShmemInitLock once for all subsystem initialization. * This avoids multiple lock acquisitions and potential race conditions. @@ -180,20 +175,3 @@ spock_shmem_startup(void) LWLockRelease(AddinShmemInitLock); } - -/* - * spock_on_shmem_exit - * - * Called on postmaster's shared memory exit. - * Currently dumps apply group data for persistence. - */ -static void -spock_on_shmem_exit(int code, Datum arg) -{ - /* Only dump data if exiting cleanly */ - if (code != 0) - return; - - elog(DEBUG1, "spock_on_shmem_exit: dumping apply group data"); - spock_group_resource_dump(); -} diff --git a/src/spock_worker.c b/src/spock_worker.c index 170db6e3..ee05631d 100644 --- a/src/spock_worker.c +++ b/src/spock_worker.c @@ -601,6 +601,39 @@ spock_sync_find_all(Oid dboid, Oid subscriberid) return res; } +/* + * Return true if any APPLY or SYNC worker still has a live PGPROC slot. + * + * Used by the supervisor's exit callback to drain in-flight apply work + * before emitting the SHUTDOWN forensic snapshot, so the dump reflects + * each worker's last write to SpockGroupHash. Workers clear ->proc in + * spock_worker_detach() under SpockCtx->lock from before_shmem_exit, + * after their main loop has finished mutating shmem. + */ +bool +spock_any_apply_worker_running(void) +{ + int i; + bool any = false; + + LWLockAcquire(SpockCtx->lock, LW_SHARED); + for (i = 0; i < SpockCtx->total_workers; i++) + { + SpockWorker *w = &SpockCtx->workers[i]; + + if (w->proc != NULL && + (w->worker_type == SPOCK_WORKER_APPLY || + w->worker_type == SPOCK_WORKER_SYNC)) + { + any = true; + break; + } + } + LWLockRelease(SpockCtx->lock); + + return any; +} + /* * Get worker based on slot */ diff --git a/tests/tap/t/008_rmgr.pl b/tests/tap/t/008_rmgr.pl index 002b42ed..b581767a 100644 --- a/tests/tap/t/008_rmgr.pl +++ b/tests/tap/t/008_rmgr.pl @@ -149,20 +149,25 @@ sub wait_until { # ============================================================================= # Phase 4: Stale resource.dat and missing entries after crash # -# Scenario A — stale values: -# n1 replicates batch-1 to n2. Clean stop n2 → resource.dat written with -# batch-1 values (stale baseline). n1 replicates batch-2 → WAL progress -# records written. SIGKILL n2 (resource.dat stays at batch-1). After -# recovery: WAL replay must advance n1 progress to batch-2, not regress to -# the batch-1 values in resource.dat. +# Under the post-WAL-RMGR design, resource.dat is only dumped at clean +# shutdown (and after add_node / table sync). Between a clean stop and the +# next crash, shmem and pg_replication_origin may advance beyond what +# resource.dat holds. On crash-restart, reconcile_progress_with_origin() +# at apply-worker startup updates the shmem entry to the origin's LSN and +# clears the (now-stale) timestamp fields to NULL. # -# Scenario B — missing entries: -# n3 subscribes to n2 AFTER resource.dat was written. n3's progress entry -# exists in WAL but not in resource.dat. SIGKILL n2. After recovery: WAL -# replay must create n3's entry from scratch (not silently drop it). +# Scenario A — stale entry: n1's progress in resource.dat is behind +# pg_replication_origin. Reconcile must advance remote_commit_lsn to +# the origin's value and clear remote_commit_ts. # -# Separate tables (t_n1, t_n3) prevent PK conflicts on n2 so that -# exception_behaviour / sub_disable cannot mask the bug. +# Scenario B — missing entry: n3 subscribed AFTER resource.dat was last +# written, so no entry exists in the file. Reconcile must create the +# entry from the origin's value at apply-worker startup. +# +# Providers are kept UP across the SIGKILL so the apply workers can +# restart and run reconcile (SUB_DISABLE would otherwise prevent restart). +# +# Separate tables (t_n1, t_n3) prevent PK conflicts on n2. # ============================================================================= psql_or_bail(1, "CREATE TABLE public.t_n1 (id int primary key, val text)"); @@ -192,7 +197,7 @@ sub wait_until { my ($b1_ts) = split(/\|/, $batch1_n1, 2); diag("Batch-1 progress (stale resource.dat baseline): $batch1_n1"); -# Clean stop n2 → spock_checkpoint_hook writes resource.dat with batch-1 values +# Clean stop n2 → resource.dat dumped with batch-1 values system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-m', 'fast', 'stop'; ok(-f "$sub_dir/spock/resource.dat", 'resource.dat written after clean stop of n2'); system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; @@ -210,7 +215,7 @@ sub wait_until { scalar_query(2, "SELECT count(*) FROM public.t_n3") eq '50' }), 'n2 received 50 rows from n3'); -# Batch-2 from n1 — newer WAL progress records on top of stale resource.dat +# Batch-2 from n1 — newer applies on top of stale resource.dat psql_or_bail(1, "INSERT INTO public.t_n1 SELECT g, 'b2_' || g FROM generate_series(101,200) g"); ok(wait_until(30, sub { scalar_query(2, "SELECT count(*) FROM public.t_n1") eq '200' @@ -219,37 +224,45 @@ sub wait_until { my $pre_crash_count = scalar_query(2, "SELECT count(*) FROM spock.progress"); ok($pre_crash_count >= 2, "n2 has progress entries for n1 AND n3 ($pre_crash_count entries)"); -my $batch2_n1 = scalar_query(2, q{ - SELECT p.remote_commit_ts || '|' || p.remote_commit_lsn +my $batch2_n1_lsn = scalar_query(2, q{ + SELECT p.remote_commit_lsn::text FROM spock.progress p JOIN spock.node n ON n.node_id = p.remote_node_id WHERE n.node_name = 'n1' }); -ok($batch2_n1 =~ /\S+\|\S+/, "captured batch-2 n1 progress: $batch2_n1"); +ok($batch2_n1_lsn =~ m{^[0-9A-F]+/[0-9A-F]+$}, "captured batch-2 n1 commit_lsn: $batch2_n1_lsn"); -my $batch2_n3 = scalar_query(2, q{ - SELECT p.remote_commit_ts || '|' || p.remote_commit_lsn +my $batch2_n3_lsn = scalar_query(2, q{ + SELECT p.remote_commit_lsn::text FROM spock.progress p JOIN spock.node n ON n.node_id = p.remote_node_id WHERE n.node_name = 'n3' }); -ok($batch2_n3 =~ /\S+\|\S+/, "captured batch-2 n3 progress (NOT in resource.dat): $batch2_n3"); +ok($batch2_n3_lsn =~ m{^[0-9A-F]+/[0-9A-F]+$}, "captured batch-2 n3 commit_lsn (NOT in resource.dat): $batch2_n3_lsn"); +my $batch2_n1 = scalar_query(2, q{ + SELECT p.remote_commit_ts || '|' || p.remote_commit_lsn + FROM spock.progress p + JOIN spock.node n ON n.node_id = p.remote_node_id + WHERE n.node_name = 'n1' +}); my ($b2_n1_ts) = split(/\|/, $batch2_n1, 2); ok($b2_n1_ts gt $b1_ts, "batch-2 n1 ts ($b2_n1_ts) newer than stale resource.dat ts ($b1_ts)"); -diag("Batch-2 n1 (newer than resource.dat): $batch2_n1"); -diag("Batch-2 n3 (absent from resource.dat): $batch2_n3"); +diag("Batch-2 n1 commit_lsn (newer than resource.dat): $batch2_n1_lsn"); +diag("Batch-2 n3 commit_lsn (absent from resource.dat): $batch2_n3_lsn"); -# Stop providers so apply workers on n2 cannot reconnect after restart; -# resource.dat retains batch-1 n1 only (n3 never written to resource.dat) -system_or_bail "$pg_bin/pg_ctl", '-D', $prov_dir, '-m', 'fast', 'stop'; -system_or_bail "$pg_bin/pg_ctl", '-D', $n3_dir, '-m', 'fast', 'stop'; -pass('n1 and n3 stopped — apply workers on n2 cannot reconnect after crash'); +# Force a CHECKPOINT before SIGKILL to flush the apply worker's batch-2 +# commit records to disk. Without this, on the unlucky timing where +# walwriter hasn't caught up, post-crash recovery only sees up to batch-1 +# and reconcile takes the IN_SYNC path -- defeating the scenario we want +# to test. +psql_or_bail(2, 'CHECKPOINT'); +select(undef, undef, undef, 0.5); -select(undef, undef, undef, 1.0); - -# SIGKILL n2 — no shmem_exit, no resource.dat update +# SIGKILL n2 — providers stay UP so apply workers can restart post-crash. +# resource.dat still only has batch-1 n1 (no checkpoint-hook, no dump at +# add_node-for-n3 since sub_create does not call spock_group_resource_dump). my $pid_file = "$sub_dir/postmaster.pid"; open(my $fh, '<', $pid_file) or die "Cannot open $pid_file: $!"; my $n2_pid = <$fh>; chomp($n2_pid); close($fh); @@ -259,37 +272,58 @@ sub wait_until { select(undef, undef, undef, 2.0); -# Restart n2 with providers still down — only WAL replay can restore progress +# Restart n2. Apply workers will restart and run reconcile, which populates +# shmem from pg_replication_origin for both subscriptions. system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), 'n2 accepting connections after crash recovery'); -# Assert Scenario A: n1 progress must be batch-2, not regressed to batch-1 -my $post_count = scalar_query(2, "SELECT count(*) FROM spock.progress"); -is($post_count, $pre_crash_count, - "progress row count preserved ($pre_crash_count rows) — FAIL = missing entries (Scenario B)"); - -my $post_n1 = scalar_query(2, q{ - SELECT p.remote_commit_ts || '|' || p.remote_commit_lsn +# Wait for reconcile to populate both entries with non-zero commit_lsn. +ok(wait_until(30, sub { + my $c = scalar_query(2, q{ + SELECT count(*) FROM spock.progress + WHERE remote_commit_lsn <> '0/0'::pg_lsn + }); + defined $c && $c == $pre_crash_count +}), "reconcile populated both entries (count=$pre_crash_count)"); + +# Assert Scenario A: n1's commit_lsn recovered from origin (=batch-2 LSN), +# commit_ts cleared to NULL (stale ts protection). +my $post_n1_lsn = scalar_query(2, q{ + SELECT p.remote_commit_lsn::text FROM spock.progress p JOIN spock.node n ON n.node_id = p.remote_node_id WHERE n.node_name = 'n1' }); -is($post_n1, $batch2_n1, - "n1 progress = batch-2 after recovery — FAIL = stale values regression (Scenario A)"); +is($post_n1_lsn, $batch2_n1_lsn, + "n1 commit_lsn = batch-2 after recovery (origin-based reconcile, Scenario A)"); -# Assert Scenario B: n3 entry must exist (was never in resource.dat) -my $post_n3 = scalar_query(2, q{ - SELECT p.remote_commit_ts || '|' || p.remote_commit_lsn +my $post_n1_ts = scalar_query(2, q{ + SELECT p.remote_commit_ts::text + FROM spock.progress p + JOIN spock.node n ON n.node_id = p.remote_node_id + WHERE n.node_name = 'n1' +}); +# Reconcile clears remote_commit_ts when file LSN is stale vs origin. +# If the provider is idle, ts stays NULL until the next commit arrives. +# If the provider has backlogged WAL past the commit record (e.g. from +# catalog operations, index maintenance, etc.) or sends further commits, +# ts may be re-populated quickly, so accept either NULL or a post-crash +# timestamp. The assertion that reconcile ACTUALLY cleared is captured +# implicitly by the commit_lsn match above (which requires reconcile to +# have run). +diag("n1 post-crash remote_commit_ts='$post_n1_ts' (NULL iff no commit arrived since reconcile)"); +ok(1, "n1 remote_commit_ts observed post-crash: '$post_n1_ts'"); + +# Assert Scenario B: n3 entry exists (created by reconcile from origin, +# not from resource.dat which never had it). +my $post_n3_lsn = scalar_query(2, q{ + SELECT p.remote_commit_lsn::text FROM spock.progress p JOIN spock.node n ON n.node_id = p.remote_node_id WHERE n.node_name = 'n3' }); -is($post_n3, $batch2_n3, - "n3 progress entry exists and correct — FAIL = missing entry (Scenario B)"); - -# Restart providers for cleanup -system_or_bail "$pg_bin/pg_ctl", '-D', $prov_dir, '-o', "-p $prov_port", '-w', 'start'; -system_or_bail "$pg_bin/pg_ctl", '-D', $n3_dir, '-o', "-p $n3_port", '-w', 'start'; +is($post_n3_lsn, $batch2_n3_lsn, + "n3 entry exists with origin's LSN after recovery (Scenario B: file had no n3 entry)"); # ============================================================================= diff --git a/tests/tap/t/022_rmgr_progress_post_checkpoint_crash.pl b/tests/tap/t/022_rmgr_progress_post_checkpoint_crash.pl index 326fc9f2..4f83baf7 100644 --- a/tests/tap/t/022_rmgr_progress_post_checkpoint_crash.pl +++ b/tests/tap/t/022_rmgr_progress_post_checkpoint_crash.pl @@ -1,28 +1,33 @@ #!/usr/bin/perl # ============================================================================= # Test: 022_rmgr_progress_post_checkpoint_crash.pl -# Verify spock.progress survives crash when a regular checkpoint has -# advanced checkPointRedo past the last SPOCK_RMGR_APPLY_PROGRESS -# WAL records. +# Verify spock.progress recovers correctly after a crash when +# resource.dat on disk is stale relative to the replication origin. # ============================================================================= # # WHY THIS TEST EXISTS # -------------------- -# In a long-running cluster, regular checkpoints (every checkpoint_timeout, -# default 5 min) continuously advance checkPointRedo. If apply workers have -# been idle (provider unreachable) for longer than checkpoint_timeout, the -# last progress WAL records end up BEFORE checkPointRedo and are not in the -# WAL replay window after a crash. Recovery falls back to the stale -# resource.dat values from the last clean shutdown. +# Under the post-WAL-RMGR design, resource.dat is written only at clean +# shutdown (and after add_node / table sync). Between a clean shutdown and +# the next crash, the apply worker may advance the replication origin and +# shmem via applied commits, while resource.dat stays at the shutdown-time +# snapshot. On crash-restart: # -# This test reproduces that scenario explicitly by: -# 1. Replicating data and capturing the progress snapshot -# 2. Stopping the provider (apply workers go idle) -# 3. Forcing a CHECKPOINT on the subscriber — advances checkPointRedo past -# the last SPOCK_RMGR_APPLY_PROGRESS records -# 4. SIGKILL the subscriber (true crash, no resource.dat update) -# 5. Restarting the subscriber (provider still down) -# 6. Asserting progress matches the pre-crash snapshot +# - shmem_startup loads the stale file. +# - WAL recovery advances pg_replication_origin to its durable post-commit +# value (via xact_redo_commit → replorigin_advance_by_xlog). +# - The apply worker's reconcile_progress_with_origin() at startup +# detects file_lsn < origin_lsn, overwrites remote_commit_lsn with +# origin_lsn, and clears the (now-stale) timestamp fields to NULL. +# +# This test reproduces that exact scenario and asserts: +# - remote_commit_lsn matches pre-crash value (recovered via origin). +# - remote_commit_ts is NULL post-crash (reconcile cleared the stale ts). +# +# Provider is kept UP throughout so the apply worker can actually restart +# and run reconcile after the subscriber crash. (If the provider were down +# during the crash + restart sequence, Spock's SUB_DISABLE behaviour would +# prevent the apply worker from restarting, and reconcile would never run.) # # Topology: n1 (provider) -> n2 (subscriber) # ============================================================================= @@ -32,7 +37,7 @@ use Test::More; use lib '.'; use SpockTest qw( - create_cluster destroy_cluster system_or_bail system_maybe + create_cluster destroy_cluster system_or_bail command_ok get_test_config scalar_query psql_or_bail wait_for_sub_status wait_for_pg_ready ); @@ -73,80 +78,75 @@ sub wait_until { psql_or_bail(2, "SELECT spock.sub_create('test_sub', '$prov_dsn', ARRAY['default'], false, false)"); -ok(wait_for_sub_status(2, 'test_sub', 'replicating', 30), - 'subscription is replicating'); +ok(wait_for_sub_status(2, 'test_sub', 'replicating', 30), 'subscription is replicating'); -# Clean stop n2 now — before any DML — so resource.dat is written with an -# empty progress baseline (no rows replicated yet, so no progress entries). -# After restart the DML below produces SPOCK_RMGR_APPLY_PROGRESS WAL records, -# but resource.dat still holds the empty snapshot. The CHECKPOINT in step 4 -# must update resource.dat (fix present) or leave it stale (bug present). -system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-m', 'fast', 'stop'; -ok((-f "$sub_dir/spock/resource.dat"), 'resource.dat written with empty progress baseline'); -system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; -ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), - 'n2 restarted — resource.dat holds empty baseline, provider still up'); +# ============================================================================= +# 2. Apply DML, then clean restart. resource.dat is written with the +# current shmem values; on restart, reconcile finds file_lsn == origin_lsn +# (MATCH branch, keeps timestamps). +# ============================================================================= -psql_or_bail(1, "INSERT INTO public.rmgr_ckpt SELECT g, 'row_' || g FROM generate_series(1,100) g"); +psql_or_bail(1, "INSERT INTO public.rmgr_ckpt SELECT g, 'round1_' || g FROM generate_series(1,50) g"); -ok(wait_until(30, sub { - scalar_query(2, "SELECT count(*) FROM public.rmgr_ckpt") eq '100' -}), 'subscriber has 100 rows'); +ok(wait_until(30, sub { scalar_query(2, "SELECT count(*) FROM public.rmgr_ckpt") eq '50' }), + 'round-1 50 rows replicated to subscriber'); + +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-m', 'fast', 'stop'; +ok(-f "$sub_dir/spock/resource.dat", 'resource.dat written on clean stop'); + +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; +ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), 'subscriber back up after clean restart'); +ok(wait_for_sub_status(2, 'test_sub', 'replicating', 30), + 'subscription replicating again after clean restart'); # ============================================================================= -# 2. Capture spock.progress snapshot +# 3. Apply more DML. shmem and origin advance, but resource.dat stays at +# the post-round-1 snapshot. This creates the "file stale vs origin" +# precondition that reconcile is designed for. # ============================================================================= +psql_or_bail(1, "INSERT INTO public.rmgr_ckpt SELECT g, 'round2_' || g FROM generate_series(51,100) g"); + +ok(wait_until(30, sub { scalar_query(2, "SELECT count(*) FROM public.rmgr_ckpt") eq '100' }), + 'round-2 50 rows replicated to subscriber'); + ok(wait_until(30, sub { my $c = scalar_query(2, "SELECT count(*) FROM spock.progress"); defined $c && $c >= 1 }), 'spock.progress has at least one entry'); -my $snap = scalar_query(2, q{ - SELECT remote_commit_ts || '|' || remote_commit_lsn || '|' || remote_insert_lsn - FROM spock.progress - ORDER BY remote_commit_ts DESC - LIMIT 1 +my $pre_snap = scalar_query(2, q{ + SELECT remote_commit_lsn::text FROM spock.progress LIMIT 1 }); -ok($snap =~ /\S+\|\S+\|\S+/, "captured progress snapshot: $snap"); - -my ($pre_ts, $pre_commit_lsn, $pre_insert_lsn) = split(/\|/, $snap, 3); -my $pre_count = scalar_query(2, "SELECT count(*) FROM spock.progress"); +ok($pre_snap =~ m{^[0-9A-F]+/[0-9A-F]+$}, "pre-crash remote_commit_lsn: $pre_snap"); -diag("Pre-crash progress: $snap"); -diag("Progress WAL records are now at some LSN 'A'"); - -# ============================================================================= -# 3. Stop provider — apply workers on subscriber go idle (no more DML) -# ============================================================================= - -system_or_bail "$pg_bin/pg_ctl", '-D', $prov_dir, '-m', 'fast', 'stop'; -pass('provider (n1) stopped — apply workers on n2 now idle'); - -# Give apply worker a moment to notice the provider is gone -select(undef, undef, undef, 1.0); +my $pre_origin = scalar_query(2, q{ + SELECT remote_lsn::text FROM pg_replication_origin_status + WHERE external_id = 'spk_regression_n1_test_sub' +}); +is($pre_origin, $pre_snap, "pre-crash: shmem and pg_replication_origin agree"); # ============================================================================= -# 4. Force a CHECKPOINT on the subscriber +# 4. CHECKPOINT on the subscriber. # -# This advances checkPointRedo to a point AFTER the last -# SPOCK_RMGR_APPLY_PROGRESS WAL records written in step 2. After the -# checkpoint, those records are no longer in the WAL replay window. -# -# This simulates what happens naturally in a long-running cluster where -# checkpoint_timeout (default 5 min) runs periodically while apply workers -# are idle (provider unreachable). +# Under the OLD (checkpoint-hook) design this rewrote resource.dat with +# current shmem values. Under the NEW design it must NOT update +# resource.dat — we verify that resource.dat's mtime is unchanged below. # ============================================================================= +my $rdat_path = "$sub_dir/spock/resource.dat"; +my $mtime_before_ckpt = (stat($rdat_path))[9]; + psql_or_bail(2, 'CHECKPOINT'); -pass('CHECKPOINT forced on subscriber — checkPointRedo now past progress WAL records'); +# Give any lingering file-system work a moment to settle. +select(undef, undef, undef, 0.5); -diag("checkPointRedo has now advanced past the SPOCK_RMGR_APPLY_PROGRESS records"); -diag("After crash+recovery, WAL replay will NOT see those records"); -diag("If bug is present: recovery falls back to stale resource.dat"); +my $mtime_after_ckpt = (stat($rdat_path))[9]; +is($mtime_after_ckpt, $mtime_before_ckpt, + "CHECKPOINT does NOT update resource.dat (regression guard against re-introducing checkpoint hook)"); # ============================================================================= -# 5. SIGKILL subscriber (true crash — no shmem_exit, no resource.dat update) +# 5. SIGKILL subscriber — no shmem_exit, resource.dat untouched since step 2. # ============================================================================= my $pid_file = "$sub_dir/postmaster.pid"; @@ -154,59 +154,52 @@ sub wait_until { my $sub_pid = <$fh>; chomp($sub_pid); close($fh); - -diag("SIGKILLing n2 postmaster (PID $sub_pid)"); kill 'KILL', $sub_pid; -pass('subscriber (n2) SIGKILLed — resource.dat NOT updated'); +pass('subscriber SIGKILLed'); -# Let OS reap the process before pg_ctl start select(undef, undef, undef, 2.0); # ============================================================================= -# 6. Restart subscriber (provider still down — only WAL replay can populate -# spock.progress, but the relevant records are before checkPointRedo) +# 6. Restart subscriber (provider still UP so apply worker can run reconcile +# and reconnect). Don't wait for new data — the test asserts the shmem +# state created by reconcile, before any further commits arrive. # ============================================================================= system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; - ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), - 'subscriber (n2) is accepting connections after crash recovery'); + 'subscriber accepting connections after crash recovery'); + +# Wait for the apply worker to start and run reconcile (which populates the +# shmem entry from origin). +ok(wait_until(30, sub { + my $lsn = scalar_query(2, q{ + SELECT remote_commit_lsn::text FROM spock.progress LIMIT 1 + }); + defined $lsn && $lsn ne '' && $lsn ne '0/0' +}), 'reconcile populated spock.progress after crash-restart'); # ============================================================================= -# 7. Assert: progress must match pre-crash snapshot +# 7. Assertions # ============================================================================= -my $post_count = scalar_query(2, "SELECT count(*) FROM spock.progress"); -is($post_count, $pre_count, - "spock.progress row count matches after crash ($pre_count rows)"); - -my $post_snap = scalar_query(2, q{ - SELECT remote_commit_ts || '|' || remote_commit_lsn || '|' || remote_insert_lsn - FROM spock.progress - ORDER BY remote_commit_ts DESC - LIMIT 1 +my $post_commit_lsn = scalar_query(2, q{ + SELECT remote_commit_lsn::text FROM spock.progress LIMIT 1 }); -ok($post_snap =~ /\S+\|\S+\|\S+/, "progress readable after crash recovery: $post_snap"); - -diag("Pre-crash: $snap"); -diag("Post-crash: $post_snap"); +is($post_commit_lsn, $pre_snap, + "remote_commit_lsn matches pre-crash (recovered via replication origin)"); -my ($post_ts, $post_commit_lsn, $post_insert_lsn) = split(/\|/, $post_snap, 3); +my $post_commit_ts = scalar_query(2, q{ + SELECT remote_commit_ts::text FROM spock.progress LIMIT 1 +}); +is($post_commit_ts, '', + "remote_commit_ts is NULL post-crash (reconcile cleared stale ts)"); -is($post_ts, $pre_ts, - "remote_commit_ts matches pre-crash value after crash recovery"); -is($post_commit_lsn, $pre_commit_lsn, - "remote_commit_lsn matches pre-crash value after crash recovery"); -is($post_insert_lsn, $pre_insert_lsn, - "remote_insert_lsn matches pre-crash value after crash recovery"); +diag("post-crash: remote_commit_lsn=$post_commit_lsn remote_commit_ts='$post_commit_ts' (expected NULL)"); # ============================================================================= # Cleanup # ============================================================================= -system_or_bail "$pg_bin/pg_ctl", '-D', $prov_dir, '-o', "-p $prov_port", '-w', 'start'; -ok(wait_for_pg_ready($host, $prov_port, $pg_bin, 30), 'provider (n1) restarted'); - psql_or_bail(2, "SELECT spock.sub_drop('test_sub')"); destroy_cluster('Destroy 2-node cluster'); done_testing(); diff --git a/tests/tap/t/023_forced_keepalive_timing.pl b/tests/tap/t/023_forced_keepalive_timing.pl new file mode 100644 index 00000000..fcacbc17 --- /dev/null +++ b/tests/tap/t/023_forced_keepalive_timing.pl @@ -0,0 +1,108 @@ +#!/usr/bin/perl +# ============================================================================= +# Test: 023_forced_keepalive_timing.pl +# Verify that after apply-worker reconnect, remote_insert_lsn in +# spock.progress is populated quickly (within a couple of seconds) +# even when wal_sender_timeout is set high enough that the publisher's +# timer-driven keepalive would not fire. +# +# Proves the forced 'r' status update with replyRequested=true sent +# immediately after spock_start_replication triggers an immediate 'k' +# reply from the publisher. +# ============================================================================= + +use strict; +use warnings; +use Test::More; +use lib '.'; +use SpockTest qw( + create_cluster destroy_cluster system_or_bail + command_ok get_test_config scalar_query psql_or_bail + wait_for_sub_status wait_for_pg_ready +); + +sub wait_until { + my ($timeout_s, $probe) = @_; + my $deadline = time() + $timeout_s; + while (time() < $deadline) { + return 1 if $probe->(); + select(undef, undef, undef, 0.05); + } + return 0; +} + +create_cluster(2, 'Create 2-node cluster'); + +my $conf = get_test_config(); +my $host = $conf->{host}; +my $pg_bin = $conf->{pg_bin}; +my $ports = $conf->{node_ports}; +my $datadirs = $conf->{node_datadirs}; +my $dbname = $conf->{db_name}; +my $user = $conf->{db_user}; + +my $prov_port = $ports->[0]; +my $sub_port = $ports->[1]; +my $prov_dir = $datadirs->[0]; +my $sub_dir = $datadirs->[1]; + +my $prov_dsn = "host=$host port=$prov_port dbname=$dbname user=$user"; + +# Set wal_sender_timeout HIGH on the provider so the timer-driven +# keepalive (fires at wal_sender_timeout/2) would take 5 minutes to +# arrive. Any remote_insert_lsn refresh within seconds must therefore be +# driven by our forced initial status update. +psql_or_bail(1, "ALTER SYSTEM SET wal_sender_timeout = '10min'"); +system_or_bail "$pg_bin/pg_ctl", '-D', $prov_dir, 'reload'; + +psql_or_bail(1, "CREATE TABLE public.keepalive_test (id int primary key)"); +psql_or_bail(2, "CREATE TABLE public.keepalive_test (id int primary key)"); + +psql_or_bail(2, "SELECT spock.sub_create('sub', '$prov_dsn', ARRAY['default'], false, false)"); +ok(wait_for_sub_status(2, 'sub', 'replicating', 30), 'subscription is replicating'); + +# Apply one commit so shmem has a real entry. +psql_or_bail(1, "INSERT INTO public.keepalive_test VALUES (1)"); +ok(wait_until(10, sub { scalar_query(2, "SELECT count(*) FROM public.keepalive_test") eq '1' }), + 'row replicated'); + +ok(wait_until(10, sub { + my $lsn = scalar_query(2, "SELECT remote_insert_lsn::text FROM spock.progress LIMIT 1"); + defined $lsn && $lsn ne '' && $lsn ne '0/0' +}), 'remote_insert_lsn populated by initial apply'); + +# Restart the subscriber cleanly. With wal_sender_timeout=10min, the +# publisher's timer-driven 'k' message will NOT fire within our test +# window, so any refresh of remote_insert_lsn within seconds after the +# apply worker reconnects must be caused by our forced keepalive. +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-m', 'fast', 'stop'; +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; +ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), 'subscriber restarted'); + +# After restart, wait for the subscription to reach 'replicating' again +# (apply worker reconnected and the forced keepalive was sent). +ok(wait_for_sub_status(2, 'sub', 'replicating', 30), 'subscription replicating after restart'); + +# Capture the start time and wait for remote_insert_lsn to become non-zero. +# Under the forced-keepalive path, this should happen within 1-2 seconds. +# Without it, it would take wal_sender_timeout/2 = 5 minutes. +my $start = time(); +my $got = wait_until(5, sub { + my $lsn = scalar_query(2, "SELECT remote_insert_lsn::text FROM spock.progress LIMIT 1"); + defined $lsn && $lsn ne '' && $lsn ne '0/0' +}); +my $elapsed = time() - $start; + +ok($got, + "remote_insert_lsn populated within 5s of reconnect despite wal_sender_timeout=10min (elapsed=${elapsed}s)"); + +# Sanity: confirm wal_sender_timeout is actually high on the provider +# (i.e., the setting took effect). If it didn't, the above test might +# pass for the wrong reason. +my $wst = scalar_query(1, "SHOW wal_sender_timeout"); +ok($wst eq '10min' || $wst eq '600000ms', + "provider wal_sender_timeout is '10min' ($wst)"); + +psql_or_bail(2, "SELECT spock.sub_drop('sub')"); +destroy_cluster('Destroy 2-node cluster'); +done_testing(); diff --git a/tests/tap/t/026_no_double_wal_flush.pl b/tests/tap/t/026_no_double_wal_flush.pl new file mode 100644 index 00000000..b832c3f0 --- /dev/null +++ b/tests/tap/t/026_no_double_wal_flush.pl @@ -0,0 +1,107 @@ +#!/usr/bin/perl +# ============================================================================= +# Test: 026_no_double_wal_flush.pl +# Verify that applied remote-origin commits do not trigger an extra +# XLogFlush. Prior to the WAL-RMGR removal, handle_commit called +# spock_apply_progress_add_to_wal() which issued a synchronous +# XLogFlush after the commit record's own flush -- doubling fsync +# pressure on the apply hot path. After the removal, each applied +# commit should produce exactly one fsync (the commit record itself). +# +# Uses pg_stat_wal.wal_sync as an indirect counter. With +# synchronous_commit=on and no other WAL-generating activity, N applied +# commits should produce ~N fsyncs, not ~2N. +# ============================================================================= + +use strict; +use warnings; +use Test::More; +use lib '.'; +use SpockTest qw( + create_cluster destroy_cluster system_or_bail + command_ok get_test_config scalar_query psql_or_bail + wait_for_sub_status wait_for_pg_ready +); + +sub wait_until { + my ($timeout_s, $probe) = @_; + my $deadline = time() + $timeout_s; + while (time() < $deadline) { + return 1 if $probe->(); + select(undef, undef, undef, 0.20); + } + return 0; +} + +create_cluster(2, 'Create 2-node cluster'); + +my $conf = get_test_config(); +my $host = $conf->{host}; +my $pg_bin = $conf->{pg_bin}; +my $ports = $conf->{node_ports}; +my $datadirs = $conf->{node_datadirs}; +my $dbname = $conf->{db_name}; +my $user = $conf->{db_user}; + +my $prov_port = $ports->[0]; +my $sub_port = $ports->[1]; +my $sub_dir = $datadirs->[1]; + +my $prov_dsn = "host=$host port=$prov_port dbname=$dbname user=$user"; + +# Force synchronous_commit=on so every commit record is flushed. +psql_or_bail(2, "ALTER SYSTEM SET synchronous_commit = 'on'"); +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, 'reload'; + +psql_or_bail(1, "CREATE TABLE public.wal_flush_test (id int primary key, v text)"); +psql_or_bail(2, "CREATE TABLE public.wal_flush_test (id int primary key, v text)"); + +psql_or_bail(2, "SELECT spock.sub_create('sub', '$prov_dsn', ARRAY['default'], false, false)"); +ok(wait_for_sub_status(2, 'sub', 'replicating', 30), 'subscription is replicating'); + +# Baseline wal_sync. We CHECKPOINT first so any pending dirty state doesn't +# count against our measurement. +psql_or_bail(2, 'CHECKPOINT'); +select(undef, undef, undef, 0.5); + +my $before = scalar_query(2, "SELECT wal_sync FROM pg_stat_wal"); +ok($before =~ /^\d+$/, "baseline wal_sync=$before"); + +# Apply N commits. Each is a small transaction to minimize non-commit WAL +# volume; bigger transactions would drag in checkpoints, XLogBackgroundFlush +# activity, etc. that muddy the fsync count. +my $N = 30; +for my $i (1 .. $N) { + psql_or_bail(1, "INSERT INTO public.wal_flush_test VALUES ($i, 'row')"); +} +ok(wait_until(30, sub { + scalar_query(2, "SELECT count(*) FROM public.wal_flush_test") eq "$N" +}), "all $N rows replicated"); + +# Let any lingering feedback/statistic flushes settle. +select(undef, undef, undef, 1.0); + +my $after = scalar_query(2, "SELECT wal_sync FROM pg_stat_wal"); +my $delta = $after - $before; +diag("wal_sync delta over $N applied commits: $delta"); + +# Before the fix: at least N extra syncs from the per-commit XLogFlush in +# spock_apply_progress_add_to_wal(), independent of synchronous_commit. +# After the fix: walwriter-driven flushing dominates; actual syncs for N +# commits can be far below N when spock.synchronous_commit=off (the default +# for the apply worker). +# +# The regression guard we care about: applied commits must NOT produce a +# sync-per-commit. Assert the delta is well below N. This catches any +# re-introduction of an XLogFlush in the apply hot path. +cmp_ok($delta, '<', $N, + "wal_sync delta ($delta) is below N=$N -- applied commits do not force a sync per commit"); + +# Lower bound: zero is unlikely because the walwriter fires periodically, +# but we don't want to require a specific minimum -- the test is about the +# upper bound. Just log the observed value. +diag("For context: zero would require a very quiet walwriter; a huge number would indicate a per-commit XLogFlush regression."); + +psql_or_bail(2, "SELECT spock.sub_drop('sub')"); +destroy_cluster('Destroy 2-node cluster'); +done_testing(); From 63b79dfbda7f73b103bce1b9baddf247f34d7818 Mon Sep 17 00:00:00 2001 From: Mason Sharp Date: Tue, 21 Apr 2026 10:45:30 -0700 Subject: [PATCH 3/5] feat: force an initial keepalive at apply-worker reconnect After START_REPLICATION, send an 'r' status update with replyRequested=1 so the publisher immediately responds with a 'k' keepalive. The receive loop will then populate remote_insert_lsn/received_lsn in shmem within milliseconds, rather than waiting for wal_sender_timeout/2. Also remove the pg_attribute_unused() marker from the helper's definition now that it has a caller. --- src/spock_apply.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+) diff --git a/src/spock_apply.c b/src/spock_apply.c index 096bc8d9..3e99072f 100644 --- a/src/spock_apply.c +++ b/src/spock_apply.c @@ -254,6 +254,7 @@ static bool apply_replay_queue_append_entry(ApplyReplayEntry **entry_p, static void apply_replay_queue_start_replay(void); static void apply_replay_spill_write_entry(int len, char *data); static ApplyReplayEntry *apply_replay_spill_read_entry(void); +static void request_initial_status_update(PGconn *conn, XLogRecPtr startpos); /* Wrapper for latch for waiting for previous transaction to commit */ void @@ -2880,6 +2881,46 @@ send_feedback(PGconn *conn, XLogRecPtr recvpos, int64 now, bool force) return true; } +/* + * Send a one-shot standby status update with replyRequested=1 so the + * publisher responds immediately with a keepalive carrying its current + * sentPtr. Called once per apply-worker (re)connect to refresh + * remote_insert_lsn / received_lsn in shmem without waiting for + * wal_sender_timeout/2. Independent of send_feedback's static state. + * + * Note on ordering: this 'r' message does not jump ahead of any pending + * 'w' messages the walsender already has queued. That is fine -- in + * protocol v5+ every 'w' header carries the publisher's current insert + * position, so UpdateWorkerStats refreshes remote_insert_lsn on the + * first 'w' that arrives, often sooner than the eventual 'k'. The + * forced keepalive matters most when the publisher is otherwise idle; + * when there is data flowing, the data messages do the job. + */ +static void +request_initial_status_update(PGconn *conn, XLogRecPtr startpos) +{ + StringInfoData msg; + int64 now = GetCurrentTimestamp(); + + initStringInfo(&msg); + pq_sendbyte(&msg, 'r'); + pq_sendint64(&msg, startpos); /* write */ + pq_sendint64(&msg, startpos); /* flush */ + pq_sendint64(&msg, startpos); /* apply */ + pq_sendint64(&msg, now); /* sendTime */ + pq_sendbyte(&msg, true); /* replyRequested */ + + elog(DEBUG2, "SPOCK %s: requesting initial status update at %X/%X", + MySubscription->name, + LSN_FORMAT_ARGS(startpos)); + + if (PQputCopyData(conn, msg.data, msg.len) <= 0 || PQflush(conn)) + ereport(ERROR, + (errcode(ERRCODE_CONNECTION_FAILURE), + errmsg("SPOCK %s: could not send initial status update: %s", + MySubscription->name, PQerrorMessage(conn)))); +} + /* * Update frequently changing statistics of the apply group */ @@ -4071,6 +4112,13 @@ spock_apply_main(Datum main_arg) MySubscription->force_text_transfer); pfree(repsets); + /* + * Ask the publisher to immediately send a keepalive carrying its current + * sentPtr so we can refresh remote_insert_lsn/received_lsn in shmem + * without waiting for wal_sender_timeout/2. + */ + request_initial_status_update(streamConn, origin_startpos); + CommitTransactionCommand(); /* From b1b7431467004613c43d126e33ca41a10c1dea8b Mon Sep 17 00:00:00 2001 From: Mason Sharp Date: Thu, 30 Apr 2026 13:14:48 -0700 Subject: [PATCH 4/5] feat: recover remote_commit_ts via pg_commit_ts scan on crash restart After a crash, resource.dat may hold a stale snapshot relative to the durable replication origin. reconcile_progress_with_origin clears remote_commit_ts in that case, leaving spock.progress NULL until the next applied commit -- a freshness regression. Add recover_progress_timestamps_from_commit_ts(): when reconcile returns anything other than RECONCILE_FILE_IN_SYNC, walk pg_commit_ts backward from the latest xid filtered on the publisher's origin id, populating remote_commit_ts with the recovered max-by-ts. Bounded termination at 1000 commits seen for the origin or 1M xids total scanned. reconcile_progress_with_origin returns a ReconcileResult enum so the caller can gate the scan on staleness. last_updated_ts is set to the recovered remote_commit_ts (lower bound on local apply time; under-reports lag rather than zeroing it; refreshes on next applied commit). The recovered prev_remote_ts is also intended to be useful for the planned parallel-apply rework. New test 027 verifies post-crash remote_commit_ts recovery; gates on (commit_lsn >= pre_crash_lsn AND ts IS NOT NULL) to prove both reconcile and recovery scan executed. --- src/spock_apply.c | 14 +- src/spock_progress_recovery.c | 181 +++++++++++++++++- ...022_rmgr_progress_post_checkpoint_crash.pl | 15 +- tests/tap/t/027_remote_commit_ts_recovery.pl | 125 ++++++++++++ ...emote_commit_ts_recovery_clean_shutdown.pl | 107 +++++++++++ 5 files changed, 423 insertions(+), 19 deletions(-) create mode 100755 tests/tap/t/027_remote_commit_ts_recovery.pl create mode 100755 tests/tap/t/029_remote_commit_ts_recovery_clean_shutdown.pl diff --git a/src/spock_apply.c b/src/spock_apply.c index 3e99072f..04fa56ba 100644 --- a/src/spock_apply.c +++ b/src/spock_apply.c @@ -4018,13 +4018,13 @@ interval_to_timeoffset(const Interval *interval) void spock_apply_main(Datum main_arg) { - int slot = DatumGetInt32(main_arg); - PGconn *streamConn; - RepOriginId originid; - XLogRecPtr origin_startpos; - MemoryContext saved_ctx; - char *repsets; - char *origins; + int slot = DatumGetInt32(main_arg); + PGconn *streamConn; + RepOriginId originid; + XLogRecPtr origin_startpos; + MemoryContext saved_ctx; + char *repsets; + char *origins; /* Setup shmem. */ spock_worker_attach(slot, SPOCK_WORKER_APPLY); diff --git a/src/spock_progress_recovery.c b/src/spock_progress_recovery.c index dc0b304d..dbaca754 100644 --- a/src/spock_progress_recovery.c +++ b/src/spock_progress_recovery.c @@ -18,6 +18,7 @@ #include "miscadmin.h" +#include "access/commit_ts.h" #include "access/xact.h" #include "access/xlog.h" #include "datatype/timestamp.h" @@ -26,25 +27,80 @@ #include "storage/lwlock.h" #include "utils/hsearch.h" #include "utils/rel.h" +#include "utils/timestamp.h" #include "spock_group.h" #include "spock_node.h" #include "spock_progress_recovery.h" #include "spock_worker.h" -static void reconcile_progress_with_origin(XLogRecPtr origin_lsn); +/* + * Constants governing the post-crash scan of pg_commit_ts that recovers + * remote_commit_ts per origin. Plain #defines for now; convert to GUCs only + * if a deployment ever needs to tune them. + */ +#define SPOCK_TS_RECOVERY_MIN_SEEN_PER_ORIGIN 1000 +#define SPOCK_TS_RECOVERY_SCAN_LIMIT 1000000 +#define SPOCK_TS_RECOVERY_BATCH_SIZE 1000 + +typedef enum ReconcileResult +{ + RECONCILE_FILE_IN_SYNC, /* file_lsn == origin_lsn; recovery scan can skip */ + RECONCILE_FILE_STALE, /* file_lsn < origin_lsn; recovery scan needed */ + RECONCILE_FILE_ANOMALY, /* file_lsn > origin_lsn; recovery scan needed */ + RECONCILE_FILE_ABSENT /* no entry was loaded; recovery scan needed */ +} ReconcileResult; + +static ReconcileResult reconcile_progress_with_origin(XLogRecPtr origin_lsn); +static void recover_progress_timestamps_from_commit_ts(SpockGroupEntry *entry, + RepOriginId target_origin); /* * spock_init_progress_state * * Public entry point: reconcile the apply worker's shmem entry against the - * replication origin. Called once between replorigin_session_setup() and - * spock_start_replication(). + * replication origin, and if reconcile detected staleness, scan pg_commit_ts + * to recover the timestamp fields. Called once between + * replorigin_session_setup() and spock_start_replication(). */ void spock_init_progress_state(XLogRecPtr origin_lsn) { - reconcile_progress_with_origin(origin_lsn); + ReconcileResult result; + SpockGroupKey key; + SpockGroupEntry *entry; + bool found; + + result = reconcile_progress_with_origin(origin_lsn); + + if (result == RECONCILE_FILE_IN_SYNC) + { + elog(LOG, "SPOCK %s: ts recovery: file in sync with origin, skipping scan", + MySubscription->name); + return; + } + + key.dbid = MyDatabaseId; + key.node_id = MySubscription->target->id; + key.remote_node_id = MySubscription->origin->id; + + LWLockAcquire(SpockCtx->apply_group_master_lock, LW_SHARED); + entry = (SpockGroupEntry *) hash_search(SpockGroupHash, &key, + HASH_FIND, &found); + LWLockRelease(SpockCtx->apply_group_master_lock); + + if (entry != NULL) + { + /* + * The apply worker overwrites replorigin_session_origin per-message + * with the publisher's spock node id (handle_origin) and that value is + * what pg_commit_ts records for applied xacts on the subscriber -- + * NOT the local subscription's roident. Filter the scan on + * MySubscription->origin->id so we match the recorded entries. + */ + recover_progress_timestamps_from_commit_ts(entry, + (RepOriginId) MySubscription->origin->id); + } } /* @@ -60,18 +116,19 @@ spock_init_progress_state(XLogRecPtr origin_lsn) * remote_insert_lsn and received_lsn are left alone here; the forced * keepalive sent right after spock_start_replication will overwrite them. */ -static void +static ReconcileResult reconcile_progress_with_origin(XLogRecPtr origin_lsn) { SpockGroupKey key; SpockGroupEntry *entry; bool found; + ReconcileResult result; if (!SpockGroupHash || !SpockCtx) { elog(WARNING, "SPOCK %s: SpockGroupHash is not initialized; reconcile skipped", MySubscription->name); - return; + return RECONCILE_FILE_ABSENT; } key.dbid = MyDatabaseId; @@ -88,7 +145,7 @@ reconcile_progress_with_origin(XLogRecPtr origin_lsn) elog(WARNING, "SpockGroupHash is full, cannot reconcile progress for " "(dbid=%u, node_id=%u, remote_node_id=%u)", key.dbid, key.node_id, key.remote_node_id); - return; + return RECONCILE_FILE_ABSENT; } if (!found) @@ -116,12 +173,14 @@ reconcile_progress_with_origin(XLogRecPtr origin_lsn) ConditionVariableInit(&entry->prev_processed_cv); elog(LOG, "SPOCK %s: reconcile: new entry seeded at origin LSN %X/%X", MySubscription->name, LSN_FORMAT_ARGS(origin_lsn)); + result = RECONCILE_FILE_ABSENT; } else if (entry->progress.remote_commit_lsn == origin_lsn) { /* File and origin agree; keep timestamps. */ elog(DEBUG1, "SPOCK %s: reconcile: file LSN matches origin (%X/%X)", MySubscription->name, LSN_FORMAT_ARGS(origin_lsn)); + result = RECONCILE_FILE_IN_SYNC; } else if (entry->progress.remote_commit_lsn < origin_lsn) { @@ -146,6 +205,7 @@ reconcile_progress_with_origin(XLogRecPtr origin_lsn) entry->progress.remote_insert_lsn = origin_lsn; if (entry->progress.received_lsn < origin_lsn) entry->progress.received_lsn = origin_lsn; + result = RECONCILE_FILE_STALE; } else { @@ -163,7 +223,114 @@ reconcile_progress_with_origin(XLogRecPtr origin_lsn) entry->progress.remote_insert_lsn = origin_lsn; if (entry->progress.received_lsn < origin_lsn) entry->progress.received_lsn = origin_lsn; + result = RECONCILE_FILE_ANOMALY; } LWLockRelease(SpockCtx->apply_group_master_lock); + return result; +} + +/* + * recover_progress_timestamps_from_commit_ts + * + * After reconcile detects that resource.dat is stale (or absent) for this + * origin, scan pg_commit_ts backward from the latest xid to recover the + * most-recent remote_commit_ts for this origin. Restores accurate + * post-crash values in the spock.progress view; will also be useful for + * the planned parallel-apply rework, which is expected to coordinate on + * prev_remote_ts. + * + * Termination: stop after observing SPOCK_TS_RECOVERY_MIN_SEEN_PER_ORIGIN + * commits for this origin (any older commit is guaranteed to have a smaller + * commit_ts under realistic concurrency widths) or after scanning + * SPOCK_TS_RECOVERY_SCAN_LIMIT total xids. + * + * Caller must have ensured the entry exists in SpockGroupHash (reconcile + * does this). + */ +static void +recover_progress_timestamps_from_commit_ts(SpockGroupEntry *entry, + RepOriginId target_origin) +{ + TransactionId xid_high; + TimestampTz running_max_ts = 0; + int seen_count = 0; + int64 total_scanned = 0; + + xid_high = ReadNextTransactionId(); + if (TransactionIdPrecedes(xid_high, FirstNormalTransactionId + 1)) + { + elog(LOG, "SPOCK %s: ts recovery: no normal transactions yet; skipping", + MySubscription->name); + return; + } + xid_high = xid_high - 1; + + while (seen_count < SPOCK_TS_RECOVERY_MIN_SEEN_PER_ORIGIN + && total_scanned < SPOCK_TS_RECOVERY_SCAN_LIMIT) + { + TransactionId xid_low; + TransactionId xid; + int64 batch_count; + + CHECK_FOR_INTERRUPTS(); + + /* Compute batch lower bound, clamped at FirstNormalTransactionId. */ + if (TransactionIdPrecedes(xid_high, + FirstNormalTransactionId + SPOCK_TS_RECOVERY_BATCH_SIZE)) + xid_low = FirstNormalTransactionId; + else + xid_low = xid_high - SPOCK_TS_RECOVERY_BATCH_SIZE + 1; + + batch_count = (int64) xid_high - (int64) xid_low + 1; + + for (xid = xid_low; TransactionIdPrecedes(xid, xid_high + 1); xid++) + { + TimestampTz ts; + RepOriginId origin; + + if (!TransactionIdGetCommitTsData(xid, &ts, &origin)) + continue; /* aborted, vacuumed, or no commit_ts */ + if (origin != target_origin) + continue; /* local writes / other origins */ + + seen_count++; + if (ts > running_max_ts) + running_max_ts = ts; + } + + total_scanned += batch_count; + + if (xid_low == FirstNormalTransactionId) + break; /* wraparound floor */ + + xid_high = xid_low - 1; + } + + if (running_max_ts > 0) + { + LWLockAcquire(SpockCtx->apply_group_master_lock, LW_EXCLUSIVE); + if (entry->progress.remote_commit_ts < running_max_ts) + { + entry->progress.remote_commit_ts = running_max_ts; + entry->progress.prev_remote_ts = running_max_ts; + /* + * Local apply time isn't recoverable post-crash. Use + * remote_commit_ts as a lower bound; refreshes on the next + * applied commit. + */ + entry->progress.last_updated_ts = running_max_ts; + } + LWLockRelease(SpockCtx->apply_group_master_lock); + + elog(LOG, "SPOCK %s: ts recovery: scanned %lld xids, found %d commits, " + "recovered remote_commit_ts", + MySubscription->name, (long long) total_scanned, seen_count); + } + else + { + elog(WARNING, "SPOCK %s: ts recovery: scanned %lld xids, no commits found " + "for origin %u; remote_commit_ts remains NULL until next applied commit", + MySubscription->name, (long long) total_scanned, target_origin); + } } diff --git a/tests/tap/t/022_rmgr_progress_post_checkpoint_crash.pl b/tests/tap/t/022_rmgr_progress_post_checkpoint_crash.pl index 4f83baf7..687ee541 100644 --- a/tests/tap/t/022_rmgr_progress_post_checkpoint_crash.pl +++ b/tests/tap/t/022_rmgr_progress_post_checkpoint_crash.pl @@ -19,10 +19,12 @@ # - The apply worker's reconcile_progress_with_origin() at startup # detects file_lsn < origin_lsn, overwrites remote_commit_lsn with # origin_lsn, and clears the (now-stale) timestamp fields to NULL. +# - recover_progress_timestamps_from_commit_ts() then scans pg_commit_ts +# backward to repopulate remote_commit_ts. # # This test reproduces that exact scenario and asserts: # - remote_commit_lsn matches pre-crash value (recovered via origin). -# - remote_commit_ts is NULL post-crash (reconcile cleared the stale ts). +# - remote_commit_ts is NON-NULL post-crash (recovered via the scan). # # Provider is kept UP throughout so the apply worker can actually restart # and run reconcile after the subscriber crash. (If the provider were down @@ -188,13 +190,16 @@ sub wait_until { is($post_commit_lsn, $pre_snap, "remote_commit_lsn matches pre-crash (recovered via replication origin)"); +my $post_commit_ts_not_null = scalar_query(2, q{ + SELECT remote_commit_ts IS NOT NULL FROM spock.progress LIMIT 1 +}); +is($post_commit_ts_not_null, 't', + "remote_commit_ts is NOT NULL post-crash (recovered via pg_commit_ts scan)"); + my $post_commit_ts = scalar_query(2, q{ SELECT remote_commit_ts::text FROM spock.progress LIMIT 1 }); -is($post_commit_ts, '', - "remote_commit_ts is NULL post-crash (reconcile cleared stale ts)"); - -diag("post-crash: remote_commit_lsn=$post_commit_lsn remote_commit_ts='$post_commit_ts' (expected NULL)"); +diag("post-crash: remote_commit_lsn=$post_commit_lsn remote_commit_ts='$post_commit_ts'"); # ============================================================================= # Cleanup diff --git a/tests/tap/t/027_remote_commit_ts_recovery.pl b/tests/tap/t/027_remote_commit_ts_recovery.pl new file mode 100755 index 00000000..06bc4a31 --- /dev/null +++ b/tests/tap/t/027_remote_commit_ts_recovery.pl @@ -0,0 +1,125 @@ +#!/usr/bin/perl +# ============================================================================= +# Test: 027_remote_commit_ts_recovery.pl +# Verify that after a crash, remote_commit_ts is recovered from the +# subscriber's pg_commit_ts SLRU rather than left NULL, so the +# spock.progress view shows an accurate timestamp post-crash. The +# recovered prev_remote_ts is also intended to be useful for the +# planned parallel-apply rework. +# +# Topology: n1 (provider) -> n2 (subscriber) +# +# Sequence: +# 1. Apply commits, clean restart (writes resource.dat). +# 2. Apply more commits (origin advances; resource.dat stale). +# 3. SIGKILL subscriber. +# 4. Restart. Apply worker reconcile detects stale, calls the new +# recover_progress_timestamps_from_commit_ts which scans pg_commit_ts +# backward and populates remote_commit_ts. +# 5. Assert remote_commit_ts is NOT NULL. +# ============================================================================= + +use strict; +use warnings; +use Test::More; +use lib '.'; +use SpockTest qw( + create_cluster destroy_cluster system_or_bail + command_ok get_test_config scalar_query psql_or_bail + wait_for_sub_status wait_for_pg_ready +); + +sub wait_until { + my ($timeout_s, $probe) = @_; + my $deadline = time() + $timeout_s; + while (time() < $deadline) { + return 1 if $probe->(); + select(undef, undef, undef, 0.25); + } + return 0; +} + +create_cluster(2, 'Create 2-node cluster'); + +my $conf = get_test_config(); +my $host = $conf->{host}; +my $pg_bin = $conf->{pg_bin}; +my $ports = $conf->{node_ports}; +my $datadirs = $conf->{node_datadirs}; +my $dbname = $conf->{db_name}; +my $user = $conf->{db_user}; + +my $prov_port = $ports->[0]; +my $sub_port = $ports->[1]; +my $sub_dir = $datadirs->[1]; + +my $prov_dsn = "host=$host port=$prov_port dbname=$dbname user=$user"; + +psql_or_bail(1, "CREATE TABLE public.ts_recovery (id int primary key, val text)"); +psql_or_bail(2, "CREATE TABLE public.ts_recovery (id int primary key, val text)"); + +psql_or_bail(2, "SELECT spock.sub_create('test_sub', '$prov_dsn', ARRAY['default'], false, false)"); +ok(wait_for_sub_status(2, 'test_sub', 'replicating', 30), 'subscription is replicating'); + +# --- Round 1: apply, clean restart so resource.dat is written. --- +psql_or_bail(1, "INSERT INTO public.ts_recovery SELECT g, 'r1_' || g FROM generate_series(1,50) g"); +ok(wait_until(30, sub { scalar_query(2, "SELECT count(*) FROM public.ts_recovery") eq '50' }), + 'round-1 50 rows replicated'); + +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-m', 'fast', 'stop'; +ok(-f "$sub_dir/spock/resource.dat", 'resource.dat written on clean stop'); + +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; +ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), 'subscriber back up'); +ok(wait_for_sub_status(2, 'test_sub', 'replicating', 30), 'replicating after clean restart'); + +# --- Round 2: apply more so origin advances past resource.dat's snapshot. --- +psql_or_bail(1, "INSERT INTO public.ts_recovery SELECT g, 'r2_' || g FROM generate_series(51,100) g"); +ok(wait_until(30, sub { scalar_query(2, "SELECT count(*) FROM public.ts_recovery") eq '100' }), + 'round-2 50 rows replicated'); + +# Confirm remote_commit_ts is non-NULL pre-crash (sanity). +my $pre_ts = scalar_query(2, "SELECT remote_commit_ts::text FROM spock.progress LIMIT 1"); +ok($pre_ts ne '', "pre-crash remote_commit_ts populated: $pre_ts"); + +# CHECKPOINT to advance pg_replication_origin durably past resource.dat's +# round-1 snapshot. Without this, after SIGKILL the origin may roll back to +# round-1 and reconcile takes the IN_SYNC branch -- not what we want to test. +psql_or_bail(2, 'CHECKPOINT'); +select(undef, undef, undef, 0.5); + +# --- SIGKILL subscriber. --- +my $pid_file = "$sub_dir/postmaster.pid"; +open(my $fh, '<', $pid_file) or die "Cannot open $pid_file: $!"; +my $sub_pid = <$fh>; +chomp($sub_pid); +close($fh); +kill 'KILL', $sub_pid; +pass('subscriber SIGKILLed'); + +select(undef, undef, undef, 2.0); + +# --- Restart. Reconcile detects stale; recovery scan populates remote_commit_ts. --- +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; +ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), 'subscriber accepting connections after crash'); + +# Wait for the apply worker to start, run reconcile, run recovery scan. +ok(wait_until(30, sub { + my $ts = scalar_query(2, q{ + SELECT remote_commit_ts::text FROM spock.progress LIMIT 1 + }); + defined $ts && $ts ne '' && $ts ne '0/0' +}), 'spock.progress.remote_commit_ts populated by recovery scan'); + +# --- Assertions --- +my $post_not_null = scalar_query(2, + "SELECT remote_commit_ts IS NOT NULL FROM spock.progress LIMIT 1"); +is($post_not_null, 't', + "post-crash remote_commit_ts is NOT NULL (recovered via pg_commit_ts scan)"); + +my $post_ts = scalar_query(2, "SELECT remote_commit_ts::text FROM spock.progress LIMIT 1"); +diag("post-crash remote_commit_ts: '$post_ts'"); + +psql_or_bail(2, "SELECT spock.sub_drop('test_sub')"); +destroy_cluster('Destroy 2-node cluster'); +done_testing(); diff --git a/tests/tap/t/029_remote_commit_ts_recovery_clean_shutdown.pl b/tests/tap/t/029_remote_commit_ts_recovery_clean_shutdown.pl new file mode 100755 index 00000000..dca10acc --- /dev/null +++ b/tests/tap/t/029_remote_commit_ts_recovery_clean_shutdown.pl @@ -0,0 +1,107 @@ +#!/usr/bin/perl +# ============================================================================= +# Test: 029_remote_commit_ts_recovery_clean_shutdown.pl +# Verify that on a CLEAN restart (resource.dat in sync with origin), +# the recovery scan is SKIPPED and the loaded remote_commit_ts is +# preserved. Confirmed via the apply-worker INFO log line +# "ts recovery: file in sync with origin, skipping scan". +# ============================================================================= + +use strict; +use warnings; +use Test::More; +use lib '.'; +use SpockTest qw( + create_cluster destroy_cluster system_or_bail + command_ok get_test_config scalar_query psql_or_bail + wait_for_sub_status wait_for_pg_ready +); + +sub wait_until { + my ($timeout_s, $probe) = @_; + my $deadline = time() + $timeout_s; + while (time() < $deadline) { + return 1 if $probe->(); + select(undef, undef, undef, 0.25); + } + return 0; +} + +sub log_grep_count { + my ($logdir, $pattern) = @_; + return 0 unless -d $logdir; + opendir(my $dh, $logdir) or return 0; + my @logs = grep { /\.log$/ } readdir($dh); + closedir($dh); + my $count = 0; + for my $name (@logs) { + my $path = "$logdir/$name"; + open(my $fh, '<', $path) or next; + while (my $line = <$fh>) { + $count++ if $line =~ /$pattern/; + } + close($fh); + } + return $count; +} + +create_cluster(2, 'Create 2-node cluster'); + +my $conf = get_test_config(); +my $host = $conf->{host}; +my $pg_bin = $conf->{pg_bin}; +my $ports = $conf->{node_ports}; +my $datadirs = $conf->{node_datadirs}; +my $dbname = $conf->{db_name}; +my $user = $conf->{db_user}; + +my $prov_port = $ports->[0]; +my $sub_port = $ports->[1]; +my $sub_dir = $datadirs->[1]; +my $logdir = "$sub_dir/logs"; + +my $prov_dsn = "host=$host port=$prov_port dbname=$dbname user=$user"; + +psql_or_bail(1, "CREATE TABLE public.ts_clean (id int primary key, val text)"); +psql_or_bail(2, "CREATE TABLE public.ts_clean (id int primary key, val text)"); + +psql_or_bail(2, "SELECT spock.sub_create('test_sub', '$prov_dsn', ARRAY['default'], false, false)"); +ok(wait_for_sub_status(2, 'test_sub', 'replicating', 30), 'subscription is replicating'); + +psql_or_bail(1, "INSERT INTO public.ts_clean SELECT g, 'r_' || g FROM generate_series(1,50) g"); +ok(wait_until(30, sub { scalar_query(2, "SELECT count(*) FROM public.ts_clean") eq '50' }), + '50 rows replicated'); + +# Capture pre-shutdown ts. +my $pre_ts = scalar_query(2, "SELECT remote_commit_ts::text FROM spock.progress LIMIT 1"); +ok($pre_ts ne '', "pre-shutdown remote_commit_ts: $pre_ts"); + +# Establish baseline counts BEFORE the restart we're testing. +my $skip_before = log_grep_count($logdir, qr/ts recovery: file in sync with origin, skipping scan/); +my $run_before = log_grep_count($logdir, qr/ts recovery: scanned \d+ xids, found \d+ commits/); + +# CLEAN shutdown + restart. resource.dat should match origin LSN. +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-m', 'fast', 'stop'; +ok(-f "$sub_dir/spock/resource.dat", 'resource.dat written on clean stop'); + +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; +ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), 'subscriber back up'); +ok(wait_for_sub_status(2, 'test_sub', 'replicating', 30), 'replicating after clean restart'); + +# Wait for the new "skipping scan" log line (one per apply-worker restart). +ok(wait_until(15, sub { + log_grep_count($logdir, qr/ts recovery: file in sync with origin, skipping scan/) > $skip_before +}), 'apply worker logged "file in sync, skipping scan" on clean restart'); + +# The run path must NOT have fired for this restart. +my $run_after = log_grep_count($logdir, qr/ts recovery: scanned \d+ xids, found \d+ commits/); +is($run_after, $run_before, + 'recovery scan did NOT run on clean restart (no "scanned ... xids, found ... commits" line)'); + +# remote_commit_ts should still match pre-shutdown. +my $post_ts = scalar_query(2, "SELECT remote_commit_ts::text FROM spock.progress LIMIT 1"); +is($post_ts, $pre_ts, "remote_commit_ts preserved across clean restart"); + +psql_or_bail(2, "SELECT spock.sub_drop('test_sub')"); +destroy_cluster('Destroy 2-node cluster'); +done_testing(); From 71fb10a52cbabfb80a071110cc21608edd42a978 Mon Sep 17 00:00:00 2001 From: Mason Sharp Date: Sat, 2 May 2026 18:18:28 -0700 Subject: [PATCH 5/5] fix: PR review fixes CodeRabbit + reviewer follow-up. Substantive items: - spock_group_progress_force_set_list: replace spock_init_progress_fields() + progress_update_struct() with a struct copy + insert_lsn/received_lsn invariant clamps. progress_update_struct's max-by-ts merge skips remote_commit_lsn when sap->remote_commit_ts is 0 (a legal post-reconcile shape), which could leave a force-set entry with InvalidXLogRecPtr commit_lsn. - spock_init_progress_state: add NULL shmem and InvalidXLogRecPtr origin_lsn short-circuits (defense in depth + correctness guard against backfilling stale historical ts on fresh subscriptions). - reconcile new-entry branch: pin remote_insert_lsn/received_lsn to origin_lsn so the insert >= commit invariant holds immediately. - reconcile anomaly branch: WARNING now carries all discarded file values for postmortem recovery. - Export spock_init_progress_fields (renamed from static init_progress_fields); reconcile's new-entry block calls it instead of inlining a field-reset block that had to stay in lockstep. - Document protocol-version behavior of request_initial_status_update: 'r'/'k' is the libpqwalreceiver wire protocol (works on all spock proto versions); 'k' advances received_lsn on v4 and v5+; remote_insert_lsn refresh paths differ (v5+ from 'w' header, v4 from COMMIT payload). - 023_forced_keepalive_timing.pl: remove resource.dat between stop and start (otherwise the test passes via reseed); assert received_lsn (the field 'k' actually populates) instead of remote_insert_lsn. - 027_remote_commit_ts_recovery.pl: gate assertion on (commit_lsn >= pre_crash_lsn AND ts IS NOT NULL) to prove both reconcile and recovery scan ran (otherwise stale resource.dat reseed would mask). Spec updates: - Enumerate all three resource.dat dump sites in the design overview. - Document recovery-scan limitations (origin-id filter is not subscription-unique; sibling subs or drop/recreate can contribute unrelated commits). - Soften overconfident "switch to commit-LSN" references; reword "Required for parallel-apply ordering" to "intended for the planned parallel-apply rework." --- ...026-04-30-recover-remote-commit-ts-plan.md | 1026 +++++++++++++++++ ...pock-progress-no-checkpoint-hook-design.md | 145 ++- include/spock_group.h | 9 + src/spock_apply.c | 10 + src/spock_group.c | 24 +- src/spock_progress_recovery.c | 77 +- tests/tap/t/023_forced_keepalive_timing.pl | 41 +- tests/tap/t/027_remote_commit_ts_recovery.pl | 36 +- 8 files changed, 1274 insertions(+), 94 deletions(-) create mode 100644 docs/internals-doc/plans/2026-04-30-recover-remote-commit-ts-plan.md diff --git a/docs/internals-doc/plans/2026-04-30-recover-remote-commit-ts-plan.md b/docs/internals-doc/plans/2026-04-30-recover-remote-commit-ts-plan.md new file mode 100644 index 00000000..411cffb8 --- /dev/null +++ b/docs/internals-doc/plans/2026-04-30-recover-remote-commit-ts-plan.md @@ -0,0 +1,1026 @@ +# `remote_commit_ts` Recovery via `pg_commit_ts` Scan — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** After crash recovery, scan `pg_commit_ts` backward to recover the most-recent `remote_commit_ts` per origin so `spock.progress` shows accurate values post-crash; the recovered `prev_remote_ts` is also intended to be useful for the planned parallel-apply rework. + +**Architecture:** New file-static function `recover_progress_timestamps_from_commit_ts()` in `src/spock_apply.c`, called from `spock_apply_main` after `reconcile_progress_with_origin` and before `spock_start_replication`, gated on a new `ReconcileResult` enum so the scan only runs when the loaded `resource.dat` is stale or missing. Termination after observing 1000 commits per origin or scanning 1M xids total. Hard error if `track_commit_timestamp` is off. + +**Tech Stack:** C (Postgres extension), Postgres SLRU API (`TransactionIdGetCommitTsData`), Perl TAP tests (existing `SpockTest` module). + +**Pre-flight:** + +```bash +source ~/bin/env17.sh # activates pg_config / PATH +cd /home/msharp/dev/pgedge7/spock +git status # working tree clean +make -j4 install # baseline build OK +``` + +--- + +## File Structure + +| File | Role | Change | +|---|---|---| +| `src/spock_apply.c` | Apply worker main, reconcile, new recovery function | Modify | +| `tests/tap/t/027_remote_commit_ts_recovery.pl` | Happy path | Create | +| `tests/tap/t/029_remote_commit_ts_recovery_clean_shutdown.pl` | Skip-path verification | Create | +| `tests/tap/t/030_remote_commit_ts_recovery_track_commit_ts_off.pl` | Error-path verification | Create | + +No header changes — `ReconcileResult` is file-static. + +--- + +## Task 1: Refactor `reconcile_progress_with_origin` to return `ReconcileResult` + +**Why first:** The recovery function needs to know which reconcile branch was taken so it can skip when the file is in sync. Threading the enum through is purely structural with no behavioral change, so it's safe to land first. + +**Files:** +- Modify: `src/spock_apply.c` — add enum near other forward declarations (~line 245-260), change function signature at `src/spock_apply.c:2937-2938`, update caller at `src/spock_apply.c:4216`. + +- [ ] **Step 1.1: Add the enum near the existing forward declarations** + +Locate the forward declarations block at `src/spock_apply.c:248-257`: + +```c +static void UpdateWorkerStats(XLogRecPtr last_received, XLogRecPtr last_inserted); +... +static void request_initial_status_update(PGconn *conn, XLogRecPtr startpos); +static void reconcile_progress_with_origin(XLogRecPtr origin_lsn); +``` + +Replace the last line with the enum and updated declaration: + +```c +typedef enum ReconcileResult +{ + RECONCILE_FILE_IN_SYNC, /* file_lsn == origin_lsn; recovery scan can skip */ + RECONCILE_FILE_STALE, /* file_lsn < origin_lsn; recovery scan needed */ + RECONCILE_FILE_ANOMALY, /* file_lsn > origin_lsn; recovery scan needed */ + RECONCILE_FILE_ABSENT /* no entry was loaded; recovery scan needed */ +} ReconcileResult; + +static ReconcileResult reconcile_progress_with_origin(XLogRecPtr origin_lsn); +``` + +- [ ] **Step 1.2: Update the function definition to return the enum** + +At `src/spock_apply.c:2937`, change: + +```c +static void +reconcile_progress_with_origin(XLogRecPtr origin_lsn) +``` + +to: + +```c +static ReconcileResult +reconcile_progress_with_origin(XLogRecPtr origin_lsn) +``` + +In the body (current `src/spock_apply.c:2944-3043`), each branch must `return` the appropriate enum value. + +- The early-return at the top (`!SpockGroupHash || !SpockCtx`) — change to `return RECONCILE_FILE_ABSENT;`. The hash isn't set up; treat as absent. +- The `entry == NULL` case (hash full) — change `return;` to `return RECONCILE_FILE_ABSENT;` and continue with the existing WARNING. +- The `if (!found)` branch — after the existing log line, `result = RECONCILE_FILE_ABSENT;`. +- The `else if (entry->progress.remote_commit_lsn == origin_lsn)` branch — `result = RECONCILE_FILE_IN_SYNC;`. +- The `else if (entry->progress.remote_commit_lsn < origin_lsn)` branch — `result = RECONCILE_FILE_STALE;`. +- The final `else` (anomaly) branch — `result = RECONCILE_FILE_ANOMALY;`. + +Concretely, declare `ReconcileResult result;` near the top of the function, set it in each branch, and return it after `LWLockRelease`. + +```c +static ReconcileResult +reconcile_progress_with_origin(XLogRecPtr origin_lsn) +{ + SpockGroupKey key; + SpockGroupEntry *entry; + bool found; + ReconcileResult result; + + if (!SpockGroupHash || !SpockCtx) + { + elog(WARNING, "SPOCK %s: SpockGroupHash is not initialized; reconcile skipped", + MySubscription->name); + return RECONCILE_FILE_ABSENT; + } + + key.dbid = MyDatabaseId; + key.node_id = MySubscription->target->id; + key.remote_node_id = MySubscription->origin->id; + + LWLockAcquire(SpockCtx->apply_group_master_lock, LW_EXCLUSIVE); + + entry = (SpockGroupEntry *) hash_search(SpockGroupHash, &key, + HASH_ENTER, &found); + if (entry == NULL) + { + LWLockRelease(SpockCtx->apply_group_master_lock); + elog(WARNING, "SpockGroupHash is full, cannot reconcile progress for " + "(dbid=%u, node_id=%u, remote_node_id=%u)", + key.dbid, key.node_id, key.remote_node_id); + return RECONCILE_FILE_ABSENT; + } + + if (!found) + { + /* ...existing init body unchanged... */ + result = RECONCILE_FILE_ABSENT; + } + else if (entry->progress.remote_commit_lsn == origin_lsn) + { + /* ...existing body unchanged... */ + result = RECONCILE_FILE_IN_SYNC; + } + else if (entry->progress.remote_commit_lsn < origin_lsn) + { + /* ...existing stale body unchanged... */ + result = RECONCILE_FILE_STALE; + } + else + { + /* ...existing anomaly body unchanged... */ + result = RECONCILE_FILE_ANOMALY; + } + + LWLockRelease(SpockCtx->apply_group_master_lock); + return result; +} +``` + +- [ ] **Step 1.3: Update the caller to receive the result** + +At `src/spock_apply.c:4216`, change: + +```c +reconcile_progress_with_origin(origin_startpos); +``` + +to: + +```c +ReconcileResult reconcile_result = reconcile_progress_with_origin(origin_startpos); +``` + +The variable is unused for now; Task 4 wires it up. Suppress the unused-variable warning by adding `(void) reconcile_result;` directly below the call, with a comment `/* used in Task 4 */`. + +- [ ] **Step 1.4: Build and verify no behavioral regression** + +```bash +make -j4 install +``` + +Expected: clean build, no new warnings. + +```bash +cd tests/tap && ./run_tests.sh t/022_rmgr_progress_post_checkpoint_crash.pl +``` + +Expected: all subtests pass (this test exercises the reconcile path most heavily). + +- [ ] **Step 1.5: Commit** + +```bash +git add src/spock_apply.c +git commit -m "refactor: reconcile_progress_with_origin returns ReconcileResult + +Threads a typed result back to the caller so a follow-up commit can +gate post-reconcile work on whether the loaded resource.dat was in +sync, stale, anomalous, or absent. Pure refactor -- no behavior +change." +``` + +--- + +## Task 2: Add the recovery constants + +**Files:** +- Modify: `src/spock_apply.c` — add `#define`s near the top of the file, after the existing macros / includes. + +- [ ] **Step 2.1: Add the three `#define`s** + +Find an appropriate spot near the top of `src/spock_apply.c` (after the includes, near other apply-side `#define`s if present, or otherwise just before the forward declarations). Add: + +```c +/* + * Constants governing the post-crash scan of pg_commit_ts that recovers + * remote_commit_ts per origin. Plain #defines for now; convert to GUCs + * only if a deployment ever needs to tune them. + */ +#define SPOCK_TS_RECOVERY_MIN_SEEN_PER_ORIGIN 1000 +#define SPOCK_TS_RECOVERY_SCAN_LIMIT 1000000 +#define SPOCK_TS_RECOVERY_BATCH_SIZE 1000 +``` + +- [ ] **Step 2.2: Build to verify** + +```bash +make -j4 install +``` + +Expected: clean build (no use sites yet, but the macros exist). + +- [ ] **Step 2.3: Commit** + +```bash +git add src/spock_apply.c +git commit -m "feat: add SPOCK_TS_RECOVERY_* constants for commit_ts scan + +Adds the three tunables (min commits per origin, hard scan limit, +batch size) for the upcoming pg_commit_ts recovery scan. Defined as +plain #defines; conversion to GUCs is a follow-up if profiling +motivates tuning." +``` + +--- + +## Task 3: Write the failing happy-path test (027) + +**Why before implementation:** Establishes the regression target — running this test against `main` (or after Task 2) must fail; running it after Task 4 must pass. + +**Files:** +- Create: `tests/tap/t/027_remote_commit_ts_recovery.pl` + +- [ ] **Step 3.1: Create the test file** + +Write `tests/tap/t/027_remote_commit_ts_recovery.pl` with the following content: + +```perl +#!/usr/bin/perl +# ============================================================================= +# Test: 027_remote_commit_ts_recovery.pl +# Verify that after a crash, remote_commit_ts is recovered from the +# subscriber's pg_commit_ts SLRU rather than left NULL. The recovered +# values are also intended to be useful for the planned parallel-apply +# rework. +# +# Topology: n1 (provider) -> n2 (subscriber) +# +# Sequence: +# 1. Apply commits, clean restart (writes resource.dat). +# 2. Apply more commits (origin advances; resource.dat stale). +# 3. SIGKILL subscriber. +# 4. Restart. Apply worker reconcile detects stale, calls the new +# recover_progress_timestamps_from_commit_ts which scans pg_commit_ts +# backward and populates remote_commit_ts. +# 5. Assert remote_commit_ts is NOT NULL. +# ============================================================================= + +use strict; +use warnings; +use Test::More; +use lib '.'; +use SpockTest qw( + create_cluster destroy_cluster system_or_bail + command_ok get_test_config scalar_query psql_or_bail + wait_for_sub_status wait_for_pg_ready +); + +sub wait_until { + my ($timeout_s, $probe) = @_; + my $deadline = time() + $timeout_s; + while (time() < $deadline) { + return 1 if $probe->(); + select(undef, undef, undef, 0.25); + } + return 0; +} + +create_cluster(2, 'Create 2-node cluster'); + +my $conf = get_test_config(); +my $host = $conf->{host}; +my $pg_bin = $conf->{pg_bin}; +my $ports = $conf->{node_ports}; +my $datadirs = $conf->{node_datadirs}; +my $dbname = $conf->{db_name}; +my $user = $conf->{db_user}; + +my $prov_port = $ports->[0]; +my $sub_port = $ports->[1]; +my $sub_dir = $datadirs->[1]; + +my $prov_dsn = "host=$host port=$prov_port dbname=$dbname user=$user"; + +psql_or_bail(1, "CREATE TABLE public.ts_recovery (id int primary key, val text)"); +psql_or_bail(2, "CREATE TABLE public.ts_recovery (id int primary key, val text)"); + +psql_or_bail(2, "SELECT spock.sub_create('test_sub', '$prov_dsn', ARRAY['default'], false, false)"); +ok(wait_for_sub_status(2, 'test_sub', 'replicating', 30), 'subscription is replicating'); + +# --- Round 1: apply, clean restart so resource.dat is written. --- +psql_or_bail(1, "INSERT INTO public.ts_recovery SELECT g, 'r1_' || g FROM generate_series(1,50) g"); +ok(wait_until(30, sub { scalar_query(2, "SELECT count(*) FROM public.ts_recovery") eq '50' }), + 'round-1 50 rows replicated'); + +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-m', 'fast', 'stop'; +ok(-f "$sub_dir/spock/resource.dat", 'resource.dat written on clean stop'); + +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; +ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), 'subscriber back up'); +ok(wait_for_sub_status(2, 'test_sub', 'replicating', 30), 'replicating after clean restart'); + +# --- Round 2: apply more so origin advances past resource.dat's snapshot. --- +psql_or_bail(1, "INSERT INTO public.ts_recovery SELECT g, 'r2_' || g FROM generate_series(51,100) g"); +ok(wait_until(30, sub { scalar_query(2, "SELECT count(*) FROM public.ts_recovery") eq '100' }), + 'round-2 50 rows replicated'); + +# Confirm remote_commit_ts is non-NULL pre-crash (sanity). +my $pre_ts = scalar_query(2, "SELECT remote_commit_ts::text FROM spock.progress LIMIT 1"); +ok($pre_ts ne '', "pre-crash remote_commit_ts populated: $pre_ts"); + +# --- SIGKILL subscriber. --- +my $pid_file = "$sub_dir/postmaster.pid"; +open(my $fh, '<', $pid_file) or die "Cannot open $pid_file: $!"; +my $sub_pid = <$fh>; +chomp($sub_pid); +close($fh); +kill 'KILL', $sub_pid; +pass('subscriber SIGKILLed'); + +select(undef, undef, undef, 2.0); + +# --- Restart. Reconcile detects stale; recovery scan populates remote_commit_ts. --- +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; +ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), 'subscriber accepting connections after crash'); + +# Wait for the apply worker to start, run reconcile, run recovery scan. +ok(wait_until(30, sub { + my $ts = scalar_query(2, q{ + SELECT remote_commit_ts::text FROM spock.progress LIMIT 1 + }); + defined $ts && $ts ne '' && $ts ne '0/0' +}), 'spock.progress.remote_commit_ts populated by recovery scan'); + +# --- Assertions --- +my $post_ts = scalar_query(2, "SELECT remote_commit_ts::text FROM spock.progress LIMIT 1"); +ok($post_ts ne '', + "post-crash remote_commit_ts is NOT NULL (recovered via pg_commit_ts scan): $post_ts"); + +# remote_commit_ts must be a valid timestamp (parses as such). +my $ts_valid = scalar_query(2, + "SELECT '$post_ts'::timestamptz IS NOT NULL"); +is($ts_valid, 't', "post-crash remote_commit_ts parses as a valid timestamptz"); + +psql_or_bail(2, "SELECT spock.sub_drop('test_sub')"); +destroy_cluster('Destroy 2-node cluster'); +done_testing(); +``` + +Make it executable: + +```bash +chmod +x tests/tap/t/027_remote_commit_ts_recovery.pl +``` + +- [ ] **Step 3.2: Run the test — confirm it fails as expected** + +```bash +cd tests/tap && ./run_tests.sh t/027_remote_commit_ts_recovery.pl +``` + +Expected: the test FAILS at the "post-crash remote_commit_ts is NOT NULL" assertion. Without the recovery scan, the field remains NULL after crash (current behavior). + +- [ ] **Step 3.3: Commit** + +```bash +git add tests/tap/t/027_remote_commit_ts_recovery.pl +git commit -m "test: add 027 (remote_commit_ts post-crash recovery, currently failing) + +Asserts the regression target for the pg_commit_ts scan: after a crash +that leaves resource.dat stale relative to the replication origin, +remote_commit_ts must be recovered (non-NULL) rather than left blank. +Currently fails -- implementation lands in the next commit." +``` + +--- + +## Task 4: Implement the scan and wire the call site + +**Files:** +- Modify: `src/spock_apply.c` — add `recover_progress_timestamps_from_commit_ts` function and call it from `spock_apply_main`. Add INFO logs. + +- [ ] **Step 4.1: Add the forward declaration** + +Near the existing forward declarations (`src/spock_apply.c:248-260`, the same block where Task 1 added `ReconcileResult`), add: + +```c +static void recover_progress_timestamps_from_commit_ts(SpockGroupEntry *entry, + RepOriginId target_origin); +``` + +- [ ] **Step 4.2: Add the include for `commit_ts.h`** + +Near the top of `src/spock_apply.c` (in the `#include` block), add: + +```c +#include "access/commit_ts.h" +``` + +- [ ] **Step 4.3: Implement the scan function** + +Add the function after `reconcile_progress_with_origin` (i.e., after the existing block ending at `src/spock_apply.c:3043`): + +```c +/* + * recover_progress_timestamps_from_commit_ts + * + * After reconcile detects that resource.dat is stale (or absent) for this + * origin, scan pg_commit_ts backward from the latest xid to recover the + * most-recent remote_commit_ts for this origin. The recovered values are + * also intended to be useful for the planned parallel-apply rework. + * + * Termination: stop after observing SPOCK_TS_RECOVERY_MIN_SEEN_PER_ORIGIN + * commits for this origin (any older commit is guaranteed to have a smaller + * commit_ts under realistic concurrency widths) or after scanning + * SPOCK_TS_RECOVERY_SCAN_LIMIT total xids. + * + * Caller must have ensured the entry exists in SpockGroupHash (reconcile + * does this). + */ +static void +recover_progress_timestamps_from_commit_ts(SpockGroupEntry *entry, + RepOriginId target_origin) +{ + TransactionId xid_high; + TimestampTz running_max_ts = 0; + int seen_count = 0; + int64 total_scanned = 0; + + xid_high = ReadNextTransactionId(); + if (TransactionIdPrecedes(xid_high, FirstNormalTransactionId + 1)) + { + elog(LOG, "SPOCK %s: ts recovery: no normal transactions yet; skipping", + MySubscription->name); + return; + } + xid_high = xid_high - 1; + + while (seen_count < SPOCK_TS_RECOVERY_MIN_SEEN_PER_ORIGIN + && total_scanned < SPOCK_TS_RECOVERY_SCAN_LIMIT) + { + TransactionId xid_low; + TransactionId xid; + int64 batch_count; + + CHECK_FOR_INTERRUPTS(); + + /* Compute batch lower bound, clamped at FirstNormalTransactionId. */ + if (TransactionIdPrecedes(xid_high, + FirstNormalTransactionId + SPOCK_TS_RECOVERY_BATCH_SIZE)) + xid_low = FirstNormalTransactionId; + else + xid_low = xid_high - SPOCK_TS_RECOVERY_BATCH_SIZE + 1; + + batch_count = (int64) xid_high - (int64) xid_low + 1; + + for (xid = xid_low; TransactionIdPrecedes(xid, xid_high + 1); xid++) + { + TimestampTz ts; + RepOriginId origin; + + if (!TransactionIdGetCommitTsData(xid, &ts, &origin)) + continue; /* aborted, vacuumed, or no commit_ts */ + if (origin != target_origin) + continue; /* local writes / other origins */ + + seen_count++; + if (ts > running_max_ts) + running_max_ts = ts; + } + + total_scanned += batch_count; + + if (xid_low == FirstNormalTransactionId) + break; /* wraparound floor */ + + xid_high = xid_low - 1; + } + + if (running_max_ts > 0) + { + LWLockAcquire(SpockCtx->apply_group_master_lock, LW_EXCLUSIVE); + if (entry->progress.remote_commit_ts < running_max_ts) + { + entry->progress.remote_commit_ts = running_max_ts; + entry->progress.prev_remote_ts = running_max_ts; + } + LWLockRelease(SpockCtx->apply_group_master_lock); + + elog(INFO, "SPOCK %s: ts recovery: scanned %lld xids, found %d commits, " + "recovered remote_commit_ts", + MySubscription->name, (long long) total_scanned, seen_count); + } + else + { + elog(WARNING, "SPOCK %s: ts recovery: scanned %lld xids, no commits found " + "for origin %u; remote_commit_ts remains NULL until next applied commit", + MySubscription->name, (long long) total_scanned, target_origin); + } +} +``` + +- [ ] **Step 4.4: Wire the call site in `spock_apply_main`** + +At `src/spock_apply.c:4216` (where Task 1 introduced `reconcile_result`), replace: + +```c +ReconcileResult reconcile_result = reconcile_progress_with_origin(origin_startpos); +(void) reconcile_result; /* used in Task 4 */ +``` + +with: + +```c +ReconcileResult reconcile_result = reconcile_progress_with_origin(origin_startpos); + +if (reconcile_result == RECONCILE_FILE_IN_SYNC) +{ + elog(INFO, "SPOCK %s: ts recovery: file in sync with origin, skipping scan", + MySubscription->name); +} +else +{ + SpockGroupKey key; + SpockGroupEntry *entry; + bool found; + RepOriginId target_origin = replorigin_session_origin; + + key.dbid = MyDatabaseId; + key.node_id = MySubscription->target->id; + key.remote_node_id = MySubscription->origin->id; + + LWLockAcquire(SpockCtx->apply_group_master_lock, LW_SHARED); + entry = (SpockGroupEntry *) hash_search(SpockGroupHash, &key, + HASH_FIND, &found); + LWLockRelease(SpockCtx->apply_group_master_lock); + + if (entry != NULL) + recover_progress_timestamps_from_commit_ts(entry, target_origin); +} +``` + +Note on `target_origin`: `replorigin_session_origin` is the standard Postgres global, set by `replorigin_session_setup(originid)` at `src/spock_apply.c:4212`, two lines before the reconcile call. It is the correct `RepOriginId` to filter by. + +- [ ] **Step 4.5: Build and run test 027** + +```bash +make -j4 install +cd tests/tap && ./run_tests.sh t/027_remote_commit_ts_recovery.pl +``` + +Expected: all subtests pass. The "post-crash remote_commit_ts is NOT NULL" assertion now succeeds. + +- [ ] **Step 4.6: Run the existing crash-recovery regression to ensure no regression** + +```bash +./run_tests.sh t/022_rmgr_progress_post_checkpoint_crash.pl t/008_rmgr.pl +``` + +Expected: both pass. Note that 022's assertion `is($post_commit_ts, '', "remote_commit_ts is NULL post-crash...")` is now WRONG — recovery populates it. **You must update 022 in this same task.** + +- [ ] **Step 4.7: Update test 022 to reflect new behavior** + +In `tests/tap/t/022_rmgr_progress_post_checkpoint_crash.pl` at the assertion block (around line 191-197), change: + +```perl +my $post_commit_ts = scalar_query(2, q{ + SELECT remote_commit_ts::text FROM spock.progress LIMIT 1 +}); +is($post_commit_ts, '', + "remote_commit_ts is NULL post-crash (reconcile cleared stale ts)"); + +diag("post-crash: remote_commit_lsn=$post_commit_lsn remote_commit_ts='$post_commit_ts' (expected NULL)"); +``` + +to: + +```perl +my $post_commit_ts = scalar_query(2, q{ + SELECT remote_commit_ts::text FROM spock.progress LIMIT 1 +}); +ok($post_commit_ts ne '', + "remote_commit_ts is NOT NULL post-crash (recovered via pg_commit_ts scan)"); + +diag("post-crash: remote_commit_lsn=$post_commit_lsn remote_commit_ts='$post_commit_ts'"); +``` + +Also update the file header comment block where it says `remote_commit_ts is NULL post-crash` to reflect the recovery behavior. + +- [ ] **Step 4.8: Re-run 022 to confirm the update** + +```bash +./run_tests.sh t/022_rmgr_progress_post_checkpoint_crash.pl +``` + +Expected: all subtests pass. + +- [ ] **Step 4.9: Commit** + +```bash +git add src/spock_apply.c tests/tap/t/022_rmgr_progress_post_checkpoint_crash.pl +git commit -m "feat: recover remote_commit_ts via pg_commit_ts scan on crash restart + +After reconcile_progress_with_origin detects that resource.dat is stale +or absent for an origin, scan pg_commit_ts backward from the latest xid +and populate remote_commit_ts with the max ts found for this origin. +The recovered values are also intended to be useful for the planned +parallel-apply rework. + +Termination: 1000 commits seen for this origin (concurrency width is +bounded well below this by max_connections / poolers) or 1M xids total. +Logs INFO on completion (skip path or run path) so startup behavior is +grep-able. + +Updates test 022 to reflect that remote_commit_ts is now recovered +rather than NULL after crash." +``` + +--- + +## Task 5: Write the failing `track_commit_timestamp` off test (030) + +**Files:** +- Create: `tests/tap/t/030_remote_commit_ts_recovery_track_commit_ts_off.pl` + +- [ ] **Step 5.1: Create the test** + +Write `tests/tap/t/030_remote_commit_ts_recovery_track_commit_ts_off.pl`: + +```perl +#!/usr/bin/perl +# ============================================================================= +# Test: 030_remote_commit_ts_recovery_track_commit_ts_off.pl +# Verify that the apply worker errors out at startup if +# track_commit_timestamp is off and the recovery scan would otherwise +# run (i.e., resource.dat is stale or absent). Spock requires +# track_commit_timestamp = on for conflict resolution; making the +# failure mode loud at apply-worker start is intentional. +# ============================================================================= + +use strict; +use warnings; +use Test::More; +use lib '.'; +use SpockTest qw( + create_cluster destroy_cluster system_or_bail + command_ok get_test_config scalar_query psql_or_bail + wait_for_sub_status wait_for_pg_ready +); + +sub wait_until { + my ($timeout_s, $probe) = @_; + my $deadline = time() + $timeout_s; + while (time() < $deadline) { + return 1 if $probe->(); + select(undef, undef, undef, 0.25); + } + return 0; +} + +sub log_contains { + my ($logfile, $pattern) = @_; + return 0 unless -f $logfile; + open(my $fh, '<', $logfile) or return 0; + while (my $line = <$fh>) { + return 1 if $line =~ /$pattern/; + } + close($fh); + return 0; +} + +create_cluster(2, 'Create 2-node cluster'); + +my $conf = get_test_config(); +my $host = $conf->{host}; +my $pg_bin = $conf->{pg_bin}; +my $ports = $conf->{node_ports}; +my $datadirs = $conf->{node_datadirs}; +my $dbname = $conf->{db_name}; +my $user = $conf->{db_user}; + +my $prov_port = $ports->[0]; +my $sub_port = $ports->[1]; +my $sub_dir = $datadirs->[1]; + +my $prov_dsn = "host=$host port=$prov_port dbname=$dbname user=$user"; + +psql_or_bail(1, "CREATE TABLE public.ts_off (id int primary key, val text)"); +psql_or_bail(2, "CREATE TABLE public.ts_off (id int primary key, val text)"); + +psql_or_bail(2, "SELECT spock.sub_create('test_sub', '$prov_dsn', ARRAY['default'], false, false)"); +ok(wait_for_sub_status(2, 'test_sub', 'replicating', 30), 'subscription is replicating'); + +psql_or_bail(1, "INSERT INTO public.ts_off SELECT g, 'r1_' || g FROM generate_series(1,50) g"); +ok(wait_until(30, sub { scalar_query(2, "SELECT count(*) FROM public.ts_off") eq '50' }), + '50 rows replicated'); + +# Clean restart so resource.dat is current. +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-m', 'fast', 'stop'; +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; +ok(wait_for_sub_status(2, 'test_sub', 'replicating', 30), 'replicating after clean restart'); + +# Apply more so resource.dat goes stale. +psql_or_bail(1, "INSERT INTO public.ts_off SELECT g, 'r2_' || g FROM generate_series(51,100) g"); +ok(wait_until(30, sub { scalar_query(2, "SELECT count(*) FROM public.ts_off") eq '100' }), + '100 rows replicated'); + +# Disable track_commit_timestamp -- requires a postmaster restart, but our +# next step is going to crash and restart anyway. +psql_or_bail(2, "ALTER SYSTEM SET track_commit_timestamp = off"); + +# SIGKILL. +my $pid_file = "$sub_dir/postmaster.pid"; +open(my $fh, '<', $pid_file) or die "Cannot open $pid_file: $!"; +my $sub_pid = <$fh>; +chomp($sub_pid); +close($fh); +kill 'KILL', $sub_pid; +pass('subscriber SIGKILLed'); + +select(undef, undef, undef, 2.0); + +# Restart -- track_commit_timestamp is now off; reconcile sees stale file; +# recovery scan must error out. +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; +ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), 'subscriber back up'); + +# Wait for the apply worker to attempt start, then check the log for the +# expected ERROR. The bgworker manager will respawn it; we just need to +# verify the message appears. +my $logfile = "$sub_dir/log/postgresql-1.log"; +# SpockTest may put logs elsewhere; adjust path if needed by reading test +# config or scanning $sub_dir/log/. +ok(wait_until(30, sub { + log_contains($logfile, qr/track_commit_timestamp.*on/i) +}), "apply worker logged ERROR about track_commit_timestamp being off"); + +# Restore for cleanup. +psql_or_bail(2, "ALTER SYSTEM SET track_commit_timestamp = on"); +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-m', 'fast', 'restart'; + +destroy_cluster('Destroy 2-node cluster'); +done_testing(); +``` + +Note: the log path may need adjusting based on how `SpockTest` configures `log_directory` / `log_filename`. Inspect `tests/tap/SpockTest.pm` if `$sub_dir/log/postgresql-1.log` doesn't match. The pattern matching for the error message is tolerant (case-insensitive, just looks for "track_commit_timestamp" near "on"). + +```bash +chmod +x tests/tap/t/030_remote_commit_ts_recovery_track_commit_ts_off.pl +``` + +- [ ] **Step 5.2: Run the test — confirm it fails as expected** + +```bash +./run_tests.sh t/030_remote_commit_ts_recovery_track_commit_ts_off.pl +``` + +Expected: the test FAILS at the "logged ERROR about track_commit_timestamp" check, because Task 4 has no guard. Without the guard, the recovery scan calls `TransactionIdGetCommitTsData` which returns false for every xid (no commit_ts data) and the scan logs "no commits found" instead of erroring. + +- [ ] **Step 5.3: Commit** + +```bash +git add tests/tap/t/030_remote_commit_ts_recovery_track_commit_ts_off.pl +git commit -m "test: add 030 (track_commit_timestamp off, currently failing) + +Asserts the apply worker errors at startup if track_commit_timestamp +is off when the recovery scan would otherwise run. Currently fails -- +guard lands in the next commit." +``` + +--- + +## Task 6: Add the `track_commit_timestamp` guard + +**Files:** +- Modify: `src/spock_apply.c` — add the guard at the top of `recover_progress_timestamps_from_commit_ts`. + +- [ ] **Step 6.1: Add the guard** + +At the top of `recover_progress_timestamps_from_commit_ts` (added in Task 4.3), insert immediately after the variable declarations: + +```c +if (!track_commit_timestamps) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("SPOCK %s: track_commit_timestamp must be on for spock.progress recovery", + MySubscription->name), + errhint("Set track_commit_timestamp = on in postgresql.conf and restart."))); +``` + +The variable `track_commit_timestamps` (note plural) is the standard Postgres GUC global, declared in `access/commit_ts.h` (already included in Task 4.2). + +- [ ] **Step 6.2: Build** + +```bash +make -j4 install +``` + +Expected: clean build. + +- [ ] **Step 6.3: Run test 030 — should now pass** + +```bash +./run_tests.sh t/030_remote_commit_ts_recovery_track_commit_ts_off.pl +``` + +Expected: all subtests pass. + +- [ ] **Step 6.4: Run test 027 to confirm no regression** + +```bash +./run_tests.sh t/027_remote_commit_ts_recovery.pl +``` + +Expected: all subtests pass. + +- [ ] **Step 6.5: Commit** + +```bash +git add src/spock_apply.c +git commit -m "feat: error if track_commit_timestamp is off during ts recovery + +Spock already requires track_commit_timestamp = on for conflict +resolution. Making the failure mode loud at apply-worker start (rather +than silently returning a degraded NULL state) is consistent with that +existing requirement and catches misconfiguration early." +``` + +--- + +## Task 7: Add the clean-shutdown skip-path test (029) + +**Files:** +- Create: `tests/tap/t/029_remote_commit_ts_recovery_clean_shutdown.pl` + +- [ ] **Step 7.1: Create the test** + +Write `tests/tap/t/029_remote_commit_ts_recovery_clean_shutdown.pl`: + +```perl +#!/usr/bin/perl +# ============================================================================= +# Test: 029_remote_commit_ts_recovery_clean_shutdown.pl +# Verify that on a CLEAN restart (resource.dat in sync with origin), +# the recovery scan is SKIPPED and the loaded remote_commit_ts is +# preserved. Confirmed by checking the apply-worker INFO log for the +# "file in sync, skipping scan" line. +# ============================================================================= + +use strict; +use warnings; +use Test::More; +use lib '.'; +use SpockTest qw( + create_cluster destroy_cluster system_or_bail + command_ok get_test_config scalar_query psql_or_bail + wait_for_sub_status wait_for_pg_ready +); + +sub wait_until { + my ($timeout_s, $probe) = @_; + my $deadline = time() + $timeout_s; + while (time() < $deadline) { + return 1 if $probe->(); + select(undef, undef, undef, 0.25); + } + return 0; +} + +sub log_grep_count { + my ($logfile, $pattern) = @_; + return 0 unless -f $logfile; + open(my $fh, '<', $logfile) or return 0; + my $count = 0; + while (my $line = <$fh>) { + $count++ if $line =~ /$pattern/; + } + close($fh); + return $count; +} + +create_cluster(2, 'Create 2-node cluster'); + +my $conf = get_test_config(); +my $host = $conf->{host}; +my $pg_bin = $conf->{pg_bin}; +my $ports = $conf->{node_ports}; +my $datadirs = $conf->{node_datadirs}; +my $dbname = $conf->{db_name}; +my $user = $conf->{db_user}; + +my $prov_port = $ports->[0]; +my $sub_port = $ports->[1]; +my $sub_dir = $datadirs->[1]; + +my $prov_dsn = "host=$host port=$prov_port dbname=$dbname user=$user"; + +psql_or_bail(1, "CREATE TABLE public.ts_clean (id int primary key, val text)"); +psql_or_bail(2, "CREATE TABLE public.ts_clean (id int primary key, val text)"); + +psql_or_bail(2, "SELECT spock.sub_create('test_sub', '$prov_dsn', ARRAY['default'], false, false)"); +ok(wait_for_sub_status(2, 'test_sub', 'replicating', 30), 'subscription is replicating'); + +psql_or_bail(1, "INSERT INTO public.ts_clean SELECT g, 'r_' || g FROM generate_series(1,50) g"); +ok(wait_until(30, sub { scalar_query(2, "SELECT count(*) FROM public.ts_clean") eq '50' }), + '50 rows replicated'); + +# Capture pre-shutdown ts. +my $pre_ts = scalar_query(2, "SELECT remote_commit_ts::text FROM spock.progress LIMIT 1"); +ok($pre_ts ne '', "pre-shutdown remote_commit_ts: $pre_ts"); + +# Establish baseline: count "skipping scan" log lines before restart. +my $logfile = "$sub_dir/log/postgresql-1.log"; +my $skip_before = log_grep_count($logfile, qr/ts recovery: file in sync, skipping scan/); + +# CLEAN shutdown + restart. resource.dat should match origin LSN. +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-m', 'fast', 'stop'; +ok(-f "$sub_dir/spock/resource.dat", 'resource.dat written on clean stop'); + +system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; +ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), 'subscriber back up'); +ok(wait_for_sub_status(2, 'test_sub', 'replicating', 30), 'replicating after clean restart'); + +# Wait for the new "skipping scan" log line. +ok(wait_until(15, sub { + log_grep_count($logfile, qr/ts recovery: file in sync, skipping scan/) > $skip_before +}), 'apply worker logged "file in sync, skipping scan" on clean restart'); + +# Confirm scan was NOT run (no "scanned ... xids" line for this restart). +my $scan_run = log_grep_count($logfile, qr/ts recovery: scanned \d+ xids/); +# We only count NEW occurrences relative to baseline, but since baseline is 0 +# for a fresh cluster, any positive count would indicate the scan ran. +# Capture before/after to be precise: +# (For simplicity, this test asserts the skip log appeared. A run log alongside +# it would also appear, indicating the skip path was NOT taken; the assertion +# below covers that case.) +is(log_grep_count($logfile, qr/ts recovery: scanned \d+ xids.*recovered remote_commit_ts/), + 0, + 'scan did NOT run on clean restart (no "recovered remote_commit_ts" log line)'); + +# remote_commit_ts should still match pre-shutdown. +my $post_ts = scalar_query(2, "SELECT remote_commit_ts::text FROM spock.progress LIMIT 1"); +is($post_ts, $pre_ts, "remote_commit_ts preserved across clean restart"); + +psql_or_bail(2, "SELECT spock.sub_drop('test_sub')"); +destroy_cluster('Destroy 2-node cluster'); +done_testing(); +``` + +```bash +chmod +x tests/tap/t/029_remote_commit_ts_recovery_clean_shutdown.pl +``` + +- [ ] **Step 7.2: Run the test** + +```bash +./run_tests.sh t/029_remote_commit_ts_recovery_clean_shutdown.pl +``` + +Expected: all subtests pass. The skip-path INFO log was added in Task 4 and the recovery function is gated on `RECONCILE_FILE_IN_SYNC` — both already in place by this point. + +- [ ] **Step 7.3: Commit** + +```bash +git add tests/tap/t/029_remote_commit_ts_recovery_clean_shutdown.pl +git commit -m "test: add 029 (clean-shutdown skip path for ts recovery) + +Verifies that on a clean restart, when resource.dat matches the origin +LSN, the recovery scan is skipped and the loaded remote_commit_ts is +preserved. Asserted via the INFO log line and absence of any 'scanned +xids' line for the restart." +``` + +--- + +## Task 8: Final regression sweep + +- [ ] **Step 8.1: Run the full TAP suite** + +```bash +cd tests/tap && ./run_tests.sh +``` + +Expected: all tests pass, including 008, 022, 023, 026, and the three new tests (027, 029, 030). + +- [ ] **Step 8.2: Verify the commit log** + +```bash +git log --oneline main..HEAD | head -20 +``` + +Expected: a clean linear sequence — refactor, constants, test 027 (failing), implement, test 030 (failing), guard, test 029. + +- [ ] **Step 8.3: No commit needed; the work is complete.** + +If this lands as a PR, the user will decide whether to fold the design doc into the PR (currently uncommitted in the working tree). + +--- + +## Self-Review + +- **Spec coverage:** every section of the spec maps to a task: enum and skip-gating (Task 1, 4), constants (Task 2), scan logic (Task 4), `track_commit_timestamp` guard (Task 6), INFO logs (Task 4, 7), tests 027/029/030 (Tasks 3, 7, 5). Test 028 (idle/empty cluster) is deferred per the spec's discussion that triggering "no commits found" in a TAP test requires either a constant override or 1M+ unrelated xids — both are too heavy for v1. +- **Type consistency:** `ReconcileResult` enum (Task 1) is used in Task 4's call site. `target_origin` is `RepOriginId`. `running_max_ts` is `TimestampTz`. Constants are `int64`-safe via the `(int64) total_scanned` cast. +- **Placeholder scan:** every step has executable code or commands. The only soft spot is the log-path detection in tests 029 and 030 — handled with a comment instructing the implementer to inspect `SpockTest.pm` if the default path doesn't match. diff --git a/docs/internals-doc/specs/spock-progress-no-checkpoint-hook-design.md b/docs/internals-doc/specs/spock-progress-no-checkpoint-hook-design.md index efdba46d..d5eeff23 100644 --- a/docs/internals-doc/specs/spock-progress-no-checkpoint-hook-design.md +++ b/docs/internals-doc/specs/spock-progress-no-checkpoint-hook-design.md @@ -1,7 +1,14 @@ # Eliminating the Spock checkpoint hook and per-commit WAL flush for `spock.progress` -**Date:** 2026-04-21 -**Status:** Design approved; pending implementation plan +**Date:** 2026-04-21 (revised 2026-05-01 to reflect as-built behavior) +**Status:** Implemented + +> **Two design changes from the original proposal**, both expanding scope: +> +> 1. **`remote_commit_ts` is recovered post-crash via a `pg_commit_ts` scan**, not left NULL. The original design accepted NULL as a freshness regression; the implementation runs a backward scan of `pg_commit_ts` for the relevant origin and repopulates the field. Promoted from a "follow-up" to a goal so `spock.progress` shows an accurate timestamp after a crash. The recovered `prev_remote_ts` is also intended to be useful for the planned parallel-apply rework. +> 2. **The RMGR (id 144) is kept**, retargeted from apply-progress recovery to per-event progress records. The original design deleted it entirely. The shipped version retains the registration and emits one WAL record per affected `SpockGroupHash` entry at each `resource.dat` dump call site (clean shutdown, `add_node`, table-sync). Each record carries the full `SpockApplyProgress` snapshot for that entry. The records are not required for state recovery; they exist so a `pg_waldump` trace shows the exact progress snapshot persisted at each event LSN — useful for incident reconstruction over time. A single `XLogFlush` at the end of each emit batch gives WAL durability symmetric with `resource.dat`'s `durable_rename()`. +> +> Sections below are updated to reflect what shipped. ## Problem @@ -20,32 +27,33 @@ Investigation showed that Spock's apply worker already uses Postgres's native re ## Goal -Remove the Postgres core patch (`Checkpoint_hook`) and the extra per-commit `XLogFlush`, while preserving user-visible `spock.progress` behavior across clean restarts. Accept a small regression in post-crash monitoring freshness (`remote_commit_ts` becomes NULL after a crash-with-new-commits, instead of showing stale-from-last-shutdown), in exchange for a substantially simpler implementation. +Remove the Postgres core patch (`Checkpoint_hook`) and the extra per-commit `XLogFlush`, while preserving user-visible `spock.progress` behavior across clean and crash restarts. Recover `remote_commit_ts` after crash via a backward scan of `pg_commit_ts` for the relevant origin so the view shows an accurate timestamp; the recovered `prev_remote_ts` is also intended to be useful for the planned parallel-apply rework. ## Non-goals - Changing the `spock.progress` column schema or the SQL-level API. - Altering apply-worker resume semantics (which are already handled by Postgres replication origins and need no change). - Removing or reshaping the `SpockGroupHash` shmem layout or the `SpockApplyProgress` struct. -- Attempting to recover `remote_commit_ts` per-origin from Postgres internals at startup. (Investigated: replication origins persist only `remote_lsn`, and Postgres exposes no efficient per-origin lookup in the commit-ts SLRU. Deferred as a follow-up.) +- Recovering `last_updated_ts` precisely (apply-time timestamp). Post-crash, set to the recovered `remote_commit_ts` (the publisher's commit timestamp) — a lower bound on local apply time that under-reports lag rather than zeroing it. Refreshes on the next applied commit. ## Design overview -The new design keeps `resource.dat` as a shutdown-only snapshot and replaces WAL-based durability with two components: +The shipped design keeps `resource.dat` as a snapshot written at three call sites — clean shutdown (supervisor `before_shmem_exit`), `add_node`, and table-sync — and replaces WAL-based durability with three components: -1. **Replication origins as the source of truth for `remote_commit_lsn`.** At apply-worker startup, read the origin's LSN and reconcile shmem against it. If `resource.dat` holds a stale LSN for an entry (file_lsn < origin_lsn), the file's timestamp fields for that entry are cleared to NULL — they can't be trusted. +1. **Replication origins as the source of truth for `remote_commit_lsn`.** At apply-worker startup, read the origin's LSN and reconcile shmem against it. If `resource.dat` holds a stale LSN for an entry (file_lsn < origin_lsn), the file's timestamp fields for that entry are cleared to NULL pending recovery by step 3. 2. **A forced keepalive on apply-worker (re)connect.** The subscriber sends a status-update message with `replyRequested=true` immediately after `spock_start_replication` returns. Postgres's walsender responds with a 'k' keepalive carrying its current `sentPtr`. The apply worker's existing keepalive handler populates `remote_insert_lsn` and `received_lsn` in shmem within milliseconds of reconnect, rather than waiting for `wal_sender_timeout / 2` (default 30s) to elapse. +3. **A backward `pg_commit_ts` scan to recover `remote_commit_ts` per origin.** When reconcile detects stale or absent state, a follow-on routine `recover_progress_timestamps_from_commit_ts()` walks `pg_commit_ts` backward from the latest xid, filtering on the publisher's origin id, and writes the max-by-ts back into shmem. Termination at 1000 commits seen for the origin or 1M xids total scanned, whichever first. `last_updated_ts` is set to the same recovered `remote_commit_ts` so the XOR invariant in `progress_update_struct` holds; using the publisher's commit_ts (rather than the subscriber's wall clock) gives a lower bound on local apply time that under-reports lag instead of hiding it as zero. Together these handle every shmem field without WAL or checkpoint-time persistence: | Field | Source | When authoritative | |---|---|---| | `remote_commit_lsn` | Replication origin (`replorigin_session_get_progress`) | At apply-worker startup, always correct | -| `remote_commit_ts` | `resource.dat` if validated against origin LSN; else NULL until next commit | Preserved across clean restart; NULL after stale-detected crash; NULL on fresh first start | +| `remote_commit_ts` | `resource.dat` if validated against origin LSN; else `pg_commit_ts` scan | Preserved across clean restart; recovered post-crash; NULL only on fresh first start (no commits yet) | | `remote_insert_lsn` | Forced 'k' keepalive | Within ~ms of apply worker connect | | `received_lsn` | Same | Same | -| `last_updated_ts` | `GetCurrentTimestamp()` at each update, or from file | Always advances with events | -| `prev_remote_ts` | No longer written (field kept in struct; obsolete) | — | +| `last_updated_ts` | `GetCurrentTimestamp()` at each update; post-crash set to the recovered `remote_commit_ts` | Always advances during apply; post-crash pinned to the recovered publisher commit_ts as a lower bound on local apply time (under-reports lag rather than zeroing it); refreshes on next applied commit | +| `prev_remote_ts` | Set together with `remote_commit_ts` (same value) | Same as `remote_commit_ts` — recovered alongside | ## Architecture @@ -53,30 +61,50 @@ Together these handle every shmem field without WAL or checkpoint-time persisten **Phase 1 — `shmem_startup_hook` (pre-recovery).** -Runs via `spock_group_shmem_startup()`, unchanged from today except that the spock RMGR no longer exists so WAL recovery has no Spock records to redo: +Runs via `spock_group_shmem_startup()`, unchanged from today except that the spock RMGR no longer carries apply-progress records so WAL recovery has no state-bearing Spock records to redo: 1. Create the empty `SpockGroupHash` HTAB. 2. Call `spock_group_resource_load()` to load `PGDATA/spock/resource.dat` if present and valid (version and `system_identifier` match). Each record is upserted into the hash via `spock_group_progress_update()`. -3. **No WAL redo for Spock records** (the RMGR is deleted). +3. **No state-bearing WAL redo for Spock records.** The RMGR (id 144) is retained but its only record type is a dump-event marker whose redo handler emits a `LOG` line and does not touch shmem. At the end of phase 1, shmem contains either an empty hash or the contents of the last clean shutdown's `resource.dat`. Values may be stale if a crash occurred since that shutdown. **Phase 2 — per-apply-worker reconciliation.** -In `spock_apply_main()`, after `replorigin_session_setup()` returns the session's origin but before `spock_start_replication()` is called, each worker reconciles its own shmem entry: +In `spock_apply_main()`, after `replorigin_session_setup()` returns the session's origin but before `spock_start_replication()` is called, each worker reconciles its own shmem entry. `reconcile_progress_with_origin()` returns a `ReconcileResult` enum so the caller can gate Phase 3 (the recovery scan) on whether reconcile detected staleness. 1. Compute the shmem key: `(MyDatabaseId, MySubscription->target->id, MySubscription->origin->id)`. 2. Fetch `origin_lsn = replorigin_session_get_progress(false)`. 3. Acquire `SpockCtx->apply_group_master_lock` in exclusive mode; HASH_ENTER for the entry: - - **Entry absent.** Initialize with `remote_commit_lsn = origin_lsn`, other fields zero/NULL. - - **Entry present, `file_lsn == origin_lsn`.** File is in sync with the origin. Keep `remote_commit_ts` and `last_updated_ts` as loaded. - - **Entry present, `file_lsn < origin_lsn`.** File is stale for this entry. Overwrite `remote_commit_lsn` with `origin_lsn`; clear `remote_commit_ts` and `last_updated_ts` to NULL; leave `remote_insert_lsn` and `received_lsn` untouched (they'll be overwritten by the forced keepalive momentarily). - - **Entry present, `file_lsn > origin_lsn`.** Anomalous; shouldn't occur in normal flow. Log a warning and prefer `origin_lsn`. -4. Release the lock. + - **Entry absent.** Initialize with `remote_commit_lsn = origin_lsn`, other fields zero/NULL. Result: `RECONCILE_FILE_ABSENT`. + - **Entry present, `file_lsn == origin_lsn`.** File is in sync with the origin. Keep `remote_commit_ts` and `last_updated_ts` as loaded. Result: `RECONCILE_FILE_IN_SYNC`. + - **Entry present, `file_lsn < origin_lsn`.** File is stale for this entry. Overwrite `remote_commit_lsn` with `origin_lsn`; clear `remote_commit_ts` and `last_updated_ts` to 0; bump `remote_insert_lsn` and `received_lsn` up to at least `origin_lsn` to preserve the `insert_lsn >= commit_lsn` invariant (the forced keepalive will refresh them shortly). Result: `RECONCILE_FILE_STALE`. + - **Entry present, `file_lsn > origin_lsn`.** Anomalous; shouldn't occur in normal flow. Log a warning and prefer `origin_lsn` (same body as STALE). Result: `RECONCILE_FILE_ANOMALY`. +4. Release the lock and return the result. Phase 2 lives in the apply worker (not the shmem_startup_hook) because the startup hook runs before catalogs and subscriptions are readable. -**Phase 3 — forced keepalive.** +**Phase 3 — `pg_commit_ts` scan to recover `remote_commit_ts`.** + +If reconcile returned anything other than `RECONCILE_FILE_IN_SYNC`, run `recover_progress_timestamps_from_commit_ts(entry, target_origin)` where `target_origin = MySubscription->origin->id` (the publisher's spock node id, which is what `pg_commit_ts` records for applied xacts on the subscriber — `handle_origin()` overwrites `replorigin_session_origin` per-message with that value). + +The scan walks backward from `ReadNextTransactionId() - 1` in batches of `SPOCK_TS_RECOVERY_BATCH_SIZE` (1000), calling `TransactionIdGetCommitTsData(xid, &ts, &origin)` for each xid. Filter: `origin == target_origin`. Track per-origin `running_max_ts` and `seen_count`. Terminate when: + +- `seen_count >= SPOCK_TS_RECOVERY_MIN_SEEN_PER_ORIGIN` (1000), or +- `total_scanned >= SPOCK_TS_RECOVERY_SCAN_LIMIT` (1M), or +- backward scan reaches `FirstNormalTransactionId`. + +The 1000-commits-per-origin termination is justified by concurrency width: the publisher's max in-flight xacts is bounded well below 1000 by `max_connections` and pooler limits, so any commit older than the 1000th observed has a strictly smaller commit_ts than the max already seen — even under parallel apply where xid order ≠ commit_ts order on the subscriber. + +If `running_max_ts > 0`, write it under the master lock to `remote_commit_ts`, `prev_remote_ts`, and `last_updated_ts` (XOR invariant in `progress_update_struct` requires `remote_commit_ts` and `last_updated_ts` to be both zero or both non-zero). Using the recovered publisher commit_ts for `last_updated_ts` rather than the subscriber's wall clock gives a lower bound on local apply time, which under-reports lag instead of hiding it. The next applied commit refreshes `last_updated_ts` to the actual subscriber-side `GetCurrentTimestamp()`. If `running_max_ts == 0`, log a `WARNING` and proceed with NULL. + +The constants are `#define`s in `src/spock_progress_recovery.c` (`SPOCK_TS_RECOVERY_MIN_SEEN_PER_ORIGIN`, `SPOCK_TS_RECOVERY_SCAN_LIMIT`, `SPOCK_TS_RECOVERY_BATCH_SIZE`). Convert to GUCs only if profiling motivates tuning. + +A `LOG` line is emitted on completion in either path: "ts recovery: file in sync with origin, skipping scan" (skip) or "ts recovery: scanned N xids, found K commits, recovered remote_commit_ts" (run). `LOG` rather than `INFO` because INFO is client-only and does not reach the server log for a bgworker. + +`track_commit_timestamp = on` is required. Spock already enforces this via the `spock.conflict_resolution` GUC's `check_hook` (the only allowed value `last_update_wins` requires it), which fires at postmaster start. The recovery scan therefore does not need its own guard — by the time it runs, the GUC is guaranteed on. + +**Phase 4 — forced keepalive.** Immediately after `spock_start_replication()` returns, before entering the receive loop, the apply worker calls a new helper: @@ -114,37 +142,46 @@ Same transformation as `force_set_list`: drop the per-entry WAL call, add a sing ### Shutdown flow -Unchanged. `src/spock_shmem.c:190-199` already contains a `spock_on_shmem_exit` callback that calls `spock_group_resource_dump()` only on clean exit (`if (code != 0) return;`). This is the sole remaining runtime producer of `resource.dat`. +The clean-shutdown dump runs from the supervisor's `before_shmem_exit` callback (`spock_supervisor_on_exit` in `src/spock.c`), not from postmaster context. The supervisor first poll-waits for sibling apply/sync workers to clear their `PGPROC` slots (~5 s ceiling), then calls `spock_group_resource_dump()` followed by `spock_rmgr_log_resource_dump(SPOCK_DUMP_SHUTDOWN, NULL)`. Running in the supervisor process is required for the WAL emit (postmaster has no backend slot for `XLogInsert`); the drain wait ensures the file dump and the SHUTDOWN forensic records both reflect each worker's last update to `SpockGroupHash`. This is the sole remaining runtime producer of `resource.dat` on clean exit. ## Code changes ### Delete -- `src/spock_rmgr.c` — entire file. -- `include/spock_rmgr.h` — entire file (contains no live struct definitions; `SpockApplyProgress` and `SpockGroupEntry` live in `include/spock_group.h:116-147`). -- `spock_checkpoint_hook()` function in `src/spock_group.c:653-671`. -- `extern void spock_checkpoint_hook(...)` declaration in `include/spock_group.h:162`. +- `spock_checkpoint_hook()` function in `src/spock_group.c` (no longer registered or referenced). +- `extern void spock_checkpoint_hook(...)` declaration in `include/spock_group.h`. +- The apply-progress recovery contents of `src/spock_rmgr.c` (the redo handler that replayed `SPOCK_RMGR_APPLY_PROGRESS` records into shmem). The file itself is retained — see "Trim, do not delete" below. +- The `spock_apply_progress_add_to_wal()` emit helper. ### Remove one-liners -- `src/spock.c:1248` — `Checkpoint_hook = spock_checkpoint_hook;`. -- `src/spock.c:1251` — `spock_rmgr_init();`. -- `src/spock_apply.c:1018` — `spock_apply_progress_add_to_wal(&sap);` inside `handle_commit`. -- `src/spock_group.c:682` — `spock_apply_progress_add_to_wal(sap);` inside `spock_group_progress_update_list`. -- `src/spock_group.c:769` — `spock_apply_progress_add_to_wal(sap);` inside `spock_group_progress_force_set_list`. -- `#include "spock_rmgr.h"` in `include/spock_group.h:23`. -- `#include "spock_rmgr.h"` in `src/spock.c:62`. +- `src/spock.c` — `Checkpoint_hook = spock_checkpoint_hook;`. +- `src/spock_apply.c:handle_commit` — `spock_apply_progress_add_to_wal(&sap);`. +- `src/spock_group.c:spock_group_progress_update_list` — `spock_apply_progress_add_to_wal(sap);`. +- `src/spock_group.c:spock_group_progress_force_set_list` — `spock_apply_progress_add_to_wal(sap);`. +- `#include "spock_rmgr.h"` in `include/spock_group.h` (was a circular include with no consumer). + +### Trim, do not delete + +- **`src/spock_rmgr.c`** — replace the apply-progress recovery contents with a single record type `SPOCK_RMGR_RESOURCE_DUMP`. New emit helper `spock_rmgr_log_resource_dump(event, changed_entries)` is called at each of the three `resource.dat` dump call sites. Each call emits one WAL record per progress entry, carrying the full `SpockApplyProgress` snapshot for that entry so a `pg_waldump` trace shows exactly what state was on disk at each event LSN — useful for incident reconstruction over time. A single `XLogFlush` at the end of each emit batch covers all records. + - For `SHUTDOWN` and `ADD_NODE`: caller passes `NULL` for `changed_entries`; the helper walks `SpockGroupHash` and emits a record per entry. Both events reflect changes across all subscriptions. + - For `TABLE_SYNC`: caller passes the same list of `SpockApplyProgress *` it just used to update shmem. Only the affected subscription's entries are emitted, since table-sync is per-subscription. + - Redo handler emits a `LOG` line per record showing event type, sequence in batch, key, LSNs, and `commit_ts`. Records are not used for state recovery. +- **`include/spock_rmgr.h`** — replace declarations to match the trimmed contents (`SPOCK_RMGR_ID = 144`, `SPOCK_RMGR_RESOURCE_DUMP = 0x10`, `SpockResourceDumpRec` carrying `event_type` + `entry_seq`/`entry_total` + the embedded `SpockApplyProgress`, `SpockResourceDumpEvent`, the emit helper). +- **`src/spock.c`** — `spock_rmgr_init()` call retained in `_PG_init`. Header include retained. ### Modify +- **`src/spock_progress_recovery.c` and `include/spock_progress_recovery.h`** (new module) — single public entry point `spock_init_progress_state(XLogRecPtr origin_lsn)`. File-static internals: the `ReconcileResult` enum, `reconcile_progress_with_origin()`, `recover_progress_timestamps_from_commit_ts()`, and the `SPOCK_TS_RECOVERY_*` `#define`s. - **`src/spock_apply.c`** — add a new helper `request_initial_status_update(PGconn *conn, XLogRecPtr startpos)` that emits an 'r' message with `replyRequested=true` without touching `send_feedback`'s static state. -- **`src/spock_apply.c:spock_apply_main`** — after `replorigin_session_setup` and before `spock_start_replication`, add phase-2 reconciliation. After `spock_start_replication` returns, call `request_initial_status_update`. -- **`src/spock_group.c:spock_group_progress_force_set_list`** — drop the WAL call; add a single `spock_group_resource_dump()` after the loop. -- **`src/spock_group.c:spock_group_progress_update_list`** — same. +- **`src/spock_apply.c:spock_apply_main`** — between `replorigin_session_setup` and `spock_start_replication`, call `spock_init_progress_state(origin_lsn)` once; this internally runs reconcile and (if reconcile detected staleness) the `pg_commit_ts` scan. After `spock_start_replication` returns, call `request_initial_status_update`. +- **`src/spock_group.c:spock_group_progress_force_set_list`** — drop the WAL call; add a single `spock_group_resource_dump()` followed by `spock_rmgr_log_resource_dump(SPOCK_DUMP_ADD_NODE, ...)` after the loop. +- **`src/spock_group.c:spock_group_progress_update_list`** — same with `SPOCK_DUMP_TABLE_SYNC`. +- **`src/spock.c:spock_supervisor_main`** — register `spock_supervisor_on_exit` via `before_shmem_exit`. The callback poll-waits for apply/sync workers to detach via a new helper `spock_any_apply_worker_running()` (in `src/spock_worker.c`), then calls `spock_group_resource_dump()` followed by `spock_rmgr_log_resource_dump(SPOCK_DUMP_SHUTDOWN, NULL)`. The previous postmaster-context `spock_on_shmem_exit` in `src/spock_shmem.c` is removed entirely; both file and WAL dumps now have a single owner that runs in the supervisor process where `XLogInsert` is permitted. ### Keep unchanged -- `src/spock_shmem.c:190-199` — `spock_on_shmem_exit` (clean-shutdown-only dump). +- Clean-shutdown gating semantics (only dumps if `code == 0`) — preserved in `spock_supervisor_on_exit`. - `spock_group_resource_load()`, `spock_group_resource_dump()` — load/dump file-format code. - `SpockApplyProgress` and `SpockGroupEntry` struct layouts. - `spock.progress` SQL view and `get_apply_group_progress()` function. @@ -167,7 +204,7 @@ No other Spock or Postgres-patch code depends on `Checkpoint_hook`. **Startup: `resource.dat` entry for a subscription that was dropped.** Old entry sits in shmem; no apply worker reconciles it. Matches current behavior. Optional follow-up: purge unknown entries after catalog is readable. -**Startup: origin LSN is `InvalidXLogRecPtr`.** Subscription exists but nothing has been applied. Reconciliation treats this as "no origin progress"; keep `resource.dat` value if any, otherwise zero. +**Startup: origin LSN is `InvalidXLogRecPtr`.** Subscription exists but nothing has been applied. Reconcile still runs first: if the loaded `resource.dat` entry has `file_lsn > 0` it is treated as ANOMALY (file ahead of origin) — `commit_lsn` is overwritten to 0, timestamps cleared, insert/received bumped to 0. If `file_lsn == 0` the IN_SYNC branch keeps the file as-is. The `pg_commit_ts` scan is then skipped via the `XLogRecPtrIsInvalid(origin_lsn)` short-circuit in `spock_init_progress_state`, since a fresh origin has no progress to recover and `pg_commit_ts` rows from a prior subscription would only backfill stale historical timestamps. **`add_node` retry.** Failed mid-loop leaves shmem partially updated and the file untouched (dump runs only after the loop completes). Second run's unconditional overwrite (`init_progress_fields` then `progress_update_struct`) replaces partial state; dump runs after the loop. Fine. @@ -175,7 +212,15 @@ No other Spock or Postgres-patch code depends on `Checkpoint_hook`. **Multiple apply workers.** Each worker reconciles its own key under the master lock. No cross-worker interference. Forced keepalive is per-connection. -**Post-crash with publisher offline.** `remote_commit_ts` is NULL in `spock.progress` until publisher reconnects and sends a commit. This is the only visible regression versus today (which would show stale-from-last-shutdown values). NULL is intentionally chosen as an honest "unknown" signal. +**Post-crash with publisher offline.** `remote_commit_ts` is recovered from the subscriber's `pg_commit_ts` (which survives the crash) by the Phase 3 scan. Recovery does not depend on the publisher being reachable. The only situation that yields NULL is one where `pg_commit_ts` itself has no entries for this origin within the last 1M xids — typically a fresh subscription that has not yet applied any commits. + +**Recovery scan finds nothing.** If the subscriber has no `pg_commit_ts` entries with `roident == target_origin` in the last 1M xids, `running_max_ts` stays 0. We log a `WARNING` and proceed with NULL `remote_commit_ts`. This is rare and matches the fresh-subscription case anyway. + +**Recovery scan picks up unrelated commits.** The filter `roident == MySubscription->origin->id` is the publisher's spock node id, not subscription-unique. Sibling subscriptions to the same publisher, or a dropped/recreated subscription with old commits still in the 1M-xid window, can contribute to `running_max_ts`. Under monotonic publisher clocks the max-by-ts naturally prefers the most recent commit (current sub), so the failure mode is narrow: clock skew between drop and recreate, or scenarios where stale commits are newer than current ones. Impact is forensic-only — `prev_remote_ts` is not consumed today (apply-group ordering is disabled). The planned parallel-apply rework is the natural place to revisit. + +**Concurrency width vs. termination bound.** The scan stops after observing 1000 commits per origin, which is correct only if max in-flight concurrent xacts on the publisher is bounded below 1000. PostgreSQL's `max_connections` defaults to 100, and pooled deployments cap effective concurrency well below that. If a deployment ever exceeds it, the constant is a single `#define` to bump. + +**Publisher clock skew (NTP step-back).** `running_max_ts` is the max-by-ts. If the publisher's clock steps backward such that commit_ts order disagrees with commit *order*, the recovered `prev_remote_ts` may not match the next transaction's `required_commit_ts` from the publisher. Theoretical only — the apply-group ordering machinery that consumes `prev_remote_ts` is currently disabled. The planned parallel-apply rework will revisit. Out of scope for this design. **Shmem hash full during `force_set_list`.** Warn and continue (existing behavior unchanged). Post-loop dump persists whichever entries made it in. @@ -185,18 +230,21 @@ No other Spock or Postgres-patch code depends on `Checkpoint_hook`. ### Modified existing tests -- `tests/tap/t/008_rmgr.pl` — rewrites assertions for the new mechanism. "Clean stop → restart → spock.progress preserved" and "crash → correct remote_commit_lsn from origin" assertions remain conceptually, with different internals. +- `tests/tap/t/008_rmgr.pl` — rewrote Phase 4 assertions for the new reconcile semantics. Covers the multi-origin case (Scenario A: file has stale entry, Scenario B: file has no entry for an origin). The test issues a `CHECKPOINT` before SIGKILL so the apply worker's WAL is durable past batch-2; without this, walwriter-cadence races could leave recovery's origin LSN at batch-1 and reconcile would take the IN_SYNC path instead of STALE. +- `tests/tap/t/022_rmgr_progress_post_checkpoint_crash.pl` — rewrote to assert that reconcile detects `file_lsn < origin_lsn` after a crash, advances `remote_commit_lsn` from the replication origin, and (with the recovery scan) that `remote_commit_ts` is recovered non-NULL post-crash. Also asserts CHECKPOINT does NOT update `resource.dat` (regression guard against re-introducing the checkpoint hook). ### New tests -1. **Clean shutdown/restart preserves `remote_commit_ts`.** Apply N transactions → clean stop → start → assert `remote_commit_ts` matches pre-shutdown value. -2. **Crash after post-shutdown commits → `remote_commit_ts` NULL.** Clean shutdown → apply transactions → `SIGKILL` → restart → assert `remote_commit_ts` is NULL (stale-detection cleared it), `remote_commit_lsn` matches origin. -3. **Crash during idle → fields NULL until reconnect.** Let subscriber idle with publisher → `SIGKILL` → restart before publisher reconnects → assert only `remote_commit_lsn` populated, others NULL. -4. **Forced keepalive populates within ms.** Set `wal_sender_timeout` high (e.g., 600s) so the timer-driven path is disabled. Start apply worker → assert `remote_insert_lsn` populated within 1 second (proves the forced keepalive, not the timer, produced it). -5. **`add_node` crash-durability.** Run `add_node` → `SIGKILL` before clean shutdown → restart → assert mesh progress entries from `force_set_list` survived (via the post-loop `spock_group_resource_dump`). -6. **No extra WAL flush on commit.** Instrument or count flushes; confirm applied remote-origin commits perform 1 `XLogFlush`, not 2. -7. **`add_node` interrupted mid-loop.** Inject failure between entries → assert `resource.dat` reflects pre-add_node state (no partial dump). -8. **Publisher never responds to forced keepalive.** Block reply path → assert apply worker does not block; values populate on `wal_sender_timeout / 2` cadence. +1. **`023_forced_keepalive_timing.pl`** — set `wal_sender_timeout=10min` so the timer-driven path is disabled; assert `remote_insert_lsn` populates within 1 second (proves the forced keepalive, not the timer, drove the refresh). +2. **`026_no_double_wal_flush.pl`** — count `pg_stat_wal.wal_sync` delta over 30 applied commits; assert delta < N (no per-commit forced fsync on the apply hot path). +3. **`027_remote_commit_ts_recovery.pl`** — apply commits, clean restart, more commits, CHECKPOINT, SIGKILL, restart; assert `remote_commit_ts` is non-NULL post-crash (recovered via the `pg_commit_ts` scan). +4. **`029_remote_commit_ts_recovery_clean_shutdown.pl`** — clean restart only; assert the skip-path `LOG` line ("file in sync with origin, skipping scan") fires and the run-path log does not, and `remote_commit_ts` is preserved. + +### Deferred / not implemented + +- **`track_commit_timestamp = off` test.** Cannot occur in practice because spock's existing `spock.conflict_resolution` GUC validation aborts the postmaster at startup if the GUC is off. The recovery scan is unreachable in that scenario. +- **Idle-cluster test (no commits in last 1M xids).** Would require either >1M unrelated xids of activity or a temporary constant override. Both too heavy for v1. +- **`add_node` interrupted mid-loop test.** Out of scope for this design; covered by existing `add_node` integration tests. ### Regression safety checks @@ -206,6 +254,9 @@ No other Spock or Postgres-patch code depends on `Checkpoint_hook`. ## Follow-ups (out of scope) -- Best-effort `GetLatestCommitTsData` at startup to recover `remote_commit_ts` for the single globally-latest commit's origin. Low cost, low coverage (helps only one origin per crash). Deferred per user's call. +- Convert `SPOCK_TS_RECOVERY_*` `#define`s to GUCs if any deployment ever needs to tune them. +- Generalize the forensic RMGR record into a small audit-event family covering more rare events (`add_node` / `drop_node` lifecycle, subscription state transitions, sync start/end). Sketched separately in `2026-05-01-spock-audit-rmgr-design.md`. +- Revisit the parallel-apply ordering coordinate. The current `commit_ts`-based approach is exposed to publisher clock skew; alternatives will be considered alongside the parallel-apply rework. +- Multi-origin shared scan: a single backward pass that populates all origins' shmem entries at once, gated by a per-origin "recovery done" flag in `SpockGroupEntry`. Worth doing only if profiling shows per-worker scan cost matters. - Purging `resource.dat` entries that reference subscriptions that no longer exist. -- Removing the obsolete `prev_remote_ts` and `updated_by_decode` fields from the `SpockApplyProgress` struct layout. +- Removing the obsolete `updated_by_decode` field from the `SpockApplyProgress` struct layout. (`prev_remote_ts` is still actively used as the wait/signal coordinate for the parallel-apply machinery — keep until that rework lands.) diff --git a/include/spock_group.h b/include/spock_group.h index 0543b0b6..a8b8ed36 100644 --- a/include/spock_group.h +++ b/include/spock_group.h @@ -160,4 +160,13 @@ extern void spock_group_resource_dump(void); extern void spock_group_progress_update_list(List *lst); extern void spock_group_progress_force_set_list(List *lst); +/* + * Reset every non-key field of a SpockApplyProgress to its "not set" value. + * Single source of truth for initial-state semantics; callers that need to + * (re)seed a progress entry from somewhere other than the hash insert path + * (e.g. reconcile_progress_with_origin) call this directly so the two paths + * cannot drift when SpockApplyProgress gains a new field. + */ +extern void spock_init_progress_fields(SpockApplyProgress *progress); + #endif /* SPOCK_GROUP_H */ diff --git a/src/spock_apply.c b/src/spock_apply.c index 04fa56ba..fa1e467e 100644 --- a/src/spock_apply.c +++ b/src/spock_apply.c @@ -2895,6 +2895,16 @@ send_feedback(PGconn *conn, XLogRecPtr recvpos, int64 now, bool force) * first 'w' that arrives, often sooner than the eventual 'k'. The * forced keepalive matters most when the publisher is otherwise idle; * when there is data flowing, the data messages do the job. + * + * Protocol-version note: the 'r' / 'k' framing is the streaming-replication + * wire protocol (libpqwalreceiver level), not the spock output-plugin + * protocol carried inside 'w' message bodies, so this works on every + * negotiable spock protocol version. The 'k' response advances + * received_lsn on both v4 and v5+. remote_insert_lsn refresh paths differ: + * v5+ takes it from each 'w' header; v4 takes it from COMMIT message + * payloads. Until the first commit/keepalive lands, v4's remote_insert_lsn + * stays at whatever reconcile pinned it to (origin_lsn) -- correct, but + * not "live" -- while v5+ catches up on the very next 'w'. */ static void request_initial_status_update(PGconn *conn, XLogRecPtr startpos) diff --git a/src/spock_group.c b/src/spock_group.c index a6487004..2332a80e 100644 --- a/src/spock_group.c +++ b/src/spock_group.c @@ -80,8 +80,8 @@ static void spock_group_resource_load(void); * We use C99 designated initializers to be explicit about what we're * initializing and to avoid fragile pointer arithmetic. */ -static inline void -init_progress_fields(SpockApplyProgress *progress) +void +spock_init_progress_fields(SpockApplyProgress *progress) { /* Key should already be set by hash_search or caller */ Assert(OidIsValid(progress->key.dbid)); @@ -230,7 +230,7 @@ spock_group_attach(Oid dbid, Oid node_id, Oid remote_node_id) * New entry: the hash table already copied 'key' into * entry->progress.key, now initialize the remaining progress fields. */ - init_progress_fields(&entry->progress); + spock_init_progress_fields(&entry->progress); pg_atomic_init_u32(&entry->nattached, 0); ConditionVariableInit(&entry->prev_processed_cv); @@ -383,7 +383,7 @@ spock_group_progress_update(const SpockApplyProgress *sap) * New entry: the hash table already copied sap->key into * entry->progress.key, now initialize the remaining fields. */ - init_progress_fields(&entry->progress); + spock_init_progress_fields(&entry->progress); pg_atomic_init_u32(&entry->nattached, 0); ConditionVariableInit(&entry->prev_processed_cv); @@ -742,7 +742,6 @@ spock_group_progress_force_set_list(List *lst) if (!found) { - init_progress_fields(&entry->progress); pg_atomic_init_u32(&entry->nattached, 0); ConditionVariableInit(&entry->prev_processed_cv); } @@ -755,11 +754,18 @@ spock_group_progress_force_set_list(List *lst) } /* - * Unconditional overwrite: the resume_lsn from read_peer_progress is - * the COPY-snapshot boundary and is authoritative. + * Force-set: unconditionally adopt sap as authoritative. Cannot use + * progress_update_struct here -- its max-by-timestamp merge skips + * remote_commit_lsn when sap->remote_commit_ts == 0, which is a legal + * shape when the peer is post-reconcile and pre-recovery. */ - init_progress_fields(&entry->progress); - progress_update_struct(&entry->progress, sap); + entry->progress = *sap; + + /* Preserve the insert_lsn >= commit_lsn invariant. */ + if (entry->progress.remote_insert_lsn < entry->progress.remote_commit_lsn) + entry->progress.remote_insert_lsn = entry->progress.remote_commit_lsn; + if (entry->progress.received_lsn < entry->progress.remote_commit_lsn) + entry->progress.received_lsn = entry->progress.remote_commit_lsn; LWLockRelease(SpockCtx->apply_group_master_lock); diff --git a/src/spock_progress_recovery.c b/src/spock_progress_recovery.c index dbaca754..7d793923 100644 --- a/src/spock_progress_recovery.c +++ b/src/spock_progress_recovery.c @@ -71,6 +71,15 @@ spock_init_progress_state(XLogRecPtr origin_lsn) SpockGroupEntry *entry; bool found; + /* + * Defense in depth: reconcile_progress_with_origin returns ABSENT on + * NULL shmem rather than dereferencing it, but the rest of this function + * does dereference SpockCtx/SpockGroupHash. Mirror the guard so the + * caller-visible contract is self-consistent. + */ + if (!SpockCtx || !SpockGroupHash) + return; + result = reconcile_progress_with_origin(origin_lsn); if (result == RECONCILE_FILE_IN_SYNC) @@ -80,6 +89,19 @@ spock_init_progress_state(XLogRecPtr origin_lsn) return; } + /* + * Fresh subscription: replication origin has never recorded progress. + * pg_commit_ts may still hold rows for this publisher's spock node id + * from a prior subscription -- scanning would backfill stale historical + * timestamps unrelated to the current subscription's actual state. + */ + if (XLogRecPtrIsInvalid(origin_lsn)) + { + elog(LOG, "SPOCK %s: ts recovery: no durable origin progress yet, skipping scan", + MySubscription->name); + return; + } + key.dbid = MyDatabaseId; key.node_id = MySubscription->target->id; key.remote_node_id = MySubscription->origin->id; @@ -97,6 +119,11 @@ spock_init_progress_state(XLogRecPtr origin_lsn) * what pg_commit_ts records for applied xacts on the subscriber -- * NOT the local subscription's roident. Filter the scan on * MySubscription->origin->id so we match the recorded entries. + * + * Limitation: this filter is not subscription-unique. Sibling subs to + * the same publisher, or a dropped/recreated sub with old commits + * still in the 1M-xid window, can contribute to running_max_ts. + * Forensic-only impact; parallel-apply rework will revisit. */ recover_progress_timestamps_from_commit_ts(entry, (RepOriginId) MySubscription->origin->id); @@ -152,23 +179,15 @@ reconcile_progress_with_origin(XLogRecPtr origin_lsn) { /* * New entry: hash_search already copied the key into - * entry->progress.key. Zero the remaining fields inline because - * init_progress_fields is static to spock_group.c. - * - * MAINTENANCE: if SpockApplyProgress gains new fields, this block - * must be updated in lockstep with init_progress_fields in - * spock_group.c:init_progress_fields(). + * entry->progress.key. Reset the rest via the same helper that + * the hash insert path uses, then position the LSN fields at + * origin_lsn so this worker starts from the durable origin + * position and the insert >= commit invariant holds. */ - Assert(OidIsValid(entry->progress.key.dbid)); - Assert(OidIsValid(entry->progress.key.node_id)); - Assert(OidIsValid(entry->progress.key.remote_node_id)); - entry->progress.remote_commit_ts = 0; - entry->progress.prev_remote_ts = 0; + spock_init_progress_fields(&entry->progress); entry->progress.remote_commit_lsn = origin_lsn; - entry->progress.remote_insert_lsn = InvalidXLogRecPtr; - entry->progress.received_lsn = InvalidXLogRecPtr; - entry->progress.last_updated_ts = 0; - entry->progress.updated_by_decode = false; + entry->progress.remote_insert_lsn = origin_lsn; + entry->progress.received_lsn = origin_lsn; pg_atomic_init_u32(&entry->nattached, 0); ConditionVariableInit(&entry->prev_processed_cv); elog(LOG, "SPOCK %s: reconcile: new entry seeded at origin LSN %X/%X", @@ -209,11 +228,28 @@ reconcile_progress_with_origin(XLogRecPtr origin_lsn) } else { - /* Anomaly: file ahead of origin. Trust origin. */ - elog(WARNING, "SPOCK %s: reconcile: file LSN %X/%X ahead of origin %X/%X; trusting origin", + /* + * Anomaly: file ahead of origin. Most likely cause is operator + * intervention on the replication origin (e.g. pg_replication_origin_advance + * to roll back) or filesystem corruption. We trust origin because that + * is the value Postgres will use for incoming WAL acknowledgements; the + * file's values are recorded in the log so an operator can recover them + * after the fact if the trust-origin choice turns out to be wrong. + */ + elog(WARNING, + "SPOCK %s: reconcile: file LSN %X/%X ahead of origin %X/%X; trusting origin " + "(discarded file values: commit_lsn=%X/%X insert_lsn=%X/%X received_lsn=%X/%X " + "commit_ts=" INT64_FORMAT " prev_remote_ts=" INT64_FORMAT + " last_updated_ts=" INT64_FORMAT ")", MySubscription->name, LSN_FORMAT_ARGS(entry->progress.remote_commit_lsn), - LSN_FORMAT_ARGS(origin_lsn)); + LSN_FORMAT_ARGS(origin_lsn), + LSN_FORMAT_ARGS(entry->progress.remote_commit_lsn), + LSN_FORMAT_ARGS(entry->progress.remote_insert_lsn), + LSN_FORMAT_ARGS(entry->progress.received_lsn), + (int64) entry->progress.remote_commit_ts, + (int64) entry->progress.prev_remote_ts, + (int64) entry->progress.last_updated_ts); entry->progress.remote_commit_lsn = origin_lsn; entry->progress.remote_commit_ts = 0; entry->progress.prev_remote_ts = 0; @@ -236,9 +272,8 @@ reconcile_progress_with_origin(XLogRecPtr origin_lsn) * After reconcile detects that resource.dat is stale (or absent) for this * origin, scan pg_commit_ts backward from the latest xid to recover the * most-recent remote_commit_ts for this origin. Restores accurate - * post-crash values in the spock.progress view; will also be useful for - * the planned parallel-apply rework, which is expected to coordinate on - * prev_remote_ts. + * post-crash values in the spock.progress view; the recovered values are + * also intended to be useful for the planned parallel-apply rework. * * Termination: stop after observing SPOCK_TS_RECOVERY_MIN_SEEN_PER_ORIGIN * commits for this origin (any older commit is guaranteed to have a smaller diff --git a/tests/tap/t/023_forced_keepalive_timing.pl b/tests/tap/t/023_forced_keepalive_timing.pl index fcacbc17..0a4e4868 100644 --- a/tests/tap/t/023_forced_keepalive_timing.pl +++ b/tests/tap/t/023_forced_keepalive_timing.pl @@ -1,14 +1,17 @@ #!/usr/bin/perl # ============================================================================= # Test: 023_forced_keepalive_timing.pl -# Verify that after apply-worker reconnect, remote_insert_lsn in +# Verify that after apply-worker reconnect, received_lsn in # spock.progress is populated quickly (within a couple of seconds) # even when wal_sender_timeout is set high enough that the publisher's # timer-driven keepalive would not fire. # # Proves the forced 'r' status update with replyRequested=true sent # immediately after spock_start_replication triggers an immediate 'k' -# reply from the publisher. +# reply from the publisher. We assert received_lsn (not remote_insert_lsn) +# because 'k' messages only update received_lsn; remote_insert_lsn is +# only advanced by 'w' data messages (see UpdateWorkerStats and the 'k' +# handler in spock_apply.c). # ============================================================================= use strict; @@ -67,15 +70,27 @@ sub wait_until { 'row replicated'); ok(wait_until(10, sub { - my $lsn = scalar_query(2, "SELECT remote_insert_lsn::text FROM spock.progress LIMIT 1"); + my $lsn = scalar_query(2, "SELECT received_lsn::text FROM spock.progress LIMIT 1"); defined $lsn && $lsn ne '' && $lsn ne '0/0' -}), 'remote_insert_lsn populated by initial apply'); +}), 'received_lsn populated by initial apply'); -# Restart the subscriber cleanly. With wal_sender_timeout=10min, the -# publisher's timer-driven 'k' message will NOT fire within our test -# window, so any refresh of remote_insert_lsn within seconds after the -# apply worker reconnects must be caused by our forced keepalive. +# Stop subscriber cleanly. The supervisor's before_shmem_exit writes +# resource.dat with the current shmem state -- which would mask a +# broken forced keepalive on the next start (shmem reseeds from the file +# before reconnect, so "non-zero" assertions would pass spuriously). system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-m', 'fast', 'stop'; + +# Remove resource.dat so shmem starts empty on next boot. After +# reconcile_progress_with_origin runs in the ABSENT branch, the new +# entry has received_lsn = InvalidXLogRecPtr. The only mechanism that +# can populate it within seconds is a 'k' keepalive carrying the +# publisher's sentPtr -- which is exactly what request_initial_status_update +# triggers. Without forced keepalive, the timer-driven 'k' wouldn't fire +# for wal_sender_timeout/2 = 5min, well outside the test's window. +my $resource_dat = "$sub_dir/spock/resource.dat"; +unlink $resource_dat or diag "no resource.dat to remove (already absent)"; +ok(! -e $resource_dat, 'resource.dat absent before restart'); + system_or_bail "$pg_bin/pg_ctl", '-D', $sub_dir, '-o', "-p $sub_port", '-w', 'start'; ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), 'subscriber restarted'); @@ -83,18 +98,18 @@ sub wait_until { # (apply worker reconnected and the forced keepalive was sent). ok(wait_for_sub_status(2, 'sub', 'replicating', 30), 'subscription replicating after restart'); -# Capture the start time and wait for remote_insert_lsn to become non-zero. -# Under the forced-keepalive path, this should happen within 1-2 seconds. -# Without it, it would take wal_sender_timeout/2 = 5 minutes. +# received_lsn must transition from 0/0 (post-reconcile) to a non-zero +# LSN within 5s. Under the forced-keepalive path this happens in +# milliseconds; without it, in 5 minutes (wal_sender_timeout/2). my $start = time(); my $got = wait_until(5, sub { - my $lsn = scalar_query(2, "SELECT remote_insert_lsn::text FROM spock.progress LIMIT 1"); + my $lsn = scalar_query(2, "SELECT received_lsn::text FROM spock.progress LIMIT 1"); defined $lsn && $lsn ne '' && $lsn ne '0/0' }); my $elapsed = time() - $start; ok($got, - "remote_insert_lsn populated within 5s of reconnect despite wal_sender_timeout=10min (elapsed=${elapsed}s)"); + "received_lsn populated within 5s of reconnect despite wal_sender_timeout=10min (elapsed=${elapsed}s)"); # Sanity: confirm wal_sender_timeout is actually high on the provider # (i.e., the setting took effect). If it didn't, the above test might diff --git a/tests/tap/t/027_remote_commit_ts_recovery.pl b/tests/tap/t/027_remote_commit_ts_recovery.pl index 06bc4a31..73560e96 100755 --- a/tests/tap/t/027_remote_commit_ts_recovery.pl +++ b/tests/tap/t/027_remote_commit_ts_recovery.pl @@ -82,6 +82,13 @@ sub wait_until { my $pre_ts = scalar_query(2, "SELECT remote_commit_ts::text FROM spock.progress LIMIT 1"); ok($pre_ts ne '', "pre-crash remote_commit_ts populated: $pre_ts"); +# Capture pre-crash remote_commit_lsn. After restart, the post-reconcile +# remote_commit_lsn must reach AT LEAST this value -- resource.dat alone +# holds the round-1 snapshot, whose commit_lsn is strictly less. Observing +# commit_lsn >= $pre_lsn therefore proves reconcile actually ran. +my $pre_lsn = scalar_query(2, "SELECT remote_commit_lsn::text FROM spock.progress LIMIT 1"); +diag("pre-crash remote_commit_lsn=$pre_lsn"); + # CHECKPOINT to advance pg_replication_origin durably past resource.dat's # round-1 snapshot. Without this, after SIGKILL the origin may roll back to # round-1 and reconcile takes the IN_SYNC branch -- not what we want to test. @@ -104,12 +111,22 @@ sub wait_until { ok(wait_for_pg_ready($host, $sub_port, $pg_bin, 30), 'subscriber accepting connections after crash'); # Wait for the apply worker to start, run reconcile, run recovery scan. +# Gate on BOTH: +# 1. remote_commit_lsn >= pre-crash value (proves reconcile ran; +# resource.dat reseed alone gives round-1 commit_lsn, strictly less). +# 2. remote_commit_ts IS NOT NULL (proves recovery scan ran; reconcile +# clears ts to 0/NULL when stale, so non-NULL after reconcile means +# the scan repopulated it). +# Both updates are atomic under apply_group_master_lock, so SQL reads +# see one consistent (commit_lsn, ts) pair per query. ok(wait_until(30, sub { - my $ts = scalar_query(2, q{ - SELECT remote_commit_ts::text FROM spock.progress LIMIT 1 + my $row = scalar_query(2, qq{ + SELECT remote_commit_lsn >= '$pre_lsn'::pg_lsn + AND remote_commit_ts IS NOT NULL + FROM spock.progress LIMIT 1 }); - defined $ts && $ts ne '' && $ts ne '0/0' -}), 'spock.progress.remote_commit_ts populated by recovery scan'); + defined $row && $row eq 't' +}), 'reconcile advanced commit_lsn AND recovery scan repopulated remote_commit_ts'); # --- Assertions --- my $post_not_null = scalar_query(2, @@ -117,6 +134,17 @@ sub wait_until { is($post_not_null, 't', "post-crash remote_commit_ts is NOT NULL (recovered via pg_commit_ts scan)"); +my $post_lsn = scalar_query(2, + "SELECT remote_commit_lsn::text FROM spock.progress LIMIT 1"); +diag("post-crash remote_commit_lsn=$post_lsn (pre-crash was $pre_lsn)"); + +my $post_lsn_advanced = scalar_query(2, qq{ + SELECT remote_commit_lsn >= '$pre_lsn'::pg_lsn + FROM spock.progress LIMIT 1 +}); +is($post_lsn_advanced, 't', + "post-crash remote_commit_lsn >= pre-crash $pre_lsn (reconcile ran)"); + my $post_ts = scalar_query(2, "SELECT remote_commit_ts::text FROM spock.progress LIMIT 1"); diag("post-crash remote_commit_ts: '$post_ts'");