Add citus_block_writes_for_backup UDFs for LTR backup#8543
Draft
codeforall wants to merge 1 commit intomainfrom
Draft
Add citus_block_writes_for_backup UDFs for LTR backup#8543codeforall wants to merge 1 commit intomainfrom
codeforall wants to merge 1 commit intomainfrom
Conversation
Citus uses Two-Phase Commit (2PC) for distributed write transactions.
When taking cluster-wide backups, independent per-node snapshots can
capture mid-flight 2PC transactions in inconsistent states — e.g. a
coordinator with the commit record but a worker missing the prepared
transaction — leading to irrecoverable data inconsistency on restore.
To solve this, we temporarily block distributed 2PC commit decisions
and schema/topology changes across the entire cluster while disk
snapshots are taken. Acquiring ExclusiveLock on pg_dist_transaction
guarantees that all in-flight 2PC transactions have fully completed
(including COMMIT PREPARED on every worker) before the lock is granted,
and no new commit decisions can begin while it is held. This creates a
clean partition: everything before the lock is fully committed on all
nodes; everything after hasn't started. Read queries and single-shard
writes are never blocked.
New UDFs:
citus_block_writes_for_backup(timeout_ms int DEFAULT 300000)
Spawns a dedicated background worker that acquires ExclusiveLock on
pg_dist_transaction, pg_dist_partition, and pg_dist_node on the
coordinator, and sends LOCK TABLE ... IN EXCLUSIVE MODE on
pg_dist_transaction and pg_dist_partition to all metadata workers.
Returns true when all locks are held cluster-wide.
citus_unblock_writes_for_backup()
Signals the background worker to release all locks and exit. Can be
called from any session. Returns true on success.
citus_backup_block_status()
Returns a single row with the current block state, worker PID,
requestor PID, block start time, timeout, and locked node count.
Architecture:
A dedicated background worker holds a single transaction whose
lifetime equals the lock-holding period. This ensures locks survive
the caller's session disconnect. All termination paths (explicit
release, timeout expiry, SIGTERM, error, postmaster death) end the
transaction, guaranteeing locks cannot leak.
The worker uses PostgreSQL's WaitEventSet API to simultaneously
monitor its latch (for release/SIGTERM), the postmaster, and every
remote connection socket. If any worker-node connection drops, the
failure is detected immediately via PQconsumeInput/PQstatus and the
block is released with an error.
Shared memory (BackupBlockControlData, protected by an LWLock)
coordinates state between the block/unblock/status UDFs and the
background worker. Stale state from a crashed worker is auto-cleaned
by the status UDF via kill(pid, 0) liveness checks.
Remote lock acquisition is bounded by SET LOCAL statement_timeout on
each worker connection, preventing indefinite hangs if a remote node
is unresponsive during the LOCK TABLE phase.
The BLOCK_DISTRIBUTED_WRITES_COMMAND macro (shared with
citus_create_restore_point) is defined in backup_block.h to avoid
duplication.
1d3f68b to
d685dea
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8543 +/- ##
==========================================
- Coverage 88.88% 88.38% -0.50%
==========================================
Files 286 287 +1
Lines 63763 64115 +352
Branches 8017 8056 +39
==========================================
- Hits 56678 56671 -7
- Misses 4776 5100 +324
- Partials 2309 2344 +35 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Citus uses Two-Phase Commit (2PC) for distributed write transactions. When taking cluster-wide backups, independent per-node snapshots can capture mid-flight 2PC transactions in inconsistent states — e.g. a coordinator with the commit record but a worker missing the prepared transaction — leading to irrecoverable data inconsistency on restore.
To solve this, we temporarily block distributed 2PC commit decisions and schema/topology changes across the entire cluster while disk snapshots are taken. Acquiring ExclusiveLock on pg_dist_transaction guarantees that all in-flight 2PC transactions have fully completed (including COMMIT PREPARED on every worker) before the lock is granted, and no new commit decisions can begin while it is held. This creates a clean partition: everything before the lock is fully committed on all nodes; everything after hasn't started. Read queries and single-shard writes are never blocked.
New UDFs:
citus_block_writes_for_backup(timeout_ms int DEFAULT 300000)
Spawns a dedicated background worker that acquires ExclusiveLock on
pg_dist_transaction, pg_dist_partition, and pg_dist_node on the
coordinator, and sends LOCK TABLE ... IN EXCLUSIVE MODE on
pg_dist_transaction and pg_dist_partition to all metadata workers.
Returns true when all locks are held cluster-wide.
citus_unblock_writes_for_backup()
Signals the background worker to release all locks and exit. Can be
called from any session. Returns true on success.
citus_backup_block_status()
Returns a single row with the current block state, worker PID,
requestor PID, block start time, timeout, and locked node count.
Architecture:
A dedicated background worker holds a single transaction whose
lifetime equals the lock-holding period. This ensures locks survive
the caller's session disconnect. All termination paths (explicit
release, timeout expiry, SIGTERM, error, postmaster death) end the
transaction, guaranteeing locks cannot leak.
The worker uses PostgreSQL's WaitEventSet API to simultaneously
monitor its latch (for release/SIGTERM), the postmaster, and every
remote connection socket. If any worker-node connection drops, the
failure is detected immediately via PQconsumeInput/PQstatus and the
block is released with an error.
Shared memory (BackupBlockControlData, protected by an LWLock)
coordinates state between the block/unblock/status UDFs and the
background worker. Stale state from a crashed worker is auto-cleaned
by the status UDF via kill(pid, 0) liveness checks.
Remote lock acquisition is bounded by SET LOCAL statement_timeout on
each worker connection, preventing indefinite hangs if a remote node
is unresponsive during the LOCK TABLE phase.
The BLOCK_DISTRIBUTED_WRITES_COMMAND macro (shared with
citus_create_restore_point) is defined in backup_block.h to avoid
duplication.
DESCRIPTION: PR description that will go into the change log, up to 78 characters