Skip to content

Add citus_block_writes_for_backup UDFs for LTR backup#8543

Draft
codeforall wants to merge 1 commit intomainfrom
muusama/ltr_block
Draft

Add citus_block_writes_for_backup UDFs for LTR backup#8543
codeforall wants to merge 1 commit intomainfrom
muusama/ltr_block

Conversation

@codeforall
Copy link
Copy Markdown
Contributor

Citus uses Two-Phase Commit (2PC) for distributed write transactions. When taking cluster-wide backups, independent per-node snapshots can capture mid-flight 2PC transactions in inconsistent states — e.g. a coordinator with the commit record but a worker missing the prepared transaction — leading to irrecoverable data inconsistency on restore.

To solve this, we temporarily block distributed 2PC commit decisions and schema/topology changes across the entire cluster while disk snapshots are taken. Acquiring ExclusiveLock on pg_dist_transaction guarantees that all in-flight 2PC transactions have fully completed (including COMMIT PREPARED on every worker) before the lock is granted, and no new commit decisions can begin while it is held. This creates a clean partition: everything before the lock is fully committed on all nodes; everything after hasn't started. Read queries and single-shard writes are never blocked.

New UDFs:

citus_block_writes_for_backup(timeout_ms int DEFAULT 300000)
Spawns a dedicated background worker that acquires ExclusiveLock on
pg_dist_transaction, pg_dist_partition, and pg_dist_node on the
coordinator, and sends LOCK TABLE ... IN EXCLUSIVE MODE on
pg_dist_transaction and pg_dist_partition to all metadata workers.
Returns true when all locks are held cluster-wide.

citus_unblock_writes_for_backup()
Signals the background worker to release all locks and exit. Can be
called from any session. Returns true on success.

citus_backup_block_status()
Returns a single row with the current block state, worker PID,
requestor PID, block start time, timeout, and locked node count.

Architecture:

A dedicated background worker holds a single transaction whose
lifetime equals the lock-holding period. This ensures locks survive
the caller's session disconnect. All termination paths (explicit
release, timeout expiry, SIGTERM, error, postmaster death) end the
transaction, guaranteeing locks cannot leak.

The worker uses PostgreSQL's WaitEventSet API to simultaneously
monitor its latch (for release/SIGTERM), the postmaster, and every
remote connection socket. If any worker-node connection drops, the
failure is detected immediately via PQconsumeInput/PQstatus and the
block is released with an error.

Shared memory (BackupBlockControlData, protected by an LWLock)
coordinates state between the block/unblock/status UDFs and the
background worker. Stale state from a crashed worker is auto-cleaned
by the status UDF via kill(pid, 0) liveness checks.

Remote lock acquisition is bounded by SET LOCAL statement_timeout on
each worker connection, preventing indefinite hangs if a remote node
is unresponsive during the LOCK TABLE phase.

The BLOCK_DISTRIBUTED_WRITES_COMMAND macro (shared with
citus_create_restore_point) is defined in backup_block.h to avoid
duplication.

DESCRIPTION: PR description that will go into the change log, up to 78 characters

Citus uses Two-Phase Commit (2PC) for distributed write transactions.
When taking cluster-wide backups, independent per-node snapshots can
capture mid-flight 2PC transactions in inconsistent states — e.g. a
coordinator with the commit record but a worker missing the prepared
transaction — leading to irrecoverable data inconsistency on restore.

To solve this, we temporarily block distributed 2PC commit decisions
and schema/topology changes across the entire cluster while disk
snapshots are taken.  Acquiring ExclusiveLock on pg_dist_transaction
guarantees that all in-flight 2PC transactions have fully completed
(including COMMIT PREPARED on every worker) before the lock is granted,
and no new commit decisions can begin while it is held.  This creates a
clean partition: everything before the lock is fully committed on all
nodes; everything after hasn't started.  Read queries and single-shard
writes are never blocked.

New UDFs:

  citus_block_writes_for_backup(timeout_ms int DEFAULT 300000)
    Spawns a dedicated background worker that acquires ExclusiveLock on
    pg_dist_transaction, pg_dist_partition, and pg_dist_node on the
    coordinator, and sends LOCK TABLE ... IN EXCLUSIVE MODE on
    pg_dist_transaction and pg_dist_partition to all metadata workers.
    Returns true when all locks are held cluster-wide.

  citus_unblock_writes_for_backup()
    Signals the background worker to release all locks and exit.  Can be
    called from any session.  Returns true on success.

  citus_backup_block_status()
    Returns a single row with the current block state, worker PID,
    requestor PID, block start time, timeout, and locked node count.

Architecture:

  A dedicated background worker holds a single transaction whose
  lifetime equals the lock-holding period.  This ensures locks survive
  the caller's session disconnect.  All termination paths (explicit
  release, timeout expiry, SIGTERM, error, postmaster death) end the
  transaction, guaranteeing locks cannot leak.

  The worker uses PostgreSQL's WaitEventSet API to simultaneously
  monitor its latch (for release/SIGTERM), the postmaster, and every
  remote connection socket.  If any worker-node connection drops, the
  failure is detected immediately via PQconsumeInput/PQstatus and the
  block is released with an error.

  Shared memory (BackupBlockControlData, protected by an LWLock)
  coordinates state between the block/unblock/status UDFs and the
  background worker.  Stale state from a crashed worker is auto-cleaned
  by the status UDF via kill(pid, 0) liveness checks.

  Remote lock acquisition is bounded by SET LOCAL statement_timeout on
  each worker connection, preventing indefinite hangs if a remote node
  is unresponsive during the LOCK TABLE phase.

  The BLOCK_DISTRIBUTED_WRITES_COMMAND macro (shared with
  citus_create_restore_point) is defined in backup_block.h to avoid
  duplication.
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 9.06433% with 311 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.38%. Comparing base (42edfe8) to head (d685dea).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8543      +/-   ##
==========================================
- Coverage   88.88%   88.38%   -0.50%     
==========================================
  Files         286      287       +1     
  Lines       63763    64115     +352     
  Branches     8017     8056      +39     
==========================================
- Hits        56678    56671       -7     
- Misses       4776     5100     +324     
- Partials     2309     2344      +35     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant