Skip to content

feat: zero-downtime node addition via populate pipeline#52

Merged
mmols merged 6 commits into
mainfrom
feat/PLAT-528/add-node-zero-downtime
Apr 16, 2026
Merged

feat: zero-downtime node addition via populate pipeline#52
mmols merged 6 commits into
mainfrom
feat/PLAT-528/add-node-zero-downtime

Conversation

@mmols
Copy link
Copy Markdown
Member

@mmols mmols commented Apr 15, 2026

Summary

Enables zero-downtime node addition when using spock to bootstrap new nodes. Writes can continue uninterrupted on existing nodes while a new node syncs via Spock logical replication. This replaces the previous behavior which required writes to be stopped during helm upgrade.

The implementation models after the control-plane's populate pattern, adapted for the Helm chart's single-pass declarative reconciliation approach.

What changed

Populate pipeline (internal/spock/)

For a 3-node cluster where n1 is the source, n2 is a peer, and n3 is the new node, the populate pipeline:

  1. Pre-creates replication slots on peers (n2) for the new node (n3)
  2. Creates disabled subscriptions (n2→n3) on the new node so Spock metadata populates
  3. Checkpoints peers via sync events — confirms the source (n1) has all peer (n2) data
  4. Creates the source→new subscription (n1→n3) with full sync
  5. Waits for sync completion via sync event on the new node
  6. Reads commit timestamps from the lag tracker on the new node
  7. Advances peer slots past already-synced data
  8. End-state subscriptions are created/enabled to complete the full mesh

Six new ephemeral resource types orchestrate this:

Resource Runs on Purpose
ReplicationSlotCreate peer Pre-create logical replication slot for new node
DisabledSubscription new node Register peer→new subscription in Spock metadata so lag tracker populates
SyncEvent provider Insert WAL bookmark via spock.sync_event()
WaitForSyncEvent subscriber Poll until WAL bookmark received
LagTrackerCommitTimestamp new node Read commit timestamp from spock.lag_tracker
ReplicationSlotAdvanceFromCTS peer Convert commit timestamp to LSN and advance slot past synced data

ComputeDesired in desired.go detects new nodes (bootstrap.mode=spock) and emits the populate chain with correct dependency ordering. When no new nodes exist, output is identical to before.

Resource engine (internal/resource/)

Added Update() to the Resource interface and ActionUpdate / NeedsUpdate to the planner. This allows the end-state Subscription to detect a disabled subscription (created by populate) and enable it without delete+recreate.

Configuration

New pgEdge.initSpockJobConfig.timeout value (default: 2 hours). Applied as activeDeadlineSeconds on the Kubernetes Job and as a Go context timeout with a 30-second grace period.

Docs

  • Updated feature descriptions in README and mkdocs index
  • Removed "stop writes" requirement from limitations and adding_nodes docs
  • Added timeout configuration guidance and recovery instructions for failed adds

Test plan

  • Unit tests: resource identifiers, dependencies, ephemeral status, ComputeDesired graph for 2-node add, 3-node add, and no-new-nodes steady state
  • TestNodesAddNodeZeroDowntime: adds n3 while background writers insert on n1 and n2, verifies data convergence, bidirectional replication, and full mesh
  • Existing tests unaffected: TestNodesRemoveNode, TestDistributedInstall, recovery tests all pass
  • Run full integration suite (make test-integration-kind)

Enables adding nodes while writes continue on existing nodes by
implementing a populate pattern in the single-pass resource engine.

Five new ephemeral resource types orchestrate the populate sequence:
ReplicationSlotCreate, SyncEvent, WaitForSyncEvent,
LagTrackerCommitTimestamp, and ReplicationSlotAdvanceFromCTS.
A DisabledSubscription resource pre-registers peer subscriptions
so the lag tracker populates during sync.

Adds Update() to the resource engine interface so the end-state
reconciler can enable previously disabled subscriptions.

Includes configurable job timeout, integration test with background
writers on both nodes, and a reusable waitForReplication test helper
using sync events.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 87284b24-ce5c-48fe-b47e-516ba8ee2adf

📥 Commits

Reviewing files that changed from the base of the PR and between 598da86 and 0e68856.

📒 Files selected for processing (1)
  • test/integration/nodes_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/integration/nodes_test.go

📝 Walkthrough

Walkthrough

Adds zero-downtime pgEdge Spock node-add support: new ephemeral populate and sync resources, update-capable resource lifecycle, optional init-spock execution timeout (Job + env + binary), documentation and values updates, and expanded unit and integration tests for live-writer node addition.

Changes

Cohort / File(s) Summary
Configuration & Docs
values.yaml, .gitignore, README.md, docs/index.md, docs/configuration.md, docs/limitations.md, docs/usage/adding_nodes.md
Added pgEdge.initSpockJobConfig.timeout (default 7200s), documented zero-downtime node addition and that writes may continue during upgrades, added init-spock to .gitignore, and removed prior Helm-upgrade write-stop limitations.
Init-spock Job / Timeout
templates/init-spock-job.yaml, cmd/init-spock/main.go
Wired Job activeDeadlineSeconds and container env INIT_SPOCK_TIMEOUT to Helm value; binary applies optional context timeout (with 30s grace) when INIT_SPOCK_TIMEOUT is set to a positive integer.
Resource Framework
internal/resource/resource.go, internal/resource/plan.go, internal/resource/executor.go, internal/resource/resource_test.go
Added Status.NeedsUpdate, new Resource.Update(ctx) method and ActionUpdate; Plan now emits update phases (ordered before creates), Execute handles updates, and tests/mocks updated to exercise update behavior.
Spock Naming & No-op Updates
internal/spock/names.go, internal/spock/node.go, internal/spock/replication_slot.go, internal/spock/user.go
Introduced spockSlotName / spockSubName helpers and added no-op Update implementations on existing spock resource types.
Populate & Sync Resources
internal/spock/replication_slot_create.go, internal/spock/disabled_subscription.go, internal/spock/sync_event.go, internal/spock/wait_for_sync_event.go, internal/spock/lag_tracker_commit_ts.go, internal/spock/replication_slot_advance.go
Added ephemeral populate resources to create replication slots, register disabled subscriptions, emit sync events, wait for sync, read lag-tracker commit timestamps, and advance slots from CTS; most reset Exists=false on Refresh and implement Create logic.
Subscription & Desired Graph
internal/spock/subscription.go, internal/spock/desired.go
Subscriptions now carry extraDeps, improved naming via helpers, detect/enable disabled subscriptions (Refresh sets NeedsUpdate, Update enables), and ComputeDesired emits populate resource chains and adjusts subscription sync/dependency wiring for new nodes.
Tests
internal/spock/spock_test.go, test/integration/nodes_test.go
Expanded unit tests for new resources and dependency semantics; added integration test TestNodesAddNodeZeroDowntime and helpers to validate replication convergence during live-writer node addition.

Poem

🐇 I nibble bits and chase the LSN trail,
I plant slots, emit syncs, and wag my tail,
New nodes hop in while writers keep pace,
A timeout watches the init-spock race,
Hoppity hop—replication finds its place 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: zero-downtime node addition via populate pipeline' clearly and concisely summarizes the main change—enabling zero-downtime node addition through a new populate pipeline, which is the primary objective of this PR.
Description check ✅ Passed The description comprehensively details what changed, explaining the populate pipeline design, new resource types, resource engine updates, configuration additions, documentation changes, and test coverage—all directly related to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/PLAT-528/add-node-zero-downtime

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 15, 2026

Up to standards ✅

🟢 Issues 3 medium

Results:
3 new issues

Category Results
Complexity 3 medium

View in Codacy

🟢 Metrics 81 complexity · 32 duplication

Metric Results
Complexity 81
Duplication 32

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
.gitignore (1)

4-4: Narrow the ignore pattern to avoid hiding future source files.

init-spock (without a leading slash) can match any path component named init-spock. If the intent is only to ignore a repo-root build artifact, use a root-scoped pattern.

Suggested diff
-init-spock
+/init-spock
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore at line 4, The .gitignore entry "init-spock" is too broad and may
ignore any path component named that; update the pattern to the repo-root scoped
form by replacing "init-spock" with "/init-spock" in the .gitignore so only the
root-level build artifact is ignored (refer to the "init-spock" line in
.gitignore when making the change).
internal/spock/spock_test.go (1)

132-133: Prefer resource type constant over raw string in test setup.

On Line 132, use ResourceTypeWaitForSyncEvent instead of "spock.wait_for_sync_event" to keep tests resilient to type-name changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/spock/spock_test.go` around lines 132 - 133, Replace the hard-coded
resource type string in the test setup with the exported constant: change the
Identifier construction that uses Type: "spock.wait_for_sync_event" to use
ResourceTypeWaitForSyncEvent; update the extra variable (resource.Identifier)
used in the call to NewSubscription so the test references
ResourceTypeWaitForSyncEvent instead of the raw string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/spock/desired.go`:
- Around line 119-127: The peer sync bookmark (peerSyncEvt created by
NewSyncEvent) is only gated on slotCreate.Identifier(), allowing it to run
before the DisabledSubscription created by NewDisabledSubscription is
registered; update the dependency list so peerSyncEvt also waits for the
disabled subscription to be created by including disabledSub.Identifier()
(alongside slotCreate.Identifier()) when constructing NewSyncEvent so the
DisabledSubscription.Create() runs before the sync bookmark.

In `@internal/spock/spock_test.go`:
- Around line 437-451: The test currently uses silent type-assertion patterns
like "if sub, ok := subN1N2.(*Subscription); ok { ... }" which hide failures
when the resource is missing or wrong type; change each such assertion (e.g. for
variables subN1N2, subN2N1 and others at the referenced locations) to fail fast
by checking the ok boolean and calling t.Fatalf (or t.Fatalf-equivalent) when ok
is false, then proceed to access sub.sync and assert its value; ensure you
reference the same types/identifiers (resource.Identifier,
ResourceTypeSubscription, Subscription) so the test errors immediately if the
cast fails.

In `@internal/spock/sync_event.go`:
- Around line 42-47: The Dependencies() method builds the subscription
dependency ID manually which fails to normalize hyphens; instead instantiate or
call the Subscription.Identifier logic to produce the normalized ID and use that
for the ResourceTypeSubscription entry. Update SyncEvent.Dependencies() to
construct the subscription identifier via the Subscription.Identifier (or the
shared helper that performs the '-'→'_' normalization) using r.providerName and
r.subscriberName, and include that normalized ID in the deps slice before
appending r.extraDeps.

In `@internal/spock/wait_for_sync_event.go`:
- Around line 72-77: The subscription name used for status polling is built from
raw node names (subName) and can contain dashes which don't match the
underscore-based naming elsewhere; update the construction of subName in the
wait_for_sync_event logic to normalize providerName and subscriberName (e.g.,
replace '-' with '_' or otherwise map to the canonical subscription naming)
before formatting into sub_%s_%s so the QueryRow lookup for status uses the same
normalized subscription name.

---

Nitpick comments:
In @.gitignore:
- Line 4: The .gitignore entry "init-spock" is too broad and may ignore any path
component named that; update the pattern to the repo-root scoped form by
replacing "init-spock" with "/init-spock" in the .gitignore so only the
root-level build artifact is ignored (refer to the "init-spock" line in
.gitignore when making the change).

In `@internal/spock/spock_test.go`:
- Around line 132-133: Replace the hard-coded resource type string in the test
setup with the exported constant: change the Identifier construction that uses
Type: "spock.wait_for_sync_event" to use ResourceTypeWaitForSyncEvent; update
the extra variable (resource.Identifier) used in the call to NewSubscription so
the test references ResourceTypeWaitForSyncEvent instead of the raw string.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a8fedd7b-e94c-4ead-9f30-738976cbb548

📥 Commits

Reviewing files that changed from the base of the PR and between 97fcb9a and 41812bd.

📒 Files selected for processing (26)
  • .gitignore
  • README.md
  • cmd/init-spock/main.go
  • docs/configuration.md
  • docs/index.md
  • docs/limitations.md
  • docs/usage/adding_nodes.md
  • internal/resource/executor.go
  • internal/resource/plan.go
  • internal/resource/resource.go
  • internal/resource/resource_test.go
  • internal/spock/desired.go
  • internal/spock/disabled_subscription.go
  • internal/spock/lag_tracker_commit_ts.go
  • internal/spock/node.go
  • internal/spock/replication_slot.go
  • internal/spock/replication_slot_advance.go
  • internal/spock/replication_slot_create.go
  • internal/spock/spock_test.go
  • internal/spock/subscription.go
  • internal/spock/sync_event.go
  • internal/spock/user.go
  • internal/spock/wait_for_sync_event.go
  • templates/init-spock-job.yaml
  • test/integration/nodes_test.go
  • values.yaml

Comment thread internal/spock/desired.go
Comment thread internal/spock/spock_test.go
Comment thread internal/spock/sync_event.go
Comment thread internal/spock/wait_for_sync_event.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
internal/spock/spock_test.go (2)

129-133: Prefer resource type constants over string literals in tests.

Using ResourceTypeWaitForSyncEvent here avoids drift and catches renames at compile time.

🔧 Proposed change
-	extra := resource.Identifier{Type: "spock.wait_for_sync_event", ID: "n1_n2"}
+	extra := resource.Identifier{Type: ResourceTypeWaitForSyncEvent, ID: "n1_n2"}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/spock/spock_test.go` around lines 129 - 133, Replace the hard-coded
resource type string in TestSubscriptionExtraDeps with the exported constant
ResourceTypeWaitForSyncEvent to avoid drift; update the creation of the extra
resource (resource.Identifier{Type: ..., ID: ...}) used in NewSubscription so
Type uses ResourceTypeWaitForSyncEvent instead of "spock.wait_for_sync_event"
(refer to TestSubscriptionExtraDeps, resource.Identifier and NewSubscription).

242-250: Strengthen dependency tests to assert exact identifiers, not only counts.

Count-only checks can pass even when dependency wiring is wrong.

🔍 Example hardening
 func TestSyncEventDependencies(t *testing.T) {
 	extra := resource.Identifier{Type: ResourceTypeReplicationSlotCreate, ID: "spk_app_n2_sub_n2_n3"}
 	r := NewSyncEvent("n2", "n1", nil, extra)
 	deps := r.Dependencies()
-	// Should depend on node + subscription + extra
-	if len(deps) != 3 {
+	if len(deps) != 3 {
 		t.Fatalf("expected 3 deps, got %d: %v", len(deps), deps)
 	}
+	expected := map[resource.Identifier]bool{
+		{Type: ResourceTypeNode, ID: "n2"}:                 true,
+		{Type: ResourceTypeSubscription, ID: "sub_n2_n1"}:  true,
+		extra:                                               true,
+	}
+	for _, d := range deps {
+		delete(expected, d)
+	}
+	if len(expected) != 0 {
+		t.Fatalf("missing expected deps: %v (got %v)", expected, deps)
+	}
 }

Also applies to: 309-316

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/spock/spock_test.go` around lines 242 - 250,
TestSyncEventDependencies only asserts the dependency count; update it to assert
the exact resource.Identifier values returned by r := NewSyncEvent("n2", "n1",
nil, extra) via r.Dependencies() instead of just len(deps). Replace the
count-only check with explicit expectations for the node identifier, the
subscription identifier, and the extra identifier (compare resource.Identifier
fields or use reflect.DeepEqual on a canonical/sorted slice or a set membership
check) so the test fails if any identifier is missing or incorrect; apply the
same stronger assertions pattern to the other similar test that validates
SyncEvent dependencies.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/spock/wait_for_sync_event.go`:
- Around line 59-63: The Create method on WaitForSyncEvent dereferences r.conn
without checking for nil, which can panic; add a nil-check at the start of
WaitForSyncEvent.Create and return a descriptive error (e.g., "connection not
initialized for %s→%s") when r.conn == nil, and apply the same nil-guard before
any other uses of r.conn in the method (also cover the subsequent blocks that
use r.conn around the later sections referenced in the review) so all
dereferences safely return an error instead of panicking.

---

Nitpick comments:
In `@internal/spock/spock_test.go`:
- Around line 129-133: Replace the hard-coded resource type string in
TestSubscriptionExtraDeps with the exported constant
ResourceTypeWaitForSyncEvent to avoid drift; update the creation of the extra
resource (resource.Identifier{Type: ..., ID: ...}) used in NewSubscription so
Type uses ResourceTypeWaitForSyncEvent instead of "spock.wait_for_sync_event"
(refer to TestSubscriptionExtraDeps, resource.Identifier and NewSubscription).
- Around line 242-250: TestSyncEventDependencies only asserts the dependency
count; update it to assert the exact resource.Identifier values returned by r :=
NewSyncEvent("n2", "n1", nil, extra) via r.Dependencies() instead of just
len(deps). Replace the count-only check with explicit expectations for the node
identifier, the subscription identifier, and the extra identifier (compare
resource.Identifier fields or use reflect.DeepEqual on a canonical/sorted slice
or a set membership check) so the test fails if any identifier is missing or
incorrect; apply the same stronger assertions pattern to the other similar test
that validates SyncEvent dependencies.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5e64e566-0882-45b0-a494-467643445e63

📥 Commits

Reviewing files that changed from the base of the PR and between 41812bd and 59cd68e.

📒 Files selected for processing (3)
  • internal/spock/spock_test.go
  • internal/spock/sync_event.go
  • internal/spock/wait_for_sync_event.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/spock/sync_event.go

Comment thread internal/spock/wait_for_sync_event.go
@mmols mmols requested a review from rshoemaker April 15, 2026 21:26
… generation

The slot name format (spk_{db}_{provider}_sub_{provider}_{subscriber})
and subscription name format (sub_{src}_{dst}) were duplicated across
7 files with inline strings.ReplaceAll+Sprintf calls. A missing dash
replacement in two of those copies was already caught and fixed in
59cd68e — this consolidates all call sites into spockSlotName() and
spockSubName() helpers to prevent recurrence.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
internal/spock/replication_slot_create.go (1)

54-77: Consider handling duplicate slot errors for robustness.

The check-then-create pattern has a small race window. If another process creates the slot concurrently, pg_create_logical_replication_slot will fail with a duplicate error. Handling this error (like DisabledSubscription does with pgCodeDuplicateObject) would make the resource more resilient.

♻️ Optional: handle duplicate error
+import (
+	"errors"
+	"github.com/jackc/pgx/v5/pgconn"
+)

 func (r *ReplicationSlotCreate) Create(ctx context.Context) error {
 	var exists bool
 	err := r.conn.QueryRow(ctx,
 		"SELECT EXISTS(SELECT 1 FROM pg_replication_slots WHERE slot_name = $1)",
 		r.slotName(),
 	).Scan(&exists)
 	if err != nil {
 		return fmt.Errorf("check slot %s: %w", r.slotName(), err)
 	}
 	if exists {
 		slog.Info("replication slot already exists", "slot", r.slotName())
 		return nil
 	}

 	_, err = r.conn.Exec(ctx,
 		"SELECT pg_create_logical_replication_slot($1, 'spock_output')",
 		r.slotName(),
 	)
 	if err != nil {
+		var pgErr *pgconn.PgError
+		if errors.As(err, &pgErr) && pgErr.Code == pgCodeDuplicateObject {
+			slog.Info("replication slot already exists (race)", "slot", r.slotName())
+			return nil
+		}
 		return fmt.Errorf("create replication slot %s: %w", r.slotName(), err)
 	}
 	slog.Info("created replication slot", "slot", r.slotName())
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/spock/replication_slot_create.go` around lines 54 - 77, The Create
method in ReplicationSlotCreate uses a check-then-create pattern which can race;
modify the Exec error handling for the pg_create_logical_replication_slot call
so that if the returned error is a duplicate-object error (compare against the
same pgCodeDuplicateObject logic used by DisabledSubscription) you treat it as
non-fatal: log/info that the slot already exists and return nil; for any other
error continue to wrap and return as currently done. Target the Create function
and the Exec call invoking "pg_create_logical_replication_slot" and reuse the
duplicate error classification from DisabledSubscription.
internal/spock/subscription.go (1)

63-94: Consider consolidating the two queries in Refresh().

The subscription existence and enabled status could be fetched in a single query to reduce round trips.

♻️ Optional consolidation
 func (s *Subscription) Refresh(ctx context.Context) error {
-	var subExists bool
-	err := s.conn.QueryRow(ctx,
-		"SELECT EXISTS(SELECT 1 FROM spock.subscription WHERE sub_name = $1)",
+	var isEnabled *bool
+	err := s.conn.QueryRow(ctx,
+		"SELECT sub_enabled FROM spock.subscription WHERE sub_name = $1",
 		s.subName(),
-	).Scan(&subExists)
+	).Scan(&isEnabled)
 	if err != nil {
+		if err == pgx.ErrNoRows {
+			s.status = resource.Status{Exists: false}
+			return nil
+		}
 		return fmt.Errorf("inspect subscription %s: %w", s.subName(), err)
 	}
 
-	if !subExists {
-		s.status = resource.Status{Exists: false}
-		return nil
-	}
-
-	// A disabled subscription that should be enabled triggers an update.
-	var isEnabled bool
-	err = s.conn.QueryRow(ctx,
-		"SELECT sub_enabled FROM spock.subscription WHERE sub_name = $1",
-		s.subName(),
-	).Scan(&isEnabled)
-	if err != nil {
-		return fmt.Errorf("check subscription enabled %s: %w", s.subName(), err)
-	}
-	if !isEnabled {
+	if !*isEnabled {
 		s.status = resource.Status{Exists: true, NeedsUpdate: true, Reason: "subscription is disabled"}
 		return nil
 	}
 
 	s.status = resource.Status{Exists: true}
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/spock/subscription.go` around lines 63 - 94, Consolidate the two DB
round-trips in Refresh by querying the enabled state once: replace the two
QueryRow calls with a single QueryRow that selects sub_enabled FROM
spock.subscription WHERE sub_name = $1, Scan the result into a nullable boolean
(e.g., sql.NullBool or *bool) and handle the "no rows" case to set s.status =
resource.Status{Exists:false}; if a row exists use the scanned boolean to set
s.status = resource.Status{Exists:true, NeedsUpdate:true, Reason:"subscription
is disabled"} when false, otherwise set Exists:true; keep using s.subName(),
s.conn.QueryRow, and s.status/resource.Status names when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/spock/disabled_subscription.go`:
- Around line 93-99: Replace the hardcoded Postgres error code string "42710"
with the existing pgCodeDuplicateObject constant defined in
internal/spock/node.go to improve consistency; in the error handling inside the
function that creates the disabled subscription (the block using errors.As into
var pgErr *pgconn.PgError and s.subName()), change the comparison pgErr.Code ==
"42710" to use pgCodeDuplicateObject so the duplicate-object check uses the
shared constant.

---

Nitpick comments:
In `@internal/spock/replication_slot_create.go`:
- Around line 54-77: The Create method in ReplicationSlotCreate uses a
check-then-create pattern which can race; modify the Exec error handling for the
pg_create_logical_replication_slot call so that if the returned error is a
duplicate-object error (compare against the same pgCodeDuplicateObject logic
used by DisabledSubscription) you treat it as non-fatal: log/info that the slot
already exists and return nil; for any other error continue to wrap and return
as currently done. Target the Create function and the Exec call invoking
"pg_create_logical_replication_slot" and reuse the duplicate error
classification from DisabledSubscription.

In `@internal/spock/subscription.go`:
- Around line 63-94: Consolidate the two DB round-trips in Refresh by querying
the enabled state once: replace the two QueryRow calls with a single QueryRow
that selects sub_enabled FROM spock.subscription WHERE sub_name = $1, Scan the
result into a nullable boolean (e.g., sql.NullBool or *bool) and handle the "no
rows" case to set s.status = resource.Status{Exists:false}; if a row exists use
the scanned boolean to set s.status = resource.Status{Exists:true,
NeedsUpdate:true, Reason:"subscription is disabled"} when false, otherwise set
Exists:true; keep using s.subName(), s.conn.QueryRow, and
s.status/resource.Status names when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f51ed7c8-b32c-4b8d-841a-1f4aecf75453

📥 Commits

Reviewing files that changed from the base of the PR and between 59cd68e and f4111f7.

📒 Files selected for processing (8)
  • internal/spock/disabled_subscription.go
  • internal/spock/names.go
  • internal/spock/replication_slot.go
  • internal/spock/replication_slot_advance.go
  • internal/spock/replication_slot_create.go
  • internal/spock/subscription.go
  • internal/spock/sync_event.go
  • internal/spock/wait_for_sync_event.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/spock/replication_slot.go
  • internal/spock/wait_for_sync_event.go

Comment thread internal/spock/disabled_subscription.go
Copy link
Copy Markdown
Contributor

@rshoemaker rshoemaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored some naming funcs - it seemed easier than dropping comments everywhere.

Claude had one HIGH priority issue:

`
HIGH — Update-before-Create phase ordering breaks retry

  1. If populate fails mid-way, retry will enable the disabled subscription before the slot advance runs

The planner runs phases in order: deletes → updates → creates. This is fine for clean runs, but breaks an important retry case:

Scenario: First init-spock run for add-node succeeds in creating the disabled peer→new subscription but crashes before running ReplicationSlotAdvanceFromCTS. Database state:

  • spk_app_n2_sub_n2_n3 slot exists (not advanced)
  • sub_n2_n3 subscription exists, disabled

On retry:

  1. Subscription.Refresh sees sub_n2_n3 exists + sub_enabled=false → sets NeedsUpdate (subscription.go:94-97)
  2. Ephemeral populate resources (ReplicationSlotCreate, DisabledSubscription, SyncEvent, WaitForSyncEvent, LagTrackerCommitTimestamp, ReplicationSlotAdvanceFromCTS) all Refresh → Exists=false → will be Create
  3. Plan: Updates phase fires first → sub_enable(sub_n2_n3) runs
  4. Subscription becomes active, starts consuming from the un-advanced slot (from creation point)
  5. Then Creates phase runs; ReplicationSlotAdvanceFromCTS.Create checks active_pid IS NOT NULL, sees it's active, and skips the advance (replication_slot_advance.go:95-97)

Result: peer→new subscription replays all WAL from slot creation instead of advancing past commit_ts. Spock's origin-awareness may dedupe the commits since they'd also arrive via source→new, but this isn't guaranteed and at
minimum causes significant wasted work.

The Update's Dependencies() include ReplicationSlotAdvanceFromCTS (via Subscription.Dependencies), but topoSort only considers dependencies within the same event set. Updates and creates are separate sets, so the update runs
without waiting.

Fixes to consider:

  • Merge updates into the creates topo-sort (single DAG, resource knows its own action)
  • Or: Subscription.Update checks if the slot has been advanced, refuses to enable if not
  • Or: Subscription.Refresh doesn't set NeedsUpdate for subscriptions whose new-node populate chain hasn't completed (needs context passthrough)

The cleanest fix is probably #1 — a unified DAG across all action types. Updates aren't conceptually different from creates in terms of dependency ordering.

`

@mmols
Copy link
Copy Markdown
Member Author

mmols commented Apr 16, 2026

Scenario: First init-spock run for add-node succeeds in creating the disabled peer→new subscription but crashes before running ReplicationSlotAdvanceFromCTS. Database state:

spk_app_n2_sub_n2_n3 slot exists (not advanced)
sub_n2_n3 subscription exists, disabled

On retry:

I think this feedback is based on that idea that you would re-run the job and expect it to automatically recover. docs/usage/adding_nodes.md has a snippet on this scenario which is still applicable with the zero downtime change:

Recovering from a failed add

If the init-spock job fails while adding a node (e.g. due to a crash, timeout, or connectivity issue), the new node may be left in a partially configured state. To retry, first remove the node from the nodes list in values.yaml and run helm upgrade to clean up, then re-add the node and upgrade again.

Do not retry the add without first removing the node, as the reconciler may encounter stale replication state from the failed attempt.

In practice, its difficult to detect this partial failure very well - the job could have failed at many different points, with data potentially partially moved to the new node (or even still moving). The requirement to remove and retry does put some burden on the user, but it's more likely to reliably succeed on retry compared to adding detection / cleanup code.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
docs/usage/adding_nodes.md (4)

56-56: LGTM: Zero-downtime claim accurately reflects the populate pipeline implementation.

The statement that writes can continue uninterrupted aligns with the PR's core objective of enabling zero-downtime node addition via the populate pipeline's ephemeral orchestration resources.

Optional: Consider adding guidance on verifying ongoing writes

For operators who want confidence during the sync, you could optionally add a sentence suggesting they monitor query activity or transaction rates on existing nodes to confirm write availability is maintained throughout the process.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/usage/adding_nodes.md` at line 56, Add an optional sentence to
docs/usage/adding_nodes.md (near the paragraph describing the init-spock job and
zero-downtime populate pipeline) advising operators how to verify ongoing writes
during the sync—for example, suggest monitoring query activity or transaction
rates on existing nodes (or checking replica lag/active transactions) so they
can confidently confirm writes remain available while the populate pipeline and
Spock subscriptions (sync_data/sync_structure) run.

60-60: Consider adding a configuration example for the timeout.

The timeout configuration path and default value are clearly documented, but showing a YAML snippet would improve usability.

Example configuration snippet

You could add something like:

You can configure the timeout via `pgEdge.initSpockJobConfig.timeout` (default: 7200 seconds / 2 hours):

    ```yaml
    pgEdge:
      initSpockJobConfig:
        timeout: 10800  # 3 hours for large databases
    ```

If the job fails or times out, see [Recovering from a failed add](`#recovering-from-a-failed-add`).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/usage/adding_nodes.md` at line 60, Add a small YAML example showing how
to override the documented timeout for clarity: include a snippet demonstrating
pgEdge.initSpockJobConfig.timeout (e.g., set to 10800 for 3 hours) placed
immediately after the sentence that mentions the default, and format it as a
fenced YAML block so readers can copy/paste; keep the existing link to
"Recovering from a failed add" after the snippet.

114-118: Consider adding diagnostic guidance for failed adds.

The recovery procedure (remove, clean up, re-add) is clear and aligns with the design choice to prefer manual cleanup over automatic retry. However, operators would benefit from guidance on:

  • How to detect that an add has failed (specific error messages, job status checks)
  • What logs to examine (kubectl logs on the init-spock job, PostgreSQL logs, etc.)
  • How to verify the cleanup was successful before retrying
Suggested additions

You could add a sentence like:

If the init-spock job fails while adding a node (e.g. due to a crash, timeout, or connectivity issue), check the job status with `kubectl get jobs` and examine logs with `kubectl logs job/init-spock`. The new node may be left in a partially configured state. To retry, first remove the node from the `nodes` list in `values.yaml` and run `helm upgrade` to clean up, then re-add the node and upgrade again.

Alternatively, consider linking to a troubleshooting guide with more detailed diagnostic steps.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/usage/adding_nodes.md` around lines 114 - 118, Add a short diagnostics
paragraph after the "Recovering from a failed add" section instructing operators
how to detect and investigate failures: advise checking the init-spock job
status with kubectl get jobs and pod/job details, inspect logs with kubectl logs
job/init-spock and PostgreSQL container logs, look for common error patterns
(timeouts, connectivity, replication errors), and verify cleanup before retrying
by confirming the node is removed from values.yaml, that no init-spock pods/jobs
are running for that node, and that the cluster reports expected membership;
optionally mention linking to a more detailed troubleshooting guide for deeper
investigation.

62-64: Enhance the bootstrap removal warning with verification guidance.

The warning correctly identifies that leaving the bootstrap block will cause issues, but it would be more actionable if it explained:

  1. How to verify the add was successful before removing the block
  2. What specific problems "interfere with active replication" means (e.g., errors, conflicts, duplicate operations)
Suggested enhancement

Consider expanding the warning to something like:

!!! warning

    After verifying the new node is replicating successfully (check that `helm get values` shows the node and `kubectl logs` for the init-spock job shows completion without errors), remove the `bootstrap` block from the new node's configuration. 
    
    If left in place, subsequent `helm upgrade` runs will re-execute the populate pipeline, which can cause replication conflicts or errors as the pipeline attempts to re-create resources that already exist.

This gives operators a clear verification step and explains the specific failure mode.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/usage/adding_nodes.md` around lines 62 - 64, Update the bootstrap
removal warning to include verification steps and concrete failure modes:
instruct operators to verify the add succeeded by checking that the new node
appears in helm get values and that the init-spock job completed without errors
via kubectl logs, then remove the bootstrap block; also explain that leaving the
bootstrap block causes subsequent helm upgrade runs to re-run the populate
pipeline and can produce replication conflicts, duplicate resource-creation
errors, or job failures when the pipeline tries to recreate existing resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/usage/adding_nodes.md`:
- Line 56: Add an optional sentence to docs/usage/adding_nodes.md (near the
paragraph describing the init-spock job and zero-downtime populate pipeline)
advising operators how to verify ongoing writes during the sync—for example,
suggest monitoring query activity or transaction rates on existing nodes (or
checking replica lag/active transactions) so they can confidently confirm writes
remain available while the populate pipeline and Spock subscriptions
(sync_data/sync_structure) run.
- Line 60: Add a small YAML example showing how to override the documented
timeout for clarity: include a snippet demonstrating
pgEdge.initSpockJobConfig.timeout (e.g., set to 10800 for 3 hours) placed
immediately after the sentence that mentions the default, and format it as a
fenced YAML block so readers can copy/paste; keep the existing link to
"Recovering from a failed add" after the snippet.
- Around line 114-118: Add a short diagnostics paragraph after the "Recovering
from a failed add" section instructing operators how to detect and investigate
failures: advise checking the init-spock job status with kubectl get jobs and
pod/job details, inspect logs with kubectl logs job/init-spock and PostgreSQL
container logs, look for common error patterns (timeouts, connectivity,
replication errors), and verify cleanup before retrying by confirming the node
is removed from values.yaml, that no init-spock pods/jobs are running for that
node, and that the cluster reports expected membership; optionally mention
linking to a more detailed troubleshooting guide for deeper investigation.
- Around line 62-64: Update the bootstrap removal warning to include
verification steps and concrete failure modes: instruct operators to verify the
add succeeded by checking that the new node appears in helm get values and that
the init-spock job completed without errors via kubectl logs, then remove the
bootstrap block; also explain that leaving the bootstrap block causes subsequent
helm upgrade runs to re-run the populate pipeline and can produce replication
conflicts, duplicate resource-creation errors, or job failures when the pipeline
tries to recreate existing resources.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 24c0ee33-bf7e-4a85-9364-c59068e0096a

📥 Commits

Reviewing files that changed from the base of the PR and between f4111f7 and 598da86.

📒 Files selected for processing (3)
  • docs/limitations.md
  • docs/usage/adding_nodes.md
  • internal/spock/disabled_subscription.go
💤 Files with no reviewable changes (1)
  • docs/limitations.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/spock/disabled_subscription.go

@mmols mmols merged commit 7a11168 into main Apr 16, 2026
5 checks passed
@mmols mmols deleted the feat/PLAT-528/add-node-zero-downtime branch April 16, 2026 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants