feat: zero-downtime node addition via populate pipeline#52
Conversation
Enables adding nodes while writes continue on existing nodes by implementing a populate pattern in the single-pass resource engine. Five new ephemeral resource types orchestrate the populate sequence: ReplicationSlotCreate, SyncEvent, WaitForSyncEvent, LagTrackerCommitTimestamp, and ReplicationSlotAdvanceFromCTS. A DisabledSubscription resource pre-registers peer subscriptions so the lag tracker populates during sync. Adds Update() to the resource engine interface so the end-state reconciler can enable previously disabled subscriptions. Includes configurable job timeout, integration test with background writers on both nodes, and a reusable waitForReplication test helper using sync events.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds zero-downtime pgEdge Spock node-add support: new ephemeral populate and sync resources, update-capable resource lifecycle, optional init-spock execution timeout (Job + env + binary), documentation and values updates, and expanded unit and integration tests for live-writer node addition. Changes
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Up to standards ✅🟢 Issues
|
| Category | Results |
|---|---|
| Complexity | 3 medium |
🟢 Metrics 81 complexity · 32 duplication
Metric Results Complexity 81 Duplication 32
TIP This summary will be updated as you push new changes. Give us feedback
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
.gitignore (1)
4-4: Narrow the ignore pattern to avoid hiding future source files.
init-spock(without a leading slash) can match any path component namedinit-spock. If the intent is only to ignore a repo-root build artifact, use a root-scoped pattern.Suggested diff
-init-spock +/init-spock🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.gitignore at line 4, The .gitignore entry "init-spock" is too broad and may ignore any path component named that; update the pattern to the repo-root scoped form by replacing "init-spock" with "/init-spock" in the .gitignore so only the root-level build artifact is ignored (refer to the "init-spock" line in .gitignore when making the change).internal/spock/spock_test.go (1)
132-133: Prefer resource type constant over raw string in test setup.On Line 132, use
ResourceTypeWaitForSyncEventinstead of"spock.wait_for_sync_event"to keep tests resilient to type-name changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/spock/spock_test.go` around lines 132 - 133, Replace the hard-coded resource type string in the test setup with the exported constant: change the Identifier construction that uses Type: "spock.wait_for_sync_event" to use ResourceTypeWaitForSyncEvent; update the extra variable (resource.Identifier) used in the call to NewSubscription so the test references ResourceTypeWaitForSyncEvent instead of the raw string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/spock/desired.go`:
- Around line 119-127: The peer sync bookmark (peerSyncEvt created by
NewSyncEvent) is only gated on slotCreate.Identifier(), allowing it to run
before the DisabledSubscription created by NewDisabledSubscription is
registered; update the dependency list so peerSyncEvt also waits for the
disabled subscription to be created by including disabledSub.Identifier()
(alongside slotCreate.Identifier()) when constructing NewSyncEvent so the
DisabledSubscription.Create() runs before the sync bookmark.
In `@internal/spock/spock_test.go`:
- Around line 437-451: The test currently uses silent type-assertion patterns
like "if sub, ok := subN1N2.(*Subscription); ok { ... }" which hide failures
when the resource is missing or wrong type; change each such assertion (e.g. for
variables subN1N2, subN2N1 and others at the referenced locations) to fail fast
by checking the ok boolean and calling t.Fatalf (or t.Fatalf-equivalent) when ok
is false, then proceed to access sub.sync and assert its value; ensure you
reference the same types/identifiers (resource.Identifier,
ResourceTypeSubscription, Subscription) so the test errors immediately if the
cast fails.
In `@internal/spock/sync_event.go`:
- Around line 42-47: The Dependencies() method builds the subscription
dependency ID manually which fails to normalize hyphens; instead instantiate or
call the Subscription.Identifier logic to produce the normalized ID and use that
for the ResourceTypeSubscription entry. Update SyncEvent.Dependencies() to
construct the subscription identifier via the Subscription.Identifier (or the
shared helper that performs the '-'→'_' normalization) using r.providerName and
r.subscriberName, and include that normalized ID in the deps slice before
appending r.extraDeps.
In `@internal/spock/wait_for_sync_event.go`:
- Around line 72-77: The subscription name used for status polling is built from
raw node names (subName) and can contain dashes which don't match the
underscore-based naming elsewhere; update the construction of subName in the
wait_for_sync_event logic to normalize providerName and subscriberName (e.g.,
replace '-' with '_' or otherwise map to the canonical subscription naming)
before formatting into sub_%s_%s so the QueryRow lookup for status uses the same
normalized subscription name.
---
Nitpick comments:
In @.gitignore:
- Line 4: The .gitignore entry "init-spock" is too broad and may ignore any path
component named that; update the pattern to the repo-root scoped form by
replacing "init-spock" with "/init-spock" in the .gitignore so only the
root-level build artifact is ignored (refer to the "init-spock" line in
.gitignore when making the change).
In `@internal/spock/spock_test.go`:
- Around line 132-133: Replace the hard-coded resource type string in the test
setup with the exported constant: change the Identifier construction that uses
Type: "spock.wait_for_sync_event" to use ResourceTypeWaitForSyncEvent; update
the extra variable (resource.Identifier) used in the call to NewSubscription so
the test references ResourceTypeWaitForSyncEvent instead of the raw string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a8fedd7b-e94c-4ead-9f30-738976cbb548
📒 Files selected for processing (26)
.gitignoreREADME.mdcmd/init-spock/main.godocs/configuration.mddocs/index.mddocs/limitations.mddocs/usage/adding_nodes.mdinternal/resource/executor.gointernal/resource/plan.gointernal/resource/resource.gointernal/resource/resource_test.gointernal/spock/desired.gointernal/spock/disabled_subscription.gointernal/spock/lag_tracker_commit_ts.gointernal/spock/node.gointernal/spock/replication_slot.gointernal/spock/replication_slot_advance.gointernal/spock/replication_slot_create.gointernal/spock/spock_test.gointernal/spock/subscription.gointernal/spock/sync_event.gointernal/spock/user.gointernal/spock/wait_for_sync_event.gotemplates/init-spock-job.yamltest/integration/nodes_test.govalues.yaml
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
internal/spock/spock_test.go (2)
129-133: Prefer resource type constants over string literals in tests.Using
ResourceTypeWaitForSyncEventhere avoids drift and catches renames at compile time.🔧 Proposed change
- extra := resource.Identifier{Type: "spock.wait_for_sync_event", ID: "n1_n2"} + extra := resource.Identifier{Type: ResourceTypeWaitForSyncEvent, ID: "n1_n2"}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/spock/spock_test.go` around lines 129 - 133, Replace the hard-coded resource type string in TestSubscriptionExtraDeps with the exported constant ResourceTypeWaitForSyncEvent to avoid drift; update the creation of the extra resource (resource.Identifier{Type: ..., ID: ...}) used in NewSubscription so Type uses ResourceTypeWaitForSyncEvent instead of "spock.wait_for_sync_event" (refer to TestSubscriptionExtraDeps, resource.Identifier and NewSubscription).
242-250: Strengthen dependency tests to assert exact identifiers, not only counts.Count-only checks can pass even when dependency wiring is wrong.
🔍 Example hardening
func TestSyncEventDependencies(t *testing.T) { extra := resource.Identifier{Type: ResourceTypeReplicationSlotCreate, ID: "spk_app_n2_sub_n2_n3"} r := NewSyncEvent("n2", "n1", nil, extra) deps := r.Dependencies() - // Should depend on node + subscription + extra - if len(deps) != 3 { + if len(deps) != 3 { t.Fatalf("expected 3 deps, got %d: %v", len(deps), deps) } + expected := map[resource.Identifier]bool{ + {Type: ResourceTypeNode, ID: "n2"}: true, + {Type: ResourceTypeSubscription, ID: "sub_n2_n1"}: true, + extra: true, + } + for _, d := range deps { + delete(expected, d) + } + if len(expected) != 0 { + t.Fatalf("missing expected deps: %v (got %v)", expected, deps) + } }Also applies to: 309-316
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/spock/spock_test.go` around lines 242 - 250, TestSyncEventDependencies only asserts the dependency count; update it to assert the exact resource.Identifier values returned by r := NewSyncEvent("n2", "n1", nil, extra) via r.Dependencies() instead of just len(deps). Replace the count-only check with explicit expectations for the node identifier, the subscription identifier, and the extra identifier (compare resource.Identifier fields or use reflect.DeepEqual on a canonical/sorted slice or a set membership check) so the test fails if any identifier is missing or incorrect; apply the same stronger assertions pattern to the other similar test that validates SyncEvent dependencies.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/spock/wait_for_sync_event.go`:
- Around line 59-63: The Create method on WaitForSyncEvent dereferences r.conn
without checking for nil, which can panic; add a nil-check at the start of
WaitForSyncEvent.Create and return a descriptive error (e.g., "connection not
initialized for %s→%s") when r.conn == nil, and apply the same nil-guard before
any other uses of r.conn in the method (also cover the subsequent blocks that
use r.conn around the later sections referenced in the review) so all
dereferences safely return an error instead of panicking.
---
Nitpick comments:
In `@internal/spock/spock_test.go`:
- Around line 129-133: Replace the hard-coded resource type string in
TestSubscriptionExtraDeps with the exported constant
ResourceTypeWaitForSyncEvent to avoid drift; update the creation of the extra
resource (resource.Identifier{Type: ..., ID: ...}) used in NewSubscription so
Type uses ResourceTypeWaitForSyncEvent instead of "spock.wait_for_sync_event"
(refer to TestSubscriptionExtraDeps, resource.Identifier and NewSubscription).
- Around line 242-250: TestSyncEventDependencies only asserts the dependency
count; update it to assert the exact resource.Identifier values returned by r :=
NewSyncEvent("n2", "n1", nil, extra) via r.Dependencies() instead of just
len(deps). Replace the count-only check with explicit expectations for the node
identifier, the subscription identifier, and the extra identifier (compare
resource.Identifier fields or use reflect.DeepEqual on a canonical/sorted slice
or a set membership check) so the test fails if any identifier is missing or
incorrect; apply the same stronger assertions pattern to the other similar test
that validates SyncEvent dependencies.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5e64e566-0882-45b0-a494-467643445e63
📒 Files selected for processing (3)
internal/spock/spock_test.gointernal/spock/sync_event.gointernal/spock/wait_for_sync_event.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/spock/sync_event.go
… generation
The slot name format (spk_{db}_{provider}_sub_{provider}_{subscriber})
and subscription name format (sub_{src}_{dst}) were duplicated across
7 files with inline strings.ReplaceAll+Sprintf calls. A missing dash
replacement in two of those copies was already caught and fixed in
59cd68e — this consolidates all call sites into spockSlotName() and
spockSubName() helpers to prevent recurrence.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
internal/spock/replication_slot_create.go (1)
54-77: Consider handling duplicate slot errors for robustness.The check-then-create pattern has a small race window. If another process creates the slot concurrently,
pg_create_logical_replication_slotwill fail with a duplicate error. Handling this error (likeDisabledSubscriptiondoes withpgCodeDuplicateObject) would make the resource more resilient.♻️ Optional: handle duplicate error
+import ( + "errors" + "github.com/jackc/pgx/v5/pgconn" +) func (r *ReplicationSlotCreate) Create(ctx context.Context) error { var exists bool err := r.conn.QueryRow(ctx, "SELECT EXISTS(SELECT 1 FROM pg_replication_slots WHERE slot_name = $1)", r.slotName(), ).Scan(&exists) if err != nil { return fmt.Errorf("check slot %s: %w", r.slotName(), err) } if exists { slog.Info("replication slot already exists", "slot", r.slotName()) return nil } _, err = r.conn.Exec(ctx, "SELECT pg_create_logical_replication_slot($1, 'spock_output')", r.slotName(), ) if err != nil { + var pgErr *pgconn.PgError + if errors.As(err, &pgErr) && pgErr.Code == pgCodeDuplicateObject { + slog.Info("replication slot already exists (race)", "slot", r.slotName()) + return nil + } return fmt.Errorf("create replication slot %s: %w", r.slotName(), err) } slog.Info("created replication slot", "slot", r.slotName()) return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/spock/replication_slot_create.go` around lines 54 - 77, The Create method in ReplicationSlotCreate uses a check-then-create pattern which can race; modify the Exec error handling for the pg_create_logical_replication_slot call so that if the returned error is a duplicate-object error (compare against the same pgCodeDuplicateObject logic used by DisabledSubscription) you treat it as non-fatal: log/info that the slot already exists and return nil; for any other error continue to wrap and return as currently done. Target the Create function and the Exec call invoking "pg_create_logical_replication_slot" and reuse the duplicate error classification from DisabledSubscription.internal/spock/subscription.go (1)
63-94: Consider consolidating the two queries inRefresh().The subscription existence and enabled status could be fetched in a single query to reduce round trips.
♻️ Optional consolidation
func (s *Subscription) Refresh(ctx context.Context) error { - var subExists bool - err := s.conn.QueryRow(ctx, - "SELECT EXISTS(SELECT 1 FROM spock.subscription WHERE sub_name = $1)", + var isEnabled *bool + err := s.conn.QueryRow(ctx, + "SELECT sub_enabled FROM spock.subscription WHERE sub_name = $1", s.subName(), - ).Scan(&subExists) + ).Scan(&isEnabled) if err != nil { + if err == pgx.ErrNoRows { + s.status = resource.Status{Exists: false} + return nil + } return fmt.Errorf("inspect subscription %s: %w", s.subName(), err) } - if !subExists { - s.status = resource.Status{Exists: false} - return nil - } - - // A disabled subscription that should be enabled triggers an update. - var isEnabled bool - err = s.conn.QueryRow(ctx, - "SELECT sub_enabled FROM spock.subscription WHERE sub_name = $1", - s.subName(), - ).Scan(&isEnabled) - if err != nil { - return fmt.Errorf("check subscription enabled %s: %w", s.subName(), err) - } - if !isEnabled { + if !*isEnabled { s.status = resource.Status{Exists: true, NeedsUpdate: true, Reason: "subscription is disabled"} return nil } s.status = resource.Status{Exists: true} return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/spock/subscription.go` around lines 63 - 94, Consolidate the two DB round-trips in Refresh by querying the enabled state once: replace the two QueryRow calls with a single QueryRow that selects sub_enabled FROM spock.subscription WHERE sub_name = $1, Scan the result into a nullable boolean (e.g., sql.NullBool or *bool) and handle the "no rows" case to set s.status = resource.Status{Exists:false}; if a row exists use the scanned boolean to set s.status = resource.Status{Exists:true, NeedsUpdate:true, Reason:"subscription is disabled"} when false, otherwise set Exists:true; keep using s.subName(), s.conn.QueryRow, and s.status/resource.Status names when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/spock/disabled_subscription.go`:
- Around line 93-99: Replace the hardcoded Postgres error code string "42710"
with the existing pgCodeDuplicateObject constant defined in
internal/spock/node.go to improve consistency; in the error handling inside the
function that creates the disabled subscription (the block using errors.As into
var pgErr *pgconn.PgError and s.subName()), change the comparison pgErr.Code ==
"42710" to use pgCodeDuplicateObject so the duplicate-object check uses the
shared constant.
---
Nitpick comments:
In `@internal/spock/replication_slot_create.go`:
- Around line 54-77: The Create method in ReplicationSlotCreate uses a
check-then-create pattern which can race; modify the Exec error handling for the
pg_create_logical_replication_slot call so that if the returned error is a
duplicate-object error (compare against the same pgCodeDuplicateObject logic
used by DisabledSubscription) you treat it as non-fatal: log/info that the slot
already exists and return nil; for any other error continue to wrap and return
as currently done. Target the Create function and the Exec call invoking
"pg_create_logical_replication_slot" and reuse the duplicate error
classification from DisabledSubscription.
In `@internal/spock/subscription.go`:
- Around line 63-94: Consolidate the two DB round-trips in Refresh by querying
the enabled state once: replace the two QueryRow calls with a single QueryRow
that selects sub_enabled FROM spock.subscription WHERE sub_name = $1, Scan the
result into a nullable boolean (e.g., sql.NullBool or *bool) and handle the "no
rows" case to set s.status = resource.Status{Exists:false}; if a row exists use
the scanned boolean to set s.status = resource.Status{Exists:true,
NeedsUpdate:true, Reason:"subscription is disabled"} when false, otherwise set
Exists:true; keep using s.subName(), s.conn.QueryRow, and
s.status/resource.Status names when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f51ed7c8-b32c-4b8d-841a-1f4aecf75453
📒 Files selected for processing (8)
internal/spock/disabled_subscription.gointernal/spock/names.gointernal/spock/replication_slot.gointernal/spock/replication_slot_advance.gointernal/spock/replication_slot_create.gointernal/spock/subscription.gointernal/spock/sync_event.gointernal/spock/wait_for_sync_event.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/spock/replication_slot.go
- internal/spock/wait_for_sync_event.go
rshoemaker
left a comment
There was a problem hiding this comment.
I refactored some naming funcs - it seemed easier than dropping comments everywhere.
Claude had one HIGH priority issue:
`
HIGH — Update-before-Create phase ordering breaks retry
- If populate fails mid-way, retry will enable the disabled subscription before the slot advance runs
The planner runs phases in order: deletes → updates → creates. This is fine for clean runs, but breaks an important retry case:
Scenario: First init-spock run for add-node succeeds in creating the disabled peer→new subscription but crashes before running ReplicationSlotAdvanceFromCTS. Database state:
- spk_app_n2_sub_n2_n3 slot exists (not advanced)
- sub_n2_n3 subscription exists, disabled
On retry:
- Subscription.Refresh sees sub_n2_n3 exists + sub_enabled=false → sets NeedsUpdate (subscription.go:94-97)
- Ephemeral populate resources (ReplicationSlotCreate, DisabledSubscription, SyncEvent, WaitForSyncEvent, LagTrackerCommitTimestamp, ReplicationSlotAdvanceFromCTS) all Refresh → Exists=false → will be Create
- Plan: Updates phase fires first → sub_enable(sub_n2_n3) runs
- Subscription becomes active, starts consuming from the un-advanced slot (from creation point)
- Then Creates phase runs; ReplicationSlotAdvanceFromCTS.Create checks active_pid IS NOT NULL, sees it's active, and skips the advance (replication_slot_advance.go:95-97)
Result: peer→new subscription replays all WAL from slot creation instead of advancing past commit_ts. Spock's origin-awareness may dedupe the commits since they'd also arrive via source→new, but this isn't guaranteed and at
minimum causes significant wasted work.
The Update's Dependencies() include ReplicationSlotAdvanceFromCTS (via Subscription.Dependencies), but topoSort only considers dependencies within the same event set. Updates and creates are separate sets, so the update runs
without waiting.
Fixes to consider:
- Merge updates into the creates topo-sort (single DAG, resource knows its own action)
- Or: Subscription.Update checks if the slot has been advanced, refuses to enable if not
- Or: Subscription.Refresh doesn't set NeedsUpdate for subscriptions whose new-node populate chain hasn't completed (needs context passthrough)
The cleanest fix is probably #1 — a unified DAG across all action types. Updates aren't conceptually different from creates in terms of dependency ordering.
`
I think this feedback is based on that idea that you would re-run the job and expect it to automatically recover.
In practice, its difficult to detect this partial failure very well - the job could have failed at many different points, with data potentially partially moved to the new node (or even still moving). The requirement to remove and retry does put some burden on the user, but it's more likely to reliably succeed on retry compared to adding detection / cleanup code. |
There was a problem hiding this comment.
🧹 Nitpick comments (4)
docs/usage/adding_nodes.md (4)
56-56: LGTM: Zero-downtime claim accurately reflects the populate pipeline implementation.The statement that writes can continue uninterrupted aligns with the PR's core objective of enabling zero-downtime node addition via the populate pipeline's ephemeral orchestration resources.
Optional: Consider adding guidance on verifying ongoing writes
For operators who want confidence during the sync, you could optionally add a sentence suggesting they monitor query activity or transaction rates on existing nodes to confirm write availability is maintained throughout the process.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/usage/adding_nodes.md` at line 56, Add an optional sentence to docs/usage/adding_nodes.md (near the paragraph describing the init-spock job and zero-downtime populate pipeline) advising operators how to verify ongoing writes during the sync—for example, suggest monitoring query activity or transaction rates on existing nodes (or checking replica lag/active transactions) so they can confidently confirm writes remain available while the populate pipeline and Spock subscriptions (sync_data/sync_structure) run.
60-60: Consider adding a configuration example for the timeout.The timeout configuration path and default value are clearly documented, but showing a YAML snippet would improve usability.
Example configuration snippet
You could add something like:
You can configure the timeout via `pgEdge.initSpockJobConfig.timeout` (default: 7200 seconds / 2 hours): ```yaml pgEdge: initSpockJobConfig: timeout: 10800 # 3 hours for large databases ``` If the job fails or times out, see [Recovering from a failed add](`#recovering-from-a-failed-add`).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/usage/adding_nodes.md` at line 60, Add a small YAML example showing how to override the documented timeout for clarity: include a snippet demonstrating pgEdge.initSpockJobConfig.timeout (e.g., set to 10800 for 3 hours) placed immediately after the sentence that mentions the default, and format it as a fenced YAML block so readers can copy/paste; keep the existing link to "Recovering from a failed add" after the snippet.
114-118: Consider adding diagnostic guidance for failed adds.The recovery procedure (remove, clean up, re-add) is clear and aligns with the design choice to prefer manual cleanup over automatic retry. However, operators would benefit from guidance on:
- How to detect that an add has failed (specific error messages, job status checks)
- What logs to examine (
kubectl logson the init-spock job, PostgreSQL logs, etc.)- How to verify the cleanup was successful before retrying
Suggested additions
You could add a sentence like:
If the init-spock job fails while adding a node (e.g. due to a crash, timeout, or connectivity issue), check the job status with `kubectl get jobs` and examine logs with `kubectl logs job/init-spock`. The new node may be left in a partially configured state. To retry, first remove the node from the `nodes` list in `values.yaml` and run `helm upgrade` to clean up, then re-add the node and upgrade again.Alternatively, consider linking to a troubleshooting guide with more detailed diagnostic steps.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/usage/adding_nodes.md` around lines 114 - 118, Add a short diagnostics paragraph after the "Recovering from a failed add" section instructing operators how to detect and investigate failures: advise checking the init-spock job status with kubectl get jobs and pod/job details, inspect logs with kubectl logs job/init-spock and PostgreSQL container logs, look for common error patterns (timeouts, connectivity, replication errors), and verify cleanup before retrying by confirming the node is removed from values.yaml, that no init-spock pods/jobs are running for that node, and that the cluster reports expected membership; optionally mention linking to a more detailed troubleshooting guide for deeper investigation.
62-64: Enhance the bootstrap removal warning with verification guidance.The warning correctly identifies that leaving the bootstrap block will cause issues, but it would be more actionable if it explained:
- How to verify the add was successful before removing the block
- What specific problems "interfere with active replication" means (e.g., errors, conflicts, duplicate operations)
Suggested enhancement
Consider expanding the warning to something like:
!!! warning After verifying the new node is replicating successfully (check that `helm get values` shows the node and `kubectl logs` for the init-spock job shows completion without errors), remove the `bootstrap` block from the new node's configuration. If left in place, subsequent `helm upgrade` runs will re-execute the populate pipeline, which can cause replication conflicts or errors as the pipeline attempts to re-create resources that already exist.This gives operators a clear verification step and explains the specific failure mode.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/usage/adding_nodes.md` around lines 62 - 64, Update the bootstrap removal warning to include verification steps and concrete failure modes: instruct operators to verify the add succeeded by checking that the new node appears in helm get values and that the init-spock job completed without errors via kubectl logs, then remove the bootstrap block; also explain that leaving the bootstrap block causes subsequent helm upgrade runs to re-run the populate pipeline and can produce replication conflicts, duplicate resource-creation errors, or job failures when the pipeline tries to recreate existing resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/usage/adding_nodes.md`:
- Line 56: Add an optional sentence to docs/usage/adding_nodes.md (near the
paragraph describing the init-spock job and zero-downtime populate pipeline)
advising operators how to verify ongoing writes during the sync—for example,
suggest monitoring query activity or transaction rates on existing nodes (or
checking replica lag/active transactions) so they can confidently confirm writes
remain available while the populate pipeline and Spock subscriptions
(sync_data/sync_structure) run.
- Line 60: Add a small YAML example showing how to override the documented
timeout for clarity: include a snippet demonstrating
pgEdge.initSpockJobConfig.timeout (e.g., set to 10800 for 3 hours) placed
immediately after the sentence that mentions the default, and format it as a
fenced YAML block so readers can copy/paste; keep the existing link to
"Recovering from a failed add" after the snippet.
- Around line 114-118: Add a short diagnostics paragraph after the "Recovering
from a failed add" section instructing operators how to detect and investigate
failures: advise checking the init-spock job status with kubectl get jobs and
pod/job details, inspect logs with kubectl logs job/init-spock and PostgreSQL
container logs, look for common error patterns (timeouts, connectivity,
replication errors), and verify cleanup before retrying by confirming the node
is removed from values.yaml, that no init-spock pods/jobs are running for that
node, and that the cluster reports expected membership; optionally mention
linking to a more detailed troubleshooting guide for deeper investigation.
- Around line 62-64: Update the bootstrap removal warning to include
verification steps and concrete failure modes: instruct operators to verify the
add succeeded by checking that the new node appears in helm get values and that
the init-spock job completed without errors via kubectl logs, then remove the
bootstrap block; also explain that leaving the bootstrap block causes subsequent
helm upgrade runs to re-run the populate pipeline and can produce replication
conflicts, duplicate resource-creation errors, or job failures when the pipeline
tries to recreate existing resources.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 24c0ee33-bf7e-4a85-9364-c59068e0096a
📒 Files selected for processing (3)
docs/limitations.mddocs/usage/adding_nodes.mdinternal/spock/disabled_subscription.go
💤 Files with no reviewable changes (1)
- docs/limitations.md
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/spock/disabled_subscription.go
Summary
Enables zero-downtime node addition when using spock to bootstrap new nodes. Writes can continue uninterrupted on existing nodes while a new node syncs via Spock logical replication. This replaces the previous behavior which required writes to be stopped during
helm upgrade.The implementation models after the control-plane's populate pattern, adapted for the Helm chart's single-pass declarative reconciliation approach.
What changed
Populate pipeline (
internal/spock/)For a 3-node cluster where n1 is the source, n2 is a peer, and n3 is the new node, the populate pipeline:
Six new ephemeral resource types orchestrate this:
ReplicationSlotCreateDisabledSubscriptionSyncEventspock.sync_event()WaitForSyncEventLagTrackerCommitTimestampspock.lag_trackerReplicationSlotAdvanceFromCTSComputeDesiredindesired.godetects new nodes (bootstrap.mode=spock) and emits the populate chain with correct dependency ordering. When no new nodes exist, output is identical to before.Resource engine (
internal/resource/)Added
Update()to theResourceinterface andActionUpdate/NeedsUpdateto the planner. This allows the end-stateSubscriptionto detect a disabled subscription (created by populate) and enable it without delete+recreate.Configuration
New
pgEdge.initSpockJobConfig.timeoutvalue (default: 2 hours). Applied asactiveDeadlineSecondson the Kubernetes Job and as a Go context timeout with a 30-second grace period.Docs
Test plan
TestNodesAddNodeZeroDowntime: adds n3 while background writers insert on n1 and n2, verifies data convergence, bidirectional replication, and full meshTestNodesRemoveNode,TestDistributedInstall, recovery tests all passmake test-integration-kind)