K8SPG-1012 stanza timing fix#1575
Conversation
Add three test suites covering the supported migration paths from Crunchy Data PGO v5 to the Percona PostgreSQL Operator: - migration-from-crunchy-backup-restore: seeds a Percona cluster from a Crunchy pgBackRest backup via spec.dataSource, then validates a full restore - migration-from-crunchy-standby: promotes a Percona standby cluster that streams from a live Crunchy primary, verifying cutover, WAL replay stabilisation, and post-promotion backup/restore - migration-from-crunchy-pv: reuses the Crunchy primary's data PV directly Also exports CPGO_VERSION in vars.sh so all three tests can pin the Crunchy operator version consistently.
During a dataSource bootstrap restore, postgres promotes from TL1 to TL2 and immediately passes 00000002.history to archive_command. pgBackRest's async archiver silently drops the push (error 103) when archive.info does not yet exist — and postgres never retries. Without 00000002.history in the archive, pg_rewind on replicas fails with "could not find common ancestor" after any subsequent PITR restore.
Bumps [github.com/Azure/go-ntlmssp](https://github.com/Azure/go-ntlmssp) from 0.0.0-20221128193559-754e69321358 to 0.1.1. - [Release notes](https://github.com/Azure/go-ntlmssp/releases) - [Commits](https://github.com/Azure/go-ntlmssp/commits/v0.1.1) --- updated-dependencies: - dependency-name: github.com/Azure/go-ntlmssp dependency-version: 0.1.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>
| // Re-push any timeline history files stranded by the async-archiver race: | ||
| // postgres archives 00000002.history during bootstrap promotion before the | ||
| // stanza exists; pgBackRest drops it silently (error 103) and postgres | ||
| // never retries. Without it pg_rewind fails on replicas after PITR. | ||
| log := logging.FromContext(ctx) | ||
| historyOut, historyErr := pgbackrest.Executor(exec).ArchivePushHistoryFiles(ctx) | ||
| if historyErr != nil { | ||
| r.Recorder.Event(postgresCluster, corev1.EventTypeWarning, | ||
| "ArchivePushHistoryFilesFailed", historyErr.Error()) | ||
| log.Error(historyErr, "timeline history file recovery failed", | ||
| "pod", writableInstanceName, "output", historyOut) | ||
| } else if historyOut != "" { | ||
| log.Info("timeline history file recovery", "output", historyOut) | ||
| } |
There was a problem hiding this comment.
i understand we need to do this after stanza is created but i wonder if we can do this in the caller of this function reconcilePGBackRest after line 1642 and if configHashMismatch is false
There was a problem hiding this comment.
@egegunes Moving it to the caller isn't straightforward. reconcileStanzaCreate returns (false, nil) in three distinct cases:
Stanza just created successfully ← the only case where we want ArchivePushHistoryFiles
Stanza already created (stanzasCreated == true) ← should not run it again
Cluster not yet writable (!clusterWritable) ← exec would target the wrong pod / fail
So checking !configHashMismatch && err == nil at the caller doesn't let us distinguish case 1 from cases 2 and 3. We'd be exec-ing into a pod on every reconcile, including when the cluster isn't writable yet.
To make it work at the caller level we'd need a third return value (e.g. stanzaJustCreated bool) and also either return writableInstanceName or re-discover the writable instance in the caller, since the exec closure is built around it.
Given that reconcileStanzaCreate already has the writable instance name in scope and the call sits naturally right after successful stanza creation but up to you.
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a pgBackRest/PostgreSQL bootstrap timing race where timeline history files (e.g. 00000002.history) can be dropped by the async archiver before archive.info exists, causing later pg_rewind failures on replicas after subsequent PITR restores.
Changes:
- Add a synchronous recovery step to (re)push any
*.historyfiles frompg_walimmediately afterstanza-create. - Extend E2E coverage for Crunchy→Percona migration with an additional backup + second PITR restore and data verification.
- Update PR E2E suite list and adjust one upgrade-minor cleanup step.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| percona/controller/pgcluster/owner_ref_migration.go | Whitespace-only adjustment near RBAC comment. |
| percona/certmanager/certmanager.go | Whitespace-only adjustment in certificate apply logic. |
| internal/pgbackrest/pgbackrest.go | Adds ArchivePushHistoryFiles helper to synchronously archive timeline history files. |
| internal/controller/postgrescluster/pgbackrest.go | Calls history recovery right after successful stanza creation and logs output. |
| internal/controller/postgrescluster/instance.go | Comment punctuation fix. |
| e2e-tests/tests/upgrade-minor/99-remove-cluster-gracefully.yaml | Adds explicit finalizer removal + deletion for upgrade-minor PostgresCluster during cleanup. |
| e2e-tests/tests/migration-from-crunchy-backup-restore/10-write-more-data.yaml | Writes a second batch of rows before the second restore scenario. |
| e2e-tests/tests/migration-from-crunchy-backup-restore/10-assert.yaml | Asserts the “second batch written” marker ConfigMap exists. |
| e2e-tests/tests/migration-from-crunchy-backup-restore/11-backup.yaml | Adds a second full backup CR for the second PITR scenario. |
| e2e-tests/tests/migration-from-crunchy-backup-restore/11-assert.yaml | Asserts the second backup job/CR succeeds. |
| e2e-tests/tests/migration-from-crunchy-backup-restore/12-restore.yaml | Performs a second PITR restore using a captured time target. |
| e2e-tests/tests/migration-from-crunchy-backup-restore/12-assert.yaml | Asserts restore success and cluster returns to ready with 3 replicas. |
| e2e-tests/tests/migration-from-crunchy-backup-restore/13-verify-after-restore.yaml | Verifies row presence/absence after the second PITR restore. |
| e2e-tests/tests/migration-from-crunchy-backup-restore/13-assert.yaml | Re-checks the post-restore data captured in the ConfigMap. |
| e2e-tests/run-pr.csv | Adds additional migration suites to the PR E2E run list. |
| config/crd/bases/upstream.pgv2.percona.com_postgresclusters.yaml | CRD file is modified but currently contains unresolved merge-conflict markers (must be fixed). |
| combined += "\nstderr: " + s | ||
| } | ||
| if err != nil { | ||
| return combined, errors.Wrap(err, combined) |
| kubectl -n ${NAMESPACE} patch postgrescluster upgrade-minor --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]' | ||
| kubectl -n ${NAMESPACE} delete postgrescluster upgrade-minor || true |
| push_failed=0 | ||
| while IFS= read -r f; do | ||
| if pgbackrest --stanza=db --log-level-console=info archive-push --no-archive-async "${f}"; then | ||
| echo "pushed ${f}" | ||
| else | ||
| echo "FAILED to push ${f}" >&2 | ||
| push_failed=1 | ||
| fi | ||
| done < "${tmplist}" | ||
| rm -f "${tmplist}" | ||
|
|
||
| [ "${push_failed}" -eq 0 ] || exit 1 | ||
| ` | ||
| err := exec(ctx, nil, &stdout, &stderr, "bash", "-ceu", "--", script) | ||
| combined := stdout.String() |
| migration-from-crunchy-standby | ||
| migration-from-crunchy-pv | ||
| migration-from-crunchy-backup-restore |
commit: 26ce453 |
CHANGE DESCRIPTION
Problem:
During a dataSource bootstrap restore, postgres promotes from TL1 to TL2
and immediately passes 00000002.history to archive_command. pgBackRest's
async archiver silently drops the push (error 103) when archive.info does
not yet exist and postgres never retries. Without 00000002.history in
the archive, pg_rewind on replicas fails with "could not find common
ancestor" after any subsequent PITR restore.
CHECKLIST
Jira
Needs Doc) and QA (Needs QA)?Tests
Config/Logging/Testability