Skip to content

K8SPG-1012 stanza timing fix#1575

Open
hors wants to merge 47 commits into
mainfrom
stanza-timing-fix
Open

K8SPG-1012 stanza timing fix#1575
hors wants to merge 47 commits into
mainfrom
stanza-timing-fix

Conversation

@hors

@hors hors commented Apr 28, 2026

Copy link
Copy Markdown
Collaborator

CHANGE DESCRIPTION

Problem:

During a dataSource bootstrap restore, postgres promotes from TL1 to TL2
and immediately passes 00000002.history to archive_command. pgBackRest's
async archiver silently drops the push (error 103) when archive.info does
not yet exist and postgres never retries. Without 00000002.history in
the archive, pg_rewind on replicas fails with "could not find common
ancestor" after any subsequent PITR restore.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PG version?
  • Does the change support oldest and newest supported Kubernetes version?

egegunes and others added 30 commits April 20, 2026 16:58
Add three test suites covering the supported migration paths from
Crunchy Data PGO v5 to the Percona PostgreSQL Operator:

- migration-from-crunchy-backup-restore: seeds a Percona cluster from a
  Crunchy pgBackRest backup via spec.dataSource, then validates a full
  restore

- migration-from-crunchy-standby: promotes a Percona standby cluster that
  streams from a live Crunchy primary, verifying cutover, WAL replay
  stabilisation, and post-promotion backup/restore

- migration-from-crunchy-pv: reuses the Crunchy primary's data PV directly

Also exports CPGO_VERSION in vars.sh so all three tests can pin the Crunchy
operator version consistently.
During a dataSource bootstrap restore, postgres promotes from TL1 to TL2
and immediately passes 00000002.history to archive_command.  pgBackRest's
async archiver silently drops the push (error 103) when archive.info does
not yet exist — and postgres never retries.  Without 00000002.history in
the archive, pg_rewind on replicas fails with "could not find common
ancestor" after any subsequent PITR restore.
Bumps [github.com/Azure/go-ntlmssp](https://github.com/Azure/go-ntlmssp) from 0.0.0-20221128193559-754e69321358 to 0.1.1.
- [Release notes](https://github.com/Azure/go-ntlmssp/releases)
- [Commits](https://github.com/Azure/go-ntlmssp/commits/v0.1.1)

---
updated-dependencies:
- dependency-name: github.com/Azure/go-ntlmssp
  dependency-version: 0.1.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Comment on lines +3030 to +3043
// Re-push any timeline history files stranded by the async-archiver race:
// postgres archives 00000002.history during bootstrap promotion before the
// stanza exists; pgBackRest drops it silently (error 103) and postgres
// never retries. Without it pg_rewind fails on replicas after PITR.
log := logging.FromContext(ctx)
historyOut, historyErr := pgbackrest.Executor(exec).ArchivePushHistoryFiles(ctx)
if historyErr != nil {
r.Recorder.Event(postgresCluster, corev1.EventTypeWarning,
"ArchivePushHistoryFilesFailed", historyErr.Error())
log.Error(historyErr, "timeline history file recovery failed",
"pod", writableInstanceName, "output", historyOut)
} else if historyOut != "" {
log.Info("timeline history file recovery", "output", historyOut)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand we need to do this after stanza is created but i wonder if we can do this in the caller of this function reconcilePGBackRest after line 1642 and if configHashMismatch is false

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egegunes Moving it to the caller isn't straightforward. reconcileStanzaCreate returns (false, nil) in three distinct cases:

Stanza just created successfully ← the only case where we want ArchivePushHistoryFiles
Stanza already created (stanzasCreated == true) ← should not run it again
Cluster not yet writable (!clusterWritable) ← exec would target the wrong pod / fail
So checking !configHashMismatch && err == nil at the caller doesn't let us distinguish case 1 from cases 2 and 3. We'd be exec-ing into a pod on every reconcile, including when the cluster isn't writable yet.

To make it work at the caller level we'd need a third return value (e.g. stanzaJustCreated bool) and also either return writableInstanceName or re-discover the writable instance in the caller, since the exec closure is built around it.

Given that reconcileStanzaCreate already has the writable instance name in scope and the call sits naturally right after successful stanza creation but up to you.

Base automatically changed from crd-rename to main May 6, 2026 17:45
Copilot AI review requested due to automatic review settings June 11, 2026 14:01
@hors hors requested a review from DhruthiKV as a code owner June 11, 2026 14:01
hors and others added 3 commits June 11, 2026 17:02
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a pgBackRest/PostgreSQL bootstrap timing race where timeline history files (e.g. 00000002.history) can be dropped by the async archiver before archive.info exists, causing later pg_rewind failures on replicas after subsequent PITR restores.

Changes:

  • Add a synchronous recovery step to (re)push any *.history files from pg_wal immediately after stanza-create.
  • Extend E2E coverage for Crunchy→Percona migration with an additional backup + second PITR restore and data verification.
  • Update PR E2E suite list and adjust one upgrade-minor cleanup step.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
percona/controller/pgcluster/owner_ref_migration.go Whitespace-only adjustment near RBAC comment.
percona/certmanager/certmanager.go Whitespace-only adjustment in certificate apply logic.
internal/pgbackrest/pgbackrest.go Adds ArchivePushHistoryFiles helper to synchronously archive timeline history files.
internal/controller/postgrescluster/pgbackrest.go Calls history recovery right after successful stanza creation and logs output.
internal/controller/postgrescluster/instance.go Comment punctuation fix.
e2e-tests/tests/upgrade-minor/99-remove-cluster-gracefully.yaml Adds explicit finalizer removal + deletion for upgrade-minor PostgresCluster during cleanup.
e2e-tests/tests/migration-from-crunchy-backup-restore/10-write-more-data.yaml Writes a second batch of rows before the second restore scenario.
e2e-tests/tests/migration-from-crunchy-backup-restore/10-assert.yaml Asserts the “second batch written” marker ConfigMap exists.
e2e-tests/tests/migration-from-crunchy-backup-restore/11-backup.yaml Adds a second full backup CR for the second PITR scenario.
e2e-tests/tests/migration-from-crunchy-backup-restore/11-assert.yaml Asserts the second backup job/CR succeeds.
e2e-tests/tests/migration-from-crunchy-backup-restore/12-restore.yaml Performs a second PITR restore using a captured time target.
e2e-tests/tests/migration-from-crunchy-backup-restore/12-assert.yaml Asserts restore success and cluster returns to ready with 3 replicas.
e2e-tests/tests/migration-from-crunchy-backup-restore/13-verify-after-restore.yaml Verifies row presence/absence after the second PITR restore.
e2e-tests/tests/migration-from-crunchy-backup-restore/13-assert.yaml Re-checks the post-restore data captured in the ConfigMap.
e2e-tests/run-pr.csv Adds additional migration suites to the PR E2E run list.
config/crd/bases/upstream.pgv2.percona.com_postgresclusters.yaml CRD file is modified but currently contains unresolved merge-conflict markers (must be fixed).

combined += "\nstderr: " + s
}
if err != nil {
return combined, errors.Wrap(err, combined)
Comment on lines +15 to +16
kubectl -n ${NAMESPACE} patch postgrescluster upgrade-minor --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
kubectl -n ${NAMESPACE} delete postgrescluster upgrade-minor || true
Copilot AI review requested due to automatic review settings June 11, 2026 17:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment on lines +146 to +160
push_failed=0
while IFS= read -r f; do
if pgbackrest --stanza=db --log-level-console=info archive-push --no-archive-async "${f}"; then
echo "pushed ${f}"
else
echo "FAILED to push ${f}" >&2
push_failed=1
fi
done < "${tmplist}"
rm -f "${tmplist}"

[ "${push_failed}" -eq 0 ] || exit 1
`
err := exec(ctx, nil, &stdout, &stderr, "bash", "-ceu", "--", script)
combined := stdout.String()
Comment thread e2e-tests/run-pr.csv
Comment on lines +36 to +38
migration-from-crunchy-standby
migration-from-crunchy-pv
migration-from-crunchy-backup-restore
@JNKPercona

Copy link
Copy Markdown
Collaborator
Test Name Result Time
backup-enable-disable passed 00:12:53
builtin-extensions passed 00:00:00
cert-manager-tls passed 00:00:00
custom-envs passed 00:00:00
custom-tls passed 00:00:00
database-init-sql passed 00:00:00
demand-backup passed 00:00:00
demand-backup-offline-snapshot passed 00:00:00
dynamic-configuration passed 00:00:00
finalizers passed 00:00:00
init-deploy passed 00:00:00
huge-pages passed 00:00:00
major-upgrade-14-to-15 passed 00:00:00
major-upgrade-15-to-16 passed 00:00:00
major-upgrade-16-to-17 passed 00:00:00
major-upgrade-17-to-18 passed 00:00:00
ldap passed 00:00:00
ldap-tls passed 00:00:00
monitoring passed 00:00:00
monitoring-pmm3 passed 00:00:00
one-pod passed 00:00:00
operator-self-healing passed 00:00:00
pitr passed 00:00:00
scaling passed 00:00:00
scheduled-backup passed 00:00:00
self-healing passed 00:00:00
sidecars passed 00:00:00
standby-pgbackrest passed 00:00:00
standby-streaming passed 00:00:00
start-from-backup passed 00:00:00
tablespaces passed 00:00:00
telemetry-transfer passed 00:00:00
upgrade-consistency passed 00:00:00
upgrade-minor passed 00:00:00
users passed 00:00:00
migration-from-crunchy-standby passed 00:00:00
migration-from-crunchy-pv passed 00:00:00
migration-from-crunchy-backup-restore passed 00:00:00
Summary Value
Tests Run 38/38
Job Duration 00:27:45
Total Test Time 00:12:53

commit: 26ce453
image: perconalab/percona-postgresql-operator:PR-1575-26ce453f2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants