K8SPG-1012 stanza timing fix by hors · Pull Request #1575 · percona/percona-postgresql-operator

hors · 2026-04-28T21:35:23Z

CHANGE DESCRIPTION

Problem:

During a dataSource bootstrap restore, postgres promotes from TL1 to TL2
and immediately passes 00000002.history to archive_command. pgBackRest's
async archiver silently drops the push (error 103) when archive.info does
not yet exist and postgres never retries. Without 00000002.history in
the archive, pg_rewind on replicas fails with "could not find common
ancestor" after any subsequent PITR restore.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PG version?
Does the change support oldest and newest supported Kubernetes version?

Add three test suites covering the supported migration paths from Crunchy Data PGO v5 to the Percona PostgreSQL Operator: - migration-from-crunchy-backup-restore: seeds a Percona cluster from a Crunchy pgBackRest backup via spec.dataSource, then validates a full restore - migration-from-crunchy-standby: promotes a Percona standby cluster that streams from a live Crunchy primary, verifying cutover, WAL replay stabilisation, and post-promotion backup/restore - migration-from-crunchy-pv: reuses the Crunchy primary's data PV directly Also exports CPGO_VERSION in vars.sh so all three tests can pin the Crunchy operator version consistently.

During a dataSource bootstrap restore, postgres promotes from TL1 to TL2 and immediately passes 00000002.history to archive_command. pgBackRest's async archiver silently drops the push (error 103) when archive.info does not yet exist — and postgres never retries. Without 00000002.history in the archive, pg_rewind on replicas fails with "could not find common ancestor" after any subsequent PITR restore.

Bumps [github.com/Azure/go-ntlmssp](https://github.com/Azure/go-ntlmssp) from 0.0.0-20221128193559-754e69321358 to 0.1.1. - [Release notes](https://github.com/Azure/go-ntlmssp/releases) - [Commits](https://github.com/Azure/go-ntlmssp/commits/v0.1.1) --- updated-dependencies: - dependency-name: github.com/Azure/go-ntlmssp dependency-version: 0.1.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>

egegunes · 2026-04-30T06:27:17Z

+	// Re-push any timeline history files stranded by the async-archiver race:
+	// postgres archives 00000002.history during bootstrap promotion before the
+	// stanza exists; pgBackRest drops it silently (error 103) and postgres
+	// never retries.  Without it pg_rewind fails on replicas after PITR.
+	log := logging.FromContext(ctx)
+	historyOut, historyErr := pgbackrest.Executor(exec).ArchivePushHistoryFiles(ctx)
+	if historyErr != nil {
+		r.Recorder.Event(postgresCluster, corev1.EventTypeWarning,
+			"ArchivePushHistoryFilesFailed", historyErr.Error())
+		log.Error(historyErr, "timeline history file recovery failed",
+			"pod", writableInstanceName, "output", historyOut)
+	} else if historyOut != "" {
+		log.Info("timeline history file recovery", "output", historyOut)
+	}


i understand we need to do this after stanza is created but i wonder if we can do this in the caller of this function reconcilePGBackRest after line 1642 and if configHashMismatch is false

@egegunes Moving it to the caller isn't straightforward. reconcileStanzaCreate returns (false, nil) in three distinct cases:

Stanza just created successfully ← the only case where we want ArchivePushHistoryFiles
Stanza already created (stanzasCreated == true) ← should not run it again
Cluster not yet writable (!clusterWritable) ← exec would target the wrong pod / fail
So checking !configHashMismatch && err == nil at the caller doesn't let us distinguish case 1 from cases 2 and 3. We'd be exec-ing into a pod on every reconcile, including when the cluster isn't writable yet.

To make it work at the caller level we'd need a third return value (e.g. stanzaJustCreated bool) and also either return writableInstanceName or re-discover the writable instance in the caller, since the exec closure is built around it.

Given that reconcileStanzaCreate already has the writable instance name in scope and the call sits naturally right after successful stanza creation but up to you.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Copilot

Pull request overview

This PR addresses a pgBackRest/PostgreSQL bootstrap timing race where timeline history files (e.g. 00000002.history) can be dropped by the async archiver before archive.info exists, causing later pg_rewind failures on replicas after subsequent PITR restores.

Changes:

Add a synchronous recovery step to (re)push any *.history files from pg_wal immediately after stanza-create.
Extend E2E coverage for Crunchy→Percona migration with an additional backup + second PITR restore and data verification.
Update PR E2E suite list and adjust one upgrade-minor cleanup step.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
percona/controller/pgcluster/owner_ref_migration.go	Whitespace-only adjustment near RBAC comment.
percona/certmanager/certmanager.go	Whitespace-only adjustment in certificate apply logic.
internal/pgbackrest/pgbackrest.go	Adds `ArchivePushHistoryFiles` helper to synchronously archive timeline history files.
internal/controller/postgrescluster/pgbackrest.go	Calls history recovery right after successful stanza creation and logs output.
internal/controller/postgrescluster/instance.go	Comment punctuation fix.
e2e-tests/tests/upgrade-minor/99-remove-cluster-gracefully.yaml	Adds explicit finalizer removal + deletion for `upgrade-minor` PostgresCluster during cleanup.
e2e-tests/tests/migration-from-crunchy-backup-restore/10-write-more-data.yaml	Writes a second batch of rows before the second restore scenario.
e2e-tests/tests/migration-from-crunchy-backup-restore/10-assert.yaml	Asserts the “second batch written” marker ConfigMap exists.
e2e-tests/tests/migration-from-crunchy-backup-restore/11-backup.yaml	Adds a second full backup CR for the second PITR scenario.
e2e-tests/tests/migration-from-crunchy-backup-restore/11-assert.yaml	Asserts the second backup job/CR succeeds.
e2e-tests/tests/migration-from-crunchy-backup-restore/12-restore.yaml	Performs a second PITR restore using a captured time target.
e2e-tests/tests/migration-from-crunchy-backup-restore/12-assert.yaml	Asserts restore success and cluster returns to ready with 3 replicas.
e2e-tests/tests/migration-from-crunchy-backup-restore/13-verify-after-restore.yaml	Verifies row presence/absence after the second PITR restore.
e2e-tests/tests/migration-from-crunchy-backup-restore/13-assert.yaml	Re-checks the post-restore data captured in the ConfigMap.
e2e-tests/run-pr.csv	Adds additional migration suites to the PR E2E run list.
config/crd/bases/upstream.pgv2.percona.com_postgresclusters.yaml	CRD file is modified but currently contains unresolved merge-conflict markers (must be fixed).

+		combined += "\nstderr: " + s
+	}
+	if err != nil {
+		return combined, errors.Wrap(err, combined)


+      kubectl -n ${NAMESPACE} patch postgrescluster upgrade-minor --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
+      kubectl -n ${NAMESPACE} delete postgrescluster upgrade-minor || true


Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

+push_failed=0
+while IFS= read -r f; do
+    if pgbackrest --stanza=db --log-level-console=info archive-push --no-archive-async "${f}"; then
+        echo "pushed ${f}"
+    else
+        echo "FAILED to push ${f}" >&2
+        push_failed=1
+    fi
+done < "${tmplist}"
+rm -f "${tmplist}"
+
+[ "${push_failed}" -eq 0 ] || exit 1
+`
+	err := exec(ctx, nil, &stdout, &stderr, "bash", "-ceu", "--", script)
+	combined := stdout.String()


+migration-from-crunchy-standby
+migration-from-crunchy-pv
+migration-from-crunchy-backup-restore


JNKPercona · 2026-06-12T10:53:43Z

Test Name	Result	Time
backup-enable-disable	passed	00:12:53
builtin-extensions	passed	00:00:00
cert-manager-tls	passed	00:00:00
custom-envs	passed	00:00:00
custom-tls	passed	00:00:00
database-init-sql	passed	00:00:00
demand-backup	passed	00:00:00
demand-backup-offline-snapshot	passed	00:00:00
dynamic-configuration	passed	00:00:00
finalizers	passed	00:00:00
init-deploy	passed	00:00:00
huge-pages	passed	00:00:00
major-upgrade-14-to-15	passed	00:00:00
major-upgrade-15-to-16	passed	00:00:00
major-upgrade-16-to-17	passed	00:00:00
major-upgrade-17-to-18	passed	00:00:00
ldap	passed	00:00:00
ldap-tls	passed	00:00:00
monitoring	passed	00:00:00
monitoring-pmm3	passed	00:00:00
one-pod	passed	00:00:00
operator-self-healing	passed	00:00:00
pitr	passed	00:00:00
scaling	passed	00:00:00
scheduled-backup	passed	00:00:00
self-healing	passed	00:00:00
sidecars	passed	00:00:00
standby-pgbackrest	passed	00:00:00
standby-streaming	passed	00:00:00
start-from-backup	passed	00:00:00
tablespaces	passed	00:00:00
telemetry-transfer	passed	00:00:00
upgrade-consistency	passed	00:00:00
upgrade-minor	passed	00:00:00
users	passed	00:00:00
migration-from-crunchy-standby	passed	00:00:00
migration-from-crunchy-pv	passed	00:00:00
migration-from-crunchy-backup-restore	passed	00:00:00

Summary	Value
Tests Run	38/38
Job Duration	00:27:45
Total Test Time	00:12:53

commit: 26ce453
image: perconalab/percona-postgresql-operator:PR-1575-26ce453f2

egegunes and others added 30 commits April 20, 2026 16:58

rename upstream api package

8d39567

fix old api references

bb21125

rename CRDs

3ade8de

fix api version

c4baaf7

remove testing dir

ac6f625

fix rbacs

3955649

add migration controller

5cee9a4

update manifests

5e81eac

fix unit tests

92f6a31

fix upgrade-minor

8f121cc

Merge branch 'main' into crd-rename

bb84b04

Merge branch 'main' into crd-rename

50cd1dd

fix typo

b1b4cfd

Merge branch 'main' into crd-rename

2c70a21

address mayank's comments

855095d

remove references to testing/

87e562b

use require to assert owner reference

0aefa6e

fix issues due to review comments

5986a05

address review comments

66a29e0

Merge branch 'main' into crd-rename

497af8d

fix timeouts

ad66ef7

fix test

706bb71

K8SPG-1000 Add extensions major upgrade tests

28c6eb5

consider in docker.io prefix in get_container_image function

bf40ab8

address mayank's comments

6816797

remove references to testing/

13bebd2

use require to assert owner reference

a854cac

hors marked this pull request as ready for review April 29, 2026 10:58

hors requested review from egegunes, eleo007, gkech, jvpasinatto, mayankshah1607, nmarukovich, oksana-grishchenko, pooknull and valmiranogueira as code owners April 29, 2026 10:58

egegunes reviewed Apr 30, 2026

View reviewed changes

egegunes and others added 4 commits April 30, 2026 14:23

remove migration controller

f1ede33

fix updating owner references of issuer & certificates

f15c14a

add status condition for migration status

b71131c

Merge branch 'crd-rename' into stanza-timing-fix

783a445

Base automatically changed from crd-rename to main May 6, 2026 17:45

Merge remote-tracking branch 'origin/main' into stanza-timing-fix

81b3319

Copilot AI review requested due to automatic review settings June 11, 2026 14:01

hors requested a review from DhruthiKV as a code owner June 11, 2026 14:01

Copilot started reviewing on behalf of hors June 11, 2026 14:01 View session

hors and others added 3 commits June 11, 2026 17:02

fix file

27557c3

Apply suggestion from @github-actions[bot]

5557553

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Apply suggestion from @github-actions[bot]

5191d59

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Copilot AI reviewed Jun 11, 2026

View reviewed changes

fix test upgrade-minor

26ce453

Copilot AI review requested due to automatic review settings June 11, 2026 17:55

Copilot started reviewing on behalf of hors June 11, 2026 17:55 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

egegunes approved these changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8SPG-1012 stanza timing fix#1575

K8SPG-1012 stanza timing fix#1575
hors wants to merge 47 commits into
mainfrom
stanza-timing-fix

hors commented Apr 28, 2026 •

edited

Loading

Uh oh!

egegunes Apr 30, 2026

Uh oh!

hors Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

JNKPercona commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		kubectl -n ${NAMESPACE} patch postgrescluster upgrade-minor --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
		kubectl -n ${NAMESPACE} delete postgrescluster upgrade-minor \|\| true

Conversation

hors commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CHANGE DESCRIPTION

CHECKLIST

Uh oh!

egegunes Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

hors Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

JNKPercona commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hors commented Apr 28, 2026 •

edited

Loading