Fix galera bootstrap deadlock when all pods are killed simultaneously by lmiccini · Pull Request #499 · openstack-k8s-operators/mariadb-operator

lmiccini · 2026-07-04T08:31:10Z

Three issues combined to delay galera cluster recovery by 12-54+ minutes when all pods were killed at once.

Status update race: injectGcommURI() stored gcomm state in Status.Attributes. The deferred PatchInstance (JSON merge patch) wrote this back, overwriting pod-pushed attributes (including ContainerIDs) with stale data from the start of the reconcile.

Fix: move gcomm injection tracking to an in-memory map (gcommState) on the reconciler. The operator no longer modifies Status.Attributes, so pod-pushed data is preserved.
Unnecessary ContainerID check in findBestCandidate(): the function required all replicas to have attributes with ContainerIDs matching the currently running containers. But pods restart faster than the reconcile cycle, so CIDs never match. This check is unnecessary during bootstrap recovery because no pod has started mysqld (pods are blocked waiting for gcomm_uri), so the seqno on the persistent volume cannot change between container restarts.

Fix: remove the CID comparison from findBestCandidate(). Only require that all replicas have pushed attributes with valid seqnos. Log CID mismatches for observability without blocking the decision.
Spurious joiner push: when all pods were killed, the StatefulSet's AvailableReplicas could remain > 0 briefly (stale status), causing Bootstrapped to be true. The operator pushed joiner gcomm URIs to pods, making them start mysqld against dead peers and wasting a restart cycle.

Fix: skip joiner gcomm push when Bootstrapped is set but no pods are actually Ready.

Also add a periodic 10s requeue when not bootstrapped, ensuring the reconciler retries even when no external events trigger a reconcile.

Functions converted to GaleraReconciler methods:

injectGcommURI: tracks injection in gcommState instead of attr.Gcomm
isBootstrapInProgress: checks gcommState instead of attr.Gcomm
getPodsWaitingForGcomm: checks gcommState instead of attr.Gcomm

Generated-By: claude-opus-4-6

Jira: https://redhat.atlassian.net/browse/OSPRH-32408

Three issues combined to delay galera cluster recovery by 12-54+ minutes when all pods were killed at once. 1. Status update race: injectGcommURI() stored gcomm state in Status.Attributes. The deferred PatchInstance (JSON merge patch) wrote this back, overwriting pod-pushed attributes (including ContainerIDs) with stale data from the start of the reconcile. Fix: move gcomm injection tracking to an in-memory map (gcommState) on the reconciler. The operator no longer modifies Status.Attributes, so pod-pushed data is preserved. 2. Unnecessary ContainerID check in findBestCandidate(): the function required all replicas to have attributes with ContainerIDs matching the currently running containers. But pods restart faster than the reconcile cycle, so CIDs never match. This check is unnecessary during bootstrap recovery because no pod has started mysqld (pods are blocked waiting for gcomm_uri), so the seqno on the persistent volume cannot change between container restarts. Fix: remove the CID comparison from findBestCandidate(). Only require that all replicas have pushed attributes with valid seqnos. Log CID mismatches for observability without blocking the decision. 3. Spurious joiner push: when all pods were killed, the StatefulSet's AvailableReplicas could remain > 0 briefly (stale status), causing Bootstrapped to be true. The operator pushed joiner gcomm URIs to pods, making them start mysqld against dead peers and wasting a restart cycle. Fix: skip joiner gcomm push when Bootstrapped is set but no pods are actually Ready. Also add a periodic 10s requeue when not bootstrapped, ensuring the reconciler retries even when no external events trigger a reconcile. Functions converted to GaleraReconciler methods: - injectGcommURI: tracks injection in gcommState instead of attr.Gcomm - isBootstrapInProgress: checks gcommState instead of attr.Gcomm - getPodsWaitingForGcomm: checks gcommState instead of attr.Gcomm Generated-By: claude-opus-4-6 Signed-off-by: Luca Miccini <lmiccini@redhat.com>

openshift-ci · 2026-07-04T08:31:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lmiccini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [lmiccini]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

lmiccini requested a review from dciabrin July 4, 2026 08:31

openshift-ci Bot requested a review from dprince July 4, 2026 08:31

openshift-ci Bot added the approved label Jul 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix galera bootstrap deadlock when all pods are killed simultaneously#499

Fix galera bootstrap deadlock when all pods are killed simultaneously#499
lmiccini wants to merge 1 commit into
openstack-k8s-operators:mainfrom
lmiccini:galera-bootstrap-fallback-timeout

lmiccini commented Jul 4, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lmiccini commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lmiccini commented Jul 4, 2026 •

edited

Loading