Skip to content

Fix galera bootstrap deadlock when all pods are killed simultaneously#499

Open
lmiccini wants to merge 1 commit into
openstack-k8s-operators:mainfrom
lmiccini:galera-bootstrap-fallback-timeout
Open

Fix galera bootstrap deadlock when all pods are killed simultaneously#499
lmiccini wants to merge 1 commit into
openstack-k8s-operators:mainfrom
lmiccini:galera-bootstrap-fallback-timeout

Conversation

@lmiccini

@lmiccini lmiccini commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Three issues combined to delay galera cluster recovery by 12-54+ minutes when all pods were killed at once.

  1. Status update race: injectGcommURI() stored gcomm state in Status.Attributes. The deferred PatchInstance (JSON merge patch) wrote this back, overwriting pod-pushed attributes (including ContainerIDs) with stale data from the start of the reconcile.

    Fix: move gcomm injection tracking to an in-memory map (gcommState) on the reconciler. The operator no longer modifies Status.Attributes, so pod-pushed data is preserved.

  2. Unnecessary ContainerID check in findBestCandidate(): the function required all replicas to have attributes with ContainerIDs matching the currently running containers. But pods restart faster than the reconcile cycle, so CIDs never match. This check is unnecessary during bootstrap recovery because no pod has started mysqld (pods are blocked waiting for gcomm_uri), so the seqno on the persistent volume cannot change between container restarts.

    Fix: remove the CID comparison from findBestCandidate(). Only require that all replicas have pushed attributes with valid seqnos. Log CID mismatches for observability without blocking the decision.

  3. Spurious joiner push: when all pods were killed, the StatefulSet's AvailableReplicas could remain > 0 briefly (stale status), causing Bootstrapped to be true. The operator pushed joiner gcomm URIs to pods, making them start mysqld against dead peers and wasting a restart cycle.

    Fix: skip joiner gcomm push when Bootstrapped is set but no pods are actually Ready.

Also add a periodic 10s requeue when not bootstrapped, ensuring the reconciler retries even when no external events trigger a reconcile.

Functions converted to GaleraReconciler methods:

  • injectGcommURI: tracks injection in gcommState instead of attr.Gcomm
  • isBootstrapInProgress: checks gcommState instead of attr.Gcomm
  • getPodsWaitingForGcomm: checks gcommState instead of attr.Gcomm

Generated-By: claude-opus-4-6

Jira: https://redhat.atlassian.net/browse/OSPRH-32408

Three issues combined to delay galera cluster recovery by 12-54+
minutes when all pods were killed at once.

1. Status update race: injectGcommURI() stored gcomm state in
   Status.Attributes. The deferred PatchInstance (JSON merge patch)
   wrote this back, overwriting pod-pushed attributes (including
   ContainerIDs) with stale data from the start of the reconcile.

   Fix: move gcomm injection tracking to an in-memory map (gcommState)
   on the reconciler. The operator no longer modifies Status.Attributes,
   so pod-pushed data is preserved.

2. Unnecessary ContainerID check in findBestCandidate(): the function
   required all replicas to have attributes with ContainerIDs matching
   the currently running containers. But pods restart faster than the
   reconcile cycle, so CIDs never match. This check is unnecessary
   during bootstrap recovery because no pod has started mysqld (pods
   are blocked waiting for gcomm_uri), so the seqno on the persistent
   volume cannot change between container restarts.

   Fix: remove the CID comparison from findBestCandidate(). Only
   require that all replicas have pushed attributes with valid seqnos.
   Log CID mismatches for observability without blocking the decision.

3. Spurious joiner push: when all pods were killed, the StatefulSet's
   AvailableReplicas could remain > 0 briefly (stale status), causing
   Bootstrapped to be true. The operator pushed joiner gcomm URIs to
   pods, making them start mysqld against dead peers and wasting a
   restart cycle.

   Fix: skip joiner gcomm push when Bootstrapped is set but no pods
   are actually Ready.

Also add a periodic 10s requeue when not bootstrapped, ensuring the
reconciler retries even when no external events trigger a reconcile.

Functions converted to GaleraReconciler methods:
- injectGcommURI: tracks injection in gcommState instead of attr.Gcomm
- isBootstrapInProgress: checks gcommState instead of attr.Gcomm
- getPodsWaitingForGcomm: checks gcommState instead of attr.Gcomm

Generated-By: claude-opus-4-6
Signed-off-by: Luca Miccini <lmiccini@redhat.com>
@lmiccini lmiccini requested a review from dciabrin July 4, 2026 08:31
@openshift-ci openshift-ci Bot requested a review from dprince July 4, 2026 08:31
@openshift-ci

openshift-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lmiccini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved label Jul 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant