operator: keep rolling restart to one broker pod per reconcile by david-yu · Pull Request #1537 · redpanda-data/redpanda-operator

david-yu · 2026-05-20T17:10:46Z

Summary

Two latent paths in the rolling restart loop could delete more than one
broker pod in a single reconcile. Both are now closed.

brokerMap parsing. Brokers were keyed by
strings.Split(broker.InternalRPCAddress, \".\")[0]. That produces
the wrong key when the admin API returns the address as host:port
(advertised RPC port suffix) or as a bare IP — the lookup
brokerMap[pod.GetName()] then misses, and the pod is silently
treated as orphan. Now we strip the port via net.SplitHostPort and
key by both the first DNS label and the raw host.
"Pod not in brokerMap" branch. The controller deleted the pod
and set continueExecution=true, so every unmapped pod in rollSet
was deleted in the same reconcile pass. Combined with the
parsing bug above (or any transient cause that empties AllNodes —
rolling controller leadership, brief admin-API 503s, NodeID flip
after PVC reset), this is the only single-reconcile path that
produces multiple brokers literally in `Terminating` at the same
time. Now the branch deletes one pod and requeues, so each pod's
state is re-evaluated against fresh broker-map data before the next
one is touched.

Added `TestHasRecentlyReplacedPods` to pin the existing safety-gate
contract — Redpanda's cluster health view lags pod state during a
roll, and without the gate two pods can end up unavailable
concurrently. (Pre-existing on main; introduced by #1446. Not the
same path as the parallel-deletion fix above — see Scope below.)

Scope

This PR addresses the only code path in the single-cluster controller
that can delete multiple broker pods in one reconcile: the
`!inBrokerMap` branch, which previously kept iterating after each
delete. The fix is unambiguously correctness:

Parsing change — `net.SplitHostPort` + dual-key (first label
and raw host) means a freshly-restarted broker advertising
`pod.svc:33145` is no longer misclassified as orphan, even if some
broker still advertises the older format. Strict superset of the
previous keyspace; no regression for FQDN-only deployments.
One-at-a-time — the orphan branch is itself low-risk per pod,
but acting on every orphan in a single reconcile means a transient
admin-API state (rolling leadership, mid-restart `AllNodes` window,
address-format mismatch) tears down the whole pool. After this PR
we delete one and requeue, so the next pod is judged against an
updated map.

Probe-scheduling lag between cluster-`IsHealthy=true` and pod
`PodReady=True` is a separate concern handled by
`HasRecentlyReplacedPods` (already on `main`). The unit test added
here pins that contract but is not the customer-facing fix.

Follow-ups (not in this PR)

ENG-222 (per-broker prestart-risk API). The proper long-term fix
replaces the cluster-`IsHealthy` heuristic with the per-broker risks
endpoint introduced in core 25.1
(`rf1_offline`, `full_acks_produce_unavailable`, `unavailable`,
`acks1_data_loss`). The rpadmin client bindings are in
common-go#170;
the operator-side wiring follows once that lands.
Backport to `release/v26.1.x`. That branch is missing
`HasRecentlyReplacedPods` entirely (added in Segment acceptance tests and add upgrade/rolling-restart test #1446). The smaller
cherry-pick I'd propose is the gate + the two fixes in this PR.
Production code applies cleanly to v25.2.x / v25.3.x / v26.1.x; the
new unit test only compiles on v26.1.x because v25.x lacks the
`MulticlusterStatefulSet` type used in the table. Will open the
v26.1.x backport once this lands.

Test plan

`go test ./operator/internal/lifecycle/... -count=1` — including
the new `TestHasRecentlyReplacedPods`
`go vet ./operator/internal/controller/redpanda/...
./operator/internal/lifecycle/...`
`golangci-lint run --fast-only ./internal/controller/redpanda/...
./internal/lifecycle/...` — 0 issues

🤖 Generated with Claude Code

The rolling restart loop in the single-cluster Redpanda controller had two latent ways to delete more than one broker pod in a single reconcile: - When a pod's name didn't appear in brokerMap, the loop deleted it and set continueExecution=true, meaning every unmapped pod in rollSet got deleted in the same pass. A transient admin-API blip, address-format mismatch, or ghost broker ID could classify multiple healthy pods as orphan and tear them all down at once. Now the branch deletes one and requeues so we re-evaluate against fresh state before touching the next pod. - The brokerMap parser keyed brokers by strings.Split(InternalRPCAddress, ".")[0], which silently produced the wrong key when the admin API returned the address as "host:port" or as a bare IP. Brokers in that state were treated as orphans by the lookup above. Now the parser strips the port via net.SplitHostPort and keys by both the first DNS label and the raw host, so a freshly-restarted broker advertising "pod.svc:33145" or an IP is still recognized. Added a TestHasRecentlyReplacedPods unit test pinning the existing safety-gate contract that defers a roll while a just-replaced pod is still coming up — Redpanda's cluster health view lags pod state, and without the gate two pods can end up Terminating concurrently. ENG-222 (per-broker prestart-risk API) remains the proper long-term fix; it needs the new endpoint exposed in common-go/rpadmin first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

david-yu · 2026-05-20T19:00:16Z

Will close for now as needs more discussion.

Incorporates the two correctness fixes from the (now-closed) #1537 into the ENG-222 wiring stack: - brokerMap parsing now strips a "host:port" suffix via net.SplitHostPort and keys by both the first DNS label and the raw host. Previously strings.Split(InternalRPCAddress, ".")[0] silently mis-keyed when the admin API returned the address as "host:port" or a bare IP, causing the roll loop's lookup to miss and treat the broker as orphan. - The "pod not in brokerMap" branch now deletes one pod and requeues instead of deleting + continuing. Combined with the parsing bug above (or any transient cause that empties AllNodes — rolling controller leadership, brief admin-API 503s, NodeID flip after PVC reset), the previous behavior was the only single-reconcile path that could produce multiple brokers literally Terminating at the same time. Also pins the existing HasRecentlyReplacedPods gate's contract with a new TestHasRecentlyReplacedPods unit test in pool_test.go covering the no-op, all-old-revision, all-Ready-new-revision, just-replaced-not-yet- Ready, terminating-pod, and stuck-in-Pending cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

david-yu force-pushed the dyu/operator-rolling-restart-one-pod-at-a-time branch from 9d5784c to 0b325e8 Compare May 20, 2026 17:20

david-yu mentioned this pull request May 20, 2026

rpadmin: add PreRestartProbe and PostRestartProbe redpanda-data/common-go#170

Merged

4 tasks

david-yu marked this pull request as ready for review May 20, 2026 17:36

david-yu requested review from RafalKorepta, andrewstucki, chrisseto, gene-redpanda and hidalgopl as code owners May 20, 2026 17:36

david-yu added backport v25.2.x v25.3.x v26.1.x labels May 20, 2026

david-yu closed this May 20, 2026

david-yu mentioned this pull request May 21, 2026

operator: adopt per-broker pre-restart probe (ENG-222) + roll-loop correctness #1547

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator: keep rolling restart to one broker pod per reconcile#1537

operator: keep rolling restart to one broker pod per reconcile#1537
david-yu wants to merge 1 commit into
mainfrom
dyu/operator-rolling-restart-one-pod-at-a-time

david-yu commented May 20, 2026 •

edited

Loading

Uh oh!

david-yu commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

david-yu commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Follow-ups (not in this PR)

Test plan

Uh oh!

david-yu commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

david-yu commented May 20, 2026 •

edited

Loading