operator: keep rolling restart to one broker pod per reconcile#1537
Closed
david-yu wants to merge 1 commit into
Closed
operator: keep rolling restart to one broker pod per reconcile#1537david-yu wants to merge 1 commit into
david-yu wants to merge 1 commit into
Conversation
The rolling restart loop in the single-cluster Redpanda controller had
two latent ways to delete more than one broker pod in a single
reconcile:
- When a pod's name didn't appear in brokerMap, the loop deleted it
and set continueExecution=true, meaning every unmapped pod in
rollSet got deleted in the same pass. A transient admin-API blip,
address-format mismatch, or ghost broker ID could classify multiple
healthy pods as orphan and tear them all down at once. Now the
branch deletes one and requeues so we re-evaluate against fresh
state before touching the next pod.
- The brokerMap parser keyed brokers by
strings.Split(InternalRPCAddress, ".")[0], which silently produced
the wrong key when the admin API returned the address as
"host:port" or as a bare IP. Brokers in that state were treated as
orphans by the lookup above. Now the parser strips the port via
net.SplitHostPort and keys by both the first DNS label and the raw
host, so a freshly-restarted broker advertising "pod.svc:33145" or
an IP is still recognized.
Added a TestHasRecentlyReplacedPods unit test pinning the existing
safety-gate contract that defers a roll while a just-replaced pod is
still coming up — Redpanda's cluster health view lags pod state, and
without the gate two pods can end up Terminating concurrently.
ENG-222 (per-broker prestart-risk API) remains the proper long-term
fix; it needs the new endpoint exposed in common-go/rpadmin first.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9d5784c to
0b325e8
Compare
4 tasks
Contributor
Author
|
Will close for now as needs more discussion. |
6 tasks
david-yu
added a commit
that referenced
this pull request
May 21, 2026
Incorporates the two correctness fixes from the (now-closed) #1537 into the ENG-222 wiring stack: - brokerMap parsing now strips a "host:port" suffix via net.SplitHostPort and keys by both the first DNS label and the raw host. Previously strings.Split(InternalRPCAddress, ".")[0] silently mis-keyed when the admin API returned the address as "host:port" or a bare IP, causing the roll loop's lookup to miss and treat the broker as orphan. - The "pod not in brokerMap" branch now deletes one pod and requeues instead of deleting + continuing. Combined with the parsing bug above (or any transient cause that empties AllNodes — rolling controller leadership, brief admin-API 503s, NodeID flip after PVC reset), the previous behavior was the only single-reconcile path that could produce multiple brokers literally Terminating at the same time. Also pins the existing HasRecentlyReplacedPods gate's contract with a new TestHasRecentlyReplacedPods unit test in pool_test.go covering the no-op, all-old-revision, all-Ready-new-revision, just-replaced-not-yet- Ready, terminating-pod, and stuck-in-Pending cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two latent paths in the rolling restart loop could delete more than one
broker pod in a single reconcile. Both are now closed.
brokerMap parsing. Brokers were keyed by
strings.Split(broker.InternalRPCAddress, \".\")[0]. That producesthe wrong key when the admin API returns the address as
host:port(advertised RPC port suffix) or as a bare IP — the lookup
brokerMap[pod.GetName()]then misses, and the pod is silentlytreated as orphan. Now we strip the port via
net.SplitHostPortandkey by both the first DNS label and the raw host.
"Pod not in brokerMap" branch. The controller deleted the pod
and set
continueExecution=true, so every unmapped pod inrollSetwas deleted in the same reconcile pass. Combined with the
parsing bug above (or any transient cause that empties
AllNodes—rolling controller leadership, brief admin-API 503s, NodeID flip
after PVC reset), this is the only single-reconcile path that
produces multiple brokers literally in `Terminating` at the same
time. Now the branch deletes one pod and requeues, so each pod's
state is re-evaluated against fresh broker-map data before the next
one is touched.
Added `TestHasRecentlyReplacedPods` to pin the existing safety-gate
contract — Redpanda's cluster health view lags pod state during a
roll, and without the gate two pods can end up unavailable
concurrently. (Pre-existing on
main; introduced by #1446. Not thesame path as the parallel-deletion fix above — see Scope below.)
Scope
This PR addresses the only code path in the single-cluster controller
that can delete multiple broker pods in one reconcile: the
`!inBrokerMap` branch, which previously kept iterating after each
delete. The fix is unambiguously correctness:
and raw host) means a freshly-restarted broker advertising
`pod.svc:33145` is no longer misclassified as orphan, even if some
broker still advertises the older format. Strict superset of the
previous keyspace; no regression for FQDN-only deployments.
but acting on every orphan in a single reconcile means a transient
admin-API state (rolling leadership, mid-restart `AllNodes` window,
address-format mismatch) tears down the whole pool. After this PR
we delete one and requeue, so the next pod is judged against an
updated map.
Probe-scheduling lag between cluster-`IsHealthy=true` and pod
`PodReady=True` is a separate concern handled by
`HasRecentlyReplacedPods` (already on `main`). The unit test added
here pins that contract but is not the customer-facing fix.
Follow-ups (not in this PR)
ENG-222 (per-broker prestart-risk API). The proper long-term fix
replaces the cluster-`IsHealthy` heuristic with the per-broker risks
endpoint introduced in core 25.1
(`rf1_offline`, `full_acks_produce_unavailable`, `unavailable`,
`acks1_data_loss`). The rpadmin client bindings are in
common-go#170;
the operator-side wiring follows once that lands.
Backport to `release/v26.1.x`. That branch is missing
`HasRecentlyReplacedPods` entirely (added in Segment acceptance tests and add upgrade/rolling-restart test #1446). The smaller
cherry-pick I'd propose is the gate + the two fixes in this PR.
Production code applies cleanly to v25.2.x / v25.3.x / v26.1.x; the
new unit test only compiles on v26.1.x because v25.x lacks the
`MulticlusterStatefulSet` type used in the table. Will open the
v26.1.x backport once this lands.
Test plan
the new `TestHasRecentlyReplacedPods`
./operator/internal/lifecycle/...`
./internal/lifecycle/...` — 0 issues
🤖 Generated with Claude Code