Skip to content

operator: keep rolling restart to one broker pod per reconcile#1537

Closed
david-yu wants to merge 1 commit into
mainfrom
dyu/operator-rolling-restart-one-pod-at-a-time
Closed

operator: keep rolling restart to one broker pod per reconcile#1537
david-yu wants to merge 1 commit into
mainfrom
dyu/operator-rolling-restart-one-pod-at-a-time

Conversation

@david-yu
Copy link
Copy Markdown
Contributor

@david-yu david-yu commented May 20, 2026

Summary

Two latent paths in the rolling restart loop could delete more than one
broker pod in a single reconcile. Both are now closed.

  • brokerMap parsing. Brokers were keyed by
    strings.Split(broker.InternalRPCAddress, \".\")[0]. That produces
    the wrong key when the admin API returns the address as host:port
    (advertised RPC port suffix) or as a bare IP — the lookup
    brokerMap[pod.GetName()] then misses, and the pod is silently
    treated as orphan. Now we strip the port via net.SplitHostPort and
    key by both the first DNS label and the raw host.

  • "Pod not in brokerMap" branch. The controller deleted the pod
    and set continueExecution=true, so every unmapped pod in rollSet
    was deleted in the same reconcile pass. Combined with the
    parsing bug above (or any transient cause that empties AllNodes
    rolling controller leadership, brief admin-API 503s, NodeID flip
    after PVC reset), this is the only single-reconcile path that
    produces multiple brokers literally in `Terminating` at the same
    time. Now the branch deletes one pod and requeues, so each pod's
    state is re-evaluated against fresh broker-map data before the next
    one is touched.

Added `TestHasRecentlyReplacedPods` to pin the existing safety-gate
contract — Redpanda's cluster health view lags pod state during a
roll, and without the gate two pods can end up unavailable
concurrently. (Pre-existing on main; introduced by #1446. Not the
same path as the parallel-deletion fix above — see Scope below.)

Scope

This PR addresses the only code path in the single-cluster controller
that can delete multiple broker pods in one reconcile: the
`!inBrokerMap` branch, which previously kept iterating after each
delete. The fix is unambiguously correctness:

  1. Parsing change — `net.SplitHostPort` + dual-key (first label
    and raw host) means a freshly-restarted broker advertising
    `pod.svc:33145` is no longer misclassified as orphan, even if some
    broker still advertises the older format. Strict superset of the
    previous keyspace; no regression for FQDN-only deployments.
  2. One-at-a-time — the orphan branch is itself low-risk per pod,
    but acting on every orphan in a single reconcile means a transient
    admin-API state (rolling leadership, mid-restart `AllNodes` window,
    address-format mismatch) tears down the whole pool. After this PR
    we delete one and requeue, so the next pod is judged against an
    updated map.

Probe-scheduling lag between cluster-`IsHealthy=true` and pod
`PodReady=True` is a separate concern handled by
`HasRecentlyReplacedPods` (already on `main`). The unit test added
here pins that contract but is not the customer-facing fix.

Follow-ups (not in this PR)

  • ENG-222 (per-broker prestart-risk API). The proper long-term fix
    replaces the cluster-`IsHealthy` heuristic with the per-broker risks
    endpoint introduced in core 25.1
    (`rf1_offline`, `full_acks_produce_unavailable`, `unavailable`,
    `acks1_data_loss`). The rpadmin client bindings are in
    common-go#170;
    the operator-side wiring follows once that lands.

  • Backport to `release/v26.1.x`. That branch is missing
    `HasRecentlyReplacedPods` entirely (added in Segment acceptance tests and add upgrade/rolling-restart test #1446). The smaller
    cherry-pick I'd propose is the gate + the two fixes in this PR.
    Production code applies cleanly to v25.2.x / v25.3.x / v26.1.x; the
    new unit test only compiles on v26.1.x because v25.x lacks the
    `MulticlusterStatefulSet` type used in the table. Will open the
    v26.1.x backport once this lands.

Test plan

  • `go test ./operator/internal/lifecycle/... -count=1` — including
    the new `TestHasRecentlyReplacedPods`
  • `go vet ./operator/internal/controller/redpanda/...
    ./operator/internal/lifecycle/...`
  • `golangci-lint run --fast-only ./internal/controller/redpanda/...
    ./internal/lifecycle/...` — 0 issues

🤖 Generated with Claude Code

The rolling restart loop in the single-cluster Redpanda controller had
two latent ways to delete more than one broker pod in a single
reconcile:

  - When a pod's name didn't appear in brokerMap, the loop deleted it
    and set continueExecution=true, meaning every unmapped pod in
    rollSet got deleted in the same pass. A transient admin-API blip,
    address-format mismatch, or ghost broker ID could classify multiple
    healthy pods as orphan and tear them all down at once. Now the
    branch deletes one and requeues so we re-evaluate against fresh
    state before touching the next pod.

  - The brokerMap parser keyed brokers by
    strings.Split(InternalRPCAddress, ".")[0], which silently produced
    the wrong key when the admin API returned the address as
    "host:port" or as a bare IP. Brokers in that state were treated as
    orphans by the lookup above. Now the parser strips the port via
    net.SplitHostPort and keys by both the first DNS label and the raw
    host, so a freshly-restarted broker advertising "pod.svc:33145" or
    an IP is still recognized.

Added a TestHasRecentlyReplacedPods unit test pinning the existing
safety-gate contract that defers a roll while a just-replaced pod is
still coming up — Redpanda's cluster health view lags pod state, and
without the gate two pods can end up Terminating concurrently.

ENG-222 (per-broker prestart-risk API) remains the proper long-term
fix; it needs the new endpoint exposed in common-go/rpadmin first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu
Copy link
Copy Markdown
Contributor Author

Will close for now as needs more discussion.

@david-yu david-yu closed this May 20, 2026
david-yu added a commit that referenced this pull request May 21, 2026
Incorporates the two correctness fixes from the (now-closed) #1537 into
the ENG-222 wiring stack:

  - brokerMap parsing now strips a "host:port" suffix via net.SplitHostPort
    and keys by both the first DNS label and the raw host. Previously
    strings.Split(InternalRPCAddress, ".")[0] silently mis-keyed when the
    admin API returned the address as "host:port" or a bare IP, causing
    the roll loop's lookup to miss and treat the broker as orphan.

  - The "pod not in brokerMap" branch now deletes one pod and requeues
    instead of deleting + continuing. Combined with the parsing bug above
    (or any transient cause that empties AllNodes — rolling controller
    leadership, brief admin-API 503s, NodeID flip after PVC reset), the
    previous behavior was the only single-reconcile path that could
    produce multiple brokers literally Terminating at the same time.

Also pins the existing HasRecentlyReplacedPods gate's contract with a
new TestHasRecentlyReplacedPods unit test in pool_test.go covering the
no-op, all-old-revision, all-Ready-new-revision, just-replaced-not-yet-
Ready, terminating-pod, and stuck-in-Pending cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant