OCPBUGS-76530: Fix intermittent etcd peer communication failures#8479
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-76530, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Caution Review failedAn error occurred during the review process. Please try again later. WalkthroughThe pull request updates ReconcileEtcdPeerSecret to add two DNS SANs to the etcd peer certificate: Suggested Reviewers
🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8479 +/- ##
==========================================
+ Coverage 40.41% 41.94% +1.52%
==========================================
Files 755 769 +14
Lines 93235 96729 +3494
==========================================
+ Hits 37679 40570 +2891
- Misses 52854 53352 +498
- Partials 2702 2807 +105
... and 121 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go`:
- Around line 50-51: The code currently places the IP addresses "127.0.0.1" and
"::1" in the dnsNames slice; move them into an ips slice and pass that as the
ips parameter to reconcileSignedCertWithKeysAndAddresses (leave only DNS entries
in dnsNames). Update the call site that constructs the cert (reference
reconcileSignedCertWithKeysAndAddresses and the dnsNames variable) to pass ips
instead of embedding IPs in DNS, and modify the test in etcd_test.go to assert
cert.IPAddresses contains "127.0.0.1" and "::1" rather than checking
cert.DNSNames for those values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 07563a75-d22c-4fe1-a6eb-ec3094b49e1e
📒 Files selected for processing (2)
control-plane-operator/controllers/hostedcontrolplane/pki/etcd.gocontrol-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go
| "127.0.0.1", | ||
| "::1", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check the signature of reconcileSignedCertWithKeysAndAddresses and verify
# whether IP addresses should be in a separate parameter.
# Find the function definition
ast-grep --pattern $'func reconcileSignedCertWithKeysAndAddresses($$$) {
$$$
}'
# Also search for how IPAddresses field is populated in certificate generation
rg -n -A5 -B5 'IPAddresses.*=' --type=goRepository: openshift/hypershift
Length of output: 18256
🏁 Script executed:
# First, examine the etcd.go file around the lines in question
sed -n '40,65p' control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go
# Also look for the function being called and its implementation
rg -n "reconcileSignedCertWithKeysAndAddresses" --type=go control-plane-operator/controllers/hostedcontrolplane/pki/Repository: openshift/hypershift
Length of output: 2936
🏁 Script executed:
# Look for etcd_test.go
fd -name "etcd_test.go" control-plane-operator/controllers/hostedcontrolplane/pki/
# If found, check what it asserts about the certificate
if [ -f control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go ]; then
cat control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go
fiRepository: openshift/hypershift
Length of output: 2462
Move IP addresses from dnsNames to ips parameter to comply with x509 certificate standards.
The IP addresses 127.0.0.1 and ::1 are included in the dnsNames slice, but x509 certificates have separate DNSNames and IPAddresses fields. The reconcileSignedCertWithKeysAndAddresses function explicitly accepts both parameters. Pass the IPs as the ips parameter instead of including them in dnsNames:
dnsNames := []string{
fmt.Sprintf("*.etcd-discovery.%s.svc", secret.Namespace),
fmt.Sprintf("*.etcd-discovery.%s.svc.cluster.local", secret.Namespace),
fmt.Sprintf("*.etcd-client.%s.svc", secret.Namespace),
fmt.Sprintf("*.etcd-client.%s.svc.cluster.local", secret.Namespace),
}
ips := []string{
"127.0.0.1",
"::1",
}
return reconcileSignedCertWithKeysAndAddresses(secret, ca, ownerRef, "etcd-discovery", []string{"kubernetes"}, X509UsageClientServerAuth, EtcdPeerCrtKey, EtcdPeerKeyKey, "", dnsNames, ips, "")Update the corresponding test assertion in etcd_test.go to check cert.IPAddresses instead of cert.DNSNames for the IP values.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go` around
lines 50 - 51, The code currently places the IP addresses "127.0.0.1" and "::1"
in the dnsNames slice; move them into an ips slice and pass that as the ips
parameter to reconcileSignedCertWithKeysAndAddresses (leave only DNS entries in
dnsNames). Update the call site that constructs the cert (reference
reconcileSignedCertWithKeysAndAddresses and the dnsNames variable) to pass ips
instead of embedding IPs in DNS, and modify the test in etcd_test.go to assert
cert.IPAddresses contains "127.0.0.1" and "::1" rather than checking
cert.DNSNames for those values.
There was a problem hiding this comment.
Valid observation, but this predates this PR, the IPs were already in dnsNames before this change. Moving them to ips would be a separate refactor with its own cert regeneration implications.
There was a problem hiding this comment.
@vsolanki12, that's a fair point — the 127.0.0.1/::1 placement in dnsNames is pre-existing technical debt outside the scope of this fix, and bundling another cert regeneration here would be unnecessarily disruptive.
Would you like me to open a follow-up GitHub issue to track the refactor of moving those IPs into the proper ips parameter of reconcileSignedCertWithKeysAndAddresses? That way it doesn't get lost.
✏️ Learnings added
Learnt from: vsolanki12
Repo: openshift/hypershift PR: 8479
File: control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go:50-51
Timestamp: 2026-05-12T06:51:42.347Z
Learning: In `control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go`, the IP addresses `127.0.0.1` and `::1` are intentionally placed in the `dnsNames` slice (rather than the `ips` parameter of `reconcileSignedCertWithKeysAndAddresses`) as pre-existing behavior that predates PR `#8479`. Moving them to the correct `ips` parameter is a known technical debt but is intentionally deferred to avoid triggering an additional etcd peer certificate regeneration cycle.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
@vsolanki12 mind opening a jira card to track IPs in dnsNames technical debit?
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
7d8ea25 to
26c7fc3
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go`:
- Around line 57-58: The etcd peer cert is currently placing "127.0.0.1" and
"::1" into DNSNames; move those into the IPAddresses parameter when calling
reconcileSignedCertWithKeysAndAddresses by removing them from dnsNames and
passing them as the ips argument (use []string{"127.0.0.1", "::1"}) in the
reconcileSignedCertWithKeysAndAddresses call for "etcd-discovery"; then adjust
the test in etcd_test.go to assert on cert.IPAddresses (using net.ParseIP for
each IP) instead of cert.DNSNames so the test verifies IPs are present in the
certificate's IPAddresses field.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 340ac5a3-0b9f-4714-aa45-843df895f434
📒 Files selected for processing (2)
control-plane-operator/controllers/hostedcontrolplane/pki/etcd.gocontrol-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go
| "127.0.0.1", | ||
| "::1", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check how ReconcileEtcdPeerSecret handles IP addresses in certificate generation
# Search for the implementation of ReconcileEtcdPeerSecret
ast-grep --pattern $'func ReconcileEtcdPeerSecret($$$) $$$'
# Search for certificate generation calls that might include IPs
rg -A 10 -B 5 "127\.0\.0\.1|::1" --type go --glob '**/pki/etcd.go'
# Look for DNSNames and IPAddresses field assignments
rg -A 5 "DNSNames|IPAddresses" --type go --glob '**/pki/etcd.go'Repository: openshift/hypershift
Length of output: 2726
🏁 Script executed:
# Find the reconcileSignedCertWithKeysAndAddresses function implementation
ast-grep --pattern $'func reconcileSignedCertWithKeysAndAddresses($$$) $$$'
# Also search for where DNSNames is actually set in the certificate
rg -B 5 -A 15 "DNSNames\s*=" --type go --glob '**/pki/*.go' | head -100
# Look for IPAddresses assignment in certificate generation
rg -B 5 -A 5 "IPAddresses\s*=" --type go --glob '**/pki/*.go'Repository: openshift/hypershift
Length of output: 1123
🏁 Script executed:
# Find where certs.ReconcileSignedCert is imported from
rg "import.*certs" --type go --glob '**/pki/cert.go' -A 20 | head -40
# Find the certs package and ReconcileSignedCert implementation
find . -name "*.go" -type f -exec grep -l "func ReconcileSignedCert" {} \;
# Search for how it populates certificate fields
rg -B 5 -A 20 "func ReconcileSignedCert" --type goRepository: openshift/hypershift
Length of output: 1324
🏁 Script executed:
# Get the full ReconcileSignedCert implementation
sed -n '1,300p' support/certs/tls.go | tail -200Repository: openshift/hypershift
Length of output: 6056
🏁 Script executed:
# Find where the CSR is generated with dnsNames and ips in ReconcileSignedCert
sed -n '/^func ReconcileSignedCert/,/^func [A-Z]/p' support/certs/tls.go | head -100Repository: openshift/hypershift
Length of output: 2748
🏁 Script executed:
# Get the test file context around lines 57-58
sed -n '35,75p' control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.goRepository: openshift/hypershift
Length of output: 1305
Fix IP address handling in etcd peer certificate generation.
Lines 50–51 in etcd.go incorrectly add IP addresses ("127.0.0.1", "::1") to the dnsNames list. Per X.509 standards, IP addresses must be in the IPAddresses field, not DNSNames. The function signature supports a separate ips parameter for this purpose.
Fix the implementation by passing IPs via the dedicated parameter:
return reconcileSignedCertWithKeysAndAddresses(secret, ca, ownerRef, "etcd-discovery", []string{"kubernetes"}, X509UsageClientServerAuth, EtcdPeerCrtKey, EtcdPeerKeyKey, "", dnsNames, []string{"127.0.0.1", "::1"}, "")Then update the test assertion to check the correct certificate field:
g.Expect(cert.IPAddresses).To(ContainElements(
net.ParseIP("127.0.0.1"),
net.ParseIP("::1"),
))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go`
around lines 57 - 58, The etcd peer cert is currently placing "127.0.0.1" and
"::1" into DNSNames; move those into the IPAddresses parameter when calling
reconcileSignedCertWithKeysAndAddresses by removing them from dnsNames and
passing them as the ips argument (use []string{"127.0.0.1", "::1"}) in the
reconcileSignedCertWithKeysAndAddresses call for "etcd-discovery"; then adjust
the test in etcd_test.go to assert on cert.IPAddresses (using net.ParseIP for
each IP) instead of cert.DNSNames so the test verifies IPs are present in the
certificate's IPAddresses field.
|
/retest hypershift-operator-main |
|
/retest |
2 similar comments
|
/retest |
|
/retest |
|
/retest |
|
/test e2e-azure-v2-self-managed |
|
/retest |
26c7fc3 to
4b1d450
Compare
|
/jira refresh |
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-76530, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
/retest |
|
/retest |
| fmt.Sprintf("*.etcd-discovery.%s.svc.cluster.local", secret.Namespace), | ||
| // etcd-client headless service shares pod IPs with etcd-discovery; reverse DNS may resolve | ||
| // to etcd-client, so peer certs must include both service SANs to pass TLS verification. | ||
| fmt.Sprintf("*.etcd-client.%s.svc", secret.Namespace), |
There was a problem hiding this comment.
If we are switching the etcd-client service to be clusterIP instead of headless, then there should no longer be ambiguity with the reverse dns lookup and this change is not needed.
There was a problem hiding this comment.
Agreed, removed the etcd-client SANs. Now first commit only adds unit tests for the peer certificate with a comment explaining why etcd-client SANs aren't needed ClusterIP service doesn't create per-pod PTR records. The second commit keeps the ClusterIP service change.
2511ea2 to
aa6a397
Compare
|
@vsolanki12 may you update the PR description mentioning the clusterIP? |
| // TODO(OCPBUGS-86648): assert on cert.IPAddresses instead once IPs are moved out of dnsNames. | ||
| "127.0.0.1", | ||
| "::1", | ||
| )) |
There was a problem hiding this comment.
NIT: g.Expect(cert.DNSNames).ToNot(ContainElement(ContainSubstring("etcd-client")))
There was a problem hiding this comment.
Thank you @sdminonne , I have updated both the suggestions.
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-76530, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: geliu2016. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Add unit tests for ReconcileEtcdPeerSecret to verify the peer certificate includes correct etcd-discovery SANs and key usages. Include a comment explaining why etcd-client SANs are not needed: the etcd-client service uses ClusterIP (not headless), so it does not create per-pod PTR records that would cause reverse DNS ambiguity during peer TLS verification. Signed-off-by: Vimal Solanki <vsolanki@redhat.com>
Both etcd-client and etcd-discovery headless services select the same etcd pods, causing dual PTR records per pod IP. Converting etcd-client to a regular ClusterIP service eliminates the duplicate PTR records at the root cause, complementing the SAN-based fix. Co-Authored-By: Cesar Wong <cewong@redhat.com> Signed-off-by: Vimal Solanki <vsolanki@redhat.com>
aa6a397 to
9a5a0f8
Compare
|
/retest |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: csrwng, vsolanki12 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |
|
Scheduling tests matching the |
|
/retest |
|
Now I have all the data needed. Here is the final report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryAll 11 e2e test failures are pre-existing flakes unrelated to the PR changes. The PR modifies only etcd PKI code ( Root CauseThree independent pre-existing flakes, none related to PR #8479's etcd changes: Root Cause 1 — Karpenter Kubelet TLS Error (4 leaf failures → 6 with cascading parents): Root Cause 2 — kubelet-config-verifier DaemonSet Timeout (2 leaf failures → 5 with cascading parents): Root Cause 3 — NodeClaim Count Mismatch After Drift (1 leaf failure → 2 with cascading parent): Konflux Recommendations
Evidence
|
|
/retest |
|
/verified by @vsolanki12 Before fix: After fix: |
|
@vsolanki12: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@vsolanki12: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@vsolanki12: Jira Issue Verification Checks: Jira Issue OCPBUGS-76530 Jira Issue OCPBUGS-76530 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in release 5.0.0-0.nightly-2026-06-26-152608 |
What this PR does / why we need it:
Both
etcd-discoveryandetcd-clientwere headless services selecting the same etcd pods (app: etcd), causing CoreDNS to register two PTR records per pod IP. The etcd peer certificate only included*.etcd-discoverywildcard SANs, so when Go'sgetnameinfo()non-deterministically returned theetcd-clientPTR record during peer TLS verification, the SAN check failed and peer connections were rejected (~50% of attempts).This PR switches the
etcd-clientservice from headless (clusterIP: None) to a regular ClusterIP service, which eliminates the per-pod PTR records that caused the reverse DNS ambiguity. With this change, onlyetcd-discoverycreates per-pod PTR records, so peer TLS verification consistently resolves to the correct SAN.Which issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/OCPBUGS-76530
Special notes for your reviewer:
etcd-clientservice is switched from headless to ClusterIP — this removes per-pod PTR records and resolves the dual-PTR ambiguity127.0.0.1,::1) are currently passed indnsNamesinstead ofipsChecklist:
Authored by Vimal Solanki