Skip to content

OCPBUGS-76530: Fix intermittent etcd peer communication failures#8479

Merged
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
vsolanki12:OCPBUGS-76530-etcd-peer-cert-san
Jun 26, 2026
Merged

OCPBUGS-76530: Fix intermittent etcd peer communication failures#8479
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
vsolanki12:OCPBUGS-76530-etcd-peer-cert-san

Conversation

@vsolanki12

@vsolanki12 vsolanki12 commented May 11, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Both etcd-discovery and etcd-client were headless services selecting the same etcd pods (app: etcd), causing CoreDNS to register two PTR records per pod IP. The etcd peer certificate only included *.etcd-discovery wildcard SANs, so when Go's getnameinfo() non-deterministically returned the etcd-client PTR record during peer TLS verification, the SAN check failed and peer connections were rejected (~50% of attempts).

This PR switches the etcd-client service from headless (clusterIP: None) to a regular ClusterIP service, which eliminates the per-pod PTR records that caused the reverse DNS ambiguity. With this change, only etcd-discovery creates per-pod PTR records, so peer TLS verification consistently resolves to the correct SAN.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-76530

Special notes for your reviewer:

  • The etcd-client service is switched from headless to ClusterIP — this removes per-pod PTR records and resolves the dual-PTR ambiguity
  • etcd-client SANs are intentionally not added to the peer cert since they are no longer needed with the ClusterIP change
  • A unit test explicitly asserts that etcd-client SANs are absent from the peer certificate
  • The ClusterIP change was originally authored by @csrwng in NO-JIRA: fix(cpo): switch etcd-client to ClusterIP service #8612 and incorporated here
  • This is a HyperShift-specific issue — standalone OCP etcd uses static pods with node IPs, no dual headless services
  • Tech debt tracked in OCPBUGS-86648: IPs (127.0.0.1, ::1) are currently passed in dnsNames instead of ips

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Authored by Vimal Solanki

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 11, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2026
@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 11, 2026
@openshift-ci

openshift-ci Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot

Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-76530, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

Both etcd-discovery and etcd-client headless services select the same etcd pods (app: etcd), causing CoreDNS to register two PTR records per pod IP. The etcd peer certificate only included *.etcd-discovery wildcard SANs, so when Go's getnameinfo() non-deterministically returned the etcd-client PTR record during peer TLS verification, the SAN check failed and peer connections were rejected (~50% of attempts).

This PR adds *.etcd-client.{ns}.svc and *.etcd-client.{ns}.svc.cluster.local SANs to the etcd peer certificate, ensuring both possible PTR lookup results match.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-76530

Special notes for your reviewer:

  • The fix triggers a one-time etcd peer certificate regeneration on first CPO reconciliation after deploy
  • The config hash annotation change causes a rolling restart of etcd pods (safe in HA — quorum maintained)
  • This is a HyperShift-specific issue — standalone OCP etcd uses static pods with node IPs, no dual headless services
  • The server cert (ReconcileEtcdServerSecret) already covers both services; the peer cert was the only gap

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Authored by Vimal Solanki

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

An error occurred during the review process. Please try again later.

Walkthrough

The pull request updates ReconcileEtcdPeerSecret to add two DNS SANs to the etcd peer certificate: *.etcd-client.<namespace>.svc and *.etcd-client.<namespace>.svc.cluster.local. It also adds a unit test TestReconcileEtcdPeerSecret which generates a CA, invokes the reconciler on an empty peer secret, and asserts the resulting certificate contains the expected DNSNames (including discovery and client wildcards plus 127.0.0.1 and ::1) and both client and server extended key usages.

Suggested Reviewers

  • rtheis
  • csrwng
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: adding DNS SANs to the etcd peer certificate to fix intermittent peer communication failures caused by non-deterministic PTR record resolution.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR does not use Ginkgo testing framework. It uses standard Go testing with t.Run() and gomega matchers. The test names are static, descriptive, and deterministic with no dynamic data.
Test Structure And Quality ✅ Passed The custom check requires Ginkgo test structure review. However, the test uses standard Go testing with t.Run() subtests, not Ginkgo. The check is not applicable to this test pattern.
Microshift Test Compatibility ✅ Passed The PR adds a standard Go unit test, not Ginkgo e2e tests. The MicroShift check applies only to Ginkgo e2e tests. Not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added. Only a standard Go unit test (TestReconcileEtcdPeerSecret) that tests internal certificate generation without topology assumptions.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies certificate generation only, adding DNS SANs to etcd peer certs. No scheduling constraints introduced; not applicable to topology check.
Ote Binary Stdout Contract ✅ Passed PR is not an OTE binary test. Changes are to operational code (etcd.go certificate logic) and standard Go unit tests. No process-level code modified. No stdout violations detected.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This is a standard Go unit test, not a Ginkgo e2e test. The custom check only applies to "new Ginkgo e2e tests" (with It(), Describe(), Context(), When() markers). This test does not use Ginkgo.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels May 11, 2026
@codecov

codecov Bot commented May 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.94%. Comparing base (7625684) to head (9a5a0f8).
⚠️ Report is 398 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8479      +/-   ##
==========================================
+ Coverage   40.41%   41.94%   +1.52%     
==========================================
  Files         755      769      +14     
  Lines       93235    96729    +3494     
==========================================
+ Hits        37679    40570    +2891     
- Misses      52854    53352     +498     
- Partials     2702     2807     +105     
Files with missing lines Coverage Δ
...perator/controllers/hostedcontrolplane/pki/etcd.go 44.44% <100.00%> (+44.44%) ⬆️

... and 121 files with indirect coverage changes

Flag Coverage Δ
cmd-support 35.46% <ø> (+1.02%) ⬆️
cpo-hostedcontrolplane 44.90% <100.00%> (+3.13%) ⬆️
cpo-other 44.70% <ø> (+4.38%) ⬆️
hypershift-operator 50.17% <ø> (-0.56%) ⬇️
other 31.69% <ø> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go`:
- Around line 50-51: The code currently places the IP addresses "127.0.0.1" and
"::1" in the dnsNames slice; move them into an ips slice and pass that as the
ips parameter to reconcileSignedCertWithKeysAndAddresses (leave only DNS entries
in dnsNames). Update the call site that constructs the cert (reference
reconcileSignedCertWithKeysAndAddresses and the dnsNames variable) to pass ips
instead of embedding IPs in DNS, and modify the test in etcd_test.go to assert
cert.IPAddresses contains "127.0.0.1" and "::1" rather than checking
cert.DNSNames for those values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 07563a75-d22c-4fe1-a6eb-ec3094b49e1e

📥 Commits

Reviewing files that changed from the base of the PR and between e66488b and 7d8ea25.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go

Comment on lines 50 to 51
"127.0.0.1",
"::1",

@coderabbitai coderabbitai Bot May 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check the signature of reconcileSignedCertWithKeysAndAddresses and verify 
# whether IP addresses should be in a separate parameter.

# Find the function definition
ast-grep --pattern $'func reconcileSignedCertWithKeysAndAddresses($$$) {
  $$$
}'

# Also search for how IPAddresses field is populated in certificate generation
rg -n -A5 -B5 'IPAddresses.*=' --type=go

Repository: openshift/hypershift

Length of output: 18256


🏁 Script executed:

# First, examine the etcd.go file around the lines in question
sed -n '40,65p' control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go

# Also look for the function being called and its implementation
rg -n "reconcileSignedCertWithKeysAndAddresses" --type=go control-plane-operator/controllers/hostedcontrolplane/pki/

Repository: openshift/hypershift

Length of output: 2936


🏁 Script executed:

# Look for etcd_test.go
fd -name "etcd_test.go" control-plane-operator/controllers/hostedcontrolplane/pki/

# If found, check what it asserts about the certificate
if [ -f control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go ]; then
  cat control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go
fi

Repository: openshift/hypershift

Length of output: 2462


Move IP addresses from dnsNames to ips parameter to comply with x509 certificate standards.

The IP addresses 127.0.0.1 and ::1 are included in the dnsNames slice, but x509 certificates have separate DNSNames and IPAddresses fields. The reconcileSignedCertWithKeysAndAddresses function explicitly accepts both parameters. Pass the IPs as the ips parameter instead of including them in dnsNames:

dnsNames := []string{
    fmt.Sprintf("*.etcd-discovery.%s.svc", secret.Namespace),
    fmt.Sprintf("*.etcd-discovery.%s.svc.cluster.local", secret.Namespace),
    fmt.Sprintf("*.etcd-client.%s.svc", secret.Namespace),
    fmt.Sprintf("*.etcd-client.%s.svc.cluster.local", secret.Namespace),
}
ips := []string{
    "127.0.0.1",
    "::1",
}

return reconcileSignedCertWithKeysAndAddresses(secret, ca, ownerRef, "etcd-discovery", []string{"kubernetes"}, X509UsageClientServerAuth, EtcdPeerCrtKey, EtcdPeerKeyKey, "", dnsNames, ips, "")

Update the corresponding test assertion in etcd_test.go to check cert.IPAddresses instead of cert.DNSNames for the IP values.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go` around
lines 50 - 51, The code currently places the IP addresses "127.0.0.1" and "::1"
in the dnsNames slice; move them into an ips slice and pass that as the ips
parameter to reconcileSignedCertWithKeysAndAddresses (leave only DNS entries in
dnsNames). Update the call site that constructs the cert (reference
reconcileSignedCertWithKeysAndAddresses and the dnsNames variable) to pass ips
instead of embedding IPs in DNS, and modify the test in etcd_test.go to assert
cert.IPAddresses contains "127.0.0.1" and "::1" rather than checking
cert.DNSNames for those values.

@vsolanki12 vsolanki12 May 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation, but this predates this PR, the IPs were already in dnsNames before this change. Moving them to ips would be a separate refactor with its own cert regeneration implications.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsolanki12, that's a fair point — the 127.0.0.1/::1 placement in dnsNames is pre-existing technical debt outside the scope of this fix, and bundling another cert regeneration here would be unnecessarily disruptive.

Would you like me to open a follow-up GitHub issue to track the refactor of moving those IPs into the proper ips parameter of reconcileSignedCertWithKeysAndAddresses? That way it doesn't get lost.


✏️ Learnings added
Learnt from: vsolanki12
Repo: openshift/hypershift PR: 8479
File: control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go:50-51
Timestamp: 2026-05-12T06:51:42.347Z
Learning: In `control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go`, the IP addresses `127.0.0.1` and `::1` are intentionally placed in the `dnsNames` slice (rather than the `ips` parameter of `reconcileSignedCertWithKeysAndAddresses`) as pre-existing behavior that predates PR `#8479`. Moving them to the correct `ips` parameter is a known technical debt but is intentionally deferred to avoid triggering an additional etcd peer certificate regeneration cycle.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsolanki12 mind opening a jira card to track IPs in dnsNames technical debit?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@vsolanki12 vsolanki12 force-pushed the OCPBUGS-76530-etcd-peer-cert-san branch from 7d8ea25 to 26c7fc3 Compare May 12, 2026 06:55

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go`:
- Around line 57-58: The etcd peer cert is currently placing "127.0.0.1" and
"::1" into DNSNames; move those into the IPAddresses parameter when calling
reconcileSignedCertWithKeysAndAddresses by removing them from dnsNames and
passing them as the ips argument (use []string{"127.0.0.1", "::1"}) in the
reconcileSignedCertWithKeysAndAddresses call for "etcd-discovery"; then adjust
the test in etcd_test.go to assert on cert.IPAddresses (using net.ParseIP for
each IP) instead of cert.DNSNames so the test verifies IPs are present in the
certificate's IPAddresses field.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 340ac5a3-0b9f-4714-aa45-843df895f434

📥 Commits

Reviewing files that changed from the base of the PR and between 7d8ea25 and 26c7fc3.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • control-plane-operator/controllers/hostedcontrolplane/pki/etcd.go

Comment on lines +57 to +58
"127.0.0.1",
"::1",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check how ReconcileEtcdPeerSecret handles IP addresses in certificate generation

# Search for the implementation of ReconcileEtcdPeerSecret
ast-grep --pattern $'func ReconcileEtcdPeerSecret($$$) $$$'

# Search for certificate generation calls that might include IPs
rg -A 10 -B 5 "127\.0\.0\.1|::1" --type go --glob '**/pki/etcd.go'

# Look for DNSNames and IPAddresses field assignments
rg -A 5 "DNSNames|IPAddresses" --type go --glob '**/pki/etcd.go'

Repository: openshift/hypershift

Length of output: 2726


🏁 Script executed:

# Find the reconcileSignedCertWithKeysAndAddresses function implementation
ast-grep --pattern $'func reconcileSignedCertWithKeysAndAddresses($$$) $$$'

# Also search for where DNSNames is actually set in the certificate
rg -B 5 -A 15 "DNSNames\s*=" --type go --glob '**/pki/*.go' | head -100

# Look for IPAddresses assignment in certificate generation
rg -B 5 -A 5 "IPAddresses\s*=" --type go --glob '**/pki/*.go'

Repository: openshift/hypershift

Length of output: 1123


🏁 Script executed:

# Find where certs.ReconcileSignedCert is imported from
rg "import.*certs" --type go --glob '**/pki/cert.go' -A 20 | head -40

# Find the certs package and ReconcileSignedCert implementation
find . -name "*.go" -type f -exec grep -l "func ReconcileSignedCert" {} \;

# Search for how it populates certificate fields
rg -B 5 -A 20 "func ReconcileSignedCert" --type go

Repository: openshift/hypershift

Length of output: 1324


🏁 Script executed:

# Get the full ReconcileSignedCert implementation
sed -n '1,300p' support/certs/tls.go | tail -200

Repository: openshift/hypershift

Length of output: 6056


🏁 Script executed:

# Find where the CSR is generated with dnsNames and ips in ReconcileSignedCert
sed -n '/^func ReconcileSignedCert/,/^func [A-Z]/p' support/certs/tls.go | head -100

Repository: openshift/hypershift

Length of output: 2748


🏁 Script executed:

# Get the test file context around lines 57-58
sed -n '35,75p' control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go

Repository: openshift/hypershift

Length of output: 1305


Fix IP address handling in etcd peer certificate generation.

Lines 50–51 in etcd.go incorrectly add IP addresses ("127.0.0.1", "::1") to the dnsNames list. Per X.509 standards, IP addresses must be in the IPAddresses field, not DNSNames. The function signature supports a separate ips parameter for this purpose.

Fix the implementation by passing IPs via the dedicated parameter:

return reconcileSignedCertWithKeysAndAddresses(secret, ca, ownerRef, "etcd-discovery", []string{"kubernetes"}, X509UsageClientServerAuth, EtcdPeerCrtKey, EtcdPeerKeyKey, "", dnsNames, []string{"127.0.0.1", "::1"}, "")

Then update the test assertion to check the correct certificate field:

g.Expect(cert.IPAddresses).To(ContainElements(
    net.ParseIP("127.0.0.1"),
    net.ParseIP("::1"),
))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/hostedcontrolplane/pki/etcd_test.go`
around lines 57 - 58, The etcd peer cert is currently placing "127.0.0.1" and
"::1" into DNSNames; move those into the IPAddresses parameter when calling
reconcileSignedCertWithKeysAndAddresses by removing them from dnsNames and
passing them as the ips argument (use []string{"127.0.0.1", "::1"}) in the
reconcileSignedCertWithKeysAndAddresses call for "etcd-discovery"; then adjust
the test in etcd_test.go to assert on cert.IPAddresses (using net.ParseIP for
each IP) instead of cert.DNSNames so the test verifies IPs are present in the
certificate's IPAddresses field.

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest hypershift-operator-main

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

2 similar comments
@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

@vsolanki12 vsolanki12 marked this pull request as ready for review May 13, 2026 05:51
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 13, 2026
@openshift-ci openshift-ci Bot requested review from enxebre and sdminonne May 13, 2026 05:51
@sdminonne

Copy link
Copy Markdown
Contributor

/retest

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/test e2e-azure-v2-self-managed

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

@vsolanki12 vsolanki12 force-pushed the OCPBUGS-76530-etcd-peer-cert-san branch from 26c7fc3 to 4b1d450 Compare May 14, 2026 15:37
@vsolanki12

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-76530, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 15, 2026
@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

fmt.Sprintf("*.etcd-discovery.%s.svc.cluster.local", secret.Namespace),
// etcd-client headless service shares pod IPs with etcd-discovery; reverse DNS may resolve
// to etcd-client, so peer certs must include both service SANs to pass TLS verification.
fmt.Sprintf("*.etcd-client.%s.svc", secret.Namespace),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are switching the etcd-client service to be clusterIP instead of headless, then there should no longer be ambiguity with the reverse dns lookup and this change is not needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, removed the etcd-client SANs. Now first commit only adds unit tests for the peer certificate with a comment explaining why etcd-client SANs aren't needed ClusterIP service doesn't create per-pod PTR records. The second commit keeps the ClusterIP service change.

@vsolanki12 vsolanki12 force-pushed the OCPBUGS-76530-etcd-peer-cert-san branch from 2511ea2 to aa6a397 Compare June 18, 2026 06:40
@sdminonne

Copy link
Copy Markdown
Contributor

@vsolanki12 may you update the PR description mentioning the clusterIP?

// TODO(OCPBUGS-86648): assert on cert.IPAddresses instead once IPs are moved out of dnsNames.
"127.0.0.1",
"::1",
))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: g.Expect(cert.DNSNames).ToNot(ContainElement(ContainSubstring("etcd-client")))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sdminonne , I have updated both the suggestions.

@openshift-ci-robot

Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-76530, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @geliu2016

Details

In response to this:

What this PR does / why we need it:

Both etcd-discovery and etcd-client were headless services selecting the same etcd pods (app: etcd), causing CoreDNS to register two PTR records per pod IP. The etcd peer certificate only included *.etcd-discovery wildcard SANs, so when Go's getnameinfo() non-deterministically returned the etcd-client PTR record during peer TLS verification, the SAN check failed and peer connections were rejected (~50% of attempts).

This PR switches the etcd-client service from headless (clusterIP: None) to a regular ClusterIP service, which eliminates the per-pod PTR records that caused the reverse DNS ambiguity. With this change, only etcd-discovery creates per-pod PTR records, so peer TLS verification consistently resolves to the correct SAN.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-76530

Special notes for your reviewer:

  • The etcd-client service is switched from headless to ClusterIP — this removes per-pod PTR records and resolves the dual-PTR ambiguity
  • etcd-client SANs are intentionally not added to the peer cert since they are no longer needed with the ClusterIP change
  • A unit test explicitly asserts that etcd-client SANs are absent from the peer certificate
  • The ClusterIP change was originally authored by @csrwng in NO-JIRA: fix(cpo): switch etcd-client to ClusterIP service #8612 and incorporated here
  • This is a HyperShift-specific issue — standalone OCP etcd uses static pods with node IPs, no dual headless services
  • Tech debt tracked in OCPBUGS-86648: IPs (127.0.0.1, ::1) are currently passed in dnsNames instead of ips

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Authored by Vimal Solanki

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: geliu2016.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@vsolanki12: This pull request references Jira Issue OCPBUGS-76530, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @geliu2016

In response to this:

What this PR does / why we need it:

Both etcd-discovery and etcd-client were headless services selecting the same etcd pods (app: etcd), causing CoreDNS to register two PTR records per pod IP. The etcd peer certificate only included *.etcd-discovery wildcard SANs, so when Go's getnameinfo() non-deterministically returned the etcd-client PTR record during peer TLS verification, the SAN check failed and peer connections were rejected (~50% of attempts).

This PR switches the etcd-client service from headless (clusterIP: None) to a regular ClusterIP service, which eliminates the per-pod PTR records that caused the reverse DNS ambiguity. With this change, only etcd-discovery creates per-pod PTR records, so peer TLS verification consistently resolves to the correct SAN.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-76530

Special notes for your reviewer:

  • The etcd-client service is switched from headless to ClusterIP — this removes per-pod PTR records and resolves the dual-PTR ambiguity
  • etcd-client SANs are intentionally not added to the peer cert since they are no longer needed with the ClusterIP change
  • A unit test explicitly asserts that etcd-client SANs are absent from the peer certificate
  • The ClusterIP change was originally authored by @csrwng in NO-JIRA: fix(cpo): switch etcd-client to ClusterIP service #8612 and incorporated here
  • This is a HyperShift-specific issue — standalone OCP etcd uses static pods with node IPs, no dual headless services
  • Tech debt tracked in OCPBUGS-86648: IPs (127.0.0.1, ::1) are currently passed in dnsNames instead of ips

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Authored by Vimal Solanki

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

vsolanki12 and others added 2 commits June 24, 2026 14:29
Add unit tests for ReconcileEtcdPeerSecret to verify the peer
certificate includes correct etcd-discovery SANs and key usages.

Include a comment explaining why etcd-client SANs are not needed:
the etcd-client service uses ClusterIP (not headless), so it does
not create per-pod PTR records that would cause reverse DNS ambiguity
during peer TLS verification.

Signed-off-by: Vimal Solanki <vsolanki@redhat.com>
Both etcd-client and etcd-discovery headless services select the same
etcd pods, causing dual PTR records per pod IP. Converting etcd-client
to a regular ClusterIP service eliminates the duplicate PTR records at
the root cause, complementing the SAN-based fix.

Co-Authored-By: Cesar Wong <cewong@redhat.com>
Signed-off-by: Vimal Solanki <vsolanki@redhat.com>
@vsolanki12 vsolanki12 force-pushed the OCPBUGS-76530-etcd-peer-cert-san branch from aa6a397 to 9a5a0f8 Compare June 24, 2026 09:00
@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

@csrwng

csrwng commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

/approve

@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng, vsolanki12

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 24, 2026
@sdminonne

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Now I have all the data needed. Here is the final report:

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
  • Build ID: 2070007124049530880
  • Target: e2e-aws
  • PR: OCPBUGS-76530: Fix intermittent etcd peer communication failures #8479 (OCPBUGS-76530: Fix intermittent etcd peer communication failures)
  • Base SHA: e771dc6
  • Test Run Start: 2026-06-25T04:52:11Z
  • Failed Tests: 11 (of 575 total, 30 skipped) — 3 distinct root causes
  • Konflux Check: enterprise-contract-mce-50 / hypershift-cli-mce-50 — integration test failure (Konflux infrastructure, unrelated to PR)

Test Failure Analysis

Error

1) TestKarpenter/.../OpenshiftEC2NodeClass_Kubelet_propagation:
   Get "https://10.0.134.217:10250/containerLogs/kube-system/kubelet-config-checker/checker": remote error: tls: internal error

2) TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_..._propagate_to_guest_without_rollout:
   failed to wait for DaemonSet kubelet-config-verifier to be ready: context deadline exceeded

3) TestKarpenterUpgradeControlPlane/Main:
   expected 1 NodeClaims, got 2

Summary

All 11 e2e test failures are pre-existing flakes unrelated to the PR changes. The PR modifies only etcd PKI code (pki/etcd.go, pki/etcd_test.go) and the etcd client service manifest (service.yaml), changing it from headless to ClusterIP. None of the 3 failing test areas (Karpenter kubelet config propagation, GlobalPullSecret DaemonSet readiness, KarpenterUpgradeControlPlane NodeClaim drift) touch etcd PKI or service configuration. The Karpenter TLS error was independently confirmed and fixed by PR #8805 (merged the same day, after this test run). The Konflux enterprise-contract-mce-50 failure is an integration test infrastructure issue unrelated to code changes. A /retest should clear these failures.

Root Cause

Three independent pre-existing flakes, none related to PR #8479's etcd changes:

Root Cause 1 — Karpenter Kubelet TLS Error (4 leaf failures → 6 with cascading parents):
The TestKarpenter/Main/Parallel_provisioning_tests/OpenshiftEC2NodeClass_Kubelet_propagation test provisions a Karpenter node with custom kubelet config and then calls GetLogs to verify the kubelet config via a kubelet-config-checker pod. The proxied log request to the kubelet's serving endpoint (https://10.0.134.217:10250/containerLogs/...) fails with remote error: tls: internal error because the kubelet serving certificate on freshly provisioned Karpenter nodes may not be ready immediately after the node reports Ready. This is a known race condition that was independently identified and fixed by PR #8805 (OCPBUGS-91656, merged 2026-06-25T07:40:18Z) by wrapping the GetLogs call in g.Eventually with a 2-minute retry timeout. PR #8479's e2e-aws run started at 04:52 UTC — nearly 3 hours before the fix merged — so the base SHA e771dc6 did not include the fix.

Root Cause 2 — kubelet-config-verifier DaemonSet Timeout (2 leaf failures → 5 with cascading parents):
In TestCreateCluster/Main/EnsureGlobalPullSecret, the first subtest (When_management-cluster_...propagate_to_guest_without_rollout) creates a kubelet-config-verifier DaemonSet to verify pull secret propagation. The DaemonSet gets stuck at 2/3 pods ready for the entire 20-minute timeout (likely one node was slow to schedule or the DaemonSet pod was evicted). When the subtest times out with context deadline exceeded, it does not clean up the DaemonSet. The subsequent subtest (Check_if_the_config.json_is_correct_in_all_of_the_nodes) then tries to create the same DaemonSet and immediately fails with a 409 AlreadyExists error. This is a test infrastructure bug — the DaemonSet creation should use CreateOrUpdate or delete the DaemonSet before recreating it. This has no relation to etcd peer cert SANs.

Root Cause 3 — NodeClaim Count Mismatch After Drift (1 leaf failure → 2 with cascading parent):
TestKarpenterUpgradeControlPlane/Main upgrades the control plane, which triggers Karpenter node drift. After the upgrade, the test waits for NodeClaims to converge to 1 (the replacement node), but finds 2 NodeClaims — the old drifted NodeClaim has not been fully cleaned up within the 5-minute timeout. This is a timing-sensitive Karpenter drift reconciliation issue, not related to etcd changes.

Konflux enterprise-contract-mce-50 Failure:
The Konflux enterprise contract validation (enterprise-contract-mce-50-qgqnt) is a post-build integration test that validates container image signing and provenance policies for the hypershift-cli-mce-50 snapshot. The build step (hypershift-cli-mce-50-on-pull-request) passed successfully. The enterprise contract validation failure is a Konflux infrastructure issue — this check shows as "skipping" on all other current open PRs (#8826, #8823, #8827), indicating it only triggers under certain snapshot conditions and its failure is not caused by code changes.

Recommendations
  1. Retest the Prow job: Run /retest e2e-aws — the Karpenter TLS fix (PR OCPBUGS-91656: fix(test): add retry logic to GetLogs in Karpenter kubelet propagatio… #8805) is now merged into main and will be included in the new test run's base SHA, which should eliminate Root Cause 1 (6 of 11 failures).

  2. Root Cause 2 is a known test bug: The kubelet-config-verifier DaemonSet cleanup issue in globalps.go should be tracked. The test should use CreateOrUpdate instead of Create for the DaemonSet, or ensure cleanup in the preceding subtest's defer. This is independent of this PR.

  3. Root Cause 3 is a timing flake: The TestKarpenterUpgradeControlPlane NodeClaim drift reconciliation timeout (5 minutes) may be too tight for some environments. Consider extending the timeout or adding retry logic.

  4. Konflux enterprise-contract: No action needed — this is a Konflux infrastructure issue that resolves on retest or is handled by the Konflux team. The build step passed, confirming the code compiles and produces valid images.

  5. No code changes needed in this PR: The etcd service and PKI changes are correctly scoped and all unit tests pass. The e2e failures are in completely unrelated test areas (Karpenter, GlobalPullSecret).

Evidence
Evidence Detail
PR #8479 changed files pki/etcd.go, pki/etcd_test.go, 5 fixture YAMLs, etcd/service.yaml — all etcd-specific, no test framework changes
Failed test area Karpenter kubelet config, GlobalPullSecret DaemonSet, Karpenter upgrade drift — none touch etcd
TLS error fix PR #8805 (OCPBUGS-91656) merged 2026-06-25T07:40:18Z, adds retry to GetLogs for kubelet TLS readiness
e2e-aws start time 2026-06-25T04:52:11Z — ~3 hours before TLS fix merged
PR #8805 e2e-aws result PASSED — confirms the fix resolves the Karpenter TLS failures
Specific TLS error Get "https://10.0.134.217:10250/containerLogs/kube-system/kubelet-config-checker/checker": remote error: tls: internal error
DaemonSet timeout kubelet-config-verifier stuck at 2/3 pods ready for 20+ minutes, then context deadline exceeded
DaemonSet conflict Subsequent subtest fails with daemonsets.apps "kubelet-config-verifier" already exists (HTTP 409)
NodeClaim mismatch expected 1 NodeClaims, got 2 after Karpenter drift reconciliation timeout (5m)
Other e2e jobs on PR e2e-v2-aws ✅, e2e-v2-gke ✅, e2e-aks ✅, e2e-azure-v2-self-managed ✅, e2e-kubevirt-aws-ovn-reduced ✅ — all passed
Konflux build step hypershift-cli-mce-50-on-pull-request: PASSED — image built successfully
Konflux enterprise-contract enterprise-contract-mce-50-qgqnt: FAILED — integration test for snapshot validation, unrelated to code
Konflux on other PRs Shows "skipping" on PRs #8826, #8823, #8827 — not consistently triggered
Test results 575 tests total, 30 skipped, 11 failures (all pre-existing flakes)

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/retest

@vsolanki12

Copy link
Copy Markdown
Contributor Author

/verified by @vsolanki12

Before fix:

$ oc get svc -n clusters-dhruvhcp -l app=etcd
NAME          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE
etcd-client   ClusterIP   None         <none>        2379/TCP,2381/TCP   3h13m

Confirm duplicate PTR records
$ POD_IP=$(oc get pod etcd-0 -n clusters-dhruvhcp -o jsonpath='{.status.podIP}')
$ oc exec -n clusters-dhruvhcp etcd-0 -c etcd -- nslookup $POD_IP
15.2.130.10.in-addr.arpa    name = etcd-0.etcd-discovery.clusters-dhruvhcp.svc.cluster.local.
15.2.130.10.in-addr.arpa    name = 10-130-2-15.etcd-client.clusters-dhruvhcp.svc.cluster.local.

Confirm peer cert missing etcd-client SANs
$ oc get secret etcd-peer-tls -n clusters-dhruvhcp \
    -o jsonpath='{.data.peer\.crt}' | base64 -d | \
    openssl x509 -noout -text | grep -A 5 "Subject Alternative Name"
            X509v3 Subject Alternative Name:
                DNS:*.etcd-discovery.clusters-dhruvhcp.svc,
                DNS:*.etcd-discovery.clusters-dhruvhcp.svc.cluster.local,
                DNS:127.0.0.1, DNS:::1
No *.etcd-client SANs present

After fix:

etcd-client service is ClusterIP (not headless)
$ oc get service etcd-client -n clusters-etcd-test -o jsonpath='{.spec.clusterIP}' && echo
172.xxx.xxx.xxx <-- ClusterIP assigned 

Peer cert does NOT include etcd-client SANs
$ oc get secret etcd-peer-tls -n clusters-etcd-test \
    -o jsonpath='{.data.peer\.crt}' | base64 -d | \
    openssl x509 -noout -text | grep DNS
            DNS:*.etcd-discovery.clusters-etcd-test.svc,
            DNS:*.etcd-discovery.clusters-etcd-test.svc.cluster.local,
            DNS:127.0.0.1, DNS:::1
Only *.etcd-discovery SANs present, no etcd-client SANs 

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 26, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@vsolanki12: This PR has been marked as verified by @vsolanki12.

Details

In response to this:

/verified by @vsolanki12

Before fix:

$ oc get svc -n clusters-dhruvhcp -l app=etcd
NAME          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE
etcd-client   ClusterIP   None         <none>        2379/TCP,2381/TCP   3h13m

Confirm duplicate PTR records
$ POD_IP=$(oc get pod etcd-0 -n clusters-dhruvhcp -o jsonpath='{.status.podIP}')
$ oc exec -n clusters-dhruvhcp etcd-0 -c etcd -- nslookup $POD_IP
15.2.130.10.in-addr.arpa    name = etcd-0.etcd-discovery.clusters-dhruvhcp.svc.cluster.local.
15.2.130.10.in-addr.arpa    name = 10-130-2-15.etcd-client.clusters-dhruvhcp.svc.cluster.local.

Confirm peer cert missing etcd-client SANs
$ oc get secret etcd-peer-tls -n clusters-dhruvhcp \
   -o jsonpath='{.data.peer\.crt}' | base64 -d | \
   openssl x509 -noout -text | grep -A 5 "Subject Alternative Name"
           X509v3 Subject Alternative Name:
               DNS:*.etcd-discovery.clusters-dhruvhcp.svc,
               DNS:*.etcd-discovery.clusters-dhruvhcp.svc.cluster.local,
               DNS:127.0.0.1, DNS:::1
No *.etcd-client SANs present

After fix:

etcd-client service is ClusterIP (not headless)
$ oc get service etcd-client -n clusters-etcd-test -o jsonpath='{.spec.clusterIP}' && echo
172.xxx.xxx.xxx <-- ClusterIP assigned 

Peer cert does NOT include etcd-client SANs
$ oc get secret etcd-peer-tls -n clusters-etcd-test \
   -o jsonpath='{.data.peer\.crt}' | base64 -d | \
   openssl x509 -noout -text | grep DNS
           DNS:*.etcd-discovery.clusters-etcd-test.svc,
           DNS:*.etcd-discovery.clusters-etcd-test.svc.cluster.local,
           DNS:127.0.0.1, DNS:::1
Only *.etcd-discovery SANs present, no etcd-client SANs 

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@vsolanki12: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit d25b272 into openshift:main Jun 26, 2026
41 checks passed
@openshift-ci-robot

Copy link
Copy Markdown

@vsolanki12: Jira Issue Verification Checks: Jira Issue OCPBUGS-76530
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-76530 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

What this PR does / why we need it:

Both etcd-discovery and etcd-client were headless services selecting the same etcd pods (app: etcd), causing CoreDNS to register two PTR records per pod IP. The etcd peer certificate only included *.etcd-discovery wildcard SANs, so when Go's getnameinfo() non-deterministically returned the etcd-client PTR record during peer TLS verification, the SAN check failed and peer connections were rejected (~50% of attempts).

This PR switches the etcd-client service from headless (clusterIP: None) to a regular ClusterIP service, which eliminates the per-pod PTR records that caused the reverse DNS ambiguity. With this change, only etcd-discovery creates per-pod PTR records, so peer TLS verification consistently resolves to the correct SAN.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-76530

Special notes for your reviewer:

  • The etcd-client service is switched from headless to ClusterIP — this removes per-pod PTR records and resolves the dual-PTR ambiguity
  • etcd-client SANs are intentionally not added to the peer cert since they are no longer needed with the ClusterIP change
  • A unit test explicitly asserts that etcd-client SANs are absent from the peer certificate
  • The ClusterIP change was originally authored by @csrwng in NO-JIRA: fix(cpo): switch etcd-client to ClusterIP service #8612 and incorporated here
  • This is a HyperShift-specific issue — standalone OCP etcd uses static pods with node IPs, no dual headless services
  • Tech debt tracked in OCPBUGS-86648: IPs (127.0.0.1, ::1) are currently passed in dnsNames instead of ips

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Authored by Vimal Solanki

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot

Copy link
Copy Markdown
Contributor

Fix included in release 5.0.0-0.nightly-2026-06-26-152608

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants