Skip to content

CNTRLPLANE-3624: e2e v2 test for control-plane-pki-operator TLS configuration#8793

Open
kaleemsiddiqu wants to merge 1 commit into
openshift:mainfrom
kaleemsiddiqu:pkioperator-tls-profile
Open

CNTRLPLANE-3624: e2e v2 test for control-plane-pki-operator TLS configuration#8793
kaleemsiddiqu wants to merge 1 commit into
openshift:mainfrom
kaleemsiddiqu:pkioperator-tls-profile

Conversation

@kaleemsiddiqu

@kaleemsiddiqu kaleemsiddiqu commented Jun 22, 2026

Copy link
Copy Markdown

CNTRLPLANE-3624: Add end-to-end v2 test to verify control-plane-pki-operator TLS configuration behavior.

What this PR does / why we need it:

e2e test for #8768

Summary by CodeRabbit

  • Tests
    • Added an end-to-end test suite that validates control-plane-pki-operator TLS behavior when the HostedCluster API server TLSSecurityProfile changes, including checks for required minTLSVersion values and connectivity success or failure for TLS 1.2 vs TLS 1.3 across default/intermediate and Modern scenarios.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

A new e2e test file is added under test/e2e/v2/tests/ that registers and implements VerifyPKIOperatorTLSConfigTest. The test validates that the control-plane-pki-operator-config ConfigMap on the management cluster correctly reflects HostedCluster TLS security profile changes (minTLSVersion: VersionTLS12 for default/intermediate vs VersionTLS13 for Modern), and that the PKI operator pod enforces those versions. The suite captures the original TLS profile in BeforeAll, progresses through multiple test phases (default profile verification, transition to Modern, Modern enforcement validation, downgrade, and post-downgrade verification), and restores the original state in AfterAll using conflict-tolerant polling.

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning Test uses bare 'localhost' in openssl commands (6 occurrences) without IPv6-aware fallback or IP family detection, which may fail in IPv6-only disconnected CI environments. Add IPv6 detection or use explicit IP addresses (127.0.0.1 for IPv4, ::1 for IPv6), or implement IP family-aware connection logic using cluster's IP family configuration.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All 9 test titles in the new test file are stable, deterministic, and use only static string literals. No dynamic values like pod names, timestamps, UUIDs, namespaces, or IP addresses are present i...
Test Structure And Quality ✅ Passed The test demonstrates strong quality on all checked dimensions: single responsibility (each It tests one behavior), proper setup/cleanup using BeforeAll/AfterAll with conflict-tolerant polling, app...
Topology-Aware Scheduling Compatibility ✅ Passed This PR adds only an end-to-end test file, not deployment manifests, operator code, or controllers. The custom check specifically applies only when such production code is added or modified, making...
No-Weak-Crypto ✅ Passed The PR adds only test code with no weak cryptography (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), no custom crypto implementations, and no non-constant-time secret comparisons.
Container-Privileges ✅ Passed PR adds only test code with no Kubernetes manifests or privileged container configurations. No privileged: true, hostPID/hostNetwork/hostIPC, SYS_ADMIN, or allowPrivilegeEscalation settings present.
No-Sensitive-Data-In-Logs ✅ Passed Test contains no logging of passwords, tokens, API keys, PII, session IDs, or sensitive data. Only infrastructure identifiers (pod names, namespaces) and generic TLS configuration status messages a...
Title check ✅ Passed The title clearly identifies the new e2e v2 test and the control-plane-pki-operator TLS configuration focus.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from enxebre and jparrill June 22, 2026 12:44
@openshift-ci openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Jun 22, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
test/e2e/v2/tests/control_plane_pki_operator_test.go (2)

246-247: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Replace hardcoded sleep with Eventually polling.

A hardcoded 60-second sleep is fragile — the pod might need more or less time to restart. Per v2 framework conventions, use Eventually for state changes that occur over time.

Consider polling until the pod has restarted (e.g., by checking that the pod's creation timestamp is newer than when the ConfigMap was updated, or by verifying the pod is Running and ready).

♻️ Suggested approach
// Wait for PKI operator pod to restart with new TLS config
Eventually(func(g Gomega) {
    pkiPodList := &corev1.PodList{}
    g.Expect(mgmtClient.List(tc.Context, pkiPodList,
        crclient.InNamespace(tc.ControlPlaneNamespace),
        crclient.MatchingLabels{"app": "control-plane-pki-operator"},
    )).To(Succeed())
    g.Expect(pkiPodList.Items).NotTo(BeEmpty())
    
    for _, pod := range pkiPodList.Items {
        g.Expect(pod.Status.Phase).To(Equal(corev1.PodRunning))
        // Optionally check Ready condition
        for _, cond := range pod.Status.Conditions {
            if cond.Type == corev1.PodReady {
                g.Expect(cond.Status).To(Equal(corev1.ConditionTrue))
            }
        }
    }
}, 2*time.Minute, 5*time.Second).Should(Succeed(), "PKI operator pod should be running and ready")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/control_plane_pki_operator_test.go` around lines 246 - 247,
Replace the hardcoded time.Sleep call with an Eventually polling loop that waits
for the PKI operator pod to actually restart and become ready. Use the
mgmtClient to list pods in the ControlPlaneNamespace with the label selector
matching the control-plane-pki-operator, then poll until the pod's Phase is
Running and its Ready condition is True. Set the timeout to around 2 minutes
with a polling interval of 5 seconds to accommodate variable restart times while
avoiding fragile hardcoded waits.

Source: Coding guidelines


54-54: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Feature annotation mismatch with outer Describe block.

The When block uses [Feature:PKIOperatorTLS] but the outer Describe at line 463 uses [Feature:PKIOperator]. Per AGENTS.md section 19, for single-feature files the Feature annotation should be on the Describe level only. Having different Feature tags at different levels may cause confusion in Sippy Component Readiness mapping.

Consider removing the Feature annotation from the When block since the outer Describe already declares the feature scope.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/control_plane_pki_operator_test.go` at line 54, The `When`
block at line 54 declares `[Feature:PKIOperatorTLS]` which conflicts with the
outer `Describe` block's `[Feature:PKIOperator]` annotation. Per AGENTS.md
section 19, for single-feature files the Feature annotation should only be
declared at the `Describe` level. Remove the `[Feature:PKIOperatorTLS]`
annotation from the `When` block description string, keeping only the
descriptive text "PKI operator TLS configuration is modified" along with the
`Ordered` parameter, since the outer `Describe` block already establishes the
feature scope.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/v2/tests/control_plane_pki_operator_test.go`:
- Around line 217-220: The NotFound error handling block in the Eventually loop
has unreachable code due to a logical conflict. When apierrors.IsNotFound(err)
is true, the g.Expect(err).NotTo(HaveOccurred()) assertion will fail immediately
since err is non-nil, and the return statement will never execute. Fix this by
removing either the if check and its Expect statement (to let the subsequent
g.Expect handle all errors) or by removing the g.Expect assertion and keeping
only the return (if the intent is to retry the Eventually loop when ConfigMap
doesn't exist).

---

Nitpick comments:
In `@test/e2e/v2/tests/control_plane_pki_operator_test.go`:
- Around line 246-247: Replace the hardcoded time.Sleep call with an Eventually
polling loop that waits for the PKI operator pod to actually restart and become
ready. Use the mgmtClient to list pods in the ControlPlaneNamespace with the
label selector matching the control-plane-pki-operator, then poll until the
pod's Phase is Running and its Ready condition is True. Set the timeout to
around 2 minutes with a polling interval of 5 seconds to accommodate variable
restart times while avoiding fragile hardcoded waits.
- Line 54: The `When` block at line 54 declares `[Feature:PKIOperatorTLS]` which
conflicts with the outer `Describe` block's `[Feature:PKIOperator]` annotation.
Per AGENTS.md section 19, for single-feature files the Feature annotation should
only be declared at the `Describe` level. Remove the `[Feature:PKIOperatorTLS]`
annotation from the `When` block description string, keeping only the
descriptive text "PKI operator TLS configuration is modified" along with the
`Ordered` parameter, since the outer `Describe` block already establishes the
feature scope.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: dd603fd6-5224-45c7-9d8c-c5f22c060279

📥 Commits

Reviewing files that changed from the base of the PR and between 8019810 and 857b5f7.

📒 Files selected for processing (1)
  • test/e2e/v2/tests/control_plane_pki_operator_test.go

Comment thread test/e2e/v2/tests/control_plane_pki_operator_test.go
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.28%. Comparing base (f69e734) to head (e3f3779).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8793   +/-   ##
=======================================
  Coverage   43.28%   43.28%           
=======================================
  Files         771      771           
  Lines       95503    95503           
=======================================
  Hits        41335    41335           
  Misses      51284    51284           
  Partials     2884     2884           
Flag Coverage Δ
cmd-support 36.67% <ø> (ø)
cpo-hostedcontrolplane 45.31% <ø> (ø)
cpo-other 45.10% <ø> (ø)
hypershift-operator 53.59% <ø> (ø)
other 31.69% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kaleemsiddiqu kaleemsiddiqu force-pushed the pkioperator-tls-profile branch from 857b5f7 to e8f538f Compare June 23, 2026 16:09

@csrwng csrwng left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of PR #8793 — e2e v2 test for control-plane-pki-operator TLS configuration.

Overall the test structure follows the v2 framework conventions well. Found one likely bug in the conflict-handling pattern (appears twice), a cleanup gap in AfterAll, and several robustness/maintainability improvements.

11 comments total: 1 bug, 1 cleanup gap, 9 suggestions.

hostedCluster := tc.GetHostedCluster()

// Check if profile is default or intermediate (both support TLS 1.2)
if hostedCluster.Spec.Configuration != nil &&

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile guard: "not Modern" != "supports TLS 1.2"

This skip guard uses != configv1.TLSProfileModernType as a proxy for "should have minTLSVersion TLS 1.2", but if a future profile type is introduced (e.g., Custom with TLS 1.3-only ciphers), this test would run and incorrectly assert VersionTLS12. Consider guarding on the positive case instead — skip unless the profile is nil (default) or explicitly IntermediateType.

The same pattern applies at lines 125-129.

if hostedCluster.Spec.Configuration != nil &&
hostedCluster.Spec.Configuration.APIServer != nil &&
hostedCluster.Spec.Configuration.APIServer.TLSSecurityProfile != nil &&
hostedCluster.Spec.Configuration.APIServer.TLSSecurityProfile.Type == configv1.TLSProfileModernType {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extract the TLS profile type check into a helper

This same multi-line nil-check-then-compare pattern appears in at least 4 tests (lines 95-99, 125-129, 239-242, 319-322). Consider extracting it into a small helper like hostedClusterHasTLSProfileType(hc, profileType) bool to reduce repetition and make the skip guards more readable.


// Get control-plane-pki-operator pod from management cluster
mgmtClient := tc.MgmtClient
mgmtRestConfig, err := e2eutil.GetConfig()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extract mgmt REST config and kube client setup

e2eutil.GetConfig() + kubernetes.NewForConfig() is repeated in 3 tests (here, line 247, line 389). Consider setting these up once in BeforeAll alongside tc and storing them as shared variables, or extracting into a helper.

GinkgoWriter.Printf("Testing TLS connections to PKI operator pod %s on port %s\n", pkiPod.Name, metricsPort)

// Test TLS 1.2 connection using openssl s_client from within the PKI operator pod
tls12Result, _ := v2util.RunCommandInPod(tc.Context, mgmtKubeClient, mgmtRestConfig,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug risk: RunCommandInPod error is silently discarded

The error return is ignored here and in every other RunCommandInPod call in this file (lines 157, 165, 280, 288, 422, 430). While || true prevents a non-zero exit from the shell command itself, an infrastructure-level failure (e.g., pod exec API error, container not found) would also be silently swallowed, and the test would assert against an empty or garbage string. Consider checking the error:

tls12Result, err := v2util.RunCommandInPod(...)
Expect(err).NotTo(HaveOccurred(), "failed to exec into PKI operator pod")

// the control-plane-pki-operator config reflects the correct minTLSVersion and that
// the API server enforces those TLS versions correctly.
func VerifyPKIOperatorTLSConfigTest(getTestCtx internal.TestContextGetter) {
When("PKI operator TLS configuration is modified", Ordered, func() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Label("Lifecycle"): this container includes tests that mutate the HostedCluster

Per the v2 framework guidelines (Rule 13), only lifecycle tests may mutate cluster state. The It blocks at lines 173 and 307 update the HostedCluster's TLS security profile. Adding Label("Lifecycle") here ensures CI can filter these appropriately (longer timeouts, avoid parallel execution with other mutation tests):

Suggested change
When("PKI operator TLS configuration is modified", Ordered, func() {
When("PKI operator TLS configuration is modified", Ordered, Label("Lifecycle"), func() {

"TLS 1.2 connection should be rejected with modern profile")
})

It("should downgrade to minTLSVersion VersionTLS12 when Modern TLS profile is removed", func() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test mutates the HostedCluster (removes TLS profile to downgrade to Intermediate). The containing When block should carry Label("Lifecycle") per framework Rule 13.


// Update to Modern TLS profile in the HostedCluster CR
hostedCluster.Spec.Configuration.APIServer.TLSSecurityProfile = &configv1.TLSSecurityProfile{
Type: configv1.TLSProfileModernType,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: conflict-retry pattern silently succeeds without applying the update

g.Expect(err).To(HaveOccurred(), ...) on a non-nil err always passes, so when a conflict occurs this Eventually iteration returns successfullyEventually considers the body satisfied and stops retrying. The update was never applied, but the test proceeds as if it was.

To actually retry on conflict, you need to fail the iteration explicitly:

Suggested change
Type: configv1.TLSProfileModernType,
err = mgmtClient.Update(tc.Context, hostedCluster)
g.Expect(apierrors.IsConflict(err)).To(BeFalse(), "conflict on update, retrying")
g.Expect(err).NotTo(HaveOccurred(), "failed to update HostedCluster TLS profile to Modern")

The same bug appears at line 335-337 in the downgrade test.


pkiPod = &pkiPodList.Items[0]
g.Expect(pkiPod.Status.Phase).To(Equal(corev1.PodRunning),
"control-plane-pki-operator pod should be running")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vacuous pass: PodReady condition may not exist

Per framework Rule 16, condition-search loops need a found guard. If the pod has no PodReady condition in its status (e.g., during init), no assertion fires and the Eventually iteration succeeds without actually verifying readiness. Same issue at lines 410-414.

Suggested change
"control-plane-pki-operator pod should be running")
readyFound := false
for _, cond := range pkiPod.Status.Conditions {
if cond.Type == corev1.PodReady {
readyFound = true
g.Expect(cond.Status).To(Equal(corev1.ConditionTrue),
"control-plane-pki-operator pod should be ready")
}
}
g.Expect(readyFound).To(BeTrue(), "PodReady condition not found on pod %s", pkiPod.Name)

fmt.Sprintf("timeout 2 openssl s_client -connect localhost:%s -tls1_2 2>&1 || true", metricsPort))

// TLS 1.2 should be rejected - expect error or alert in output, or no TLSv1.2 in successful output
lowerResult := strings.ToLower(tls12Result)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overly broad assertion: ContainSubstring("error") matches successful connections too

The Or() matcher has several issues:

  1. ContainSubstring("error") is too broad — even a successful openssl handshake outputs debug lines containing "error" (e.g., Verify return code: ... preceded by error-related lines, error writing to BIO). This arm would match on a successful TLS 1.2 connection, making the test always pass.
  2. ContainSubstring("alert") is also common in successful TLS output.
  3. Not(ContainSubstring("protocol : tlsv1.2")) with the double space is openssl-version-dependent formatting.

Consider asserting the negative instead — that the connection was NOT successfully negotiated at TLS 1.2:

Expect(lowerResult).NotTo(
    And(
        ContainSubstring("new,"),
        ContainSubstring("tlsv1.2"),
    ),
    "TLS 1.2 connection should be rejected with modern profile, got: %s", tls12Result)

Or check for the specific openssl error that TLS 1.3-only servers return: ContainSubstring("no protocols available") or ContainSubstring("unsupported protocol").


AfterAll(func() {
// Restore original TLS profile if it was captured
if originalTLSProfile != nil && tc != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup gap: AfterAll doesn't restore when the original TLS profile was nil

If the HostedCluster starts with no TLS profile (the common default — originalTLSProfile is nil), this cleanup block is entirely skipped. If the test fails after setting Modern (line 195) but before the downgrade test runs (line 307), AfterAll won't restore the HC — it stays mutated with Modern. With Ordered, a failure in any earlier It causes later ones to be skipped, so this is a realistic scenario.

Consider always running the cleanup and handling the nil case:

Suggested change
if originalTLSProfile != nil && tc != nil {
AfterAll(func() {
if tc == nil {
return
}
GinkgoWriter.Printf("Restoring original TLS security profile\n")
err := wait.PollUntilContextTimeout(tc.Context, 5*time.Second, 1*time.Minute, true, func(ctx context.Context) (bool, error) {
hostedCluster := &hyperv1.HostedCluster{}
err := tc.MgmtClient.Get(ctx, crclient.ObjectKey{
Namespace: tc.ClusterNamespace,
Name: tc.ClusterName,
}, hostedCluster)
if err != nil {
return false, err
}
if hostedCluster.Spec.Configuration == nil {
hostedCluster.Spec.Configuration = &hyperv1.ClusterConfiguration{}
}
if hostedCluster.Spec.Configuration.APIServer == nil {
hostedCluster.Spec.Configuration.APIServer = &configv1.APIServerSpec{}
}
hostedCluster.Spec.Configuration.APIServer.TLSSecurityProfile = originalTLSProfile

This way, if originalTLSProfile is nil, it sets the profile back to nil (restoring the default).

@kaleemsiddiqu kaleemsiddiqu force-pushed the pkioperator-tls-profile branch from e8f538f to 92505ae Compare June 29, 2026 09:04

@ricardomaraschini ricardomaraschini left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered making this part of the existing test on the origin repository ?


// VerifyPKIOperatorTLSConfigTest validates that when TLS security profile changes are applied,
// the control-plane-pki-operator config reflects the correct minTLSVersion and that
// the API server enforces those TLS versions correctly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What API server are we referring to here ?

@kaleemsiddiqu kaleemsiddiqu Jun 29, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated text with better description.

}, cm)

if apierrors.IsNotFound(err) {
Skip(fmt.Sprintf("ConfigMap %s/%s not found - this test requires the PKI operator config to be present",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this a failure instead ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this as failure.

isDefaultOrIntermediate := !hasProfile || hostedClusterHasTLSProfileType(hostedCluster, configv1.TLSProfileIntermediateType)

if !isDefaultOrIntermediate {
Skip("HostedCluster does not have default or Intermediate TLS profile - skipping TLS 1.2 test")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we keep skipping tests ? Shouldn't we make sure the transitions work ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an additional check here before the transition (upgrade/downgrade) in configmap itself before upgrade transition

@kaleemsiddiqu

Copy link
Copy Markdown
Author

Have you considered making this part of the existing test on the origin repository ?

Those tests are very broad and covers things in general involving all related operators on both standard cluster and hypershift, this is specific to hypershift only thats why i thought to keep the test for this here only.

@kaleemsiddiqu kaleemsiddiqu force-pushed the pkioperator-tls-profile branch from 92505ae to cac6721 Compare June 29, 2026 14:57
@csrwng

csrwng commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

/approve

@openshift-ci

openshift-ci Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng, kaleemsiddiqu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 30, 2026
@kaleemsiddiqu kaleemsiddiqu force-pushed the pkioperator-tls-profile branch from cac6721 to 3ea0648 Compare June 30, 2026 14:40
@kaleemsiddiqu kaleemsiddiqu changed the title test: add e2e v2 test for control-plane-pki-operator TLS configuration CNTRLPLANE-3624: e2e v2 test for control-plane-pki-operator TLS configuration Jun 30, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 30, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 30, 2026

Copy link
Copy Markdown

@kaleemsiddiqu: This pull request references CNTRLPLANE-3624 which is a valid jira issue.

Details

In response to this:

CNTRLPLANE-3624: Add end-to-end v2 test to verify control-plane-pki-operator TLS configuration behavior.

What this PR does / why we need it:

e2e test for #8768

Summary by CodeRabbit

  • Tests
  • Added an end-to-end test suite that validates control-plane-pki-operator TLS behavior when the HostedCluster API server TLSSecurityProfile changes, including checks for required minTLSVersion values and connectivity success or failure for TLS 1.2 vs TLS 1.3 across default/intermediate and Modern scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ricardomaraschini

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2072215604164562944 | Cost: $3.0053817499999997 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@kaleemsiddiqu

Copy link
Copy Markdown
Author

/retest

@jparrill jparrill left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Thanks!

// VerifyPKIOperatorTLSConfigTest validates that when TLS security profile changes are applied
// to the HostedCluster, the control-plane-pki-operator config reflects the correct minTLSVersion
// and that the control-plane-pki-operator's HTTPS endpoint enforces those TLS versions correctly.
func VerifyPKIOperatorTLSConfigTest(getTestCtx internal.TestContextGetter) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs Serial here too. This test mutates the HC TLS profile and triggers pod restarts — without Serial, parallel tests checking pod health or certs will flake. See backup_restore_test.go for the pattern.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added serial here

Expect(configYAML).To(ContainSubstring("TLS"), "config.yaml should contain TLS configuration")
})

It("should have minTLSVersion set to VersionTLS12 with default/intermediate profile", func() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is meaningless — "TLS" appears everywhere in field names and cipher strings. The subsequent tests already validate the real config. Remove this line.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed that check


// Get control-plane-pki-operator pod from management cluster
mgmtClient := tc.MgmtClient
pkiPodList := &corev1.PodList{}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declared 3 times as a local variable. Hoist to a constant next to pkiOperatorConfigMapName.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved as constant and replaced wherever it was used.

GinkgoWriter.Printf("Testing TLS connections to PKI operator pod %s on port %s\n", pkiPod.Name, metricsPort)

// Test TLS 1.2 connection using openssl s_client from within the PKI operator pod
tls12Result, err := v2util.RunCommandInPod(tc.Context, mgmtKubeClient, mgmtRestConfig,

@jparrill jparrill Jul 1, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g.Expect(apierrors.IsConflict(err)).To(BeFalse()) works by accident but reads wrong. Use:

if apierrors.IsConflict(err) {
    return
}
g.Expect(err).NotTo(HaveOccurred())

Same at lines 287-288.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with suggested code

pkiPodList := &corev1.PodList{}
g.Expect(mgmtClient.List(tc.Context, pkiPodList,
crclient.InNamespace(tc.ControlPlaneNamespace),
crclient.MatchingLabels{"app": "control-plane-pki-operator"},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wait for a pod to be Ready, but never verify it is a new pod. The old pod with old config can still be Running+Ready briefly. TLS 1.3 passes against both old and new config, so you get a false positive. Capture pod UID before mutation, wait for a different UID after.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added code to compare pods UID before and after.

g.Expect(cond.Status).To(Equal(corev1.ConditionTrue),
"control-plane-pki-operator pod should be ready")
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout 2 is pretty aggressive for CI. The pod exec goes through SPDY streaming via the API server — in a loaded cluster, 2 seconds might not be enough. Bump to timeout 5 across all the openssl commands.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout of 5 secods added.

Namespace: tc.ClusterNamespace,
Name: tc.ClusterName,
}, hostedCluster)
if err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. wait.PollUntilContextTimeout: No other v2 test uses this — the framework convention is Eventually. You already use Eventually with conflict retry earlier in the same file. Switching also lets you drop the "context" and wait imports.

  2. Silent failure: The warning log swallows the restore error. If this fails, the cluster is left with whatever TLS profile was last set, breaking subsequent tests. Use Expect(err).NotTo(HaveOccurred()) to fail loudly.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Replaced with Eventually
  2. Used Expect(err).*

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown

Now I have a clear picture. Let me output the final report.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Process did not finish before 2h0m0s timeout
Container test exited with code 127, reason Error
Step e2e-aws-4-22-hypershift-aws-run-e2e-nested failed after 2h30m8s.

Summary

The e2e test suite exceeded the 2-hour process timeout. The TestCreateCluster test was the only test that did not complete — it was stuck in its EnsureMetricsForwarderWorking subtest, repeatedly polling for a Prometheus kube-apiserver scrape target that never appeared. All other tests (TestAutoscaling, TestKarpenter, TestKarpenterUpgradeControlPlane, TestUpgradeControlPlane, etc.) completed successfully. This failure is unrelated to PR #8793, which only adds a new v2 e2e test file (test/e2e/v2/tests/control_plane_pki_operator_test.go) with build tag e2ev2 — this file is not compiled into the standard e2e binary used by this job.

Root Cause

The root cause is a cumulative e2e suite timeout (2-hour wall clock), not a specific test assertion failure.

What happened step-by-step:

  1. The e2e test binary started at 12:24:07Z with a 2-hour Prow entrypoint timeout.
  2. ~14 tests ran in parallel. The longest-running tests were:
    • TestKarpenterUpgradeControlPlane: 4022s (67 min)
    • TestAutoscaling: 3815s (64 min)
    • TestKarpenter: 3517s (59 min)
  3. TestCreateCluster ran its sequential "Main" subtests: break-glass-credentials → EnsureAPIUX → ... → EnsureGlobalPullSecret → EnsureMetricsForwarderWorking.
  4. EnsureMetricsForwarderWorking began polling Prometheus targets inside the guest cluster. Despite enabling the metrics-forwarder, the kube-apiserver target via control-plane-metrics-forwarder was never found in Prometheus active targets. The test has a 10-minute internal polling timeout, but before that timeout could expire, the entire process was killed at 14:24:07Z.
  5. The Prow entrypoint sent SIGTERM. The process failed to exit within the 30-minute grace period and was killed at 14:54:07Z with exit code 127.

Why EnsureMetricsForwarderWorking was slow:
The test successfully deployed the endpoint-resolver and metrics-proxy in the management namespace, and the metrics-forwarder was deployed and running in the guest cluster. However, the Prometheus kube-apiserver scrape target via the forwarder never appeared as healthy in the active targets API. The initial exec into the Prometheus pod failed with error dialing backend: EOF, suggesting temporary connectivity issues to the guest cluster's monitoring stack. Subsequent retries found the targets API accessible but the specific scrape target was never present, indicating a configuration or scrape discovery issue in the metrics-forwarder chain.

Why the PR is not at fault:
PR #8793 adds a single file test/e2e/v2/tests/control_plane_pki_operator_test.go with build tag //go:build e2ev2. This tag is not included in the standard e2e test binary compiled for e2e-aws-4-22, so the file has no effect on test compilation, test count, or test execution.

Recommendations
  1. Retry the job — This appears to be a transient infrastructure/timing issue where the metrics-forwarder Prometheus scrape target did not register in time. The failure is not related to the PR's code changes.

  2. Investigate EnsureMetricsForwarderWorking reliability — This test has been the bottleneck before. Consider:

    • Adding more diagnostic logging when the target is not found (e.g., dump all active targets to help debug)
    • Reducing the polling timeout from 10 minutes to 5 minutes so it fails fast rather than consuming budget from the overall 2h timeout
    • Adding a test-level context.WithTimeout to TestCreateCluster to ensure it fails explicitly rather than being killed by the process timeout
  3. Consider the overall test suite timing budget — With tests like TestKarpenterUpgradeControlPlane (67 min) and TestAutoscaling (64 min) running concurrently, the 2-hour budget is tight. If TestCreateCluster hits any slowness in its sequential subtests, it risks timeout.

Evidence
Evidence Detail
Process timeout Process did not finish before 2h0m0s timeout at 2026-07-01T14:24:07Z
Failed step e2e-aws-4-22-hypershift-aws-run-e2e-nested (exit code 127)
Stuck test TestCreateCluster/Main/EnsureMetricsForwarderWorking — never completed
Stuck polling message kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry) — repeated 6+ times
Initial exec error exec in openshift-monitoring/prometheus-k8s-0 container prometheus failed: error dialing backend: EOF
PR change scope Single file: test/e2e/v2/tests/control_plane_pki_operator_test.go with build tag e2ev2 — not compiled into the e2e binary
All other tests All 14 other tests PASSED (TestAutoscaling 3815s, TestKarpenter 3517s, TestKarpenterUpgradeControlPlane 4022s, etc.)
Test start time 2026-07-01T12:24:07Z
Timeout hit 2026-07-01T14:24:07Z (exactly 2 hours)
Grace period exceeded Process did not exit before 30m0s grace period at 2026-07-01T14:54:07Z

@kaleemsiddiqu kaleemsiddiqu force-pushed the pkioperator-tls-profile branch from 3ea0648 to f92ebfb Compare July 1, 2026 15:31
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2026
CNTRLPLANE-3624: Add end-to-end v2 test to verify control-plane-pki-operator
TLS configuration behavior.

Signed-off-by: Kaleemullah Siddiqui <ksiddiqu@redhat.com>
@kaleemsiddiqu kaleemsiddiqu force-pushed the pkioperator-tls-profile branch from f92ebfb to e3f3779 Compare July 2, 2026 08:37
@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@kaleemsiddiqu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-4-22 3ea0648 link true /test e2e-aws-4-22

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jparrill

jparrill commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jul 2, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants