Skip to content

OCPBUGS-91656: fix(test): add retry logic to GetLogs in Karpenter kubelet propagatio…#8805

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
mgencur:OCPBUGS-91656_retry_getlogs
Jun 25, 2026
Merged

OCPBUGS-91656: fix(test): add retry logic to GetLogs in Karpenter kubelet propagatio…#8805
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
mgencur:OCPBUGS-91656_retry_getlogs

Conversation

@mgencur

@mgencur mgencur commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

…n test

What this PR does / why we need it:

The kubelet serving certificate on freshly provisioned Karpenter nodes may not be ready immediately after the node is marked Ready, causing transient "tls: internal error" or "http2: client connection lost" failures on the proxied log request. Wrap the GetLogs call in g.Eventually with a 2-minute timeout to tolerate this window.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-91656

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Tests
    • Improved reliability of node provisioning tests on newly provisioned clusters.

…n test

The kubelet serving certificate on freshly provisioned Karpenter nodes
may not be ready immediately after the node is marked Ready, causing
transient "tls: internal error" or "http2: client connection lost"
failures on the proxied log request. Wrap the GetLogs call in
g.Eventually with a 2-minute timeout to tolerate this window.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 23, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mgencur: This pull request references Jira Issue OCPBUGS-91656, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

…n test

What this PR does / why we need it:

The kubelet serving certificate on freshly provisioned Karpenter nodes may not be ready immediately after the node is marked Ready, causing transient "tls: internal error" or "http2: client connection lost" failures on the proxied log request. Wrap the GetLogs call in g.Eventually with a 2-minute timeout to tolerate this window.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-91656

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from bryan-cox and sjenning June 23, 2026 08:58
@openshift-ci openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Jun 23, 2026
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6931fc5f-7141-44bf-8d05-a22278dfa0eb

📥 Commits

Reviewing files that changed from the base of the PR and between d3162ab and 1ba5c1a.

📒 Files selected for processing (1)
  • test/e2e/karpenter_test.go

📝 Walkthrough

Walkthrough

In testKubeletPropagation, the pod log retrieval for the checker container is changed from a single immediate fetch to a retry loop. A logBytes variable is declared outside the loop, and g.Eventually polls the Kubernetes pod log streaming endpoint every 10 seconds for up to 2 minutes, storing the result in logBytes only on successful reads. The subsequent log output and phase assertion (PodSucceeded or PodFailed) remain unchanged.


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 2 warnings)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error The PR logs kubelet.conf file contents via the test pod output. Kubelet.conf contains sensitive authentication material including client certificates, private keys, and CA certificates that should... Modify the checker pod to not output the full kubelet.conf file, or implement log filtering to redact sensitive certificate/key data before logging.
Test Structure And Quality ⚠️ Warning The g.Eventually block wrapping GetLogs (lines 1336, 1339) lacks meaningful failure messages on assertions. Per requirement 4, assertions should help diagnose failures; these are especially critica... Add failure messages to both g.Expect(err).NotTo(HaveOccurred()) calls in the GetLogs g.Eventually block to clarify what TLS/HTTP2/cert errors are being retried.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning testKubeletPropagation test requires pulling container images from public registries (quay.io/openshift/origin-pod:4.22.0 and alpine) which will fail in disconnected IPv6-only environments without... Configure image mirrors for public registries or add [Skipped:Disconnected] to test name if it cannot work in disconnected environments. Verify test works on IPv6 with `/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e...
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding retry logic to GetLogs in the Karpenter kubelet propagation test to handle transient failures.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The test file uses Go's standard t.Run() testing framework, not Ginkgo. All test names are static strings with no dynamic content. The check for Ginkgo test stability is not applicable here.
Topology-Aware Scheduling Compatibility ✅ Passed Check is not applicable: PR modifies only test file (test/e2e/karpenter_test.go), adding retry logic for test reliability, not deployment/operator/controller code with scheduling constraints.
No-Weak-Crypto ✅ Passed PR only modifies test/e2e/karpenter_test.go to add retry logic for pod log retrieval; no weak cryptographic algorithms or non-constant-time comparisons introduced.
Container-Privileges ✅ Passed The PR adds a test fixture pod with privileged: true for accessing kubelet.conf on nodes in e2e tests. This is a test-only change with justified privileged access for diagnostic purposes.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.12%. Comparing base (d3162ab) to head (1ba5c1a).
⚠️ Report is 52 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8805   +/-   ##
=======================================
  Coverage   42.12%   42.12%           
=======================================
  Files         767      767           
  Lines       95067    95067           
=======================================
  Hits        40051    40051           
  Misses      52202    52202           
  Partials     2814     2814           
Flag Coverage Δ
cmd-support 35.46% <ø> (ø)
cpo-hostedcontrolplane 44.53% <ø> (ø)
cpo-other 44.25% <ø> (ø)
hypershift-operator 51.92% <ø> (ø)
other 31.69% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@enxebre

enxebre commented Jun 23, 2026

Copy link
Copy Markdown
Member

/approve

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2026

@mehabhalodiya mehabhalodiya left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 23, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@openshift-ci

openshift-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, mehabhalodiya, mgencur

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mgencur

mgencur commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@cwbotbot

cwbotbot commented Jun 23, 2026

Copy link
Copy Markdown

Test Results

e2e-aws

e2e-aks

@mgencur

mgencur commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@hypershift-jira-solve-ci

Copy link
Copy Markdown

I now have all the evidence needed for a comprehensive analysis. Here is the report:

Test Failure Analysis Complete

Job Information

  • Prow Job: Red Hat Konflux / hypershift-cli-mce-50-on-pull-request
  • Build ID: hypershift-cli-mce-50-on-pull-request-wxbrr
  • Pipeline Run: hypershift-cli-mce-50-on-pull-request-wxbrr
  • PR: #8805 — OCPBUGS-91656: fix(test): add retry logic to GetLogs in Karpenter kubelet propagation
  • Started: 2026-06-24T05:42:36Z
  • Completed: 2026-06-24T07:42:37Z (2 hours — pipeline timeout)

Test Failure Analysis

Error

task sast-snyk-check has the status "TaskRunCancelled":
TaskRun "hypershift-cli-mce-50-on-pull-request-wxbrr-sast-snyk-check" was cancelled.
TaskRun cancelled as the PipelineRun it belongs to has timed out.

Summary

The sast-snyk-check Konflux pipeline task hung for over 1 hour (vs. a normal ~35-second runtime) and was ultimately cancelled when the 2-hour PipelineRun timeout was reached. All 16 other pipeline tasks succeeded. This is a transient Snyk service infrastructure issue unrelated to the PR's code changes, which only modify a single test file (test/e2e/karpenter_test.go). A concurrent run of the same pipeline on PR #8820 completed sast-snyk-check in 35 seconds roughly one hour later, confirming the transient nature of the failure.

Root Cause

The root cause is a transient Snyk service hang during the SAST (Static Application Security Testing) scan step of the Konflux pipeline. The sast-snyk-check task became unresponsive — likely due to a Snyk API backend issue, network connectivity problem, or resource contention in the Konflux shared infrastructure — and ran for over 1 hour instead of its normal ~35-second execution time.

Key evidence proving this is not related to the PR:

  1. The PR only changes test/e2e/karpenter_test.go (a 31-line diff adding retry logic to GetLogs). Test files do not affect Snyk SAST scanning behavior — Snyk scans source code for security vulnerabilities, and test code changes cannot cause scan hangs.

  2. All other 16 pipeline tasks succeeded, including other SAST checks (sast-shell-check: 25s, sast-unicode-check: 1min), vulnerability scanning (clair-scan: 45s, clamav-scan: 1min), and the full image build chain. Only sast-snyk-check was affected.

  3. The identical pipeline passed on PR OCPBUGS-91650: bump consolidateAfter to 60s in karpenter e2e base NodePool #8820 (same repo, same Konflux pipeline definition) approximately 1 hour later, with sast-snyk-check completing in 35 seconds. This confirms the Snyk service recovered.

  4. The failure mode is TaskRunCancelled, not TaskRunFailed — the Snyk task did not report an error; it simply never completed before the 2-hour PipelineRun deadline.

The downstream enterprise-contract-mce-50 / hypershift-cli-mce-50 check was also cancelled as a consequence of the same pipeline timeout.

Recommendations
  1. Re-trigger the pipeline — This is a transient infrastructure issue. Simply re-running the Konflux pipeline should succeed, as the Snyk service has already recovered (confirmed by PR OCPBUGS-91650: bump consolidateAfter to 60s in karpenter e2e base NodePool #8820's successful run).

  2. No code changes needed — The PR's changes to test/e2e/karpenter_test.go are entirely unrelated to the Snyk scan failure. The retry logic addition is correct and should not be modified due to this CI failure.

  3. If the failure recurs, check the Snyk Status Page for ongoing incidents and consider opening a support request with the Konflux/RHTAP team, as the sast-snyk-check task may need a dedicated timeout shorter than the pipeline-level timeout to fail fast rather than consuming the entire pipeline budget.

Evidence
Evidence Detail
Failed task sast-snyk-check — ran for 1 hour, then cancelled by pipeline timeout
Normal duration sast-snyk-check completes in 35 seconds (PR #8820 baseline)
Failure mode TaskRunCancelled — not a scan failure, but a hang/timeout
Pipeline timeout 2 hours (05:42:36Z → 07:42:37Z)
Other tasks All 16 other tasks succeeded (init, clone, build, scans, etc.)
PR diff Single file: test/e2e/karpenter_test.go (31 lines, test-only change)
Comparison run PR #8820 same pipeline passed at 07:15:19Z, sast-snyk-check in 35s
Collateral damage enterprise-contract-mce-50 / hypershift-cli-mce-50 also cancelled (same pipeline)

@mgencur

mgencur commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

/verified by unit test

This PR only includes a fix for the given unit test.

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 24, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mgencur: This PR has been marked as verified by unit test.

Details

In response to this:

/verified by unit test

This PR only includes a fix for the given unit test.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mgencur

mgencur commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@mgencur

mgencur commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 25, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mgencur: This pull request references Jira Issue OCPBUGS-91656, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@mgencur: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 652dbf2 into openshift:main Jun 25, 2026
41 checks passed
@openshift-ci-robot

Copy link
Copy Markdown

@mgencur: Jira Issue Verification Checks: Jira Issue OCPBUGS-91656
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-91656 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

…n test

What this PR does / why we need it:

The kubelet serving certificate on freshly provisioned Karpenter nodes may not be ready immediately after the node is marked Ready, causing transient "tls: internal error" or "http2: client connection lost" failures on the proxied log request. Wrap the GetLogs call in g.Eventually with a 2-minute timeout to tolerate this window.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/OCPBUGS-91656

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Tests
  • Improved reliability of node provisioning tests on newly provisioned clusters.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot

Copy link
Copy Markdown
Contributor

Fix included in release 5.0.0-0.nightly-2026-06-25-194049

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Indicates the PR includes changes for e2e testing jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants