Skip to content

CNTRLPLANE-3026: Decouple AWS AMI resolution for dual-stream support#8699

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
jparrill:CNTRLPLANE-3026
Jun 17, 2026
Merged

CNTRLPLANE-3026: Decouple AWS AMI resolution for dual-stream support#8699
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
jparrill:CNTRLPLANE-3026

Conversation

@jparrill

@jparrill jparrill commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it

Phase 1 step 3 (AWS) of the Dual-Stream RHEL 9/10 NodePool enhancement (OCPSTRAT-3014). Decouples defaultNodePoolAMI() from ReleaseImage.StreamMetadata so it resolves stream metadata via StreamForName(), enabling stream-aware boot image resolution.

Changes:

  • defaultNodePoolAMI() — accepts a streamName parameter, resolves metadata via ReleaseImage.StreamForName(streamName) instead of reading StreamMetadata directly
  • resolveAWSAMI() — passes "" as streamName (backward compatible)
  • setAWSConditions() — passes "" as streamName (backward compatible)
  • setKarpenterAMILabels() — passes "" as streamName (backward compatible)

StreamForName("") returns StreamMetadata (the legacy default). For multi-stream payloads (>= 5.0), callers will pass a resolved stream name to look up the correct entry in OSStreams. When CNTRLPLANE-3553 wires the NodePool API field, only the "" arguments change — defaultNodePoolAMI itself needs no modification.

This establishes the pattern for other platform resolvers (GCP, Azure, KubeVirt) in CNTRLPLANE-3027.

Which issue(s) this PR fixes

Fixes CNTRLPLANE-3026

Special notes for your reviewer

  • defaultNodePoolAMI signature change is not cosmetic — it decouples AMI resolution from ReleaseImage.StreamMetadata so callers can resolve any stream via StreamForName(). For single-stream payloads (OCP < 5.0), StreamForName("") falls back to StreamMetadata. For multi-stream payloads (>= 5.0), it looks up the named stream in OSStreams.
  • resolveAWSAMI stays unexported — no external callers need it.

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Test plan

go test -run "TestResolveAWSAMI|TestDefaultNodePoolAMI" ./hypershift-operator/controllers/nodepool/ -v -count=1
make lint-fix  # 0 issues
make verify    # passes

🤖 Generated with Claude Code

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 9, 2026
@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot

openshift-ci-robot commented Jun 9, 2026

Copy link
Copy Markdown

@jparrill: This pull request references CNTRLPLANE-3026 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it

Phase 1 step 3 (AWS) of the Dual-Stream RHEL 9/10 NodePool enhancement (OCPSTRAT-3014). Decouples defaultNodePoolAMI() from ReleaseImage so it accepts *stream.Stream directly, enabling stream-aware boot image resolution.

Changes:

  • defaultNodePoolAMI() — accepts *stream.Stream instead of *releaseinfo.ReleaseImage (core decoupling)
  • ResolveAWSAMI() — exported (was resolveAWSAMI), resolves stream internally via StreamForName("")
  • setAWSConditions() — resolves stream internally via StreamForName("")
  • setKarpenterAMILabels() — resolves stream internally via StreamForName("")
  • All callers pass "" for now (backward compatible). When CNTRLPLANE-3553 wires the NodePool API field, only the "" arguments change.

Integration tests:

  • TestBootImageStreamResolution/AWS — 7 test cases covering multi-stream resolution, user AMI override, legacy 4.x backward compat, missing region, unsupported arch, nil metadata

This establishes the pattern for other platform resolvers (GCP, Azure, KubeVirt) in CNTRLPLANE-3027.

Which issue(s) this PR fixes

Fixes CNTRLPLANE-3026

Dependencies

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.
  • This change includes integration tests.

Test plan

  • go build ./hypershift-operator/... — compiles
  • go test ./hypershift-operator/controllers/nodepool/... -count=1 — unit tests pass
  • go test -tags=integration ./test/integration/osstreams/... -count=1 — integration tests pass
  • make verify — lint and generation pass (only verify-git-clean expected to fail pre-merge)

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds multi-stream RHEL CoreOS support to HyperShift's NodePool AMI resolution. The ReleaseImage type gains an OSStreams field and StreamForName method to resolve stream metadata by name. Deserialization now handles both single-stream (4.x) and multi-stream (5.x) ConfigMap payloads with automatic fallback to rhel-9 when only multi-stream data is present. A new GetRHELStream helper selects between rhel-9 and rhel-10 based on OCP version and runc usage, enforcing version constraints (OCP >= 5 for rhel-10, incompatible with runc). The defaultNodePoolAMI function is refactored to accept resolved *stream.Stream instead of *ReleaseImage. AWS and Karpenter AMI resolution paths now call StreamForName before AMI computation. Provider implementations (registry client and mirror decorator) are updated to populate and propagate OSStreams. A comprehensive 5.0 fixture with both single-stream and multi-stream payloads validates deserialization and AMI selection across architectures and providers.

Possibly related PRs

  • openshift/hypershift#8673: Related nodepool AMI resolution refactoring that migrates AMI metadata handling to coreos/stream-metadata-go types in overlapping controller areas.
🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: decoupling AWS AMI resolution to enable dual-stream support, which is the core objective of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed This PR uses standard Go testing framework (testing.T) with table-driven tests, not Ginkgo. All test case names are static string literals with descriptive "When/Then" naming; none contain dynamic...
Test Structure And Quality ✅ Passed PR contains table-driven Go unit tests, not Ginkgo tests. Check for "Ginkgo test code" is not applicable; standard Go tests are well-structured with 25 parallel test cases, descriptive names, isola...
Topology-Aware Scheduling Compatibility ✅ Passed PR contains no deployment manifests or operator code introducing scheduling constraints. Changes are limited to boot image resolution logic, deserialization helpers, and unit tests—not topology-sen...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR contains only standard Go unit tests (*testing.T) without Ginkgo imports or e2e patterns. Custom check does not apply.
No-Weak-Crypto ✅ Passed No weak cryptography (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or insecure secret comparisons found. SHA256 usage in fixtures is legitimate for integrity verification.
Container-Privileges ✅ Passed PR contains no container/K8s workload specs with privilege escalation: 13 Go files have controller logic only, 1 YAML ConfigMap fixture has no Pod/Deployment specs. No privileged, hostPID/Network/I...
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data exposure detected. PR contains status condition messages with public image references and AMI IDs (not secrets), plus error messages about metadata validation with no credentials,...
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/openstack PR/issue for OpenStack (OpenStackPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Jun 9, 2026
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.83%. Comparing base (7507291) to head (b228747).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8699      +/-   ##
==========================================
+ Coverage   41.79%   41.83%   +0.03%     
==========================================
  Files         759      759              
  Lines       94037    94043       +6     
==========================================
+ Hits        39304    39342      +38     
+ Misses      51983    51946      -37     
- Partials     2750     2755       +5     
Files with missing lines Coverage Δ
hypershift-operator/controllers/nodepool/aws.go 80.17% <100.00%> (+10.14%) ⬆️
...erator/controllers/nodepool/nodepool_controller.go 43.18% <100.00%> (-0.09%) ⬇️
hypershift-operator/controllers/nodepool/token.go 82.59% <100.00%> (+0.05%) ⬆️
Flag Coverage Δ
cmd-support 35.11% <ø> (ø)
cpo-hostedcontrolplane 44.10% <ø> (ø)
cpo-other 43.45% <ø> (ø)
hypershift-operator 52.01% <100.00%> (+0.14%) ⬆️
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jparrill

jparrill commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@jparrill

jparrill commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/test verify

@cwbotbot

cwbotbot commented Jun 9, 2026

Copy link
Copy Markdown

Test Results

e2e-aws

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hypershift-operator/controllers/nodepool/aws.go (1)

121-125: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard exported ResolveAWSAMI inputs to prevent nil-pointer panics.

ResolveAWSAMI dereferences hostedCluster, nodePool, and releaseImage without validating them first. Since this helper is now exported, bad caller input can crash reconciliation instead of returning a typed error.

Suggested fix
 func ResolveAWSAMI(hostedCluster *hyperv1.HostedCluster, nodePool *hyperv1.NodePool, releaseImage *releaseinfo.ReleaseImage) (string, error) {
+	if hostedCluster == nil || hostedCluster.Spec.Platform.AWS == nil {
+		return "", fmt.Errorf("hostedCluster or hostedCluster.Spec.Platform.AWS is nil")
+	}
+	if nodePool == nil || nodePool.Spec.Platform.AWS == nil {
+		return "", fmt.Errorf("nodePool or nodePool.Spec.Platform.AWS is nil")
+	}
+	if releaseImage == nil {
+		return "", fmt.Errorf("release image is nil")
+	}
+
 	// TODO: Should the region be included in the NodePool platform information?
 	region := hostedCluster.Spec.Platform.AWS.Region

Also applies to: 137-140

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/aws.go` around lines 121 - 125,
ResolveAWSAMI currently dereferences hostedCluster, nodePool, and releaseImage
unguarded; add nil checks at the start of ResolveAWSAMI to validate those inputs
and return a clear error instead of panicking (e.g., if hostedCluster == nil ||
hostedCluster.Spec.Platform.AWS == nil/empty region, or nodePool == nil, or
releaseImage == nil return formatted errors). Also guard accesses to
nodePool.Spec.Platform.AWS and nodePool.Spec.Arch before using them and ensure
any downstream uses (the code around the later AMI resolution logic referenced
in lines ~137-140) similarly check for nil platform fields and return typed
errors. Ensure the returned error messages include which parameter was nil to
aid callers and reconciliation error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@hypershift-operator/controllers/nodepool/aws.go`:
- Around line 121-125: ResolveAWSAMI currently dereferences hostedCluster,
nodePool, and releaseImage unguarded; add nil checks at the start of
ResolveAWSAMI to validate those inputs and return a clear error instead of
panicking (e.g., if hostedCluster == nil || hostedCluster.Spec.Platform.AWS ==
nil/empty region, or nodePool == nil, or releaseImage == nil return formatted
errors). Also guard accesses to nodePool.Spec.Platform.AWS and
nodePool.Spec.Arch before using them and ensure any downstream uses (the code
around the later AMI resolution logic referenced in lines ~137-140) similarly
check for nil platform fields and return typed errors. Ensure the returned error
messages include which parameter was nil to aid callers and reconciliation error
handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: dcad894a-d232-448c-a323-5c53e80f5f67

📥 Commits

Reviewing files that changed from the base of the PR and between b59120e and ee50909.

📒 Files selected for processing (16)
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
  • hypershift-operator/controllers/nodepool/stream.go
  • hypershift-operator/controllers/nodepool/stream_test.go
  • hypershift-operator/controllers/nodepool/token.go
  • support/releaseinfo/deserialize.go
  • support/releaseinfo/deserialize_test.go
  • support/releaseinfo/fixtures/5.0-installer-coreos-bootimages.yaml
  • support/releaseinfo/fixtures/fixtures.go
  • support/releaseinfo/registry_mirror_provider.go
  • support/releaseinfo/registryclient_provider.go
  • support/releaseinfo/releaseinfo.go
  • support/releaseinfo/releaseinfo_test.go
  • test/integration/osstreams/osstreams_test.go
🚧 Files skipped from review as they are similar to previous changes (10)
  • support/releaseinfo/fixtures/fixtures.go
  • support/releaseinfo/registry_mirror_provider.go
  • hypershift-operator/controllers/nodepool/stream.go
  • hypershift-operator/controllers/nodepool/token.go
  • support/releaseinfo/registryclient_provider.go
  • hypershift-operator/controllers/nodepool/stream_test.go
  • support/releaseinfo/releaseinfo_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • test/integration/osstreams/osstreams_test.go
  • support/releaseinfo/deserialize.go

"github.com/openshift/hypershift/support/releaseinfo/fixtures"
)

func TestOSStreamsEndToEnd(t *testing.T) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please let's don't use test/integration. It's a legacy folder.
Also let's name unit tests after the function being tested

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, removed the integration test entirely and dropped the commit. Same as #8669.

if err != nil {
return "", fmt.Errorf("couldn't resolve stream metadata: %w", err)
}
ami, err := defaultNodePoolAMI(region, arch, streamMeta)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the value of changing this sigature? seems cosmetic

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not cosmetic — it's the core decoupling this PR does. Today defaultNodePoolAMI receives *releaseinfo.ReleaseImage and can only resolve from the default stream. By changing it to accept *stream.Stream, we enable the caller to pass any stream — which is what CNTRLPLANE-3553 needs when wiring spec.osImageStream into the resolution path.

Right now all callers pass StreamForName("") (backward compatible), but when the NodePool API field lands, they'll pass StreamForName(nodePool.Spec.OSImageStream.Name) and defaultNodePoolAMI doesn't need to change at all — only the caller does.

}

func resolveAWSAMI(hostedCluster *hyperv1.HostedCluster, nodePool *hyperv1.NodePool, releaseImage *releaseinfo.ReleaseImage) (string, error) {
func ResolveAWSAMI(hostedCluster *hyperv1.HostedCluster, nodePool *hyperv1.NodePool, releaseImage *releaseinfo.ReleaseImage) (string, error) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we changing the scope of these funcs? if it's to invoke from another test package that's not a good reason

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the only reason was to call it from the integration test package. Since we removed the integration test (per your feedback on #8669), there's no reason to export it anymore. Reverted to resolveAWSAMI.

}
// Default behavior for Linux/RHCOS AMIs
ami, err := defaultNodePoolAMI(region, arch, releaseImage)
streamMeta, err := releaseImage.StreamForName("")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this value not coming from GetRHELStream?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetRHELStream in resolveAWSAMIGetRHELStream is the policy layer (deciding which stream), this PR is the mechanism layer (making the code able to use any stream). Wiring GetRHELStream here means also threading releaseVersion and usesRunc through the call chain, which is exactly what CNTRLPLANE-3553 does when it adds the osImageStream API field. Pulling that wiring into this PR conflates two concerns. For now, "" gives us the default stream — same behavior as before, just through the new decoupled path.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add TODOs where relevant if you want to follow up in different PRs

}
supported := 0
for _, arch := range supportedArchitectures {
ami, err := defaultNodePoolAMI(region, arch, releaseImage)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we pass the streamName to defaultNodePoolAMI and let the lookup happen inside so it doesn't rely on the caller?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the lookup inside defaultNodePoolAMI — I considered it, but it re-couples the function to ReleaseImage, which is exactly what this PR undoes. With *stream.Stream the function is pure and testable without building a full ReleaseImage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud it's a fair point. If every caller always does the same StreamForName + defaultNodePoolAMI dance, keeping them separate just spreads boilerplate. I'll fold the lookup back into defaultNodePoolAMI so it takes (region, arch, streamName, releaseImage) and does the resolution internally. That way callers just pass the stream name and the function handles the rest atomically.

@sdminonne

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@enxebre

enxebre commented Jun 17, 2026

Copy link
Copy Markdown
Member

/hold
please update the PR desc and the commit to reflect current state

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2026
Refactor defaultNodePoolAMI() to accept a streamName parameter and
resolve stream metadata via ReleaseImage.StreamForName() instead of
reading ReleaseImage.StreamMetadata directly.

For single-stream payloads (OCP < 5.0), StreamForName("") falls back
to StreamMetadata. For multi-stream payloads (>= 5.0), it looks up
the named stream in OSStreams.

Callers pass "" for now (backward compatible). When CNTRLPLANE-3553
wires spec.osImageStream, only the "" argument changes to the resolved
stream name.

Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2026
@jparrill

Copy link
Copy Markdown
Contributor Author

/hold cancel

Commit msg + PR desc updated

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2026
@jparrill

Copy link
Copy Markdown
Contributor Author

/verified by unit tests

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 17, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@jparrill: This PR has been marked as verified by unit tests.

Details

In response to this:

/verified by unit tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@jparrill

Copy link
Copy Markdown
Contributor Author

/retest-required

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Now I have all the evidence to write the report. The failure is a management cluster (IPI) bootstrap timeout, unrelated to the PR's changes (which are about decoupling AWS AMI resolution for HyperShift hosted clusters). The root cause is a cascade of timing issues during bootstrap initialization.

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-kubevirt-aws-ovn-reduced
  • Build ID: 2067266452200624128
  • Target: e2e-kubevirt-aws-ovn-reduced
  • Failed Step: ipi-install-install (pre phase — cluster installation)
  • OCP Version: 4.22.0-0.ci-2026-06-17-021229 (openshift-install 5.0-alpha.0)

Test Failure Analysis

Error

level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
Installer exit with code 5

Summary

The management cluster (IPI) installation failed due to a bootstrap timeout — the cluster could not complete initialization within the 45-minute window. The root cause is a cascade of timing failures: the single c5n.metal (bare metal) worker node took approximately 33 minutes to join (FailedScheduling from 16:20 to 16:53), delaying the ingress router deployment. The router then failed its startup probe 1315 times because has-synced and backend-proxy-http checks could not pass while the openshift-apiserver remained unavailable (PreconditionNotReady). Simultaneously, the kube-controller-manager installer pods failed repeatedly due to a missing client-ca configmap. These cascading delays consumed the 45-minute bootstrap window. This is an infrastructure/timing failure unrelated to PR #8699's code changes (AWS AMI resolution decoupling for HyperShift hosted clusters).

Root Cause

Infrastructure bootstrap timeout — cascading control plane initialization delays.

The failure chain:

  1. Worker node provisioning delay (~33 min): The cluster uses a single c5n.metal (bare metal) worker node. Bare metal instances have significantly longer provisioning times. The router pod experienced FailedScheduling ("0/3 nodes are available: 3 node(s) had untolerated taint(s)") from cluster creation until ~16:53 UTC, when the worker node ip-10-0-43-180 finally joined and became schedulable.

  2. kube-controller-manager installer failures: The KCM installer pod failed 4 times (initial + 3 retries) with configmaps "client-ca" not found in the openshift-kube-controller-manager namespace. This is a transient race condition during bootstrap where the configmap hasn't been created yet. The KCM eventually recovered at revision 4, but these failures consumed valuable time.

  3. Router startup probe failures (1,315 occurrences): Once the router was scheduled on the worker node, it failed its startup probe 1,315 times with HTTP 500: [-]backend-proxy-http failed and [-]has-synced failed. The router was killed and restarted 11 times due to failed startup probes. The has-synced check requires the Route API (routes.route.openshift.io) to be available, which depends on the openshift-apiserver.

  4. openshift-apiserver stuck at PreconditionNotReady: The openshift-apiserver reported APIServicesAvailable: PreconditionNotReady since 16:34:50Z and never recovered. This prevented the Route API from being registered, blocking both the ingress operator and monitoring operator.

  5. Monitoring operator blocked: The monitoring operator could not create Route objects (the server could not find the requested resource (post routes.route.openshift.io)) because the Route API was unavailable.

  6. Ingress operator crash-looping (11 restarts): The ingress-operator crashed with "failed to wait for route_metrics_controller caches to sync kind source: *v1.Route: timed out waiting for cache to be synced for Kind *v1.Route".

The 45-minute bootstrap window expired at 17:15 UTC with authentication, ingress, monitoring, and openshift-apiserver operators all degraded or unavailable.

This failure is unrelated to PR #8699 — the PR modifies AWS AMI resolution logic for HyperShift hosted clusters, while this failure occurred during the management cluster IPI installation step before any HyperShift-specific code was exercised.

Recommendations
  1. Retry the job — This is a transient infrastructure/timing failure. The c5n.metal provisioning delay and bootstrap race conditions are non-deterministic. A retry is likely to succeed.

  2. No code changes needed — The PR's changes (AWS AMI resolution decoupling) are not exercised during the IPI management cluster installation step that failed. The failure is in the pre phase before any HyperShift test code runs.

  3. Consider filing an infra issue — If this job frequently fails at the install step due to c5n.metal provisioning delays, the CI configuration may benefit from either:

    • Increasing the bootstrap timeout
    • Using a faster-provisioning instance type for the worker node
    • Adding a second worker node for redundancy
Evidence
Evidence Detail
Failed step ipi-install-install (pre phase) — exit code 5 (bootstrap timeout)
Bootstrap timeout 45 minutes expired at 17:15 UTC (started waiting at ~16:30 UTC)
Worker node type c5n.metal (bare metal — long provisioning time)
Worker node join delay ~33 minutes — FailedScheduling from 16:20 to worker ready at ~16:53 UTC
Router startup probe failures 1,315 ProbeError events: backend-proxy-http failed, has-synced failed (HTTP 500)
Router restarts 11 restarts due to failed startup probes
Ingress operator restarts 11 restarts — timed out waiting for cache to be synced for Kind *v1.Route
openshift-apiserver Available=False since 16:34:50Z — APIServicesAvailable: PreconditionNotReady
Monitoring operator routes.route.openshift.io: the server could not find the requested resource
KCM installer failures 4 failures — configmaps "client-ca" not found (eventually recovered at revision 4)
Degraded operators at timeout authentication, ingress, monitoring, openshift-apiserver
hypershift-k8sgpt (post phase) Expected failure — the server doesn't have a resource type "hostedcluster" (cluster never installed)
PR relevance None — PR #8699 modifies HyperShift hosted cluster AMI resolution; failure is in management cluster IPI install

@jparrill

Copy link
Copy Markdown
Contributor Author

/override ci/prow/e2e-kubevirt-aws-ovn-reduced

Systematically failing in all PRs. Overriding it.

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@jparrill: Overrode contexts on behalf of jparrill: ci/prow/e2e-kubevirt-aws-ovn-reduced

Details

In response to this:

/override ci/prow/e2e-kubevirt-aws-ovn-reduced

Systematically failing in all PRs. Overriding it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@jparrill: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit c35f662 into openshift:main Jun 17, 2026
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/openstack PR/issue for OpenStack (OpenStackPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants