Skip to content

CNTRLPLANE-3553: Wire osImageStream into NodePool controller (hash, token, status, validation)#8730

Open
sdminonne wants to merge 3 commits into
openshift:mainfrom
sdminonne:CNTRLPLANE-3553-osImageStream-controller
Open

CNTRLPLANE-3553: Wire osImageStream into NodePool controller (hash, token, status, validation)#8730
sdminonne wants to merge 3 commits into
openshift:mainfrom
sdminonne:CNTRLPLANE-3553-osImageStream-controller

Conversation

@sdminonne

@sdminonne sdminonne commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Wire spec.osImageStream and status.osImageStream into the NodePool
reconciliation loop: validation, config hash (triggering rollouts on
stream change), token secret propagation, observation-based status
reporting, and boot image function parameter threading.

Aligns with the dual-stream RHEL NodePool enhancement
Phase 2 controller plumbing.

Dependencies

Changes

Stream resolution (stream.go, osstream.go)

  • GetRHELStream() returns:
    • The explicit spec.osImageStream.name when set (with validation)
    • "rhel-10" when unset and release >= 5.0
    • "rhel-9" when unset and release < 5.0 (always returns a concrete
      stream name so that downstream consumers like StreamForName() work
      correctly when legacy StreamMetadata is removed)
  • getRHELStream() wrapper in osstream.go extracts spec and version
    from NodePool/ReleaseImage and delegates to GetRHELStream()
  • validateOSImageStream() delegates to GetRHELStream() for
    consistent validation (no duplicate resolution logic)

Validation (conditions.go)

  • osImageStream validation runs before NewConfigGenerator inside
    validMachineConfigCondition for fail-fast on invalid stream names
    before expensive MCO config generation

Config hash integration (config.go)

  • Add rhelStream to rolloutConfig, Hash(), and HashWithoutVersion()
  • Normalize rhelStream in NewConfigGenerator: when the explicit
    spec.osImageStream.name matches the version-derived default (e.g.
    "rhel-9" on a 4.x release), it is kept as "" so the hash doesn't
    change and no spurious rollout is triggered. Only a non-default stream
    produces a different hash
  • Keep resolvedRHELStream on ConfigGenerator (not rolloutConfig)
    for downstream consumers that need a concrete stream name (GCP, AWS
    AMI, token secret)

Token secret (token.go)

  • Write os-stream key into the token secret for future ignition-server
    consumption (TODO: not yet read downstream)

Status observation (version.go)

  • rhcosStreamFromOSImage() parses Machine NodeInfo.OSImage to
    determine RHEL generation (RHCOS 4xx → rhel-9, 5xx → rhel-10)
  • osImageStreamFromMachines() determines the observed stream using
    strict majority consensus (count > total/2) across the pool
  • setOSImageStreamStatus() sets status.osImageStream when a
    majority of machines report a consistent stream, aligning with the
    enhancement's observation-based status design

Boot image threading (aws.go, gcp.go)

  • Thread resolved stream through defaultNodePoolAMI,
    defaultNodePoolGCPImage, setAWSConditions, and their callers via
    getRHELStream() for consistent boot image resolution

What's not yet implemented (future work)

  • usesRunc guard (explicit rhel-10 + runc → error, implicit >=5.0 + runc → fallback to rhel-9)
  • Multi-stream boot image metadata parsing (CNTRLPLANE-3553)
  • Ignition server plumbing (OSImageStream CR generation in GetPayload())
  • StreamMetadata / OSStreams naming consolidation in ReleaseImage struct

Test plan

  • TestGetRHELStream — covers explicit stream, 4.x/5.x/6.x defaults, unparsable version
  • TestValidateOSImageStream — covers empty, valid, and invalid stream names
  • TestOSImageStreamHashStability — covers default stream producing same hash, non-default producing different hash
  • TestHash / TestHashWithoutVersion — covers non-default stream changing the hash
  • TestRhcosStreamFromOSImage — covers RHCOS 4xx/5xx parsing, unknown versions, unrecognised OS
  • TestOsImageStreamFromMachines — covers majority consensus: single replica, all same, majority during upgrade, even split, no NodeInfo, unrecognised OS
  • TestSetOSImageStreamStatus — covers integration with fake client: no machines, all RHEL 9, majority RHEL 10, preserve previous status
  • TestReconcileMachineDeploymentStatus — covers config annotation update
  • make verify passes clean

🤖 Generated with Claude Code

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 12, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 12, 2026

Copy link
Copy Markdown

@sdminonne: This pull request references CNTRLPLANE-3553 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Wire spec.osImageStream into the NodePool reconciliation loop: config hash
(triggering rollouts on stream change), token secret propagation, status
reporting, and boot image function parameter threading.

DRAFT — depends on #8669 and #8719 merging first.
This PR will be rebased and conflict-resolved after those land.
Opening early for design review on the config hash, token secret,
and boot image threading approach.

Dependencies

This PR overlaps with both on condition constants and stream resolution logic.
After they merge, the overlapping pieces (osstream.go condition constants,
validOSImageStreamCondition allowlist) will be dropped and replaced with
the implementations from those PRs.

What this PR adds (unique to this PR)

  • Config hash integration (config.go): add rhelStream to rolloutConfig,
    Hash(), and HashWithoutVersion() — backward compatible when empty
  • Token secret (token.go): write os-stream key for downstream ignition
  • Status propagation (capi.go): set status.osImageStream on rollout
    completion in both Replace (MachineDeployment) and InPlace (MachineSet) paths
  • Boot image threading: add rhelStream parameter to defaultNodePoolAMI,
    defaultNodePoolGCPImage, and all callers (aws.go, gcp.go, token.go).
    Parameter is accepted but not yet used — marked with TODO([CNTRLPLANE-3553](https://redhat.atlassian.net/browse/CNTRLPLANE-3553))
    for when StreamForName() from CNTRLPLANE-3552: Multi-stream CoreOS metadata parsing and stream resolution #8669 is wired in

What overlaps (will be resolved on rebase)

Test plan

  • go test ./hypershift-operator/controllers/nodepool/... -count=1 — all pass
  • make lint-fix — 0 issues
  • go build ./... — full repo builds
  • CI will fail until dependencies merge — expected for draft

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR propagates a resolved RHEL OS image stream (rhelStream) through the full NodePool reconciliation stack. Two new API constants (OSImageStreamRHEL9, OSImageStreamRHEL10) are added alongside a CEL immutability rule for osImageStream. The defaultRHELStream, getRHELStream, and validateOSImageStream helpers are rewritten to delegate to the shared GetRHELStream helper. The resolved stream is threaded into AWS and GCP image resolution functions, written into token secrets under a new TokenSecretOSStreamKey constant, included in rollout hash computation (normalized to avoid spurious changes when the default is explicitly set), and used to set NodePool.Status.OSImageStream upon MachineDeployment/MachineSet rollout completion. An early validateOSImageStream check is added to validMachineConfigCondition.

Possibly related PRs

  • openshift/hypershift#8669: Introduced the GetRHELStream shared helper and constants that this PR now delegates to across defaultRHELStream, getRHELStream, and validateOSImageStream.
  • openshift/hypershift#8699: Reworks AWS NodePool AMI resolution flow in defaultNodePoolAMI/resolveAWSAMI/setAWSConditions, directly overlapping with this PR's changes to those same call paths.

Suggested reviewers

  • sjenning
  • devguyio
🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Assertions in osstream_test.go lack meaningful failure messages (e.g., g.Expect(err).ToNot(HaveOccurred()) with no message), inconsistent with quality patterns seen in config_test.go where assert... Add failure messages to all Expect() assertions in osstream_test.go, e.g., g.Expect(err).ToNot(HaveOccurred(), "failed to get RHEL stream for test case: %s", tc.name).
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: wiring osImageStream into the NodePool controller across hash, token, status, and validation components.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the PR are stable and deterministic. No dynamic values (UUIDs, timestamps, pod names, namespaces, etc.) appear in test titles. Tests use static string fields from table-driven tes...
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces no scheduling constraints. Changes are limited to OS image stream threading through boot image resolution, config hashing, status propagation, and validation—not affecting Pod specs,...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds only standard Go unit tests (osstream_test.go with table-driven tests), not Ginkgo e2e tests. Custom check only applies to Ginkgo e2e tests; therefore not applicable.
No-Weak-Crypto ✅ Passed PR uses FNV-1a hashing for config tracking and contains no weak crypto (MD5, SHA1, DES, RC4, etc.), custom crypto implementations, or insecure secret comparisons.
Container-Privileges ✅ Passed PR contains no Kubernetes or container manifests (YAML/JSON/Dockerfile). All 14+ changed files are Go source code implementing controller logic; no privileged container configurations present.
No-Sensitive-Data-In-Logs ✅ Passed PR adds logging for osImageStream validation errors and writes stream names to token secrets. All logged/stored data are public OCP version identifiers (rhel-9, rhel-10) and public release versions...
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot added area/api Indicates the PR includes changes for the API area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform and removed do-not-merge/needs-area labels Jun 12, 2026
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 74.61538% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.29%. Comparing base (adccbd6) to head (e5e4f1d).
⚠️ Report is 139 commits behind head on main.

Files with missing lines Patch % Lines
...rshift-operator/controllers/nodepool/conditions.go 0.00% 10 Missing ⚠️
hypershift-operator/controllers/nodepool/config.go 57.14% 6 Missing and 3 partials ⚠️
...erator/controllers/nodepool/nodepool_controller.go 14.28% 4 Missing and 2 partials ⚠️
...pershift-operator/controllers/nodepool/osstream.go 85.00% 2 Missing and 1 partial ⚠️
...ypershift-operator/controllers/nodepool/version.go 92.68% 2 Missing and 1 partial ⚠️
hypershift-operator/controllers/nodepool/gcp.go 75.00% 1 Missing ⚠️
hypershift-operator/controllers/nodepool/token.go 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8730      +/-   ##
==========================================
+ Coverage   41.86%   42.29%   +0.43%     
==========================================
  Files         759      771      +12     
  Lines       94101    97129    +3028     
==========================================
+ Hits        39392    41081    +1689     
- Misses      51949    53179    +1230     
- Partials     2760     2869     +109     
Files with missing lines Coverage Δ
hypershift-operator/controllers/nodepool/aws.go 80.83% <100.00%> (+0.66%) ⬆️
hypershift-operator/controllers/nodepool/stream.go 100.00% <100.00%> (ø)
hypershift-operator/controllers/nodepool/gcp.go 66.21% <75.00%> (-0.30%) ⬇️
hypershift-operator/controllers/nodepool/token.go 82.70% <83.33%> (+0.10%) ⬆️
...pershift-operator/controllers/nodepool/osstream.go 85.00% <85.00%> (ø)
...ypershift-operator/controllers/nodepool/version.go 94.11% <92.68%> (-0.97%) ⬇️
...erator/controllers/nodepool/nodepool_controller.go 43.37% <14.28%> (+0.19%) ⬆️
hypershift-operator/controllers/nodepool/config.go 82.91% <57.14%> (-2.61%) ⬇️
...rshift-operator/controllers/nodepool/conditions.go 53.27% <0.00%> (-0.66%) ⬇️

... and 62 files with indirect coverage changes

Flag Coverage Δ
cmd-support 35.63% <ø> (+0.50%) ⬆️
cpo-hostedcontrolplane 45.31% <ø> (+1.16%) ⬆️
cpo-other 45.10% <ø> (+1.64%) ⬆️
hypershift-operator 50.74% <74.61%> (-1.32%) ⬇️
other 31.69% <ø> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hypershift-operator/controllers/nodepool/config.go (1)

88-94: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Thread the resolved OS stream, not the raw spec field.

getRHELStream() is the contract for this feature: when spec.osImageStream is unset, the controller derives rhel-9/rhel-10 from the release. These sites currently cache/forward nodePool.Spec.OSImageStream.Name directly, so the value collapses to "" for defaulted NodePools. That leaves the hash path out of sync with status.osImageStream today, and it will feed the wrong stream into AWS/GCP image selection as soon as the dependent multi-stream metadata work starts using rhelStream after rebase.

Compute the effective stream once from getRHELStream(nodePool, releaseImage) and thread that value through rolloutConfig, CAPI, and the platform validation helpers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/config.go` around lines 88 - 94,
Compute the effective RHEL stream once by calling getRHELStream(nodePool,
releaseImage) and pass that computed value into rolloutConfig.rhelStream instead
of using nodePool.Spec.OSImageStream.Name; also propagate the same computed
value into any CAPI-related code paths and platform validation helper calls that
previously read spec.osImageStream so all consumers use the resolved stream
(e.g., update usages in rolloutConfig initialization, CAPI helpers, and platform
validation functions to accept/consume the new resolved rhelStream variable).
🧹 Nitpick comments (2)
hypershift-operator/controllers/nodepool/osstream_test.go (1)

17-80: ⚡ Quick win

Add test coverage for getRHELStream error path.

TestGetRHELStream only covers success cases. Add a test case for when releaseImage.Version() returns an invalid semver string, to verify that getRHELStream returns an error as expected (Line 26-29 in osstream.go handles this, but it's untested).

📋 Proposed test case for invalid semver
 		expectedStream: "rhel-10",
 	},
+	{
+		name: "When spec.osImageStream.Name is empty and version is invalid, it should return error",
+		nodePool: &hyperv1.NodePool{
+			Spec: hyperv1.NodePoolSpec{},
+		},
+		releaseImage: &releaseinfo.ReleaseImage{
+			ImageStream: &imageapi.ImageStream{ObjectMeta: metav1.ObjectMeta{Name: "invalid-version"}},
+		},
+		expectErr: true,
+	},
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/osstream_test.go` around lines 17 -
80, TestGetRHELStream lacks coverage for the error path in getRHELStream when
releaseImage.Version() yields an invalid semver; add a new test case in
TestGetRHELStream that supplies a releaseinfo.ReleaseImage whose
ImageStream.ObjectMeta.Name is an invalid version string (e.g., "not-a-semver"),
call getRHELStream(nodePool, releaseImage) and assert that an error is returned
(Expect(err).To(HaveOccurred())); reference the existing test harness and types
(TestGetRHELStream, getRHELStream, releaseinfo.ReleaseImage,
imageapi.ImageStream/ObjectMeta.Name) so the new case is consistent with the
other table-driven cases.
hypershift-operator/controllers/nodepool/nodepool_controller.go (1)

407-409: ⚡ Quick win

Log or surface getRHELStream errors in all three status-update paths.

In nodepool_controller.go (Line 407-409), capi.go reconcileMachineDeploymentStatus (Line 617-619), and capi.go reconcileMachineSet (Line 1018-1020), errors from getRHELStream are silently ignored with if err == nil guards. The shared root cause is missing error visibility: when OS stream resolution fails (e.g., invalid semver in release version), Status.OSImageStream remains unset or stale and no log or condition surfaces the issue to the user. As per coding guidelines, "Always check errors in Go — don't ignore them." Add logging or set a condition in each path to surface failures.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/nodepool_controller.go` around lines
407 - 409, getRHELStream errors are being ignored in three places
(nodepool_controller.go when setting nodePool.Status.OSImageStream, capi.go
reconcileMachineDeploymentStatus, and capi.go reconcileMachineSet); update each
location to handle the error path: when getRHELStream returns an error, log the
error with the controller logger (including err.Error()) and set a clear status
condition (e.g., "OSImageStreamResolutionFailed") on the owning object with the
error message so users see the failure, and when getRHELStream succeeds clear
that condition and set Status.OSImageStream as before; reference the
getRHELStream call sites and the Status.OSImageStream update points and ensure
status updates are persisted.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hypershift-operator/controllers/nodepool/osstream.go`:
- Around line 21-35: The getRHELStream function dereferences nodePool
(nodePool.Spec.OSImageStream.Name) and releaseImage (releaseImage.Version())
without nil checks; add defensive nil guards at the top of getRHELStream to
return a clear error if nodePool or releaseImage is nil, and also guard
nodePool.Spec or nodePool.Spec.OSImageStream as needed before accessing Name to
avoid panics, then proceed with semver.Parse on releaseImage.Version() only when
releaseImage is non-nil.

---

Outside diff comments:
In `@hypershift-operator/controllers/nodepool/config.go`:
- Around line 88-94: Compute the effective RHEL stream once by calling
getRHELStream(nodePool, releaseImage) and pass that computed value into
rolloutConfig.rhelStream instead of using nodePool.Spec.OSImageStream.Name; also
propagate the same computed value into any CAPI-related code paths and platform
validation helper calls that previously read spec.osImageStream so all consumers
use the resolved stream (e.g., update usages in rolloutConfig initialization,
CAPI helpers, and platform validation functions to accept/consume the new
resolved rhelStream variable).

---

Nitpick comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go`:
- Around line 407-409: getRHELStream errors are being ignored in three places
(nodepool_controller.go when setting nodePool.Status.OSImageStream, capi.go
reconcileMachineDeploymentStatus, and capi.go reconcileMachineSet); update each
location to handle the error path: when getRHELStream returns an error, log the
error with the controller logger (including err.Error()) and set a clear status
condition (e.g., "OSImageStreamResolutionFailed") on the owning object with the
error message so users see the failure, and when getRHELStream succeeds clear
that condition and set Status.OSImageStream as before; reference the
getRHELStream call sites and the Status.OSImageStream update points and ensure
status updates are persisted.

In `@hypershift-operator/controllers/nodepool/osstream_test.go`:
- Around line 17-80: TestGetRHELStream lacks coverage for the error path in
getRHELStream when releaseImage.Version() yields an invalid semver; add a new
test case in TestGetRHELStream that supplies a releaseinfo.ReleaseImage whose
ImageStream.ObjectMeta.Name is an invalid version string (e.g., "not-a-semver"),
call getRHELStream(nodePool, releaseImage) and assert that an error is returned
(Expect(err).To(HaveOccurred())); reference the existing test harness and types
(TestGetRHELStream, getRHELStream, releaseinfo.ReleaseImage,
imageapi.ImageStream/ObjectMeta.Name) so the new case is consistent with the
other table-driven cases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e87df744-44ee-4acb-9a91-4746ce49edd8

📥 Commits

Reviewing files that changed from the base of the PR and between 39f04eb and afe4b10.

⛔ Files ignored due to path filters (1)
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_conditions.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (13)
  • api/hypershift/v1beta1/nodepool_conditions.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/config.go
  • hypershift-operator/controllers/nodepool/gcp.go
  • hypershift-operator/controllers/nodepool/gcp_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
  • hypershift-operator/controllers/nodepool/osstream.go
  • hypershift-operator/controllers/nodepool/osstream_test.go
  • hypershift-operator/controllers/nodepool/token.go
  • hypershift-operator/controllers/nodepool/token_test.go

Comment on lines +21 to +35
func getRHELStream(nodePool *hyperv1.NodePool, releaseImage *releaseinfo.ReleaseImage) (string, error) {
if nodePool.Spec.OSImageStream.Name != "" {
return nodePool.Spec.OSImageStream.Name, nil
}

version, err := semver.Parse(releaseImage.Version())
if err != nil {
return "", fmt.Errorf("failed to parse release image version %q: %w", releaseImage.Version(), err)
}

if version.Major >= 5 {
return "rhel-10", nil
}
return "rhel-9", nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add nil-safety guards to prevent panics.

getRHELStream dereferences nodePool (Line 22) and releaseImage (Line 26) without nil checks. If either parameter is nil, the function will panic. Current callers appear to pass valid pointers, but defensive validation would prevent runtime panics if the contract changes or a new caller is added.

🛡️ Proposed fix to add nil guards
 func getRHELStream(nodePool *hyperv1.NodePool, releaseImage *releaseinfo.ReleaseImage) (string, error) {
+	if nodePool == nil {
+		return "", fmt.Errorf("nodePool cannot be nil")
+	}
+	if releaseImage == nil {
+		return "", fmt.Errorf("releaseImage cannot be nil")
+	}
 	if nodePool.Spec.OSImageStream.Name != "" {
 		return nodePool.Spec.OSImageStream.Name, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/osstream.go` around lines 21 - 35,
The getRHELStream function dereferences nodePool
(nodePool.Spec.OSImageStream.Name) and releaseImage (releaseImage.Version())
without nil checks; add defensive nil guards at the top of getRHELStream to
return a clear error if nodePool or releaseImage is nil, and also guard
nodePool.Spec or nodePool.Spec.OSImageStream as needed before accessing Name to
avoid panics, then proceed with semver.Parse on releaseImage.Version() only when
releaseImage is non-nil.

@sdminonne sdminonne force-pushed the CNTRLPLANE-3553-osImageStream-controller branch from 03a5cef to 8e027da Compare June 17, 2026 07:40
@sdminonne sdminonne changed the title CNTRLPLANE-3553: Wire osImageStream into NodePool controller (config hash, token, boot image) CNTRLPLANE-3553: Wire osImageStream into NodePool controller (hash, token, status, validation) Jun 17, 2026
@openshift-ci openshift-ci Bot added the area/cli Indicates the PR includes changes for CLI label Jun 17, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
api/hypershift/v1beta1/nodepool_types.go (1)

51-60: 💤 Low value

Consider nil check for defensive safety.

The function dereferences nodePool at Line 52 without a nil check. While current callers in context snippets (conditions.go:390) pass valid pointers, adding a guard would prevent panics if a new caller is added or the contract changes.

🛡️ Optional defensive nil guard
 func validateOSImageStream(nodePool *hyperv1.NodePool) error {
+	if nodePool == nil {
+		return fmt.Errorf("nodePool cannot be nil")
+	}
 	name := nodePool.Spec.OSImageStream.Name
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/hypershift/v1beta1/nodepool_types.go` around lines 51 - 60, Locate the
function that dereferences the nodePool parameter (likely in the
nodepool_types.go file around the specified line range) and add a nil check
guard before the dereference operation. The check should verify that nodePool is
not nil before attempting to access its fields or methods, returning an
appropriate error or early exit if the pointer is nil. This defensive check
prevents potential panics if the function contract changes or new callers pass
nil values in the future.
hypershift-operator/controllers/nodepool/capi_test.go (2)

2771-2773: ⚡ Quick win

Add nil-safety to OSImageStream assertion.

The assertion at line 2772 directly accesses nodePool.Status.OSImageStream.Name without first checking if nodePool.Status.OSImageStream is nil. If the status field is unexpectedly nil when expectedOSImageStream is non-empty, this will panic rather than provide a clear test failure message.

Safer assertion pattern
 if tc.expectedOSImageStream != "" {
+    g.Expect(nodePool.Status.OSImageStream).ToNot(BeNil(), "OSImageStream should be set when stream is expected")
     g.Expect(nodePool.Status.OSImageStream.Name).To(Equal(tc.expectedOSImageStream))
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/capi_test.go` around lines 2771 -
2773, In the test assertion block where tc.expectedOSImageStream is checked, add
a nil-safety check before accessing the Name field. When
tc.expectedOSImageStream is non-empty, first verify that
nodePool.Status.OSImageStream is not nil before attempting to access
nodePool.Status.OSImageStream.Name in the Expect call. This will prevent a panic
and provide a clear test failure message if the OSImageStream field is
unexpectedly nil.

2636-2674: ⚡ Quick win

Add test case to verify OSImageStream status population on rollout completion.

The test infrastructure includes an expectedOSImageStream field and conditional assertion (lines 2771-2773), but no test case actually verifies that nodePool.Status.OSImageStream is populated when the MachineDeployment completes. The test case at line 2652 sets expectedOSImageStream: "", which skips the assertion entirely.

According to the layer description, reconcileMachineDeploymentStatus should call getRHELStream and populate nodePool.Status.OSImageStream when the deployment reaches completion. Add a test case that sets a non-empty expectedOSImageStream value and verifies the status field is correctly populated.

Consider also updating the test name at line 2652 to reflect what is actually being verified, or add a separate test case specifically for OSImageStream status population.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/capi_test.go` around lines 2636 -
2674, Add a new test case to the testCases slice in the
TestReconcileMachineDeploymentStatus function that verifies OSImageStream status
population when a MachineDeployment completes. This test case should have a
non-empty expectedOSImageStream value (such as a valid RHEL stream identifier)
and match the completion conditions of the existing test case (all replica
counts at 3, ObservedGeneration matching Generation). This will ensure the
conditional assertion at lines 2771-2773 that checks
nodePool.Status.OSImageStream is actually exercised with a meaningful
verification rather than being skipped due to an empty expected value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go`:
- Around line 406-408: In the nodepool_controller.go file, the call to
getRHELStream is silently ignoring errors by only handling the success case with
if err == nil. Instead of dropping the error, properly handle failure cases by
either logging the error with context and returning it to trigger
reconciliation, or updating the NodePool status to reflect the error condition.
This ensures that stream-resolution failures are visible and the status remains
synchronized with the actual state rather than remaining stale and unset.

In `@hypershift-operator/controllers/nodepool/token.go`:
- Around line 358-359: The assignment of TokenSecretOSStreamKey to
tokenSecret.Data only executes within a conditional block that checks if
tokenSecret.Data is nil, meaning existing token secrets never get the os-stream
field populated on reconciliation. Move the line
tokenSecret.Data[TokenSecretOSStreamKey] = []byte(t.rhelStream) outside of the
nil-check conditional so it runs on every reconciliation, ensuring all token
secrets have the os-stream field persisted consistently. Additionally, add unit
tests to verify that the os-stream field is set during both initial creation and
subsequent reconciliations, and add e2e tests to validate this behavior impacts
consumer behavior correctly.

---

Nitpick comments:
In `@api/hypershift/v1beta1/nodepool_types.go`:
- Around line 51-60: Locate the function that dereferences the nodePool
parameter (likely in the nodepool_types.go file around the specified line range)
and add a nil check guard before the dereference operation. The check should
verify that nodePool is not nil before attempting to access its fields or
methods, returning an appropriate error or early exit if the pointer is nil.
This defensive check prevents potential panics if the function contract changes
or new callers pass nil values in the future.

In `@hypershift-operator/controllers/nodepool/capi_test.go`:
- Around line 2771-2773: In the test assertion block where
tc.expectedOSImageStream is checked, add a nil-safety check before accessing the
Name field. When tc.expectedOSImageStream is non-empty, first verify that
nodePool.Status.OSImageStream is not nil before attempting to access
nodePool.Status.OSImageStream.Name in the Expect call. This will prevent a panic
and provide a clear test failure message if the OSImageStream field is
unexpectedly nil.
- Around line 2636-2674: Add a new test case to the testCases slice in the
TestReconcileMachineDeploymentStatus function that verifies OSImageStream status
population when a MachineDeployment completes. This test case should have a
non-empty expectedOSImageStream value (such as a valid RHEL stream identifier)
and match the completion conditions of the existing test case (all replica
counts at 3, ObservedGeneration matching Generation). This will ensure the
conditional assertion at lines 2771-2773 that checks
nodePool.Status.OSImageStream is actually exercised with a meaningful
verification rather than being skipped due to an empty expected value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8ebdba94-e5e5-4bf0-8e2a-0fa112fff63f

📥 Commits

Reviewing files that changed from the base of the PR and between 03a5cef and 8e027da.

⛔ Files ignored due to path filters (5)
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OSStreams.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • cmd/install/assets/crds/hypershift-operator/tests/nodepools.hypershift.openshift.io/featuregated.nodepools.osimagestream.testsuite.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (16)
  • api/hypershift/v1beta1/nodepool_types.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/config.go
  • hypershift-operator/controllers/nodepool/config_test.go
  • hypershift-operator/controllers/nodepool/gcp.go
  • hypershift-operator/controllers/nodepool/gcp_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
  • hypershift-operator/controllers/nodepool/osstream.go
  • hypershift-operator/controllers/nodepool/osstream_test.go
  • hypershift-operator/controllers/nodepool/token.go
  • hypershift-operator/controllers/nodepool/token_test.go
✅ Files skipped from review due to trivial changes (2)
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
  • hypershift-operator/controllers/nodepool/aws_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • hypershift-operator/controllers/nodepool/config.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/gcp_test.go
  • hypershift-operator/controllers/nodepool/gcp.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/config_test.go

Comment thread hypershift-operator/controllers/nodepool/nodepool_controller.go Outdated
Comment thread hypershift-operator/controllers/nodepool/token.go Outdated
@sdminonne sdminonne force-pushed the CNTRLPLANE-3553-osImageStream-controller branch from 49e64cb to ebde395 Compare June 17, 2026 08:31
@sdminonne

Copy link
Copy Markdown
Contributor Author

I'll rebase this once #8699 will be merged

@sdminonne sdminonne marked this pull request as ready for review June 17, 2026 13:10
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 17, 2026
@openshift-ci openshift-ci Bot requested review from Nirshal and cblecker June 17, 2026 13:11

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hypershift-operator/controllers/nodepool/capi_test.go`:
- Around line 2771-2773: The assertion for nodePool.Status.OSImageStream.Name is
conditionally guarded, which skips verification when tc.expectedOSImageStream is
empty. This allows incorrect non-empty status values to pass undetected in test
cases expecting empty values. Remove the conditional guard `if
tc.expectedOSImageStream != ""` and assert nodePool.Status.OSImageStream.Name
unconditionally against tc.expectedOSImageStream to ensure all test cases
properly verify the expected OS image stream value regardless of whether it is
empty or not.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b312f697-6295-426e-b155-a62af1d22c0d

📥 Commits

Reviewing files that changed from the base of the PR and between 8e027da and ebde395.

⛔ Files ignored due to path filters (5)
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OSStreams.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • cmd/install/assets/crds/hypershift-operator/tests/nodepools.hypershift.openshift.io/featuregated.nodepools.osimagestream.testsuite.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (16)
  • api/hypershift/v1beta1/nodepool_types.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/config.go
  • hypershift-operator/controllers/nodepool/config_test.go
  • hypershift-operator/controllers/nodepool/gcp.go
  • hypershift-operator/controllers/nodepool/gcp_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
  • hypershift-operator/controllers/nodepool/osstream.go
  • hypershift-operator/controllers/nodepool/osstream_test.go
  • hypershift-operator/controllers/nodepool/token.go
  • hypershift-operator/controllers/nodepool/token_test.go
🚧 Files skipped from review as they are similar to previous changes (13)
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
  • hypershift-operator/controllers/nodepool/token_test.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/gcp_test.go
  • hypershift-operator/controllers/nodepool/config.go
  • hypershift-operator/controllers/nodepool/osstream_test.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/gcp.go
  • hypershift-operator/controllers/nodepool/config_test.go
  • api/hypershift/v1beta1/nodepool_types.go
  • hypershift-operator/controllers/nodepool/token.go

Comment thread hypershift-operator/controllers/nodepool/capi_test.go Outdated

@jparrill jparrill left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Thanks!

Note: #8699 (CNTRLPLANE-3026) is still pending merge. Once it lands, this PR will need a rebase — defaultNodePoolAMI signature and behavior change significantly there (streamName becomes functional via StreamForName). The //nolint:unparam and TODOs referencing CNTRLPLANE-3553 for that function will be obsolete after rebase.

We'll do multiple review rounds as the dependency chain settles.

// behavior: no OSImageStream CR is generated and the MCC uses
// BaseOSContainerImage from ControllerConfig as-is).
func getRHELStream(nodePool *hyperv1.NodePool, releaseImage *releaseinfo.ReleaseImage) (string, error) {
if nodePool.Spec.OSImageStream.Name != "" {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reimplements GetRHELStream() from stream.go in the same package, but with divergent semantics: defaultRHELStream says major >= 5 → rhel-10 unconditionally, while GetRHELStream handles major >= 5 && usesRunc → rhel-9. These two functions will return different answers for 5.x + runc.

Same with validateOSImageStreamGetRHELStream already validates unknown stream names and rejects invalid combinations (rhel-10 on 4.x, rhel-10 with runc).

Can these delegate to GetRHELStream instead of reimplementing? Having two resolution paths for the same question in the same package is a maintenance hazard — when a new stream is added, both must be updated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Both defaultRHELStream, getRHELStream, and validateOSImageStream now delegate to GetRHELStream from stream.go. The duplicate resolution logic is removed — see 63532a2.

if rhelStream != "" {
defaultStream, err := defaultRHELStream(releaseImage)
if err == nil && rhelStream == defaultStream {
rhelStream = ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This normalization is the most load-bearing part of the PR — it's what prevents spurious fleet-wide rollouts when a user explicitly sets the default stream. But `TestNewConfigGenerator` has zero test cases that set `nodePool.Spec.OSImageStream`.

Since the plan is to consolidate on `GetRHELStream()` from `stream.go` (see osstream.go comment), the normalization internals will change. But the observable contract should be tested now:

  • Setting `osImageStream.Name` to the version-derived default (e.g. `"rhel-9"` on 4.x) must produce the same hash as not setting it at all (no spurious rollout).
  • Setting a non-default stream (e.g. `"rhel-10"` on 4.x) must produce a different hash (rollout triggered).

These tests verify the hash stability guarantee regardless of which internal function does the resolution.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added TestOSImageStreamHashStability in config_test.go covering both cases:

  • Setting osImageStream.Name to the version-derived default (e.g. "rhel-9" on 4.x) produces the same hash as not setting it (no spurious rollout).
  • Setting a non-default stream (e.g. "rhel-10" on 4.x) produces a different hash (rollout triggered).

The TestConfigHash and TestConfigHashWithoutVersion tables also include rhelStream cases — see 0b8280f and 257394d.

@jparrill

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@jparrill

Copy link
Copy Markdown
Contributor

/retest required

@jparrill

Copy link
Copy Markdown
Contributor

/retest-required

@jparrill

Copy link
Copy Markdown
Contributor

/verified by unit tests

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 29, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@jparrill: This PR has been marked as verified by unit tests.

Details

In response to this:

/verified by unit tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne

Copy link
Copy Markdown
Contributor Author

/retest-required

…geStream is unset

getRHELStream was always resolving to a concrete stream name (e.g.
"rhel-9") even when spec.osImageStream.Name was empty. This caused
StreamForName to use the OSStreams lookup path instead of the legacy
StreamMetadata path, potentially producing a different AMI and
changing the AWSMachineTemplate spec hash. During HyperShift Operator
upgrades this triggered UpdatingPlatformMachineTemplate=True and an
unintended node rollout.

Return "" from getRHELStream when osImageStream.Name is unset so that
all callers (CAPI machine templates, conditions, token secrets) pass
an empty stream to StreamForName, preserving the pre-PR behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sdminonne

Copy link
Copy Markdown
Contributor Author

Upgrade test regression: spurious UpdatingPlatformMachineTemplate=True

Root cause: getRHELStream always resolves to a concrete stream name (e.g. "rhel-9") even when spec.osImageStream.Name is empty. This changes the AMI lookup path through StreamForName:

  • Before this PR: defaultNodePoolAMI(region, arch, "", releaseImage)StreamForName("") → returns ReleaseImage.StreamMetadata directly
  • After this PR: defaultNodePoolAMI(region, arch, "rhel-9", releaseImage)StreamForName("rhel-9") → looks up ReleaseImage.OSStreams["rhel-9"], falling back to StreamMetadata only for single-stream payloads

For current single-stream payloads, the fallback produces the same AMI, but the lookup path difference is enough to change the AWSMachineTemplate spec hash in certain conditions, generating a new template name and triggering UpdatingPlatformMachineTemplate=True.

The upgrade test creates a cluster with the old HO (no osImageStream wiring), upgrades to the new HO, and asserts no rollout occurs. The changed template name fails that assertion.

Fix (bbde575199): Return "" from getRHELStream when spec.osImageStream.Name is empty, preserving the legacy StreamForName("") behavior. The concrete stream resolution only kicks in when the user explicitly sets osImageStream. This is consistent with the rhelStream normalization already done in NewConfigGenerator for the config hash — both paths now treat "unset" as "don't change anything".

@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Jun 29, 2026
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2026
@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

New changes are detected. LGTM label has been removed.

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aws-upgrade-hypershift-operator

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-kubevirt-aws-ovn-reduced

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aws-4-22

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aks-4-22

@jparrill jparrill left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Thanks!

// Default behavior for Linux/RHCOS AMIs.
// Use the resolved stream (via GetRHELStream) rather than the raw spec field,
// so that the AMI lookup is consistent with the CAPI path.
rhelStream, err := getRHELStream(nodePool, releaseImage)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new error path (getRHELStream failure -> condition False + return error) is untested. TestSetAWSConditions doesn't have a case with an invalid osImageStream.Name — e.g. rhel-10 on a 4.x release. Worth adding one to make sure the condition message surfaces correctly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread hypershift-operator/controllers/nodepool/conditions.go
tokenSecret.Data[TokenSecretHCConfigurationHashKey] = t.globalConfigHash
// TODO(CNTRLPLANE-3553): consumed by the ignition-server's TokenSecretReconciler once
// multi-stream ignition support lands. Until then this key is written but not read downstream.
tokenSecret.Data[TokenSecretOSStreamKey] = []byte(t.resolvedRHELStream)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This key is written now but consumed downstream later (per the TODO). If someone refactors the tokenSecret.Data == nil init block, the key could be dropped silently — nothing would catch it until the ignition-server starts reading it in a separate PR, likely weeks later. A test asserting tokenSecret.Data["os-stream"] exists with the expected value would catch that at the source. Same applies to the other keys in this block if they're not already tested.

// changes during HyperShift Operator upgrades.
// When the field is explicitly set, it delegates to GetRHELStream for
// version-aware validation and runc constraint checking.
func getRHELStream(nodePool *hyperv1.NodePool, releaseImage *releaseinfo.ReleaseImage) (string, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor test gap: there's no case for a set osImageStream.Name with an unparsable version (e.g. Name: "rhel-9" + version "not-a-version"). The empty-name + unparsable case is covered (returns "" via short-circuit), but the set-name path through semver.Parse at line 24 would fail and return an error — worth one test case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}{
{
name: "When MachineDeployment is complete, it should update nodePool version and annotations",
name: "When MachineDeployment is complete, it should update nodePool version",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this rename intentional? The test behavior didn't change, just the name. If there's no reason for it, I'd revert to avoid noise in git blame.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Sorry for this, TY!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amending it

…erage

- Add TestSetAWSConditions case for invalid osImageStream (rhel-10 on 4.x)
- Add TODO for integration test on validMachineConfigCondition path
- Assert os-stream key in token secret reconciliation test
- Add getRHELStream test for set name with unparsable version
- Revert unintentional test rename in capi_test.go

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 30, 2026

Copy link
Copy Markdown

Now I have all the evidence. Both jobs were aborted by the Prow trigger plugin at the exact same time (07:43:27Z). The kubevirt job had a genuine ingress degradation flake on one cluster (while the other cluster passed), and the AKS job was killed mid-execution with all 36 failures being context cancellation cascades. Neither failure is related to the PR's code changes.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

[e2e-kubevirt-aws-ovn-reduced] TestAutoscaling/ValidateHostedCluster (2310s):
  ClusterVersionSucceeding=False: ClusterOperatorDegraded(Cluster operator ingress is degraded)
  ClusterVersionAvailable=False: FromClusterVersion
  wanted most recent version history to have state Completed, has state Partial

[e2e-aks-4-22] 36 test failures — all "context canceled" after SIGTERM:
  "Entrypoint received interrupt: terminated" at 07:43:27Z
  "Aborted by trigger plugin."

Summary

Neither failure is caused by the PR's code changes. Both jobs were externally aborted by the Prow trigger plugin at the exact same timestamp (07:43:27Z), indicating a new push or re-trigger on PR #8730. The kubevirt job had one genuine infrastructure flake — the ingress ClusterOperator degraded on the single-node autoscaling HostedCluster (while the parallel TestCreateCluster on a separate 2-node cluster passed with all checks green). The AKS job's 36 failures are entirely context-cancellation cascades from the abort killing tests mid-execution (upgrades, config rollouts, and scale operations were actively progressing when terminated). The PR's changes (adding osImageStream wiring to config hash, token, CAPI, and validation) are not exercised by these e2e tests since no osImageStream field is set on any test NodePool.

Root Cause

Both jobs: Aborted by Prow trigger pluginprowjob.json for both jobs shows "state": "aborted" with "description": "Aborted by trigger plugin." Both received SIGTERM at the identical timestamp 2026-06-30T07:43:27Z, confirming a simultaneous external kill (new commit push or manual re-trigger on the PR).

e2e-kubevirt-aws-ovn-reduced (2 failures / 47 tests):

  • TestAutoscaling/ValidateHostedCluster timed out after 30 minutes waiting for the HostedCluster autoscaling-rd7bw to complete rollout. The ingress ClusterOperator in the single-node KubeVirt guest cluster became degraded and never recovered, keeping ClusterVersionSucceeding=False and ClusterVersionAvailable=False. The version history remained in Partial state.
  • This is an infrastructure flake specific to the autoscaling cluster's single-node KubeVirt configuration. The parallel TestCreateCluster (2-replica cluster create-cluster-2vd4t) passed all checks — its rollout completed in 4m3s with all conditions healthy. All other HostedCluster conditions (Available, KubeAPIServerAvailable, EtcdAvailable, InfrastructureReady) were True — only ingress was degraded.
  • No panics, compilation errors, or symptom detections (no quota issues, no segfaults) were found.

e2e-aks-4-22 (36 failures / 278 tests):

  • Every single failure is a context canceled error from Go context propagation after the SIGTERM. Tests that were actively making progress include:
    • TestAutoscaling: Scaling from 3→1 nodes after workload deletion, canceled after 4m41s
    • TestUpgradeControlPlane: openshift-apiserver deployment had 2/3 new replicas updated, 15+ components waiting
    • TestNodePool: Rolling upgrade, config updates, in-place updates — all mid-operation
    • TestCreateCluster: Metrics forwarder validation, ingress operator check — both in-progress
  • The 7 Teardown failures are all "Failed to destroy cluster: context canceled" — cleanup couldn't complete after abort.
  • 203 tests passed, 39 were skipped — no genuine assertion failures occurred before the abort.

The PR's code changes are not implicated. The osImageStream wiring (config hash, token secret, CAPI template, validation) is only activated when spec.osImageStream.Name is set on a NodePool. None of the e2e test NodePools set this field, so the new code paths are not exercised in these tests.

Recommendations
  1. Re-run both jobs — Both were aborted externally, not by genuine test failures. A /retest should produce clean results.
  2. No code changes needed — The PR's changes to osImageStream wiring are not exercised by these tests and are not related to the observed failures.
  3. The kubevirt ingress flake is a known pattern — Single-node KubeVirt guest clusters occasionally see ingress operator degradation, likely due to resource constraints on the single worker VM. This is not specific to this PR (the parallel TestCreateCluster with 2 replicas passed cleanly with the same code).
Evidence
Evidence Detail
prowjob.json state (both jobs) "state": "aborted", "description": "Aborted by trigger plugin."
SIGTERM timestamp (both jobs) 2026-06-30T07:43:27Z — identical for both jobs
kubevirt: TestCreateCluster PASSED (1237s) — same PR code, 2-node cluster, all conditions healthy
kubevirt: TestAutoscaling FAILED — ingress degraded on single-node cluster only
kubevirt test stats 45 passed, 2 failed, 0 skipped out of 47 tests
AKS test stats 203 passed, 36 failed, 39 skipped out of 278 tests
AKS failure pattern All 36 failures contain context canceled — no genuine assertion failures
AKS: TestUpgradeControlPlane openshift-apiserver 2/3 replicas updated when killed — active progress
AKS: TestNodePool rolling upgrades UpdatingConfig=True, UpdatingVersion=True — mid-operation
PR code path exercised? No — spec.osImageStream.Name is empty on all test NodePools
kubevirt: symptom JUnit No panics, no segfaults, no quota issues detected

@openshift-ci

openshift-ci Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

@sdminonne: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-4-22 bbde575 link true /test e2e-aws-4-22

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

nodePool *hyperv1.NodePool
controlplaneNamespace string
// resolvedRHELStream is the explicit RHEL stream name resolved via
// getRHELStream. Unlike rhelStream (which may be empty after normalization

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this consumed?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it's in awsMachineTemplateSpec(

this always holds a concrete stream name (e.g.
// "rhel-9", "rhel-10")

this seems to contradict https://github.com/openshift/hypershift/pull/8730/changes#diff-c54291673a20f27874b97852879404ae26450496aeda317cf79561255f0d8f1dR21 which returns empty

"github.com/blang/semver"
)

// getRHELStream returns the effective RHEL stream for the NodePool.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When spec.osImageStream.Name is unset, it returns "" so that callers
// pass an empty stream to StreamForName, preserving the legacy
// StreamMetadata

That doesn't seem right if the nodepool is on >5

ObjectMeta: metav1.ObjectMeta{
Annotations: tc.nodePoolAnnotations,
},
Spec: tc.nodePool,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add test cases for resolveAWSAMI, setKarpenterAMILabels, and gcpMachineTemplateSpec with all possible rhelStream values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants