OCPBUGS-92182: Use providerSpec.Template in vSphere machineset reconciliation#6234
OCPBUGS-92182: Use providerSpec.Template in vSphere machineset reconciliation#6234djoshy wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@djoshy: This pull request references Jira Issue OCPBUGS-92182, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@djoshy: This pull request references Jira Issue OCPBUGS-92182, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughThe PR adds a helper that resolves the target vSphere template by checking the configured name, then the computed name, and either renames a rollback VM or creates a new template from the OVA when lookup fails. The caller now uses that helper and returns early after creation. ChangesvSphere template resolution
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: djoshy The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/controller/bootimage/vsphere_helpers.go`:
- Around line 262-283: The rollback VM lookup in createTemplateForFailureDomain
treats any oldErr from finder.VirtualMachine as if the VM does not exist, which
can hide real vCenter failures and incorrectly trigger OVA recreation. Update
this branch to only take the “no existing template” path for a
*find.NotFoundError, matching the computed-name lookup behavior, and return
other errors immediately so transient lookup failures are surfaced instead of
falling back to createNewVMTemplateWithNameForFailureDomain.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e445c5ba-c20b-48d5-9b12-d660d0468c8a
📒 Files selected for processing (1)
pkg/controller/bootimage/vsphere_helpers.go
|
/retest-required |
1 similar comment
|
/retest-required |
|
@djoshy surprised that there are no unit tests |
jcpowermac
left a comment
There was a problem hiding this comment.
Review by Claude (AI Reviewer)
This review was performed by Claude Code (claude-opus-4-6) on behalf of @jcpowermac. I am the reviewer.
The refactor into resolveExistingTemplateVM is clean — extracting the resolution logic reduces cyclomatic complexity and the errors.As fix correctly handles govmomi error types. The create-from-OVA path for new failure domains fills the missing gap.
However there is a critical issue with what happens after the providerSpec.Template lookup succeeds — the caller still returns the computed name, which will patch the MachineSet to point at a VM that doesn't exist. See inline comments for details.
Findings summary:
- 🔴 Critical: providerSpec.Template lookup succeeds but caller overwrites with non-existent computed name
- 🟡 Silent error swallowing on providerSpec.Template lookup failure
- 🟡 Hardcoded
"thin"disk type for new templates - ⚪ Typo in doc comment
| // already has a valid template doesn't fail just because the infra computed name isn't there yet. | ||
| if providerSpec.Template != "" && providerSpec.Template != name { | ||
| if tmplVM, tmplErr := finder.VirtualMachine(ctx, providerSpec.Template); tmplErr == nil { | ||
| return tmplVM, false, nil |
There was a problem hiding this comment.
[Claude Review — Critical] When this branch succeeds and returns the VM found via providerSpec.Template, the caller in createNewVMTemplate still has name set to the computed {infraID}-rhcos-{failureDomainName}. The caller proceeds to check the RHCOS version and at line 621-623 does:
if providerSpec.Template != name {
return name, true, nil
}This patches the MachineSet's template to the computed name — but no VM with that name exists in vSphere. The MachineSet gets updated to reference a non-existent template, which will cause the same failure on the next reconciliation.
The fix needs to either:
- Return the resolved name alongside the VM (e.g. add a
resolvedName stringreturn), so the caller uses the name that actually matched, or - Have the caller skip the
providerSpec.Template != namedivergence check when the template was resolved via providerSpec.
| if providerSpec.Template != "" && providerSpec.Template != name { | ||
| if tmplVM, tmplErr := finder.VirtualMachine(ctx, providerSpec.Template); tmplErr == nil { | ||
| return tmplVM, false, nil | ||
| } |
There was a problem hiding this comment.
[Claude Review] If finder.VirtualMachine fails with a non-NotFoundError (e.g., permission denied, network timeout), the error is silently swallowed and we fall through to the computed name lookup. This could mask real infrastructure issues and make debugging harder. Consider at least logging:
klog.Warningf("Failed to look up providerSpec.Template %s, falling back to computed name: %v", providerSpec.Template, tmplErr)There was a problem hiding this comment.
This is valid.
vCenter won't always be available (upgrade - maintenance, etc.)
There was a problem hiding this comment.
Updated with a log line.
| return nil, false, fmt.Errorf("failed to download %s: %w", ova.Location, ovaErr) | ||
| } | ||
| if createErr := createNewVMTemplateWithNameForFailureDomain(ctx, providerSpec, failureDomain, finder, client, tagManager, name, ovaPath, infraID, "thin"); createErr != nil { | ||
| return nil, false, createErr |
There was a problem hiding this comment.
[Claude Review] The existing update path uses getDiskTypeFromExistingVM(vmMo) to preserve the disk type from the current template. Hardcoding "thin" here for new failure domains may not match the cluster's convention (e.g., thick or eagerZeroedThick). Might be fine as a default but worth either a comment explaining why, or reading the disk type from a sibling failure domain's existing template if one was already processed in this loop.
There was a problem hiding this comment.
I suppose in this scenario no template exists anywhere. I am unsure if we should select disk type for a user.
There was a problem hiding this comment.
Yeah, should we just error out instead of creating a new VM?
There was a problem hiding this comment.
[Claude Review] I checked the API types — disk type is not persisted anywhere in the OpenShift API. It is not in VSphereMachineProviderSpec, Infrastructure spec, or the failure domain topology. The install-config diskType is consumed by the installer at template creation time but never written to a cluster object that MCO can read.
So getDiskTypeFromExistingVM() is the only way to determine disk type at runtime — by inspecting an existing VM backing. For a brand new failure domain with no template, the options are:
- Read it from a sibling failure domain template — createNewVMTemplate loops over all failure domains, so if one was already resolved earlier in the loop with an existing VM, its disk type could be captured and reused here. This preserves whatever the installer originally set.
- Error out rather than guessing — force the user to ensure the template exists before adding the failure domain, since MCO cannot know the intended disk type.
Given that picking the wrong disk type could have real storage/performance implications and MCO should not be making that choice for the user, erroring out (option 2) with a clear message seems safer. Alternatively, option 1 is reasonable if we can assume all failure domains in a cluster use the same disk type.
There was a problem hiding this comment.
@djoshy I am concerned about erroring, I have seen issues in testing when mco goes degraded for the entire cluster, other operators themselves become degraded as a result - I need to investigate this further.
When a new failure domain is added, the controller always looked up the template by the computed name, ignoring providerSpec.Template, causing reconciliation to fail. Fix by checking providerSpec.Template first, using errors.As for NotFoundError, and creating the template from OVA when none exists.
It probably wasn't trivial to unit test govmomi calls without additional vendors when it was originally written. E2Es are required for feature promotion, so we probably decided to prioritize those instead. Although at this point, the amount of vSphere carveouts in the boot image controller warrants it 😅 ....but I'd rather not block this bug fix on that. |
|
/test ? |
|
/test e2e-vsphere e2e-vsphere-ovn-zones e2e-vsphere-ovn-upi |
|
@djoshy: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
When a new failure domain is added, the controller always looked up the template by the computed name, ignoring
providerSpec.Template, causing reconciliation to fail. Fix by checking providerSpec.Template first, using errors.As forNotFoundError, and creating the template from OVA when none exists. I broke out the resolution step into a separate function to reduce the cyclomatic complexity as the lint check failed(which is fair 😄).- How to verify it
A
machinesetwith a non standard vsphere template name(not matching the old name the MCO compute from the infra object) should be successfully reconciled.Existing vSphere boot image e2es should also continue to pass.
Summary by CodeRabbit