OCPBUGS-91745: e2e: fix: clear hugepages before switching kernelPageSize to 4k#1550
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughA new ChangesARM kernelPageSize test refactoring with new profile update helper
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/performanceprofile/functests/14_arm/kernelpagesize.go`:
- Around line 78-80: The condition checking
`*perfProfile.Spec.RealTimeKernel.Enabled` can panic when RealTimeKernel exists
but the Enabled field is nil. In the if statement that checks
`perfProfile.Spec.RealTimeKernel != nil`, replace the direct dereference of
Enabled with `ptr.Deref(perfProfile.Spec.RealTimeKernel.Enabled, false)` to
safely handle the case where Enabled might be nil, returning false as the
default value if it is.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b1fc153b-dd57-47e2-82d6-26d418eb1899
📒 Files selected for processing (1)
test/e2e/performanceprofile/functests/14_arm/kernelpagesize.go
| validKernelPageSize = currentKernelPageSize != kernelPageSizeBytes4k | ||
| if perfProfile.Spec.RealTimeKernel != nil && *perfProfile.Spec.RealTimeKernel.Enabled { | ||
| testlog.Info("Setting RealTimeKernel.Enabled to false") | ||
| perfProfile.Spec.RealTimeKernel = &performancev2.RealTimeKernel{Enabled: ptr.To(false)} |
There was a problem hiding this comment.
Can we move this if conditions to a helper function
like
func resettodefaults(profile *performancev2.PerformanceProfile) bool {
GinkgoHelper()
needsUpdate := false
// KernelPageSize
if profile.Spec.KernelPageSize == nil || *profile.Spec.KernelPageSize != performancev2.KernelPageSize(kernelPageSize4k) {
testlog.Infof("Setting KernelPageSize to %s (was %v)", kernelPageSize4k, getKernelPageSizeValue(profile))
profile.Spec.KernelPageSize = ptr.To(performancev2.KernelPageSize(kernelPageSize4k))
needsUpdate = true
}
.....
return needsupdate
}
There was a problem hiding this comment.
I found that the need for such function is very common so i created a more general one.
Now the BeforeAll + AfterAll and the test are shorter and easier to read.
Tal-or
left a comment
There was a problem hiding this comment.
Please address the existing review comments (including the CodeRabbit one which make sense in this case)
otherwise, LGTM
86a4692 to
8df1d51
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/performanceprofile/functests/utils/profilesupdate/profile_update.go (1)
181-196: ⚡ Quick winConsider distinguishing timeout from fetch errors.
The function returns
falsefor any non-nil error fromPollUntilContextTimeout, conflating two distinct cases: the expected timeout (MC genuinely unchanged) and unexpected fetch errors (transient API issues). If fetching the MachineConfig fails during polling, the test silently skips the rollout wait, which could cause downstream flakiness.♻️ Suggested improvement to fail explicitly on fetch errors
func hasMCResourceVersionChanged(ctx context.Context, profile *performancev2.PerformanceProfile, resourceVersionBefore string) bool { mcName := machineconfig.GetMachineConfigName(profile.Name) err := wait.PollUntilContextTimeout(ctx, 2*time.Second, 60*time.Second, true, func(ctx context.Context) (bool, error) { currentResourceVersion, err := fetchMCResourceVersion(ctx, profile) if err != nil { - return false, err + testlog.Errorf("transient error fetching MachineConfig %q: %v", mcName, err) + return false, nil // retry on transient errors } return currentResourceVersion != resourceVersionBefore, nil }) if err == nil { testlog.Infof("MachineConfig %q resourceVersion changed", mcName) return true } + // Timeout reached without change — expected when profile update doesn't affect MC return false }As per coding guidelines, "Never ignore error returns" — logging transient errors while continuing to poll makes failures diagnosable without aborting on brief API hiccups.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/performanceprofile/functests/utils/profilesupdate/profile_update.go` around lines 181 - 196, The hasMCResourceVersionChanged function currently returns false for any error from PollUntilContextTimeout, which masks transient fetch errors from fetchMCResourceVersion. After the poll completes with a non-nil error, check if the error is a context deadline exceeded (which indicates the expected timeout case where the MachineConfig resource version did not change), and only then return false. For any other error from the polling operation, log the error details using testlog.Infof or testlog.Warnf to make diagnostic information available, then return false to avoid aborting the test but allow visibility into what went wrong during the fetch attempt.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@test/e2e/performanceprofile/functests/utils/profilesupdate/profile_update.go`:
- Around line 181-196: The hasMCResourceVersionChanged function currently
returns false for any error from PollUntilContextTimeout, which masks transient
fetch errors from fetchMCResourceVersion. After the poll completes with a
non-nil error, check if the error is a context deadline exceeded (which
indicates the expected timeout case where the MachineConfig resource version did
not change), and only then return false. For any other error from the polling
operation, log the error details using testlog.Infof or testlog.Warnf to make
diagnostic information available, then return false to avoid aborting the test
but allow visibility into what went wrong during the fetch attempt.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 36a7b1aa-fde9-4e2d-9ed2-6f00a3130da0
📒 Files selected for processing (2)
test/e2e/performanceprofile/functests/14_arm/kernelpagesize.gotest/e2e/performanceprofile/functests/utils/profilesupdate/profile_update.go
|
/retest |
8df1d51 to
fc1ac89
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/performanceprofile/functests/14_arm/kernelpagesize.go`:
- Around line 65-70: The `initialProfile` variable can be nil if the ARM check
in `BeforeAll` calls Skip(), but Ginkgo v2 still executes `AfterAll` afterward,
causing a nil pointer dereference when `ApplyProfileIfChanged` tries to access
the Spec field of the nil `initialProfile`. Fix this by adding a nil check guard
before calling `ApplyProfileIfChanged(ctx, profile, initialProfile)` in the
`AfterAll` block, so the rollback operation only proceeds if `initialProfile`
was successfully initialized and is not nil.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 385bbf23-f2fa-483e-9837-56a83349721e
📒 Files selected for processing (2)
test/e2e/performanceprofile/functests/14_arm/kernelpagesize.gotest/e2e/performanceprofile/functests/utils/profilesupdate/profile_update.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/performanceprofile/functests/utils/profilesupdate/profile_update.go
fc1ac89 to
22b4db1
Compare
BeforeAll set kernelPageSize=4k without clearing hugepages (512M is default on arm ci cluster), causing the performance admission webhook to reject the incompatible combination on aarch64. Extract common 3-step profile update pattern (UpdateWithRetry + WaitForTuningUpdating + WaitForTuningUpdated) into ApplyProfileAndWait helper in profilesupdate package. - Declarative baseline in BeforeAll/AfterAll with DeepEqual guard - Fix error check ordering for GetByNodeLabels - Remove unused imports and variables Ref: CNF-25175
22b4db1 to
7085ded
Compare
|
|
||
| // ApplyProfileAndWait updates the profile and waits for the MCP rollout (or nodepool on hypershift) to complete. | ||
| // Replaces the common 3-step pattern: UpdateWithRetry + WaitForTuningUpdating + WaitForTuningUpdated. | ||
| func ApplyProfileAndWait(ctx context.Context, profile *performancev2.PerformanceProfile) { |
There was a problem hiding this comment.
I would suggest to add another wrapper
ApplyProfileIfNeededAndWait(ctx context.Context, initialProfile, profile *performancev2.PerformanceProfile) {
if equality.Semantic.DeepEqual(profile.Spec, initialProfile.Spec) {
return
}
ApplyProfileAndWait(ctx, profile)
}
can be done on separate PR if you're planning major refactoring.
There was a problem hiding this comment.
Yes, will add that exact function when refactoring.
|
/retest |
|
@oblau: This pull request references Jira Issue OCPBUGS-91745, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@oblau: This pull request references Jira Issue OCPBUGS-91745, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/verified by oblau |
|
@oblau: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@oblau: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: oblau, Tal-or, yanirq The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@oblau: Jira Issue Verification Checks: Jira Issue OCPBUGS-91745 Jira Issue OCPBUGS-91745 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry pick release-4.22 |
|
/cherry-pick release-4.22 |
|
@oblau: new pull request created: #1554 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in release 5.0.0-0.nightly-2026-06-25-122140 |
BeforeAll set kernelPageSize=4k without clearing hugepages (512M default
on arm CI cluster), causing the admission webhook to reject the
incompatible combination on aarch64.
hugepages, disable RT; apply only if spec differs (DeepEqual guard)
Summary by CodeRabbit
kernelPageSizeend-to-end coverage by standardizing baseline preparation and making the test skip non-ARM nodes.