OCPBUGS-78154: Use HA leader election defaults for MCO on SNO#5764
OCPBUGS-78154: Use HA leader election defaults for MCO on SNO#5764lucaconsalvi wants to merge 5 commits intoopenshift:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
@lucaconsalvi: This pull request references Jira Issue OCPBUGS-78154, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdds a new function Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 7 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (7 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lucaconsalvi The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Refactor GetLeaderElectionConfig to accept a flag that allows callers to use HA default leader election timings regardless of cluster topology. All MCO components (MCC, MCO, MOB) pass true to avoid inflated SNO lease timings that cause ~5.5min delay after node reboot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/common/helpers.go`:
- Around line 72-74: When useDefaultTimings is true the code returns
defaultLeaderElection silently; add an informational log just before that return
to record that HA default timings are being forced regardless of topology (e.g.,
log.Printf or the project's logger: log.Infof/klog.Infof) and include the value
being returned (defaultLeaderElection) and the flag name useDefaultTimings so
operators can see this decision in logs; update imports if necessary and place
the log immediately above the return in the same function where
useDefaultTimings and defaultLeaderElection are referenced.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 204a9bcc-63b8-4156-a8c7-8d82e3e88901
📒 Files selected for processing (4)
cmd/common/helpers.gocmd/machine-config-controller/start.gocmd/machine-config-operator/start.gocmd/machine-os-builder/start.go
Add info log when useDefaultTimings flag bypasses topology-specific timings for better runtime visibility during troubleshooting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a dedicated function that returns HA default leader election timings regardless of cluster topology. Only MCC uses it, since it runs a single replica with no lease contention and the inflated SNO timings add ~5.5min of unnecessary recovery latency after node reboot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@lucaconsalvi: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
- What I did
In SNO clusters, after a non-graceful node reboot the Machine Config Controller (MCC) takes ~5-6 minutes to acquire its leader lease. This is because
GetLeaderElectionConfig()applies library-go's SNO leader election timings (LeaseDuration=270s, RetryPeriod=60s, worst non-graceful acquisition=330s).Since MCO runs exactly one replica on SNO, there is zero lease contention — the inflated SNO timings provide no benefit and only add recovery latency.
This PR adds a GetDefaultLeaderElectionConfig() function that returns HA defaults (LeaseDuration=137s, RetryPeriod=26s, worst non-graceful acquisition=163s) regardless of topology. Only MCC uses it; MCO and MOB retain the existing topology-aware configuration.
Fixes: https://issues.redhat.com/browse/OCPBUGS-78154
- How to verify it
ssh core@<node> 'sudo systemctl reboot --force'oc logs -n openshift-machine-config-operator -l k8s-app=machine-config-controlleroc get lease -n openshift-machine-config-operator machine-config-controller -o yaml-leaseDurationSecondsshould be137(was270)- Description for the changelog
Use HA leader election defaults for MCO on SNO to reduce post-reboot lease acquisition delay from ~5.5min to ~2.5min
Summary by CodeRabbit