Skip to content

OCPBUGS-78154: Use HA leader election defaults for MCO on SNO#5764

Open
lucaconsalvi wants to merge 5 commits intoopenshift:mainfrom
lucaconsalvi:OCPBUGS-78154-mcc-leader-election-sno
Open

OCPBUGS-78154: Use HA leader election defaults for MCO on SNO#5764
lucaconsalvi wants to merge 5 commits intoopenshift:mainfrom
lucaconsalvi:OCPBUGS-78154-mcc-leader-election-sno

Conversation

@lucaconsalvi
Copy link

@lucaconsalvi lucaconsalvi commented Mar 13, 2026

- What I did

In SNO clusters, after a non-graceful node reboot the Machine Config Controller (MCC) takes ~5-6 minutes to acquire its leader lease. This is because GetLeaderElectionConfig() applies library-go's SNO leader election timings (LeaseDuration=270s, RetryPeriod=60s, worst non-graceful acquisition=330s).

Since MCO runs exactly one replica on SNO, there is zero lease contention — the inflated SNO timings provide no benefit and only add recovery latency.

This PR adds a GetDefaultLeaderElectionConfig() function that returns HA defaults (LeaseDuration=137s, RetryPeriod=26s, worst non-graceful acquisition=163s) regardless of topology. Only MCC uses it; MCO and MOB retain the existing topology-aware configuration.

Fixes: https://issues.redhat.com/browse/OCPBUGS-78154

- How to verify it

  1. Deploy a SNO cluster (tested on OCP 4.20.15 and 4.22.0-ec.3)
  2. Reboot the node: ssh core@<node> 'sudo systemctl reboot --force'
  3. After reboot, check MCC logs: oc logs -n openshift-machine-config-operator -l k8s-app=machine-config-controller
  4. Verify the log line: "SNO topology detected, using HA leader election defaults"
  5. Check lease values: oc get lease -n openshift-machine-config-operator machine-config-controller -o yaml - leaseDurationSeconds should be 137 (was 270)
  6. Measure lease acquisition time — expected ~2m30s (was ~5m30s)

- Description for the changelog
Use HA leader election defaults for MCO on SNO to reduce post-reboot lease acquisition delay from ~5.5min to ~2.5min

Summary by CodeRabbit

  • Refactor
    • Enhanced high availability configuration handling with standardized default leader election timings independent of cluster topology.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 13, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 13, 2026
@openshift-ci-robot
Copy link
Contributor

@lucaconsalvi: This pull request references Jira Issue OCPBUGS-78154, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

- What I did

In SNO clusters, after a non-graceful node reboot the Machine Config Controller (MCC) takes ~5-6 minutes to acquire its leader lease. This is because GetLeaderElectionConfig() applies library-go's SNO leader election timings (LeaseDuration=270s, RetryPeriod=60s, worst non-graceful acquisition=330s).

Since MCO runs exactly one replica on SNO, there is zero lease contention — the inflated SNO timings provide no benefit and only add recovery latency.

This PR removes the LeaderElectionSNOConfig() override so MCO uses HA defaults (LeaseDuration=137s, RetryPeriod=26s, worst non-graceful acquisition=163s) on all topologies.

Fixes: https://issues.redhat.com/browse/OCPBUGS-78154

- How to verify it

  1. Deploy a SNO cluster (tested on OCP 4.20.15 and 4.22.0-ec.3)
  2. Reboot the node: ssh core@<node> 'sudo systemctl reboot --force'
  3. After reboot, check MCC logs: oc logs -n openshift-machine-config-operator -l k8s-app=machine-config-controller
  4. Verify the log line: "SNO topology detected, using HA leader election defaults"
  5. Check lease values: oc get lease -n openshift-machine-config-operator machine-config-controller -o yaml - leaseDurationSeconds should be 137 (was 270)
  6. Measure lease acquisition time — expected ~2m30s (was ~5m30s)

- Description for the changelog
Use HA leader election defaults for MCO on SNO to reduce post-reboot lease acquisition delay from ~5.5min to ~2.5min

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 958ee88d-46b7-4933-bf9f-78de9fb6e379

📥 Commits

Reviewing files that changed from the base of the PR and between 65d648b and 34cd2aa.

📒 Files selected for processing (2)
  • cmd/common/helpers.go
  • cmd/machine-config-controller/start.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/common/helpers.go

Walkthrough

Adds a new function GetDefaultLeaderElectionConfig() that returns default HA leader election timings independent of cluster topology. Updates machine-config-controller to use this new function instead of the topology-aware variant.

Changes

Cohort / File(s) Summary
Leader Election Configuration
cmd/common/helpers.go, cmd/machine-config-controller/start.go
Added GetDefaultLeaderElectionConfig() function that initializes and returns default leader election timings with logging. Updated start.go to call this new function instead of GetLeaderElectionConfig() for configuration retrieval.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 7 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (7 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly references the Jira ticket (OCPBUGS-78154) and clearly describes the main change: using HA leader election defaults for the Machine Config Controller on Single Node OpenShift clusters to reduce acquisition latency.
Stable And Deterministic Test Names ✅ Passed PR modifies only non-test files (helpers.go and start.go). No Ginkgo test files are added or modified.
Test Structure And Quality ✅ Passed No Ginkgo test files were modified in this PR, only implementation files. The custom check for test code quality is not applicable.
Microshift Test Compatibility ✅ Passed This PR does not introduce any new Ginkgo e2e tests. Modifications are limited to two source files adding a helper function and updating the start command. No e2e tests with Ginkgo patterns are added.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR does not add new Ginkgo e2e tests; changes only affect production code in cmd/common/helpers.go and cmd/machine-config-controller/start.go, making this check not applicable.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add Ginkgo e2e tests; changes are limited to implementation files modifying leader election configuration logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lucaconsalvi
Once this PR has been reviewed and has the lgtm label, please assign yuqi-zhang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lucaconsalvi lucaconsalvi marked this pull request as ready for review March 13, 2026 10:50
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026
Refactor GetLeaderElectionConfig to accept a flag that allows callers
to use HA default leader election timings regardless of cluster topology.
All MCO components (MCC, MCO, MOB) pass true to avoid inflated SNO
lease timings that cause ~5.5min delay after node reboot.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/common/helpers.go`:
- Around line 72-74: When useDefaultTimings is true the code returns
defaultLeaderElection silently; add an informational log just before that return
to record that HA default timings are being forced regardless of topology (e.g.,
log.Printf or the project's logger: log.Infof/klog.Infof) and include the value
being returned (defaultLeaderElection) and the flag name useDefaultTimings so
operators can see this decision in logs; update imports if necessary and place
the log immediately above the return in the same function where
useDefaultTimings and defaultLeaderElection are referenced.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 204a9bcc-63b8-4156-a8c7-8d82e3e88901

📥 Commits

Reviewing files that changed from the base of the PR and between f016a8e and 165da5a.

📒 Files selected for processing (4)
  • cmd/common/helpers.go
  • cmd/machine-config-controller/start.go
  • cmd/machine-config-operator/start.go
  • cmd/machine-os-builder/start.go

lucaconsalvi and others added 2 commits March 13, 2026 16:22
Add info log when useDefaultTimings flag bypasses topology-specific
timings for better runtime visibility during troubleshooting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a dedicated function that returns HA default leader election timings
regardless of cluster topology. Only MCC uses it, since it runs a single
replica with no lease contention and the inflated SNO timings add ~5.5min
of unnecessary recovery latency after node reboot.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 13, 2026

@lucaconsalvi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-part2 34cd2aa link true /test e2e-gcp-op-part2
ci/prow/e2e-aws-ovn-upgrade 34cd2aa link true /test e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants