OCPBUGS-78154: Use HA leader election defaults for MCO on SNO by lucaconsalvi · Pull Request #5764 · openshift/machine-config-operator

lucaconsalvi · 2026-03-13T09:16:41Z

- What I did

In SNO clusters, after a non-graceful node reboot the Machine Config Controller (MCC) takes ~5-6 minutes to acquire its leader lease. This is because GetLeaderElectionConfig() applies library-go's SNO leader election timings (LeaseDuration=270s, RetryPeriod=60s, worst non-graceful acquisition=330s).

Since MCO runs exactly one replica on SNO, there is zero lease contention — the inflated SNO timings provide no benefit and only add recovery latency.

This PR adds a GetDefaultLeaderElectionConfig() function that returns HA defaults (LeaseDuration=137s, RetryPeriod=26s, worst non-graceful acquisition=163s) regardless of topology. Only MCC uses it; MCO and MOB retain the existing topology-aware configuration.

Fixes: https://issues.redhat.com/browse/OCPBUGS-78154

- How to verify it

Deploy a SNO cluster (tested on OCP 4.20.15 and 4.22.0-ec.3)
Reboot the node: ssh core@<node> 'sudo systemctl reboot --force'
After reboot, check MCC logs: oc logs -n openshift-machine-config-operator -l k8s-app=machine-config-controller
Verify the log line: "SNO topology detected, using HA leader election defaults"
Check lease values: oc get lease -n openshift-machine-config-operator machine-config-controller -o yaml - leaseDurationSeconds should be 137 (was 270)
Measure lease acquisition time — expected ~2m30s (was ~5m30s)

- Description for the changelog
Use HA leader election defaults for MCO on SNO to reduce post-reboot lease acquisition delay from ~5.5min to ~2.5min

Summary by CodeRabbit

Refactor
- Enhanced high availability configuration handling with standardized default leader election timings independent of cluster topology.

openshift-ci · 2026-03-13T09:16:45Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-03-13T09:16:48Z

@lucaconsalvi: This pull request references Jira Issue OCPBUGS-78154, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

- What I did

In SNO clusters, after a non-graceful node reboot the Machine Config Controller (MCC) takes ~5-6 minutes to acquire its leader lease. This is because GetLeaderElectionConfig() applies library-go's SNO leader election timings (LeaseDuration=270s, RetryPeriod=60s, worst non-graceful acquisition=330s).

Since MCO runs exactly one replica on SNO, there is zero lease contention — the inflated SNO timings provide no benefit and only add recovery latency.

This PR removes the LeaderElectionSNOConfig() override so MCO uses HA defaults (LeaseDuration=137s, RetryPeriod=26s, worst non-graceful acquisition=163s) on all topologies.

Fixes: https://issues.redhat.com/browse/OCPBUGS-78154

- How to verify it

Deploy a SNO cluster (tested on OCP 4.20.15 and 4.22.0-ec.3)

Reboot the node: ssh core@<node> 'sudo systemctl reboot --force'

After reboot, check MCC logs: oc logs -n openshift-machine-config-operator -l k8s-app=machine-config-controller

Verify the log line: "SNO topology detected, using HA leader election defaults"

Check lease values: oc get lease -n openshift-machine-config-operator machine-config-controller -o yaml - leaseDurationSeconds should be 137 (was 270)

Measure lease acquisition time — expected ~2m30s (was ~5m30s)

- Description for the changelog
Use HA leader election defaults for MCO on SNO to reduce post-reboot lease acquisition delay from ~5.5min to ~2.5min

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-03-13T09:17:02Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 958ee88d-46b7-4933-bf9f-78de9fb6e379

📥 Commits

Reviewing files that changed from the base of the PR and between 65d648b and 34cd2aa.

📒 Files selected for processing (2)

cmd/common/helpers.go
cmd/machine-config-controller/start.go

🚧 Files skipped from review as they are similar to previous changes (1)

cmd/common/helpers.go

Walkthrough

Adds a new function GetDefaultLeaderElectionConfig() that returns default HA leader election timings independent of cluster topology. Updates machine-config-controller to use this new function instead of the topology-aware variant.

Changes

Cohort / File(s)	Summary
Leader Election Configuration `cmd/common/helpers.go`, `cmd/machine-config-controller/start.go`	Added `GetDefaultLeaderElectionConfig()` function that initializes and returns default leader election timings with logging. Updated start.go to call this new function instead of `GetLeaderElectionConfig()` for configuration retrieval.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 7 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (7 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly references the Jira ticket (OCPBUGS-78154) and clearly describes the main change: using HA leader election defaults for the Machine Config Controller on Single Node OpenShift clusters to reduce acquisition latency.
Stable And Deterministic Test Names	✅ Passed	PR modifies only non-test files (helpers.go and start.go). No Ginkgo test files are added or modified.
Test Structure And Quality	✅ Passed	No Ginkgo test files were modified in this PR, only implementation files. The custom check for test code quality is not applicable.
Microshift Test Compatibility	✅ Passed	This PR does not introduce any new Ginkgo e2e tests. Modifications are limited to two source files adding a helper function and updating the start command. No e2e tests with Ginkgo patterns are added.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR does not add new Ginkgo e2e tests; changes only affect production code in cmd/common/helpers.go and cmd/machine-config-controller/start.go, making this check not applicable.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This PR does not add Ginkgo e2e tests; changes are limited to implementation files modifying leader election configuration logic.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-03-13T09:17:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lucaconsalvi
Once this PR has been reviewed and has the lgtm label, please assign yuqi-zhang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Refactor GetLeaderElectionConfig to accept a flag that allows callers to use HA default leader election timings regardless of cluster topology. All MCO components (MCC, MCO, MOB) pass true to avoid inflated SNO lease timings that cause ~5.5min delay after node reboot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/common/helpers.go`:
- Around line 72-74: When useDefaultTimings is true the code returns
defaultLeaderElection silently; add an informational log just before that return
to record that HA default timings are being forced regardless of topology (e.g.,
log.Printf or the project's logger: log.Infof/klog.Infof) and include the value
being returned (defaultLeaderElection) and the flag name useDefaultTimings so
operators can see this decision in logs; update imports if necessary and place
the log immediately above the return in the same function where
useDefaultTimings and defaultLeaderElection are referenced.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 204a9bcc-63b8-4156-a8c7-8d82e3e88901

📥 Commits

Reviewing files that changed from the base of the PR and between f016a8e and 165da5a.

📒 Files selected for processing (4)

cmd/common/helpers.go
cmd/machine-config-controller/start.go
cmd/machine-config-operator/start.go
cmd/machine-os-builder/start.go

cmd/common/helpers.go

Add info log when useDefaultTimings flag bypasses topology-specific timings for better runtime visibility during troubleshooting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a dedicated function that returns HA default leader election timings regardless of cluster topology. Only MCC uses it, since it runs a single replica with no lease contention and the inflated SNO timings add ~5.5min of unnecessary recovery latency after node reboot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci · 2026-03-13T22:26:24Z

@lucaconsalvi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-op-part2	`34cd2aa`	link	true	`/test e2e-gcp-op-part2`
ci/prow/e2e-aws-ovn-upgrade	`34cd2aa`	link	true	`/test e2e-aws-ovn-upgrade`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

lucaconsalvi added 2 commits March 13, 2026 09:52

OCPBUGS-78154: Use HA leader election defaults for MCO on SNO.

a7b45fe

Trim comment in GetLeaderElectionConfig.

f016a8e

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026

lucaconsalvi marked this pull request as ready for review March 13, 2026 10:50

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026

openshift-ci bot requested review from RishabhSaini and umohnani8 March 13, 2026 10:51

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

cmd/common/helpers.go Outdated Show resolved Hide resolved

lucaconsalvi and others added 2 commits March 13, 2026 16:22

OCPBUGS-78154: Log when using HA default leader election timings

65d648b

Add info log when useDefaultTimings flag bypasses topology-specific timings for better runtime visibility during troubleshooting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-78154: Use HA leader election defaults for MCO on SNO#5764

OCPBUGS-78154: Use HA leader election defaults for MCO on SNO#5764
lucaconsalvi wants to merge 5 commits intoopenshift:mainfrom
lucaconsalvi:OCPBUGS-78154-mcc-leader-election-sno

lucaconsalvi commented Mar 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

openshift-ci bot commented Mar 13, 2026

Uh oh!

openshift-ci-robot commented Mar 13, 2026

Uh oh!

coderabbitai bot commented Mar 13, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

openshift-ci bot commented Mar 13, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

openshift-ci bot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lucaconsalvi commented Mar 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Mar 13, 2026

Uh oh!

openshift-ci-robot commented Mar 13, 2026

Uh oh!

coderabbitai bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci bot commented Mar 13, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci bot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lucaconsalvi commented Mar 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 13, 2026 •

edited

Loading