CNTRLPLANE-2800: Fix APIService race condition and gate HCP availability on aggregated API readiness#7810
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
/auto-cc |
|
@hypershift-jira-solve-ci[bot]: This pull request references CNTRLPLANE-2800 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@hypershift-jira-solve-ci[bot]: This pull request references CNTRLPLANE-2800 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an AggregatedAPIServicesAvailable condition and reason; evaluates it in HostedControlPlane availability reconciliation with an alreadyAvailable latch; reorders service/endpoints before APIServices reconciliation; centralizes HCP status patching via patchHCPWithCondition; implements aggregated APIServices reconciliation and tests; documents the new condition. Changes
Sequence DiagramsequenceDiagram
participant HCPController as HostedControlPlane Controller
participant Reconciler as Resources Reconciler
participant APIServer as Guest APIServices
participant CPClient as ControlPlane API
HCPController->>Reconciler: reconcileAggregatedAPIServicesAvailableCondition(ctx,hcp)
Reconciler->>APIServer: List expected APIServices (OpenShift, OAuth?, OLM)
APIServer-->>Reconciler: Return APIService objects / errors
Reconciler->>Reconciler: Evaluate aggregated availability
alt All available
Reconciler->>CPClient: patchHCPWithCondition(AggregatedAPIServicesAvailable=True)
else Any unavailable/missing/error
Reconciler->>CPClient: patchHCPWithCondition(AggregatedAPIServicesAvailable=False, reason=APIServicesNotAvailable)
end
HCPController->>HCPController: reconcileAvailabilityStatus(..., alreadyAvailable)
HCPController->>CPClient: Read HostedControlPlane.Status.Conditions
alt AggregatedAPIServicesAvailable != True and !alreadyAvailable
HCPController->>HCPController: Set Ready=false, reason=APIServicesNotAvailable
else
HCPController->>HCPController: Continue other availability checks
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 2 warnings)
✅ Passed checks (2 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Hi @hypershift-jira-solve-ci[bot]. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@hypershift-jira-solve-ci[bot]: This pull request references CNTRLPLANE-2800 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go (1)
747-753: Gate on explicitConditionTrueinstead of onlyConditionFalse.Current logic allows
ConditionUnknownto pass through to Available. Requiring explicit true keeps the readiness gate strict.Suggested patch
- case !alreadyAvailable && (apiServicesCondition == nil || apiServicesCondition.Status == metav1.ConditionFalse): + case !alreadyAvailable && (apiServicesCondition == nil || apiServicesCondition.Status != metav1.ConditionTrue):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go` around lines 747 - 753, The code currently treats only ConditionFalse as unavailable, letting ConditionUnknown slip through; change the check in the block that sets reason = hyperv1.AggregatedAPIServicesNotAvailableReason to gate on explicit metav1.ConditionTrue by replacing the condition (apiServicesCondition == nil || apiServicesCondition.Status == metav1.ConditionFalse) with (apiServicesCondition == nil || apiServicesCondition.Status != metav1.ConditionTrue), and keep the existing message assignment using apiServicesCondition.Message when apiServicesCondition != nil otherwise the default string.control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go (2)
1923-1925: Also assert empty message on success-path cases.Right now message is only checked when non-empty is expected. Add an explicit empty assertion for cases where no message should be present to catch stale-message regressions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go` around lines 1923 - 1925, The test currently only asserts when tc.expectedMessage is non-empty; add an explicit assertion for the success-path to fail on stale messages by checking that message is empty when tc.expectedMessage == "". After the existing conditional, or replace it with a conditional branch, assert g.Expect(message).To(BeEmpty()) (or To(Equal(""))) whenever tc.expectedMessage == "" and keep the existing g.Expect(message).To(ContainSubstring(tc.expectedMessage)) for non-empty expectations; reference the variables tc.expectedMessage and message in the hostedcontrolplane_controller_test.go test loop.
1866-1919: Prefer exercising the shared availability evaluator instead of re-implementing it in-test.This block mirrors the production decision tree inline, which can drift and still pass. Consider calling the extracted availability-status function (or the reconciler path that sets the condition) and asserting its outputs directly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go` around lines 1866 - 1919, This test duplicates the availability decision tree; instead call the shared evaluator used by the reconciler rather than re-implementing the switch. Replace the inline switch that inspects infrastructureCondition, etcdCondition, kubeAPIServerCondition, apiServicesCondition and the manual componentsErr/componentsNotAvailableMsg logic with a direct invocation of the reconciler’s availability evaluation path (use the HostedControlPlaneReconciler instance you already construct and call the same method the controller uses to set the HostedControlPlaneAvailable condition—i.e., invoke the reconciler function that computes/sets HostedControlPlaneAvailable and assert its resulting status/reason/message rather than mirroring the logic; you may still call controlPlaneComponentsAvailable if the evaluator relies on it). Ensure assertions validate the final HostedControlPlaneAvailable condition returned/attached by that evaluator.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`:
- Around line 525-528: The error messages incorrectly say "service" when
reconciling APIServices; update the fmt.Errorf messages in the calls to
reconcileOpenshiftAPIServerAPIServices and reconcileKubeAPIServerAPIServices so
they use a clear identifier like "apiservice" (e.g. change "failed to reconcile
openshift apiserver service: %w" to "failed to reconcile openshift apiserver
apiservice: %w" and similarly for the kube apiserver) so the error reflects the
actual resource being reconciled.
- Around line 1490-1497: patchHCPWithCondition currently patches
HostedControlPlane status directly and can overwrite concurrent updates; change
it to use optimistic locking by wrapping the status patch in a
retry.RetryOnConflict loop: fetch the latest HostedControlPlane (e.g., via
r.cpClient.Get), apply the condition to the fetched object's Status.Conditions,
call r.cpClient.Status().Patch with client.MergeFrom(originalFetchedCopy) and
retry on conflict until success or non-retryable error; keep references to the
existing function name patchHCPWithCondition and the call site
r.cpClient.Status().Patch so reviewers can locate and replace the direct patch
with the fetch-modify-patch retry pattern.
In `@docs/content/reference/api.md`:
- Around line 4927-4928: Update the API comment that currently states "This
condition is an HCP implementation detail set by the HCCO" to correctly
attribute the condition to the control-plane-operator (e.g., replace "HCCO" with
"control-plane-operator" in the condition description comment for the relevant
condition constant/type), then regenerate the docs so
docs/content/reference/api.md reflects the corrected comment and no longer
misleads operators about the controller responsible for reconciling that
condition.
---
Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go`:
- Around line 1923-1925: The test currently only asserts when tc.expectedMessage
is non-empty; add an explicit assertion for the success-path to fail on stale
messages by checking that message is empty when tc.expectedMessage == "". After
the existing conditional, or replace it with a conditional branch, assert
g.Expect(message).To(BeEmpty()) (or To(Equal(""))) whenever tc.expectedMessage
== "" and keep the existing
g.Expect(message).To(ContainSubstring(tc.expectedMessage)) for non-empty
expectations; reference the variables tc.expectedMessage and message in the
hostedcontrolplane_controller_test.go test loop.
- Around line 1866-1919: This test duplicates the availability decision tree;
instead call the shared evaluator used by the reconciler rather than
re-implementing the switch. Replace the inline switch that inspects
infrastructureCondition, etcdCondition, kubeAPIServerCondition,
apiServicesCondition and the manual componentsErr/componentsNotAvailableMsg
logic with a direct invocation of the reconciler’s availability evaluation path
(use the HostedControlPlaneReconciler instance you already construct and call
the same method the controller uses to set the HostedControlPlaneAvailable
condition—i.e., invoke the reconciler function that computes/sets
HostedControlPlaneAvailable and assert its resulting status/reason/message
rather than mirroring the logic; you may still call
controlPlaneComponentsAvailable if the evaluator relies on it). Ensure
assertions validate the final HostedControlPlaneAvailable condition
returned/attached by that evaluator.
In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`:
- Around line 747-753: The code currently treats only ConditionFalse as
unavailable, letting ConditionUnknown slip through; change the check in the
block that sets reason = hyperv1.AggregatedAPIServicesNotAvailableReason to gate
on explicit metav1.ConditionTrue by replacing the condition
(apiServicesCondition == nil || apiServicesCondition.Status ==
metav1.ConditionFalse) with (apiServicesCondition == nil ||
apiServicesCondition.Status != metav1.ConditionTrue), and keep the existing
message assignment using apiServicesCondition.Message when apiServicesCondition
!= nil otherwise the default string.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
⛔ Files ignored due to path filters (1)
vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.gois excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (7)
api/hypershift/v1beta1/hostedcluster_conditions.gocontrol-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.gocontrol-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.godocs/content/reference/aggregated-docs.mddocs/content/reference/api.md
| This condition is an HCP implementation detail set by the HCCO and is not | ||
| bubbled up to the HostedCluster level.</p> |
There was a problem hiding this comment.
Incorrect controller attribution in condition description
The text says this condition is set by the HCCO, but this PR’s behavior describes it as reconciled by the control-plane-operator. Please correct the source API comment (then regenerate api.md) to avoid misleading operators during debugging.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/content/reference/api.md` around lines 4927 - 4928, Update the API
comment that currently states "This condition is an HCP implementation detail
set by the HCCO" to correctly attribute the condition to the
control-plane-operator (e.g., replace "HCCO" with "control-plane-operator" in
the condition description comment for the relevant condition constant/type),
then regenerate the docs so docs/content/reference/api.md reflects the corrected
comment and no longer misleads operators about the controller responsible for
reconciling that condition.
There was a problem hiding this comment.
The attribution is actually correct. The reconcileAggregatedAPIServicesAvailableCondition function lives in control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go — that is the HCCO (Hosted Cluster Config Operator). The HCCO runs inside the guest cluster context and is the component that lists APIServices from the guest cluster and sets this condition on the HCP status. No change needed.
AI-assisted response via Claude Code
There was a problem hiding this comment.
Skipped: comment is from another GitHub bot.
|
@hypershift-jira-solve-ci[bot]: This pull request references CNTRLPLANE-2800 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1 similar comment
|
@hypershift-jira-solve-ci[bot]: This pull request references CNTRLPLANE-2800 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@hypershift-jira-solve-ci[bot]: This pull request references CNTRLPLANE-2800 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
This PR needs rebase now, please fix conflicts if any, without changing any functionality or test coverage. /test address-review-comments |
|
/test address-review-comments |
|
/ok-to-test |
|
Addressed the nitpick suggestions from the review:
AI-assisted response via Claude Code |
|
hypershift-jira-solve-ci - This PR needs rebase now, please fix conflicts if any, without changing any functionality or test coverage. |
|
/test address-review-comments |
96f8d34 to
c2c25eb
Compare
Review Comments StatusRebased branch onto latest Verification:
Review comments: All 13 inline review comments were previously addressed. No new review feedback to handle. Commits (8):
AI-assisted response via Claude Code |
c2c25eb to
28568e5
Compare
Review Comments StatusRebased all 8 commits onto latest ✅ All previous review comments (inline from @jparrill, @coderabbitai, and issue comments from @enxebre, @bryan-cox) have been addressed in prior sessions. No outstanding code review items found. AI-assisted response via Claude Code |
28568e5 to
204290e
Compare
Add AggregatedAPIServicesAvailable condition type and AggregatedAPIServicesNotAvailableReason reason constant to track aggregated APIService readiness on the HostedControlPlane. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dition Fix resource ordering in the HCCO: create backing Service and Endpoints before the APIService for all three aggregated API groups (OpenShift API Server, OAuth API Server, OLM PackageServer), eliminating the race condition where the Kubernetes API aggregator picks up an APIService before its backend exists. Add reconcileAggregatedAPIServicesAvailableCondition which lists all expected aggregated APIServices in the guest cluster and sets the AggregatedAPIServicesAvailable condition on the HostedControlPlane. Extract patchHCPWithCondition from inline closure to a shared method on the reconciler for reuse across condition reconcilers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add AggregatedAPIServicesAvailable as a new case in reconcileAvailabilityStatus that blocks the HCP from transitioning to Available until all aggregated APIServices are functional. Uses != ConditionTrue to also gate on Unknown status. Protected by the alreadyAvailable latch to prevent flapping once the HCP has been marked Available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…y gate Add test cases to TestReconcileAvailabilityStatus for the new AggregatedAPIServicesAvailable gate: condition False blocks, condition nil blocks, condition True allows, latch prevents regression, ConditionUnknown blocks. Add TestReconcileAggregatedAPIServicesAvailableCondition with 7 cases: all available (OAuth enabled/disabled), missing APIServices, unavailable APIServices, empty conditions slice, OAuth-conditional, client error. Add TestPatchHCPWithConditionCPClientFailure to verify cpClient patch error propagation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Apply optimistic locking in patchHCPWithCondition using MergeFromWithOptions with OptimisticLock to match the codebase convention used throughout the HCP controller, preventing lost status-condition updates from concurrent reconcilers. Fix test case name from "condition is missing" to "condition is False" to accurately reflect the test setup which sets AggregatedAPIServicesAvailable=False. Include generated API docs for the new AggregatedAPIServicesAvailable condition type to fix make verify. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use manifests.OpenShiftAPIServerAPIService(group).Name and manifests.OpenShiftOAuthAPIServerAPIService(group).Name instead of fmt.Sprintf for building APIService names, matching the pattern already used for OLMPackageServerAPIService. This improves resilience against naming changes and keeps the code consistent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Gate APIService creation on successful Service/Endpoints reconciliation to prevent publishing APIServices that point to non-ready backends. This closes the remaining race window where a Service or Endpoints error still allowed the corresponding APIService to be created, causing transient 503 errors from the Kubernetes API aggregator. Addresses coderabbitai review feedback for OpenShift API, OAuth API, and PackageServer APIService reconciliation paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…liation The error message when reconciling OLM PackageServer Endpoints incorrectly reported "failed to reconcile OLM packageserver service" instead of "failed to reconcile OLM packageserver endpoints". This made debugging harder when the Endpoints reconciliation failed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
204290e to
14acc45
Compare
Review Comments StatusRebased onto latest All 29 comments flagged for review are either:
No outstanding review comments require code changes or replies. AI-assisted response via Claude Code |
|
/retest |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@hypershift-jira-solve-ci[bot]: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Stale PRs are closed after 21d of inactivity. If this PR is still relevant, comment to refresh it or remove the stale label. If this PR is safe to close now please do so with /lifecycle stale |
|
hypershift bot - this PR needs rebased |
|
/address-review-comments |
|
🤖 Addressing review comments: workflow run |
|
Stale PRs rot after 14d of inactivity. Mark the PR as fresh by commenting If this PR is safe to close now please do so with /lifecycle rotten |
|
Now I have all the information. Let me compile the final report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryNone of the five failures are caused by the PR's code changes. The PR branch Root CauseThere are three independent root causes across the five failing jobs: 1. Merge Conflicts (affects: ci/prow/security, tide)
The Prow security job aborts immediately during the git merge phase before any test or scan execution. Tide reports "Not mergeable. PR has a merge conflict." The PR also lacks 2. Flaky CRD Teardown Timeout (affects: Envtest Vanilla Kube 1.35.0, Conclusion) 3. Git Server HTTP 500 (affects: Konflux hypershift-release-mce-50-on-pull-request) Recommendations
Evidence
|
What this PR does / why we need it:
Fixes a race condition in the control-plane-operator where APIService resources were created before their backing Service and Endpoints, causing transient 503 errors on aggregated API groups (OpenShift API Server, OAuth API Server, OLM PackageServer). Also adds a new availability gate to prevent the HostedControlPlane from being marked Available until all aggregated APIServices are functional.
Changes:
API (
api/hypershift/v1beta1)AggregatedAPIServicesAvailablecondition type andAggregatedAPIServicesNotAvailableReasonreason constant to track aggregated APIService readiness on the HostedControlPlane.Control Plane Operator (
control-plane-operator/)resources.go: Create backing Service and Endpoints before the APIService for all three aggregated API groups, eliminating the race condition where the Kubernetes API aggregator picks up an APIService before its backend exists.reconcileAggregatedAPIServicesAvailableCondition: Lists all expected aggregated APIServices in the guest cluster and sets theAggregatedAPIServicesAvailablecondition on the HostedControlPlane. UsesListfor efficiency and extractspatchHCPWithConditionas a shared helper.AvailableuntilAggregatedAPIServicesAvailableis True. Protected by the existingalreadyAvailablelatch to prevent flapping.Refactoring
reconcileAvailabilityStatusinto a pure function for testability of the availability decision cascade without full reconciler setup.Tests
reconcileAggregatedAPIServicesAvailableConditioncovering: all APIServices available (OAuth enabled/disabled), missing APIServices, unavailable APIServices, OAuth-conditional behavior, and client errors.TestReconcileAvailabilityStatuswith 11 cases covering every branch of the availability decision cascade.TestEtcdStatefulSetConditionwith 5 cases for etcd quorum scenarios.Which issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/CNTRLPLANE-2800
Special notes for your reviewer:
The availability gate is protected by the
alreadyAvailablelatch — once the HCP is already Available, a transientAggregatedAPIServicesAvailable=Falsewill not regress availability. This prevents flapping during temporary API service disruptions while still ensuring initial availability requires all aggregated APIs to be functional.Checklist:
Always review AI generated responses prior to use.
Generated with Claude Code via
/jira:solve [CNTRLPLANE-2800](https://issues.redhat.com/browse/CNTRLPLANE-2800)Generated via
/ci:trigger-periodic periodic-ci-openshift-hypershift-main-periodic-jira-agent MULTISTAGE_PARAM_OVERRIDE_JIRA_AGENT_ISSUE_KEY=[CNTRLPLANE-2800](https://issues.redhat.com/browse/CNTRLPLANE-2800)Jira Agent Report
Summary by CodeRabbit
New Features
Documentation