Skip to content

allow missing cluster service cluster ID#4752

Open
deads2k wants to merge 6 commits intomainfrom
cs-152-allow-missing-cluster-service-id
Open

allow missing cluster service cluster ID#4752
deads2k wants to merge 6 commits intomainfrom
cs-152-allow-missing-cluster-service-id

Conversation

@deads2k
Copy link
Copy Markdown
Collaborator

@deads2k deads2k commented Apr 3, 2026

Necessary for moving creation to the backend.

Critically, we never serialize a nil, we keep it as an empty. This allows the n-1 level (if we need to revert) to read data created by the new version. This version reads and tolerates nil, so the n+1 version can write the nil.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 3, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Apr 3, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 3, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Comment thread frontend/pkg/frontend/cluster.go Outdated
database.OperationRequestCreate,
newInternalCluster.ID,
newInternalCluster.ServiceProviderProperties.ClusterServiceID,
*newInternalCluster.ServiceProviderProperties.ClusterServiceID,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to track down operation clsuter-service-id usage to handle empty.

@deads2k deads2k marked this pull request as ready for review April 3, 2026 17:32
@openshift-ci openshift-ci bot requested review from geoberle and mbarnes April 3, 2026 17:32
@machi1990
Copy link
Copy Markdown
Collaborator

@JakobGray PTAL, this is relevant for the work around moving cluster creation to the backend.

/assign @JakobGray

@deads2k deads2k force-pushed the cs-152-allow-missing-cluster-service-id branch from dd66af0 to 9728a5a Compare April 6, 2026 19:17
@deads2k deads2k force-pushed the cs-152-allow-missing-cluster-service-id branch from 9728a5a to 1f0df3d Compare April 6, 2026 19:29
@deads2k
Copy link
Copy Markdown
Collaborator Author

deads2k commented Apr 7, 2026

/retest

Comment on lines +70 to +72
// we do this to keep serialization the same so that we can go to n-1 where this field isn't a pointer.
// on the reading side, we handle the pointer as expected.
cosmosObj.InternalState.InternalAPI.ServiceProviderProperties.ClusterServiceID = &api.InternalID{}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical part of the PR

Comment thread internal/api/types_cluster.go
// TODO should we take into account that at some point in the future we will implement migration between management
// clusters, where a cluster could have bundles allocated to different provision shards at the same time?
clusterCSShard, err := c.clusterServiceClient.GetClusterProvisionShard(ctx, cluster.ServiceProviderProperties.ClusterServiceID)
clusterCSShard, err := c.clusterServiceClient.GetClusterProvisionShard(ctx, *cluster.ServiceProviderProperties.ClusterServiceID)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k deads2k force-pushed the cs-152-allow-missing-cluster-service-id branch from 1f0df3d to e93d6fc Compare April 7, 2026 20:59
if cluster.ServiceProviderProperties.ClusterServiceID == nil {
// we don't have enough information to proceed. We will retrigger once the information is present.
// TODO remove this once we have the information all in cosmos.
continue
Copy link
Copy Markdown
Collaborator

@miguelsorianod miguelsorianod Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we should return an error instead of continuing: if for some reason we have a serviceprovidercluster but we don't store its information to the shard it belongs (because of the continue here), the cleanup logic might find a bundle that does not have a corresponding bundle reference because that serviceprovidercluster info is missing because of this check, and it would incorrectly delete the bundle.

The downside is that it could be common for this to occur as it requires all the clusters to have the CS ID at this point.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have a clusterserviceID, how did the bundle get created since there's nothing in the clusterservice to provide the maestro?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race condition is:

  1. The orphan deleter gets the SPCs and discards those that don't have a cluster with CS ID
  2. The orphan deleter lists bundles. In the meantime between 1 and 2 the bundle was created and assigned to the SPC we discarded. It finds that there's no corresponding SPC so it deletes it, which shouldn't occur as there's a SPC that has it.

We discussed about this general race condition some days ago. In #4599 I fixed it in the delete orphan controller. In that PR I've also updated the maestro readonly bundle controllers of nodepools create and read to consider the case where CSID is empty.

Comment thread frontend/pkg/frontend/external_auth.go
Comment thread internal/api/types_cluster.go
}
clustersByClusterServiceID := make(map[string]*api.HCPOpenShiftCluster)
for _, internalCluster := range internalClusterIterator.Items(ctx) {
if internalCluster.ServiceProviderProperties.ClusterServiceID == nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean ArmResourceListClusters will omit clusters that still don't have the CS ID set? it could be confusing to consumers of the API endpoint, just to be aware

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean ArmResourceListClusters will omit clusters that still don't have the CS ID set? it could be confusing to consumers of the API endpoint, just to be aware

It goes away in #4610

Comment thread frontend/pkg/frontend/cluster.go Outdated
Comment thread frontend/pkg/frontend/cluster.go
logger.Info("clusterService cluster missing, trying to clean up", "err", err)
} else if err != nil {
return utils.TrackError(err)
if cluster.ServiceProviderProperties.ClusterServiceID != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar answer

Comment thread frontend/pkg/frontend/cluster.go
// TODO this overwrite will transformed into a "set" function as we transition fields to ownership in cosmos
// TODO remove the azure location once we have migrated every record to store the location
func mergeToInternalCluster(csCluster *arohcpv1alpha1.Cluster, internalCluster *api.HCPOpenShiftCluster, azureLocation string) (*api.HCPOpenShiftCluster, error) {
if csCluster == nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we analyzed the implications of this nil check and error in mergeToInternalCluster and readInternalClusterFromClusterService from consumer code?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function goes away in #4610, which has to merge before actually write the nil, so it will self-solve

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 9, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 16, 2026

@deads2k: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/lint cefc148 link true /test lint
ci/prow/test-unit cefc148 link true /test test-unit
ci/prow/cspr cefc148 link true /test cspr
ci/prow/images-push cefc148 link true /test images-push

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants