allow missing cluster service cluster ID by deads2k · Pull Request #4752 · Azure/ARO-HCP

deads2k · 2026-04-03T16:22:08Z

Necessary for moving creation to the backend.

Critically, we never serialize a nil, we keep it as an empty. This allows the n-1 level (if we need to revert) to read data created by the new version. This version reads and tolerates nil, so the n+1 version can write the nil.

openshift-ci · 2026-04-03T16:22:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-04-03T16:22:45Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

deads2k · 2026-04-03T17:06:53Z

 		database.OperationRequestCreate,
 		newInternalCluster.ID,
-		newInternalCluster.ServiceProviderProperties.ClusterServiceID,
+		*newInternalCluster.ServiceProviderProperties.ClusterServiceID,


need to track down operation clsuter-service-id usage to handle empty.

machi1990 · 2026-04-04T06:07:14Z

@JakobGray PTAL, this is relevant for the work around moving cluster creation to the backend.

/assign @JakobGray

deads2k · 2026-04-07T14:22:35Z

/retest

deads2k · 2026-04-07T14:23:32Z

+	// we do this to keep serialization the same so that we can go to n-1 where this field isn't a pointer.
+	// on the reading side, we handle the pointer as expected.
+	cosmosObj.InternalState.InternalAPI.ServiceProviderProperties.ClusterServiceID = &api.InternalID{}


critical part of the PR

miguelsorianod · 2026-04-07T16:57:09Z

 		// TODO should we take into account that at some point in the future we will implement migration between management
 		// clusters, where a cluster could have bundles allocated to different provision shards at the same time?
-		clusterCSShard, err := c.clusterServiceClient.GetClusterProvisionShard(ctx, cluster.ServiceProviderProperties.ClusterServiceID)
+		clusterCSShard, err := c.clusterServiceClient.GetClusterProvisionShard(ctx, *cluster.ServiceProviderProperties.ClusterServiceID)


Similar comment to https://github.com/Azure/ARO-HCP/pull/4752/changes#r3046562829

This ensures the process hangs and doesn't fail if we run n-1 backend after removing synchronous creation. We need to remember to first add the controller handling.

miguelsorianod · 2026-04-08T11:34:38Z

+		if cluster.ServiceProviderProperties.ClusterServiceID == nil {
+			// we don't have enough information to proceed.  We will retrigger once the information is present.
+			// TODO remove this once we have the information all in cosmos.
+			continue


I think here we should return an error instead of continuing: if for some reason we have a serviceprovidercluster but we don't store its information to the shard it belongs (because of the continue here), the cleanup logic might find a bundle that does not have a corresponding bundle reference because that serviceprovidercluster info is missing because of this check, and it would incorrectly delete the bundle.

The downside is that it could be common for this to occur as it requires all the clusters to have the CS ID at this point.

If we don't have a clusterserviceID, how did the bundle get created since there's nothing in the clusterservice to provide the maestro?

The race condition is:

The orphan deleter gets the SPCs and discards those that don't have a cluster with CS ID

The orphan deleter lists bundles. In the meantime between 1 and 2 the bundle was created and assigned to the SPC we discarded. It finds that there's no corresponding SPC so it deletes it, which shouldn't occur as there's a SPC that has it.

We discussed about this general race condition some days ago. In #4599 I fixed it in the delete orphan controller. In that PR I've also updated the maestro readonly bundle controllers of nodepools create and read to consider the case where CSID is empty.

miguelsorianod · 2026-04-08T14:45:08Z

 	}
 	clustersByClusterServiceID := make(map[string]*api.HCPOpenShiftCluster)
 	for _, internalCluster := range internalClusterIterator.Items(ctx) {
+		if internalCluster.ServiceProviderProperties.ClusterServiceID == nil {


Does this mean ArmResourceListClusters will omit clusters that still don't have the CS ID set? it could be confusing to consumers of the API endpoint, just to be aware

Does this mean ArmResourceListClusters will omit clusters that still don't have the CS ID set? it could be confusing to consumers of the API endpoint, just to be aware

It goes away in #4610

miguelsorianod · 2026-04-08T14:51:47Z

-		logger.Info("clusterService cluster missing, trying to clean up", "err", err)
-	} else if err != nil {
-		return utils.TrackError(err)
+	if cluster.ServiceProviderProperties.ClusterServiceID != nil {


Similar comment to https://github.com/Azure/ARO-HCP/pull/4752/changes#r3052173454

Similar answer

miguelsorianod · 2026-04-08T14:53:36Z

 // TODO this overwrite will transformed into a "set" function as we transition fields to ownership in cosmos
 // TODO remove the azure location once we have migrated every record to store the location
 func mergeToInternalCluster(csCluster *arohcpv1alpha1.Cluster, internalCluster *api.HCPOpenShiftCluster, azureLocation string) (*api.HCPOpenShiftCluster, error) {
+	if csCluster == nil {


Have we analyzed the implications of this nil check and error in mergeToInternalCluster and readInternalClusterFromClusterService from consumer code?

Function goes away in #4610, which has to merge before actually write the nil, so it will self-solve

openshift-ci · 2026-04-09T09:19:00Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2026-04-16T21:36:27Z

@deads2k: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/lint	`cefc148`	link	true	`/test lint`
ci/prow/test-unit	`cefc148`	link	true	`/test test-unit`
ci/prow/cspr	`cefc148`	link	true	`/test cspr`
ci/prow/images-push	`cefc148`	link	true	`/test images-push`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress label Apr 3, 2026

openshift-ci bot added the approved label Apr 3, 2026

deads2k commented Apr 3, 2026

View reviewed changes

deads2k marked this pull request as ready for review April 3, 2026 17:32

openshift-ci bot removed the do-not-merge/work-in-progress label Apr 3, 2026

openshift-ci bot requested review from geoberle and mbarnes April 3, 2026 17:32

openshift-ci bot assigned JakobGray Apr 4, 2026

openshift-ci bot added the needs-rebase label Apr 4, 2026

deads2k force-pushed the cs-152-allow-missing-cluster-service-id branch from dd66af0 to 9728a5a Compare April 6, 2026 19:17

openshift-ci bot removed the needs-rebase label Apr 6, 2026

deads2k force-pushed the cs-152-allow-missing-cluster-service-id branch from 9728a5a to 1f0df3d Compare April 6, 2026 19:29

deads2k commented Apr 7, 2026

View reviewed changes

miguelsorianod reviewed Apr 7, 2026

View reviewed changes

Comment thread backend/pkg/controllers/create_cluster_scoped_maestro_readonly_bundles_controller.go

miguelsorianod reviewed Apr 7, 2026

View reviewed changes

Comment thread internal/api/types_cluster.go

miguelsorianod reviewed Apr 7, 2026

View reviewed changes

Comment thread backend/pkg/controllers/operationcontrollers/operation_cluster_create.go

miguelsorianod reviewed Apr 7, 2026

View reviewed changes

Comment thread ...g/controllers/read_and_persist_cluster_scoped_maestro_readonly_bundles_content_controller.go

deads2k added 5 commits April 7, 2026 13:56

make the clusterServiceID optional for clusters

a106f00

AI modifications to handle pointer

108c830

correct problems with nil clusterserviceID that AI missed

dd61ac9

Adjust operation handling to not fail on missing cluster internalID

2bd4379

This ensures the process hangs and doesn't fail if we run n-1 backend after removing synchronous creation. We need to remember to first add the controller handling.

allow missing clusterserviceID with maestro controllers

e93d6fc

deads2k force-pushed the cs-152-allow-missing-cluster-service-id branch from 1f0df3d to e93d6fc Compare April 7, 2026 20:59