Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Skipping CI for Draft Pull Request. |
| database.OperationRequestCreate, | ||
| newInternalCluster.ID, | ||
| newInternalCluster.ServiceProviderProperties.ClusterServiceID, | ||
| *newInternalCluster.ServiceProviderProperties.ClusterServiceID, |
There was a problem hiding this comment.
need to track down operation clsuter-service-id usage to handle empty.
|
@JakobGray PTAL, this is relevant for the work around moving cluster creation to the backend. /assign @JakobGray |
dd66af0 to
9728a5a
Compare
9728a5a to
1f0df3d
Compare
|
/retest |
| // we do this to keep serialization the same so that we can go to n-1 where this field isn't a pointer. | ||
| // on the reading side, we handle the pointer as expected. | ||
| cosmosObj.InternalState.InternalAPI.ServiceProviderProperties.ClusterServiceID = &api.InternalID{} |
There was a problem hiding this comment.
critical part of the PR
| // TODO should we take into account that at some point in the future we will implement migration between management | ||
| // clusters, where a cluster could have bundles allocated to different provision shards at the same time? | ||
| clusterCSShard, err := c.clusterServiceClient.GetClusterProvisionShard(ctx, cluster.ServiceProviderProperties.ClusterServiceID) | ||
| clusterCSShard, err := c.clusterServiceClient.GetClusterProvisionShard(ctx, *cluster.ServiceProviderProperties.ClusterServiceID) |
There was a problem hiding this comment.
Similar comment to https://github.com/Azure/ARO-HCP/pull/4752/changes#r3046562829
This ensures the process hangs and doesn't fail if we run n-1 backend after removing synchronous creation. We need to remember to first add the controller handling.
1f0df3d to
e93d6fc
Compare
| if cluster.ServiceProviderProperties.ClusterServiceID == nil { | ||
| // we don't have enough information to proceed. We will retrigger once the information is present. | ||
| // TODO remove this once we have the information all in cosmos. | ||
| continue |
There was a problem hiding this comment.
I think here we should return an error instead of continuing: if for some reason we have a serviceprovidercluster but we don't store its information to the shard it belongs (because of the continue here), the cleanup logic might find a bundle that does not have a corresponding bundle reference because that serviceprovidercluster info is missing because of this check, and it would incorrectly delete the bundle.
The downside is that it could be common for this to occur as it requires all the clusters to have the CS ID at this point.
There was a problem hiding this comment.
If we don't have a clusterserviceID, how did the bundle get created since there's nothing in the clusterservice to provide the maestro?
There was a problem hiding this comment.
The race condition is:
- The orphan deleter gets the SPCs and discards those that don't have a cluster with CS ID
- The orphan deleter lists bundles. In the meantime between 1 and 2 the bundle was created and assigned to the SPC we discarded. It finds that there's no corresponding SPC so it deletes it, which shouldn't occur as there's a SPC that has it.
We discussed about this general race condition some days ago. In #4599 I fixed it in the delete orphan controller. In that PR I've also updated the maestro readonly bundle controllers of nodepools create and read to consider the case where CSID is empty.
| } | ||
| clustersByClusterServiceID := make(map[string]*api.HCPOpenShiftCluster) | ||
| for _, internalCluster := range internalClusterIterator.Items(ctx) { | ||
| if internalCluster.ServiceProviderProperties.ClusterServiceID == nil { |
There was a problem hiding this comment.
Does this mean ArmResourceListClusters will omit clusters that still don't have the CS ID set? it could be confusing to consumers of the API endpoint, just to be aware
There was a problem hiding this comment.
Does this mean ArmResourceListClusters will omit clusters that still don't have the CS ID set? it could be confusing to consumers of the API endpoint, just to be aware
It goes away in #4610
| logger.Info("clusterService cluster missing, trying to clean up", "err", err) | ||
| } else if err != nil { | ||
| return utils.TrackError(err) | ||
| if cluster.ServiceProviderProperties.ClusterServiceID != nil { |
There was a problem hiding this comment.
Similar comment to https://github.com/Azure/ARO-HCP/pull/4752/changes#r3052173454
| // TODO this overwrite will transformed into a "set" function as we transition fields to ownership in cosmos | ||
| // TODO remove the azure location once we have migrated every record to store the location | ||
| func mergeToInternalCluster(csCluster *arohcpv1alpha1.Cluster, internalCluster *api.HCPOpenShiftCluster, azureLocation string) (*api.HCPOpenShiftCluster, error) { | ||
| if csCluster == nil { |
There was a problem hiding this comment.
Have we analyzed the implications of this nil check and error in mergeToInternalCluster and readInternalClusterFromClusterService from consumer code?
There was a problem hiding this comment.
Function goes away in #4610, which has to merge before actually write the nil, so it will self-solve
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@deads2k: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Necessary for moving creation to the backend.
Critically, we never serialize a nil, we keep it as an empty. This allows the n-1 level (if we need to revert) to read data created by the new version. This version reads and tolerates nil, so the n+1 version can write the nil.