New machine should not move to the pending state if node label is missing#1070
Conversation
|
Other than that code changes seem fine. Should handle the case. |
takoverflow
left a comment
There was a problem hiding this comment.
Thanks for the fix, LGTM!
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: takoverflow The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
LGTM label has been added. DetailsGit tree hash: aa1ba4d084c5a3462a5b56027cd4cb0951f4fdbf |
|
/cherry-pick rel-v0.61 |
|
@takoverflow: once the present PR merges, I will cherry-pick it on top of rel-v0.61 in a new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@takoverflow: new pull request created: #1072 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
We recently ran into a bug where a machine was stuck in the
Pendingstate while its corresponding node had already joined the cluster.The reason for this was that the node name label on the machine had not yet been populated.
This can occur due to the an oversight in the machine creation flow where a newly created machine is moved to the
Pendingstate after it's underlying node (or VM) has been created and initialised. The machine creation flow makes a best effort attempt to populate they node label, but it does not requeue the machine if it could not.The reason we do not want machines to move to the
Pendingstate in this case is that when a machine is in this state, it is not allowed to reenter the machine creation flow since we are just waiting for the node to join the k8s cluster. This causes an issue as the node name label is only updated by the machine creation flow and the machine deletion flow, leading to machines being stuck in this state indefinitelyThese kind of bugs can surface when a CSP is not able to provide a node name on time. The machine creation flow in turn moved the machine to the
Pendingstate after initialising the node (VM) and this machine is stuckThis PR enhances the machine creation flow to not allow newly created machines to proceed to the
Pendingphase when theNodeNameandProviderIDare not available.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Release note: