Skip to content

Add support for Preservation of Machines and Backing nodes#1059

Open
thiyyakat wants to merge 74 commits intogardener:masterfrom
thiyyakat:feat/preserve-machine
Open

Add support for Preservation of Machines and Backing nodes#1059
thiyyakat wants to merge 74 commits intogardener:masterfrom
thiyyakat:feat/preserve-machine

Conversation

@thiyyakat
Copy link
Member

@thiyyakat thiyyakat commented Dec 10, 2025

What this PR does / why we need it:

This PR introduces a feature that allows operators and endusers to preserve a machine/node and the backing VM for diagnostic purposes.

The expected behaviour, use cases and usage are detailed in the proposal that can be found here

Which issue(s) this PR fixes:
Fixes #1008

Special notes for your reviewer:

The following tests were carried out serially with the machine-controller-manager-provider-virtual: #1059 (comment)

Please also take a look at the questions asked here.

Release note:

Introduce support for preservation of machines (both Running and Failed), and the backing node (if it exists). 

@gardener-robot gardener-robot added kind/api-change API change with impact on API users needs/second-opinion Needs second review by someone else needs/rebase Needs git rebase labels Dec 10, 2025
@gardener-robot
Copy link

@thiyyakat You need rebase this pull request with latest master branch. Please check.

@gardener-robot gardener-robot added needs/review Needs review size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 10, 2025
@thiyyakat thiyyakat force-pushed the feat/preserve-machine branch 2 times, most recently from 06ecf58 to 89f2900 Compare December 10, 2025 12:06
@thiyyakat
Copy link
Member Author

thiyyakat commented Dec 11, 2025

Questions that remain unanswered:

  1. On recovery of a preserved machine, it transitions from Failed to Running. However, if the preserve annotation was when-failed, then the node continues to be preserved in Running even though the annotation says when-failed - is that okay? The node needs to be preserved so that pods can get scheduled onto it without CA scaling it down.
    Update: We allow the annotation to stay, but we clear PreserveExpirTime and set the node condition to false. The CA annotation remains until manually removed from node.
  2. drain timeout is checked currently by calculating time from LastUpdateTime (from when machine moved to Failed) to now. Is there a better way to do it?
    timeOutOccurred = utiltime.HasTimeOutOccurred(machine.Status.CurrentStatus.LastUpdateTime, timeOutDuration)
    In the normal drain, it is checked wrt DeletionTimestamp
  3. In some parts of the code, checks are performed to see if the returned error is due to a Conflict, and ConflictRetry rather than ShortRetry is returned. When should these checks be performed? The preservation flow has a lot of update calls. : Addressed. Use ConflictRetry when appropriate.

Copy link
Member Author

@thiyyakat thiyyakat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: A review meeting was held today for this PR. The comments were given during the meeting.

During the meeting, we revisited the decision to move drain to Failed state for preserved machine. The reason discussed previously was that it didn't make sense semantically to move the machine to Terminating and then do the drain, because there is a possibility that the machine may recover. Since Terminating is a final state, the drain (separate from the drain in triggerDeletionFlow) will be performed in Failed phase. There was no change proposed during the meeting. This design decision was only reconfirmed.

Copy link
Member

@takoverflow takoverflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have only gone through half of the PR, have some suggestions PTAL.

Comment on lines 2475 to 2493
err := nodeops.AddOrUpdateConditionsOnNode(ctx, c.targetCoreClient, nodeName, preservedCondition)
if err != nil {
return err
}
// Step 2: remove CA's scale-down disabled annotations to allow CA to scale down node if needed
CAAnnotations := make(map[string]string)
CAAnnotations[autoscaler.ClusterAutoscalerScaleDownDisabledAnnotationKey] = ""
latestNode, err := c.targetCoreClient.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
if err != nil {
klog.Errorf("error trying to get backing node %q for machine %s. Retrying, error: %v", nodeName, machine.Name, err)
return err
}
latestNodeCopy := latestNode.DeepCopy()
latestNodeCopy, _, _ = annotations.RemoveAnnotation(latestNodeCopy, CAAnnotations) // error can be ignored, always returns nil
_, err = c.targetCoreClient.CoreV1().Nodes().Update(ctx, latestNodeCopy, metav1.UpdateOptions{})
if err != nil {
klog.Errorf("Node UPDATE failed for node %q of machine %q. Retrying, error: %s", nodeName, machine.Name, err)
return err
}
Copy link
Member

@takoverflow takoverflow Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why there are two get and update calls made for a node, can these not be combined into a single atomic node object update?

And I know this is not part of your PR but can we update this RemoveAnnotation function, it's needlessly complicated.
All you have to do after fetching the object and checking that annotations are non-nil is

delete(obj.Annotations, annotationKey)

Creating a dummy annotation map, then passing it and then creating a new map which doesn't have the key. All of this complication can be avoided.

Copy link
Member Author

@thiyyakat thiyyakat Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By 2 Get() calls are you referring to the call within AddOrUpdateConditionsOnNode and the following Get() here:
latestNode, err := c.targetCoreClient.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})?

The first one can be avoided if we didn't use the function. The second one is required because step 1 adds conditions to the node object, and the function does not return the updated node object. Fetching from the cache doesn't guarantee an up-to-date node object (tested this out empirically). I could potentially avoid fetching the objects if I didn't use the function. Will test it out.

The two update calls cannot be combined since step 1 requires an UpdateStatus() call, and step 2 updates the Spec, and requires an Update() call.

I will update the RemoveAnnotation function as recommended by you.

Edit: The RemoveAnnotation function returns a boolean indicating whether or not an update is needed. This value is being used in other usages of the function. The function cannot be updated. I will use your suggestion instead of using the function since the boolean value is not required in this case.

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Dec 18, 2025
@thiyyakat thiyyakat force-pushed the feat/preserve-machine branch from 22c646e to 7c062b5 Compare December 19, 2025 08:30
@thiyyakat thiyyakat force-pushed the feat/preserve-machine branch from e2a7ea7 to 74603a4 Compare December 31, 2025 09:56
@thiyyakat thiyyakat marked this pull request as ready for review January 6, 2026 05:56
@thiyyakat thiyyakat requested a review from a team as a code owner January 6, 2026 05:56
@thiyyakat thiyyakat force-pushed the feat/preserve-machine branch 2 times, most recently from a487a18 to 508b1ba Compare January 12, 2026 04:24
Copy link
Member

@aaronfern aaronfern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @thiyyakat!
A few questions/nits from me, please address them

UpdateFailed string = "UpdateFailed"
)

const (
Copy link
Member

@elankath elankath Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These condition constants feel like they are in the wrong place as we already have conditions at pkg/apis/machine/types.go. Also, I don't think the Node prefix should be used for the condition constant names as they are used in Machine objects too. @unmarshall should these even be exposed in API ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added them here after seeing the constants for InPlaceUpdates added just above:

NodeInPlaceUpdate corev1.NodeConditionType = "InPlaceUpdate"

The NodeCondition for InPlace is named NodeInPlaceUpdate, and I've followed the same.

@elankath , @unmarshall , please let me know what change you would like me to make.

Copy link
Member

@elankath elankath Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thiyyakat Ok, but the the reason constants like NodePreservedByMCM, etc should just be PreservedByMCM - that is also the convention followed by in-place update constants.

PreservedNodeDrainSuccessful -> DrainSuccessful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make the change to the other constant names and shorten them.

This one: PreservedNodeDrainSuccessful -> DrainSuccessful I am unsure of what to do. DrainSuccessful is used as a Reason for InPlaceUpdate, and the comment indicates the same. Is it okay to re-use it for a Message?
Ref:

// DrainSuccessful is a constant for reason in condition that indicates node drain is successful

- Modify sort function to de-prioritize preserve machines
- Add test for the same
- Improve logging
- Fix bug in stopMachinePreservationIfPreserved when node is not found
- Update default MachinePreserveTimeout to 3 days as per doc
- Reuse function to write annotation on machine
- Minor refactoring
- Make changes to add auto-preserve-stopped on recovered, auto-preserved previously failed machines.
- Change stopMachinePreservationIfPreserved to removeCA annotation when preserve=false on a recovered failed, preserved machine
* remove stop annotation value
* remove CA scale-down annotation when preservation stops
* change preservation annotation handling semantics for machine and node
* remove auto-preserve-stopped annotation value
* Add preserveExpiryTime to NodeCondition.Message
* modify test cases
…eserved machines if autoPreservedFailedMachineMax is decreased in the shoot spec.
…liedNodePreserveValue for persisting node annotation values that have been applied.
@thiyyakat thiyyakat force-pushed the feat/preserve-machine branch from 6636d86 to 132539d Compare February 20, 2026 10:38
@gardener-prow gardener-prow bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 20, 2026
@thiyyakat
Copy link
Member Author

Manual Testing carried out with MCM-P-virtual with latest changes

18th Feb,2026

Note: The following tests were carried out serially

Annotating node object

Annotating with "preserve=now"

  • annotation present on node
  • CA scale down disabled annotation present on node
  • CA scale down disabled by MCM annotation present on node
  • Node Condition (type Preserved) added
    • Reason set to "Preserved by User"
    • Status set to "True"
    • Message contains info about preserve expiry time
  • PreserveExpiryTime set in machine.CurrentStatus
  • on machine object, node.machine.sapcloud.io/last-applied-node-preserve-value updated with "now"

Node object:

> k get no shoot--i749592--test-ca-wp-sys-z1-78474-k2x4c -oyaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"
    cluster-autoscaler.kubernetes.io/scale-down-disabled-by-mcm: "true"
    node.machine.sapcloud.io/preserve: now

...
 conditions:
	...
	- lastHeartbeatTime: null
	  lastTransitionTime: "2026-02-18T05:00:41Z"
      message: Machine preserved until 2026-02-21 10:30:41 +0530 IST
      reason: Preserved by user
      status: "True"
      type: Preserved

Machine object:

❯ k get mc shoot--i749592--test-ca-wp-sys-z1-78474-k2x4c -oyaml
apiVersion: machine.sapcloud.io/v1alpha1
kind: Machine
metadata:
  annotations:
    machinepriority.machine.sapcloud.io: "3"
    node.machine.sapcloud.io/last-applied-node-preserve-value: now

...
  currentStatus:
    lastUpdateTime: "2026-02-18T05:00:41Z"
    phase: Running
    preserveExpiryTime: "2026-02-21T05:00:41Z"

Annotating with "preserve=false"

  • annotation present on node
  • both CA scale down disabled annotations removed from node
  • Node Condition (type Preserved) present
    • Reason set to "Preservation Stopped"
    • Status set to "False"
  • PreserveExpiryTime removed from machine.CurrentStatus
  • node.machine.sapcloud.io/last-applied-node-preserve-value updated to "false" on machine object

Node object:

❯ k get no shoot--i749592--test-ca-wp-sys-z1-78474-k2x4c -oyaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    node.machine.sapcloud.io/preserve: "false"

...
conditions:
  - lastHeartbeatTime: null
    lastTransitionTime: "2026-02-18T05:04:49Z"
    reason: Preservation stopped
    status: "False"
    type: Preserved

Machine object:

❯ k get mc shoot--i749592--test-ca-wp-sys-z1-78474-k2x4c -oyaml
apiVersion: machine.sapcloud.io/v1alpha1
kind: Machine
metadata:
  annotations:
    machinepriority.machine.sapcloud.io: "3"
    node.machine.sapcloud.io/last-applied-node-preserve-value: "false"
...
  currentStatus:
    lastUpdateTime: "2026-02-18T05:04:49Z"
    phase: Running

Annotated "preserve=when-failed"

When machine is Running

  • annotation present on node
  • No change in Node Condition for Preserved
  • PreserveExpiryTime not set in machine object
  • node.machine.sapcloud.io/last-applied-node-preserve-value updated to "when-failed" on machine
    Node object:
❯ k get no shoot--i749592--test-ca-wp-sys-z1-78474-k2x4c -oyaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    node.machine.sapcloud.io/preserve: when-failed
...
conditions://remains unchanged after setting to false)
 - lastHeartbeatTime: null
    lastTransitionTime: "2026-02-18T05:04:49Z"
    reason: Preservation stopped
    status: "False"
    type: Preserved

Machine object:

❯ k get mc shoot--i749592--test-ca-wp-sys-z1-78474-k2x4c -oyaml
apiVersion: machine.sapcloud.io/v1alpha1
kind: Machine
metadata:
  annotations:
    machinepriority.machine.sapcloud.io: "3"
    node.machine.sapcloud.io/last-applied-node-preserve-value: when-failed
...
 currentStatus: // (no change)
    lastUpdateTime: "2026-02-18T05:04:49Z"
    phase: Running

When machine has Failed

  • CA scale down disabled by MCM annotation added on node
  • CA scale down disabled annotation added on node
  • Node Condition (type Preserved) changed
    • Reason set to "Preserved by User"
    • Status set to "True"
    • Message set to "Preserved node drained successfully"
  • PreserveExpiryTime set in machine.CurrentStatus
    Node object:
❯ k get no shoot--i749592--test-ca-wp-z3-7c579-ns8l9 -oyaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"
    cluster-autoscaler.kubernetes.io/scale-down-disabled-by-mcm: "true"
    node.machine.sapcloud.io/preserve: when-failed
...
spec:
	...
	 unschedulable: true
...
	conditions:
	  - lastHeartbeatTime: null
	    lastTransitionTime: "2026-01-23T06:48:58Z"
	    message: Preserved node drained successfully
	    reason: PreservedByUser
	    status: "True"
	    type: Preserved

Machine object:

 currentStatus:
    lastUpdateTime: "2026-02-18T08:35:12Z"
    phase: Failed
    preserveExpiryTime: "2026-02-21T08:35:12Z"

When node with "preserved=when-failed" recovers to Running

  • annotation still present on node
  • CA scale down disabled annotation still present on node
  • Node Condition (type Preserved) present
    • Reason set to "PreservationStopped"
    • Status set to "False"
  • PreserveExpiryTime set to nil machine.CurrentStatus
  • Node no longer marked "unschedulable"
    Node object:
❯ k get no shoot--i749592--test-ca-wp-z3-7c579-ns8l9 -oyaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    node.machine.sapcloud.io/preserve: when-failed
...
conditions:
 - lastHeartbeatTime: null
    lastTransitionTime: "2026-02-18T08:41:28Z"
    reason: Preservation stopped.
    status: "False"
    type: Preserved

Machine object:

...
 currentStatus:
    currentStatus:
      lastUpdateTime: "2026-02-18T08:41:28Z"
      phase: Running
    lastOperation:
      description: Machine shoot--i749592--test-ca-wp-z3-7c579-ns8l9 successfully re-joined the cluster

Annotating machine object

Annotating with "preserve=when-failed"

  • Annotation present on machine
  • No change in node object
  • No change in Node Condition of type=Preserved added on node
  • No PreserveExpiryTime set
    Machine object
❯ k get mc shoot--i749592--test-ca-wp-sys-z1-78474-7bhxv -oyaml
apiVersion: machine.sapcloud.io/v1alpha1
kind: Machine
metadata:
  annotations:
    machinepriority.machine.sapcloud.io: "3"
    node.machine.sapcloud.io/preserve: when-failed
  ...
  currentStatus:
    lastUpdateTime: "2026-02-18T08:25:28Z"
    phase: Running

When machine moved to Failed

  • Annotation unchanged on machine
  • Node still has no preserve annotation
  • CA scale-down disabled annotation added on node
  • CA scale-down disabled by MCM annotation added on node
  • PreserveExpiryTime set
  • Node condition (type=Preserved) set on node:
    • Reason set to "Preserved by User"
    • Status set to "True"
    • Message set to "Preserved node drained successfully"
  • Node marked unschedulable
    Machine object:
❯ k get mc shoot--i749592--test-ca-wp-sys-z1-78474-7bhxv -oyaml
apiVersion: machine.sapcloud.io/v1alpha1
kind: Machine
metadata:
  annotations:
    machinepriority.machine.sapcloud.io: "3"
    node.machine.sapcloud.io/preserve: when-failed
...
	currentStatus:
      lastUpdateTime: "2026-02-18T08:57:37Z"
      phase: Failed
      preserveExpiryTime: "2026-02-21T08:57:37Z"

Node object:

❯ k get no shoot--i749592--test-ca-wp-sys-z1-78474-7bhxv -oyaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"
    cluster-autoscaler.kubernetes.io/scale-down-disabled-by-mcm: "true"
...
spec:
	 unschedulable: true
...
	conditions:
	  - lastHeartbeatTime: null
    lastTransitionTime: "2026-02-18T08:57:37Z"
    message: Preserved node drained successfully. Machine preserved until 2026-02-21 14:27:37 +0530 IST.
    reason: Preserved by user.
    status: "True"
    type: Preserved

When machine recovers to Running

  • Machine annotation remains unchanged
  • PreserveExpiryTime cleared
  • Node Conditions (type=Preserved) changed:
    • Reason set to "PreservationStopped"
    • Status set to "False"
  • Node no longer unschedulable
  • CA scale-down annotations removed from node
    Machine object:
❯ k get mc shoot--i749592--test-ca-wp-sys-z1-78474-7bhxv -oyaml
apiVersion: machine.sapcloud.io/v1alpha1
kind: Machine
metadata:
  annotations:
    machinepriority.machine.sapcloud.io: "3"
    node.machine.sapcloud.io/preserve: when-failed
...
	currentStatus:
      lastUpdateTime: "2026-02-18T09:02:10Z"
      phase: Running
    lastOperation:
      description: Machine shoot--i749592--test-ca-wp-sys-z1-78474-7bhxv successfully re-joined the cluster
      lastUpdateTime: "2026-02-18T09:02:10Z"
      state: Successful
      type: HealthCheck


Node object:

❯ k get no shoot--i749592--test-ca-wp-sys-z1-78474-7bhxv -oyaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2026-02-18T08:25:13Z"

...
//spec.unschedulable no longer false
	conditions:
	   - lastHeartbeatTime: null
	     lastTransitionTime: "2026-02-18T09:02:10Z"
	     reason: Preservation stopped.
		 status: "False"
	     type: Preserved

Other Scenarios checked:

  • On PreserveExpiryTime lapsing, the machine preservation stops.
    • for machines and nodes annotated with "now", the preservation is removed
  • When preserve="now" is changed to preserve="when-failed" and machine is Running: machine preservation stops
  • When preserve="when-failed" is changed to preserve="now": machine preservation starts again
  • When preserve="now" is changed to preserve="when-failed" and the machine is Failed: preservation continues without interruption

AutoPreservation

  • Machine annotated with "auto-preserve"
  • Node annotated with CA scale-down-disabled annotation
  • Node annotated with CA scale-down-disabled by MCM annotation
  • PreserveExpiry Set on the machine
  • machineset count increased to 1
  • Node Condition Updated to indicate auto-preservation is ongoing
  • machineset object:

MachineSet object:

❯ k get mcs shoot--i749592--test-ca-wp-sys-z1-78474 -oyaml
apiVersion: machine.sapcloud.io/v1alpha1
kind: MachineSet
...
spec:
  autoPreserveFailedMachineMax: 1
...
status:
  autoPreserveFailedMachineCount: 1

Machine Object:

❯ k get mc shoot--i749592--test-ca-wp-sys-z1-78474-pph6w -oyaml
apiVersion: machine.sapcloud.io/v1alpha1
kind: Machine
metadata:
  annotations:
    machinepriority.machine.sapcloud.io: "3"
    node.machine.sapcloud.io/preserve: auto-preserved
...
   currentStatus:
    lastUpdateTime: "2026-02-18T09:58:20Z"
    phase: Failed
    preserveExpiryTime: "2026-02-21T09:58:20Z"

Node Object:

❯ k get no shoot--i749592--test-ca-wp-sys-z1-78474-pph6w -oyaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"
    cluster-autoscaler.kubernetes.io/scale-down-disabled-by-mcm: "true"
...
spec:
  providerID: aws:///eu-west-1/i-56079dc3b6be5ab55
  unschedulable: true
...
status:
   conditions:
   - lastHeartbeatTime: null
     lastTransitionTime: "2026-02-18T09:58:20Z"
     message: Preserved node drained successfully. Machine preserved until 2026-02-21 15:28:20 +0530 IST.
     reason: Preserved by MCM.
     status: "True"
     type: Preserved


Stopping AutoPreservation by clearing annotation on machine

  • while annotation is removed temporarily, the machineset controller sees the machine is Failed and auto-preserves the machine again.

Stopping AutoPreservation by setting preserve=false

❯ k get mc shoot--i749592--test-ca-wp-sys-z1-78474-pph6w -oyaml
apiVersion: machine.sapcloud.io/v1alpha1
kind: Machine
metadata:
  annotations:
    machinepriority.machine.sapcloud.io: "3"
    node.machine.sapcloud.io/preserve: "false"
...
   currentStatus:
    lastUpdateTime: "2026-02-18T10:06:35Z"
    phase: Terminating

  • Also tested preventing Machine from being AutoPreserved by annotating with preserve=false
  • Tested increasing/decreasing AutoPreserveFailedMachineMax in the MCD. This value is propagated to the MCS. If value is decreased beyond AutoPreserveFailedMachineCount, the auto-preservation is stopped for the required number of Failed machines.
  • When machinePreserveTimeout is increased/decreased on re-launching MCM, the existing preserved machines do not undergo any change. All preservations made after the change have preserveExpiryTime set according to the new value.

Scaling down machineset with preserved machines and unpreserved machines

  • shoot--i749592--test-worker-test-z1-56676-9wt9c --> annotated with "preserve=now"
  • shoot--i749592--test-worker-test-z1-56676-ndqfp--> annotated with "preserve=when-failed" and node marked Not Ready
  • shoot--i749592--test-worker-test-z1-56676-xsrdj --> unpreserved
❯ k get mc
NAME                                              STATUS    AGE    NODE
shoot--i749592--test-worker-test-z1-56676-9wt9c   Running   111s   shoot--i749592--test-worker-test-z1-56676-9wt9c
shoot--i749592--test-worker-test-z1-56676-ndqfp   Failed    107m   shoot--i749592--test-worker-test-z1-56676-ndqfp
shoot--i749592--test-worker-test-z1-56676-xsrdj   Running   111s   shoot--i749592--test-worker-test-z1-56676-xsrdj
❯ k scale mcd shoot--i749592--test-worker-test-z1 --replicas=2
machinedeployment.machine.sapcloud.io/shoot--i749592--test-worker-test-z1 scaled
❯ k get mc
NAME                                              STATUS        AGE    NODE
shoot--i749592--test-worker-test-z1-56676-9wt9c   Running       2m1s   shoot--i749592--test-worker-test-z1-56676-9wt9c
shoot--i749592--test-worker-test-z1-56676-ndqfp   Failed        107m   shoot--i749592--test-worker-test-z1-56676-ndqfp
shoot--i749592--test-worker-test-z1-56676-xsrdj   Terminating   2m1s   shoot--i749592--test-worker-test-z1-56676-xsrdj
❯ k scale mcd shoot--i749592--test-worker-test-z1 --replicas=1
machinedeployment.machine.sapcloud.io/shoot--i749592--test-worker-test-z1 scaled
❯ k get mc
NAME                                              STATUS        AGE     NODE
shoot--i749592--test-worker-test-z1-56676-9wt9c   Running       2m12s   shoot--i749592--test-worker-test-z1-56676-9wt9c
shoot--i749592--test-worker-test-z1-56676-ndqfp   Terminating   107m    shoot--i749592--test-worker-test-z1-56676-ndqfp
shoot--i749592--test-worker-test-z1-56676-xsrdj   Terminating   2m12s   shoot--i749592--test-worker-test-z1-56676-xsrdj

The first machine to be scaled down was the unpreserved Running machine. Next, the preserved Failed machine and lastly the preserved Running machine.

@aaronfern aaronfern self-assigned this Feb 26, 2026
@gardener-prow
Copy link

gardener-prow bot commented Feb 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from aaronfern. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@thiyyakat thiyyakat force-pushed the feat/preserve-machine branch from 6332ab3 to 28b3c11 Compare February 26, 2026 07:55
@thiyyakat thiyyakat force-pushed the feat/preserve-machine branch from 28b3c11 to 38ff3d1 Compare February 26, 2026 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/api-change API change with impact on API users needs/changes Needs (more) changes needs/rebase Needs git rebase needs/review Needs review needs/second-opinion Needs second review by someone else size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Preservation of Failed Machines for diagnostics

7 participants