feat(cocoonset): restore hibernated agents from :hibernate on (re)create#14
Merged
Conversation
Revives the restore-from-hibernate consumer (vk-cocoon) that had no producer: when the operator (re)creates a pod for a currently-hibernated agent, stamp vm.cocoonstack.io/restore-from-hibernate so the new node restores the VM from its :hibernate snapshot instead of booting fresh. Without it, a hibernated agent whose pod moves nodes boots clean and a later re-hibernate overwrites the real snapshot — silent state loss. Predicate = intent AND existence: a CocoonHibernation CR in phase Hibernated/Waking (or Spec.Suspend) AND Registry.HasManifest(vmName, hibernate). The registry probe is the same lookup vk runs at wake, so it never flags without a snapshot to restore, and it fails closed on a probe error. Wired at the three create sites (main/sub/suspend-main); extracted createMainAgent to keep Reconcile under the complexity budget. Follow-ups (out of scope): a sub deleted during whole-set suspend isn't recreated until unsuspend (no CR then); a desire=Hibernate agent recreated mid-hibernation restores then re-hibernates (wasteful but preserves state).
- extract hibernationPodNames (shared by podsRestorableByCR + podsHibernatedByCR) and hasHibernateSnapshot (shared with allOwnedPodsHibernated) - bind logger at the top of markRestoreIfHibernated; pass a literal true from the suspend path (intent is unconditional there); rename restorableFromHibernateByCR to the noun-led podsRestorableByCR; drop the non-ASCII AND symbol from the doc
ensureToolboxes built and created toolbox pods directly, bypassing the producer the main/sub/suspend-main paths use. Managed toolboxes are VM-backed and hibernatable (per-CR or whole-set suspend), so a hibernated toolbox that goes terminal or drifts would cold-boot on recreate and a later hibernate would then overwrite the real :hibernate snapshot — the same state loss the producer prevents elsewhere. Compute podsRestorableByCR once and stamp each rebuilt toolbox, mirroring ensureSubAgents. Regression test included.
CMGS
pushed a commit
that referenced
this pull request
Jul 2, 2026
…state machine) Rewritten on current main atop the merged restore-from-hibernate producer (#14): the control plane patches CocoonSet.spec.nodeName and the operator hibernates the main agent, waits for the :hibernate snapshot in the OCI registry, deletes the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate, and drops the snapshot once the restored VM runs with a fresh VMID. Decisions are pure functions of durable state (spec.nodeName, status.phase, the pod, the snapshot), so every step is idempotent and crash-recoverable. Hardening over the original branch: - a registry probe error now owns the reconcile (handled=true) — falling through would let applyUnsuspend unwind the migration or fresh-boot over the only copy of the state - a :hibernate tag on a pod this controller never quiesced is treated as a leftover (suspend/unsuspend never deletes the tag) and dropped instead of restored — a raw presence check would delete a live pod and roll back state - re-targeting nodeName back to the current node mid-migration wakes the pod in place instead of deadlocking (unless a CocoonHibernation CR owns it) - clearing nodeName in the deleted-pod window finishes the restore instead of stranding the snapshot behind a fresh boot - steady-state pinned sets skip the registry probe (Migrating is persisted before the first side effect, so in-flight migrations are never mistaken) Scoped to the main agent (slot 0); sub-agents follow via their hard bind.
CMGS
pushed a commit
that referenced
this pull request
Jul 2, 2026
…state machine) Rewritten on current main atop the merged restore-from-hibernate producer (#14): the control plane patches CocoonSet.spec.nodeName and the operator hibernates the main agent, waits for the :hibernate snapshot in the OCI registry, deletes the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate, and drops the snapshot once the restored VM runs with a fresh VMID. Decisions are pure functions of durable state (spec.nodeName, status.phase, the pod, the snapshot), so every step is idempotent and crash-recoverable. Hardening over the original branch: - a registry probe error owns the reconcile (handled=true) — falling through would let applyUnsuspend unwind the migration or fresh-boot over the only copy of the state - a :hibernate tag on a pod this controller never quiesced is a leftover (suspend/unsuspend never deletes the tag) and is dropped, not restored - re-targeting nodeName back mid-migration wakes the pod in place instead of deadlocking; CR-owned hibernation short-circuits before the registry probe - clearing nodeName in the deleted-pod window finishes the restore instead of stranding the snapshot - steady-state pinned sets skip the registry probe (Migrating is persisted before the first side effect); a CR wake mid-flight is not repainted as a migration Scoped to the main agent (slot 0); sub-agents follow via their hard bind.
CMGS
pushed a commit
that referenced
this pull request
Jul 2, 2026
…state machine) Rewritten on current main atop the merged restore-from-hibernate producer (#14): the control plane patches CocoonSet.spec.nodeName and the operator hibernates the main agent, waits for the :hibernate snapshot in the OCI registry, deletes the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate, and drops the snapshot once the restored VM runs with a fresh VMID. Decisions are pure functions of durable state (spec.nodeName, status.phase, the pod, the snapshot), so every step is idempotent and crash-recoverable. Hardening over the original branch: - a registry probe error owns the reconcile (handled=true) — falling through would let applyUnsuspend unwind the migration or fresh-boot over the only copy of the state - a :hibernate tag on a pod this controller never quiesced is a leftover (suspend/unsuspend never deletes the tag) and is dropped, not restored - re-targeting nodeName back mid-migration wakes the pod in place instead of deadlocking; CR-owned hibernation short-circuits before the registry probe - clearing nodeName in the deleted-pod window finishes the restore instead of stranding the snapshot - steady-state pinned sets skip the registry probe (Migrating is persisted before the first side effect); a CR wake mid-flight is not repainted as a migration Scoped to the main agent (slot 0); sub-agents follow via their hard bind.
CMGS
pushed a commit
that referenced
this pull request
Jul 2, 2026
…state machine) Rewritten on current main atop the merged restore-from-hibernate producer (#14): the control plane patches CocoonSet.spec.nodeName and the operator hibernates the main agent, waits for the :hibernate snapshot in the OCI registry, deletes the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate, and drops the snapshot once the restored VM runs with a fresh VMID. Decisions are pure functions of durable state, so every step is idempotent and crash-recoverable. Hardening over the original branch: - a registry probe error owns the reconcile — falling through would let applyUnsuspend unwind the migration or fresh-boot over the only snapshot - a :hibernate tag on a never-quiesced pod is a leftover and is dropped, not restored - re-targeting nodeName back mid-migration wakes the pod in place instead of deadlocking; CR-owned hibernation short-circuits before the registry probe - clearing nodeName in the deleted-pod window finishes the restore instead of stranding the snapshot - steady-state pinned sets skip the registry probe; a CR wake mid-flight is not repainted as a migration Scoped to the main agent (slot 0); sub-agents follow via their hard bind.
CMGS
pushed a commit
that referenced
this pull request
Jul 2, 2026
…state machine) Rewritten on current main atop the merged restore-from-hibernate producer (#14): the control plane patches CocoonSet.spec.nodeName and the operator hibernates the main agent, waits for the :hibernate snapshot in the OCI registry, deletes the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate, and drops the snapshot once the restored VM runs with a fresh VMID. Decisions are pure functions of durable state, so every step is idempotent and crash-recoverable. Hardening over the original branch: - a registry probe error owns the reconcile — falling through would let applyUnsuspend unwind the migration or fresh-boot over the only snapshot - a :hibernate tag on a never-quiesced pod is a leftover and is dropped, not restored - re-targeting nodeName back mid-migration wakes the pod in place instead of deadlocking; CR-owned hibernation short-circuits before the registry probe - clearing nodeName in the deleted-pod window finishes the restore instead of stranding the snapshot - steady-state pinned sets skip the registry probe; a CR wake mid-flight is not repainted as a migration Scoped to the main agent (slot 0); sub-agents follow via their hard bind.
CMGS
added a commit
that referenced
this pull request
Jul 2, 2026
…ne) (#11) * feat(cocoonset): cross-node migration (nodeName affinity + migration state machine) Rewritten on current main atop the merged restore-from-hibernate producer (#14): the control plane patches CocoonSet.spec.nodeName and the operator hibernates the main agent, waits for the :hibernate snapshot in the OCI registry, deletes the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate, and drops the snapshot once the restored VM runs with a fresh VMID. Decisions are pure functions of durable state, so every step is idempotent and crash-recoverable. Hardening over the original branch: - a registry probe error owns the reconcile — falling through would let applyUnsuspend unwind the migration or fresh-boot over the only snapshot - a :hibernate tag on a never-quiesced pod is a leftover and is dropped, not restored - re-targeting nodeName back mid-migration wakes the pod in place instead of deadlocking; CR-owned hibernation short-circuits before the registry probe - clearing nodeName in the deleted-pod window finishes the restore instead of stranding the snapshot - steady-state pinned sets skip the registry probe; a CR wake mid-flight is not repainted as a migration Scoped to the main agent (slot 0); sub-agents follow via their hard bind. * fix(main): surface controller-runtime errors through core/log crlog was set to logr.Discard(), so every reconcile error controller-runtime retried (returned by reconcilers, not logged at call sites) vanished — the migration E2E's registry 403s were invisible. Bridge logr to core/log: errors always forwarded (nil-err anomaly reports downgrade to Warn since core/log drops nil-err Error lines), V(0) info kept, V(1)+ internals chatter dropped. --------- Co-authored-by: CMGS <ilskdw@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Revives the
restore-from-hibernateconsumer in vk-cocoon (which had no producer). When the operator (re)creates a pod for a currently-hibernated agent, it stampsvm.cocoonstack.io/restore-from-hibernateso the new node restores the VM from its:hibernatesnapshot instead of booting fresh.Why
Without a producer, a hibernated agent whose pod moves nodes (drain/failure) boots clean; a subsequent re-hibernate then overwrites the real snapshot → silent state loss. This closes that gap (the annotation's doc comment already specced "Written by the operator on the rebuilt pod").
How
Predicate = intent ∧ existence:
CocoonHibernationCR in phaseHibernated/Waking, orSpec.Suspend.Registry.HasManifest(vmName, hibernate)— the same lookup vk runs at wake, so it never flags without a snapshot to restore, and fails closed on a probe error.Wired at the 3 create sites (main / sub / suspend-main); extracted
createMainAgentto keepReconcileunder the gocyclo budget.Depends on
cocoonstack/cocoon-common#5 (
MarkRestoreFromHibernate); go.mod pins the setter commit — rebump to commonmainonce #5 merges.Known follow-ups (out of scope)
desire=Hibernateagent recreated mid-hibernation restores then re-hibernates (wasteful but preserves state).Tests:
markRestoreIfHibernated(4 cases + nil-registry),restorableFromHibernateByCR(phase filter).