Skip to content

feat(cocoonset): restore hibernated agents from :hibernate on (re)create#14

Merged
CMGS merged 4 commits into
mainfrom
feat/restore-from-hibernate-producer
Jul 1, 2026
Merged

feat(cocoonset): restore hibernated agents from :hibernate on (re)create#14
CMGS merged 4 commits into
mainfrom
feat/restore-from-hibernate-producer

Conversation

@CMGS

@CMGS CMGS commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What

Revives the restore-from-hibernate consumer in vk-cocoon (which had no producer). When the operator (re)creates a pod for a currently-hibernated agent, it stamps vm.cocoonstack.io/restore-from-hibernate so the new node restores the VM from its :hibernate snapshot instead of booting fresh.

Why

Without a producer, a hibernated agent whose pod moves nodes (drain/failure) boots clean; a subsequent re-hibernate then overwrites the real snapshot → silent state loss. This closes that gap (the annotation's doc comment already specced "Written by the operator on the rebuilt pod").

How

Predicate = intent ∧ existence:

  • intent: a CocoonHibernation CR in phase Hibernated/Waking, or Spec.Suspend.
  • existence: Registry.HasManifest(vmName, hibernate) — the same lookup vk runs at wake, so it never flags without a snapshot to restore, and fails closed on a probe error.

Wired at the 3 create sites (main / sub / suspend-main); extracted createMainAgent to keep Reconcile under the gocyclo budget.

Depends on

cocoonstack/cocoon-common#5 (MarkRestoreFromHibernate); go.mod pins the setter commit — rebump to common main once #5 merges.

Known follow-ups (out of scope)

  • A sub deleted during whole-set suspend isn't recreated until unsuspend (no CR then).
  • A desire=Hibernate agent recreated mid-hibernation restores then re-hibernates (wasteful but preserves state).

Tests: markRestoreIfHibernated (4 cases + nil-registry), restorableFromHibernateByCR (phase filter).

CMGS added 4 commits July 2, 2026 00:43
Revives the restore-from-hibernate consumer (vk-cocoon) that had no producer:
when the operator (re)creates a pod for a currently-hibernated agent, stamp
vm.cocoonstack.io/restore-from-hibernate so the new node restores the VM from
its :hibernate snapshot instead of booting fresh. Without it, a hibernated agent
whose pod moves nodes boots clean and a later re-hibernate overwrites the real
snapshot — silent state loss.

Predicate = intent AND existence: a CocoonHibernation CR in phase
Hibernated/Waking (or Spec.Suspend) AND Registry.HasManifest(vmName, hibernate).
The registry probe is the same lookup vk runs at wake, so it never flags without
a snapshot to restore, and it fails closed on a probe error. Wired at the three
create sites (main/sub/suspend-main); extracted createMainAgent to keep Reconcile
under the complexity budget.

Follow-ups (out of scope): a sub deleted during whole-set suspend isn't recreated
until unsuspend (no CR then); a desire=Hibernate agent recreated mid-hibernation
restores then re-hibernates (wasteful but preserves state).
- extract hibernationPodNames (shared by podsRestorableByCR + podsHibernatedByCR)
  and hasHibernateSnapshot (shared with allOwnedPodsHibernated)
- bind logger at the top of markRestoreIfHibernated; pass a literal true from the
  suspend path (intent is unconditional there); rename restorableFromHibernateByCR
  to the noun-led podsRestorableByCR; drop the non-ASCII AND symbol from the doc
ensureToolboxes built and created toolbox pods directly, bypassing the producer
the main/sub/suspend-main paths use. Managed toolboxes are VM-backed and
hibernatable (per-CR or whole-set suspend), so a hibernated toolbox that goes
terminal or drifts would cold-boot on recreate and a later hibernate would then
overwrite the real :hibernate snapshot — the same state loss the producer
prevents elsewhere. Compute podsRestorableByCR once and stamp each rebuilt
toolbox, mirroring ensureSubAgents. Regression test included.
@CMGS CMGS merged commit 5485764 into main Jul 1, 2026
2 checks passed
@CMGS CMGS deleted the feat/restore-from-hibernate-producer branch July 1, 2026 18:12
CMGS pushed a commit that referenced this pull request Jul 2, 2026
…state machine)

Rewritten on current main atop the merged restore-from-hibernate producer (#14):
the control plane patches CocoonSet.spec.nodeName and the operator hibernates
the main agent, waits for the :hibernate snapshot in the OCI registry, deletes
the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate,
and drops the snapshot once the restored VM runs with a fresh VMID. Decisions
are pure functions of durable state (spec.nodeName, status.phase, the pod, the
snapshot), so every step is idempotent and crash-recoverable.

Hardening over the original branch:
- a registry probe error now owns the reconcile (handled=true) — falling
  through would let applyUnsuspend unwind the migration or fresh-boot over the
  only copy of the state
- a :hibernate tag on a pod this controller never quiesced is treated as a
  leftover (suspend/unsuspend never deletes the tag) and dropped instead of
  restored — a raw presence check would delete a live pod and roll back state
- re-targeting nodeName back to the current node mid-migration wakes the pod
  in place instead of deadlocking (unless a CocoonHibernation CR owns it)
- clearing nodeName in the deleted-pod window finishes the restore instead of
  stranding the snapshot behind a fresh boot
- steady-state pinned sets skip the registry probe (Migrating is persisted
  before the first side effect, so in-flight migrations are never mistaken)

Scoped to the main agent (slot 0); sub-agents follow via their hard bind.
CMGS pushed a commit that referenced this pull request Jul 2, 2026
…state machine)

Rewritten on current main atop the merged restore-from-hibernate producer (#14):
the control plane patches CocoonSet.spec.nodeName and the operator hibernates
the main agent, waits for the :hibernate snapshot in the OCI registry, deletes
the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate,
and drops the snapshot once the restored VM runs with a fresh VMID. Decisions
are pure functions of durable state (spec.nodeName, status.phase, the pod, the
snapshot), so every step is idempotent and crash-recoverable.

Hardening over the original branch:
- a registry probe error owns the reconcile (handled=true) — falling through
  would let applyUnsuspend unwind the migration or fresh-boot over the only
  copy of the state
- a :hibernate tag on a pod this controller never quiesced is a leftover
  (suspend/unsuspend never deletes the tag) and is dropped, not restored
- re-targeting nodeName back mid-migration wakes the pod in place instead of
  deadlocking; CR-owned hibernation short-circuits before the registry probe
- clearing nodeName in the deleted-pod window finishes the restore instead of
  stranding the snapshot
- steady-state pinned sets skip the registry probe (Migrating is persisted
  before the first side effect); a CR wake mid-flight is not repainted as a
  migration

Scoped to the main agent (slot 0); sub-agents follow via their hard bind.
CMGS pushed a commit that referenced this pull request Jul 2, 2026
…state machine)

Rewritten on current main atop the merged restore-from-hibernate producer (#14):
the control plane patches CocoonSet.spec.nodeName and the operator hibernates
the main agent, waits for the :hibernate snapshot in the OCI registry, deletes
the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate,
and drops the snapshot once the restored VM runs with a fresh VMID. Decisions
are pure functions of durable state (spec.nodeName, status.phase, the pod, the
snapshot), so every step is idempotent and crash-recoverable.

Hardening over the original branch:
- a registry probe error owns the reconcile (handled=true) — falling through
  would let applyUnsuspend unwind the migration or fresh-boot over the only
  copy of the state
- a :hibernate tag on a pod this controller never quiesced is a leftover
  (suspend/unsuspend never deletes the tag) and is dropped, not restored
- re-targeting nodeName back mid-migration wakes the pod in place instead of
  deadlocking; CR-owned hibernation short-circuits before the registry probe
- clearing nodeName in the deleted-pod window finishes the restore instead of
  stranding the snapshot
- steady-state pinned sets skip the registry probe (Migrating is persisted
  before the first side effect); a CR wake mid-flight is not repainted as a
  migration

Scoped to the main agent (slot 0); sub-agents follow via their hard bind.
CMGS pushed a commit that referenced this pull request Jul 2, 2026
…state machine)

Rewritten on current main atop the merged restore-from-hibernate producer (#14):
the control plane patches CocoonSet.spec.nodeName and the operator hibernates
the main agent, waits for the :hibernate snapshot in the OCI registry, deletes
the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate,
and drops the snapshot once the restored VM runs with a fresh VMID. Decisions
are pure functions of durable state, so every step is idempotent and
crash-recoverable.

Hardening over the original branch:
- a registry probe error owns the reconcile — falling through would let
  applyUnsuspend unwind the migration or fresh-boot over the only snapshot
- a :hibernate tag on a never-quiesced pod is a leftover and is dropped,
  not restored
- re-targeting nodeName back mid-migration wakes the pod in place instead of
  deadlocking; CR-owned hibernation short-circuits before the registry probe
- clearing nodeName in the deleted-pod window finishes the restore instead of
  stranding the snapshot
- steady-state pinned sets skip the registry probe; a CR wake mid-flight is
  not repainted as a migration

Scoped to the main agent (slot 0); sub-agents follow via their hard bind.
CMGS pushed a commit that referenced this pull request Jul 2, 2026
…state machine)

Rewritten on current main atop the merged restore-from-hibernate producer (#14):
the control plane patches CocoonSet.spec.nodeName and the operator hibernates
the main agent, waits for the :hibernate snapshot in the OCI registry, deletes
the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate,
and drops the snapshot once the restored VM runs with a fresh VMID. Decisions
are pure functions of durable state, so every step is idempotent and
crash-recoverable.

Hardening over the original branch:
- a registry probe error owns the reconcile — falling through would let
  applyUnsuspend unwind the migration or fresh-boot over the only snapshot
- a :hibernate tag on a never-quiesced pod is a leftover and is dropped,
  not restored
- re-targeting nodeName back mid-migration wakes the pod in place instead of
  deadlocking; CR-owned hibernation short-circuits before the registry probe
- clearing nodeName in the deleted-pod window finishes the restore instead of
  stranding the snapshot
- steady-state pinned sets skip the registry probe; a CR wake mid-flight is
  not repainted as a migration

Scoped to the main agent (slot 0); sub-agents follow via their hard bind.
CMGS added a commit that referenced this pull request Jul 2, 2026
…ne) (#11)

* feat(cocoonset): cross-node migration (nodeName affinity + migration state machine)

Rewritten on current main atop the merged restore-from-hibernate producer (#14):
the control plane patches CocoonSet.spec.nodeName and the operator hibernates
the main agent, waits for the :hibernate snapshot in the OCI registry, deletes
the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate,
and drops the snapshot once the restored VM runs with a fresh VMID. Decisions
are pure functions of durable state, so every step is idempotent and
crash-recoverable.

Hardening over the original branch:
- a registry probe error owns the reconcile — falling through would let
  applyUnsuspend unwind the migration or fresh-boot over the only snapshot
- a :hibernate tag on a never-quiesced pod is a leftover and is dropped,
  not restored
- re-targeting nodeName back mid-migration wakes the pod in place instead of
  deadlocking; CR-owned hibernation short-circuits before the registry probe
- clearing nodeName in the deleted-pod window finishes the restore instead of
  stranding the snapshot
- steady-state pinned sets skip the registry probe; a CR wake mid-flight is
  not repainted as a migration

Scoped to the main agent (slot 0); sub-agents follow via their hard bind.

* fix(main): surface controller-runtime errors through core/log

crlog was set to logr.Discard(), so every reconcile error controller-runtime
retried (returned by reconcilers, not logged at call sites) vanished — the
migration E2E's registry 403s were invisible. Bridge logr to core/log: errors
always forwarded (nil-err anomaly reports downgrade to Warn since core/log
drops nil-err Error lines), V(0) info kept, V(1)+ internals chatter dropped.

---------

Co-authored-by: CMGS <ilskdw@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant