Skip to content

feat(restore): restore VM from :hibernate on cross-node create#24

Merged
CMGS merged 4 commits into
mainfrom
feat/cross-node-vm-restore
Jul 1, 2026
Merged

feat(restore): restore VM from :hibernate on cross-node create#24
CMGS merged 4 commits into
mainfrom
feat/cross-node-vm-restore

Conversation

@tonicmuroq

Copy link
Copy Markdown
Contributor

Lets a hibernated VM come back on a different node instead of booting fresh from the base image — the core capability behind cross-node migrate.

What

  • CreatePod/bringUpVM gains a first-priority restore branch (gated on the restore-from-hibernate annotation): resolves the :hibernate snapshot (local or pulled from epoch) and clones from it, rather than from spec.Image.
  • Extracts wake()'s restore core into shared cloneFromHibernate + dispatchHibernateRestore — so the cross-node path reuses the exact same logic, including the CH+Windows 0-NIC --nics 1 hot-add and the "wait for the fresh NIC's DHCP lease, no PnP recycle" post-step. wake() refactored to call them (behavior-preserving).
  • A Windows restore therefore hot-adds the NIC and does not redundantly recycle.

Dependency

Depends on cocoonstack/cocoon-common#3 (the restore-from-hibernate annotation). go.mod currently pins the branch commit via pseudo-version; bump to the cocoon-common release tag after #3 merges.

Tests

restore_test.go (Windows hot-adds NIC / Linux inherits); full suite + make lint clean on linux + darwin.

@CMGS CMGS force-pushed the feat/cross-node-vm-restore branch from 42af387 to 506d0bb Compare June 29, 2026 08:03
@CMGS

CMGS commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Rebased onto main (code-style cleanup + dep bumps). The deps commit became a no-op — main is already on common v0.2.2 / epoch v0.2.4, both of which carry AnnotationRestoreFromHibernate — so it dropped, leaving 2 commits.

The create.go + update.go changes were auto-merged onto a reorganized codebase, so I verified the merge is intact: cloneFromHibernate / dispatchHibernateRestore / finalizeDropNICWake / wake / bringUpVM are each defined exactly once, the restore branch is the first bringUpVM case, and restoring is consistently gated (dispatch / skip-SAC / skip-base-post-clone / defer-Ready). build / vet / test (incl. restore_test + delete_test) all green. No correctness must-fix — the restore flow is solid.

One metric-consistency call for you (non-blocking):

dispatchHibernateRestore's CH+Windows (dropNIC) branch returns before the op=="update" gate and delegates to finalizeDropNICWake, which unconditionally bumps WakeTotal / WakeIPWaitTotal. So a cross-node create restore of CH+Windows counts into wake-specific metrics, while the Linux create-restore path (gated) does not. It's a semantics judgment — should cross-node restore count as wake or create? Either (a) op-gate finalizeDropNICWake's metric bumps, or (b) accept restore as uniformly "wake" (then the Linux path is the odd one out).

Two nits while you're in there: VMBootDuration label is spec.Mode on create-restore vs "clone" on wake; and the -hibernate-import copy leaks if a cross-node pull succeeds but the subsequent Clone fails (pre-existing wake pattern, inherited here — separate cleanup).

Otherwise #24 is merge-ready.

CreatePod restores a hibernated VM on a new node instead of cloning a
fresh one from the base image. Extracts wake()'s restore core into
cloneFromHibernate (resolve source + CH+Windows --nics 1 hot-add + drop
import copy) and dispatchHibernateRestore (CH+Windows waits on the fresh
NIC's lease, others run runPostCloneSetup). bringUpVM gains a
first-priority restore branch gated on the restore-from-hibernate
annotation; the create dispatch skips base-image post-clone setup for
restores, so a Windows restore hot-adds the NIC without a redundant
PnP recycle.

Requires cocoon-common with AnnotationRestoreFromHibernate; bump the
require before merge.
Cross-node migration deletes the old pod after hibernate has already removed +
forgotten the VM. Pins that DeletePod then takes its v==nil early return and
removes no snapshot, so the epoch :hibernate checkpoint survives the deletion.
@CMGS CMGS force-pushed the feat/cross-node-vm-restore branch from 506d0bb to 8c6bf0c Compare July 1, 2026 13:55
CMGS added 2 commits July 1, 2026 22:24
…on clone failure

- dispatchHibernateRestore: drop the op==update gate so a cross-node create restore
  bumps WakeTotal(ok) like a wake (the CH+Windows dropNIC path already did via
  finalizeDropNICWake) — a restore is a wake regardless of trigger.
- CreatePod: on a restore bring-up failure, bump WakeTotal(failed) too, so the
  cross-node create path is symmetric with wake()'s failure accounting.
- create.go: label a restore's VMBootDuration "clone" like wake, not spec.Mode.
- cloneFromHibernate: defer cleanupWakeImport so a failed Clone doesn't leak the
  cross-node import copy (also fixes the pre-existing wake-path leak).
Drop the cloneFromHibernate inline comment that restated its doc; condense the
create-path restore note.
@CMGS CMGS merged commit 293ae37 into main Jul 1, 2026
2 checks passed
@CMGS CMGS deleted the feat/cross-node-vm-restore branch July 1, 2026 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants