Skip to content

feat(vm): dm-snapshot based block-level CoW for sandbox rootfs#207

Open
lucas77778 wants to merge 4 commits intomasterfrom
refactor/sandbox-rootfs-boot
Open

feat(vm): dm-snapshot based block-level CoW for sandbox rootfs#207
lucas77778 wants to merge 4 commits intomasterfrom
refactor/sandbox-rootfs-boot

Conversation

@lucas77778
Copy link
Copy Markdown
Member

@lucas77778 lucas77778 commented Apr 7, 2026

Summary

  • Introduce CowManager (snapshot_cow.rs) that uses Linux dm-snapshot to share a single read-only template ext4 image across sandboxes. Each sandbox gets a sparse COW file backed by a dm-snapshot device — only written blocks consume disk space, eliminating the need for full rootfs copies.
  • Wire dm-snapshot into the sandbox lifecycle in direct mode (jailer: None). do_boot() creates a snapshot device at /dev/mapper/arcbox-snap-{id} and passes it to Firecracker; remove_sandbox_impl() tears it down after the process exits. Falls back to direct rootfs usage if dm-snapshot is unavailable.
  • Fix agent tracing filter to include arcbox_vm=info so CowManager diagnostics are visible in agent.log.

Key design decisions

  • std::process::Command + spawn_blocking instead of tokio::process::Command to avoid SIGCHLD conflicts with the PID-1 reaper in arcbox-agent.
  • AsyncMutex serializes busybox losetup -f + losetup DEV FILE two-step operations to prevent TOCTOU races on concurrent sandbox creation.
  • cleanup_stale_sync() runs synchronously at CowManager::new() to remove orphaned dm devices and COW files from previous crashes.
  • Conditional COW file deletion in teardown() — only unlinks the backing file after both dm device and loop device are successfully released.
  • Jailer mode and checkpoint/restore unchanged (Phase 2/3).

Prerequisites

Benchmark (1GB rootfs, 1536MB memory, 5 sandboxes)

Sandbox Boot time
1 204ms
2 175ms
3 170ms
4 163ms
5 240ms
Avg 190ms

Test plan

  • E2E: 5 sandboxes created with dm-snapshot, all reached ready state
  • Verified dm-snapshot created and dm-snapshot teardown complete in agent.log
  • Template loop device reuse confirmed (refcount increments on subsequent sandboxes)
  • Fallback path works when dmsetup is absent (warn at init, debug per-boot)
  • cargo clippy and cargo fmt pass with zero warnings
  • CI

Introduce `CowManager` in `snapshot_cow.rs` that uses Linux dm-snapshot
to share a single read-only template ext4 image across sandboxes. Each
sandbox gets a sparse COW file; only written blocks consume disk space.

Key design decisions:
- Uses `std::process::Command` + `spawn_blocking` instead of
  `tokio::process::Command` to avoid SIGCHLD conflicts with the PID-1
  reaper in arcbox-agent.
- Two-step busybox `losetup -f` + `losetup DEV FILE` serialized via
  `AsyncMutex` to prevent TOCTOU races on concurrent sandbox creation.
- Synchronous `cleanup_stale_sync()` runs at init to remove orphaned
  dm devices and COW files from previous crashes.
- Template loop devices are refcounted; detached when last sandbox using
  that template is removed.
In direct mode (jailer: None), `do_boot()` now calls
`cow_manager.setup()` to create a dm-snapshot device and passes
`/dev/mapper/arcbox-snap-{id}` to Firecracker as the rootfs block
device. On failure, falls back to using the rootfs image directly.

`remove_sandbox_impl()` tears down the dm-snapshot after the
Firecracker process exits. TTL expiry and restore paths also pass
the cow_manager through for proper cleanup.

Jailer mode and checkpoint/restore are unchanged (Phase 2/3).
The default EnvFilter only allowed `arcbox_agent=info`, which silently
dropped all logs from the `arcbox_vm` crate (including dm-snapshot
setup/teardown). Add `arcbox_vm=info` so CowManager diagnostics are
visible in agent.log.
Copilot AI review requested due to automatic review settings April 7, 2026 11:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds dm-snapshot–backed block-level copy-on-write to avoid per-sandbox rootfs image copies when running nested Firecracker sandboxes inside the guest VM.

Changes:

  • Introduces CowManager/CowHandle to create and manage dm-snapshot devices backed by sparse COW files.
  • Wires CoW setup/teardown into sandbox boot and removal (direct mode; jailer mode deferred).
  • Extends logging defaults in arcbox-agent to include arcbox_vm at info.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
virt/arcbox-vm/src/snapshot_cow.rs New dm-snapshot CoW implementation (loop attach, dmsetup create/remove, stale cleanup, small unit tests).
virt/arcbox-vm/src/sandbox.rs Integrates CoW into sandbox lifecycle: setup during boot (direct mode) and teardown during remove.
virt/arcbox-vm/src/lib.rs Exposes the new snapshot_cow module.
virt/arcbox-vm/src/error.rs Adds VmmError::DeviceMapper for dm/loop/dmsetup-related failures.
guest/arcbox-agent/src/main.rs Expands default tracing filter to include arcbox_vm=info.

Comment thread virt/arcbox-vm/src/snapshot_cow.rs
Comment thread virt/arcbox-vm/src/snapshot_cow.rs Outdated
Comment thread virt/arcbox-vm/src/sandbox.rs Outdated
- teardown(): only delete COW file after both dm device removal and
  loop detach succeed; avoids unlinking a still-referenced backing file
  that would delay space reclamation.
- Downgrade dm-snapshot fallback log from warn to debug in do_boot();
  CowManager::new() already warns once at init when dmsetup is missing,
  so per-sandbox warnings are redundant noise.
@lucas77778 lucas77778 changed the title Refactor/sandbox rootfs boot feat(vm): dm-snapshot based block-level CoW for sandbox rootfs Apr 7, 2026
@lucas77778 lucas77778 marked this pull request as ready for review April 7, 2026 12:19
Copilot AI review requested due to automatic review settings April 7, 2026 12:19
@lucas77778 lucas77778 requested a review from AprilNEA April 7, 2026 12:21

This comment was marked as outdated.

@lucas77778 lucas77778 requested a review from PeronGH April 8, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants