Problem
When a tmpfs mount is shared across multiple containers in a Pod (mount hint share=pod, or share=container) and the whole sandbox is checkpointed and then restored, restore fails for every container past the first. The first container restores and runs; the second dies at start:
OCI runtime restore failed: starting container: starting sub-container
[/counter --tick=1s --state-file=/workspace/b.state]:
inconsistent private memory files on restore:
savedMFOwners = [writer-a:/ writer-a:/workspace writer-b:/],
mfmap = map[writer-a:/ ... writer-a:/workspace ... writer-b:/ ... writer-b:/workspace ...]
savedMFOwners has one entry for the shared /workspace overlay (owned by the first container), but mfmap on restore has one per mounting container, so the counts disagree and loadPrivateMemoryFiles (pkg/sentry/kernel/kernel_restore.go) aborts.
Root cause
A pod-shared overlay is backed by a single MemoryFile. At runtime the first container mounts the master via getSharedMount (runsc/boot/vfs.go); peer containers reuse that master and their extra filestore FD is closed, so only one private MemoryFile exists and PrepareSave registers exactly one owner.
On restore, configureRestore (runsc/boot/vfs.go) doesn't account for the sharing — it creates and registers a private MemoryFile for every container's submount that carries a filestore FD. So a Pod with the overlay mounted in two containers restores with two MemoryFile entries against one saved owner, and the counts mismatch.
Reproduce
A two-container Pod sharing a disk-backed tmpfs overlay via the mount hints dev.gvisor.spec.mount.<name>.share=pod and type=tmpfs (OCI mount kept as bind, so the overlay is on disk). Both containers write to the mount, runsc checkpoint the sandbox, then restore (e.g. a fresh Pod carrying the restore annotation). The first container resumes; the second fails with the error above. The same overlay mounted by a single container checkpoints and restores cleanly, which points at the per-container duplicate rather than the overlay itself.
Fix
Mirror getSharedMount in configureRestore: track the shared-overlay sources already registered and, for a peer container's shared mount, close the extra filestore FD and skip creating a duplicate MemoryFile, so the restored map holds exactly one entry per shared overlay and matches the saved owners. With that, both containers restore, the shared workspace comes back on a fresh emptyDir, and writes from either container stay coherent after restore.
One caveat: keying the restored MemoryFile by {container, mount-destination} makes correctness depend on containers being restored in creation order (fine for the guaranteed unnamed-container case, and for a kubelet restoring in Pod-spec order). A more robust variant keys the shared overlay's MemoryFile on the pod-global mount source so ordering stops mattering.
Environment
runsc built from the #13326 branch (Kubernetes pod checkpoint/restore), linux/arm64, containerd v2.x, single-node kind cluster.
Problem
When a tmpfs mount is shared across multiple containers in a Pod (mount hint
share=pod, orshare=container) and the whole sandbox is checkpointed and then restored, restore fails for every container past the first. The first container restores and runs; the second dies at start:savedMFOwnershas one entry for the shared/workspaceoverlay (owned by the first container), butmfmapon restore has one per mounting container, so the counts disagree andloadPrivateMemoryFiles(pkg/sentry/kernel/kernel_restore.go) aborts.Root cause
A pod-shared overlay is backed by a single MemoryFile. At runtime the first container mounts the master via
getSharedMount(runsc/boot/vfs.go); peer containers reuse that master and their extra filestore FD is closed, so only one private MemoryFile exists andPrepareSaveregisters exactly one owner.On restore,
configureRestore(runsc/boot/vfs.go) doesn't account for the sharing — it creates and registers a private MemoryFile for every container's submount that carries a filestore FD. So a Pod with the overlay mounted in two containers restores with two MemoryFile entries against one saved owner, and the counts mismatch.Reproduce
A two-container Pod sharing a disk-backed tmpfs overlay via the mount hints
dev.gvisor.spec.mount.<name>.share=podandtype=tmpfs(OCI mount kept asbind, so the overlay is on disk). Both containers write to the mount,runsc checkpointthe sandbox, then restore (e.g. a fresh Pod carrying the restore annotation). The first container resumes; the second fails with the error above. The same overlay mounted by a single container checkpoints and restores cleanly, which points at the per-container duplicate rather than the overlay itself.Fix
Mirror
getSharedMountinconfigureRestore: track the shared-overlay sources already registered and, for a peer container's shared mount, close the extra filestore FD and skip creating a duplicate MemoryFile, so the restored map holds exactly one entry per shared overlay and matches the saved owners. With that, both containers restore, the shared workspace comes back on a fresh emptyDir, and writes from either container stay coherent after restore.One caveat: keying the restored MemoryFile by
{container, mount-destination}makes correctness depend on containers being restored in creation order (fine for the guaranteed unnamed-container case, and for a kubelet restoring in Pod-spec order). A more robust variant keys the shared overlay's MemoryFile on the pod-global mount source so ordering stops mattering.Environment
runsc built from the #13326 branch (Kubernetes pod checkpoint/restore), linux/arm64, containerd v2.x, single-node kind cluster.