Skip to content

containerd-shim: support a disk-backed, pod-shared emptyDir that runsc checkpoint can capture #13595

Description

@mayur-tolexo

Problem

The containerd shim's UpdateVolumeAnnotations (pkg/shim/v1/utils/volumes.go) gives an emptyDir one of two shapes when it carries a mount hint: an empty emptyDir becomes a memory-backed tmpfs, and a force-shared/non-empty one becomes a share=shared gofer bind. Neither is simultaneously disk-backed, shared by two containers in the Pod, and captured by runsc checkpoint — even though the runsc runtime itself supports exactly that combination.

Use case: a Pod with two containers sharing a working directory (one writes, the other reads/serves it). We want the directory on ephemeral disk (it can grow large, so RAM isn't acceptable), shared coherently between both containers, and included in runsc checkpoint / runsc fscheckpoint so a snapshot captures the workload's files together with its memory.

runsc already supports this (raw OCI bundle)

With the mount declared as a bind from a real directory plus a hint of type=tmpfs + share=pod, runsc builds a disk-backed SelfOverlay shared across both containers, captured by checkpoint. Driving runsc directly (a sandbox container + two app containers joining via io.kubernetes.cri.sandbox-id), /workspace mounted as {type: bind, source: <emptyDir node dir>, options: [rbind,rw]} with annotations dev.gvisor.spec.mount.ws.{type=tmpfs, share=pod, source=<same dir>}:

# disk-backed SelfOverlay filestore, on disk, inside the source dir:
$ ls -la /var/lib/kubelet/pods/013952ee-.../volumes/kubernetes.io~empty-dir/data
-rw-r--r-- 1 root root 1073741824 .gvisor.filestore.ed

# second container reads the first container's writes (shared, no error):
=== (shared) sidecar sees agent.txt? ===
agent n=3
SIDE:
agent n=6

# captured by fscheckpoint:
$ runsc ... fscheckpoint --leave-running --image-path=/out --path=/workspace ed
$ cat /out/fscheckpoint.json
... "tmpfs":[{"resource_id":{"path":"/workspace"},"tar_start":0,"tar_end":3584}] ...
multitar bytes: 3584

# captured by full checkpoint:
$ runsc ... checkpoint --leave-running --image-path=/out2 ed
-rw-r--r-- 1 root root 340340 checkpoint.img
-rw-r--r-- 1 root root 606208 pages.img

Through the containerd shim (a real Pod) it is unreachable

  1. Empty emptyDir + the same type=tmpfs/share=pod hint: the shim rewrites the OCI mount from bind to tmpfs, so it becomes memory-backed. Two containers still share it, but the on-disk dir stays empty — the contents are in RAM, not on disk:
# pod sp-2c: emptyDir + dev.gvisor.spec.mount.ws.type=tmpfs + .share=pod, 2 containers
$ kubectl get pod sp-2c
NAME    READY   STATUS    RESTARTS   AGE
sp-2c   2/2     Running   0          8s
# second container shares /workspace:
SEES:
agent n=7
# but the emptyDir's node dir is empty -> RAM, no .gvisor.filestore.* on disk:
$ ls -la .../kubernetes.io~empty-dir/ws/
total 8
drwxrwxrwx 2 root root 4096 .
drwxr-xr-x 3 root root 4096 ..
# the full runsc checkpoint does capture this mount (it lives in the main MemoryFile):
$ runsc checkpoint --leave-running --image-path=/out <cid>
-rw-r--r-- 1 root root 444279 checkpoint.img
-rw-r--r-- 1 root root 446464 pages.img

So sharing across two containers does work this way — but only with the workspace in RAM, which is what we need to avoid for a workspace that can grow large. (Whether fscheckpoint can also capture a memory-backed tmpfs is a separate matter, tracked in #13566; the request here is to keep it on disk.)

  1. Two containers + --overlay2=all:self (no hint): the second container fails to start, since each tries to create its own SelfOverlay filestore on the same source:
$ kubectl get pod ws-2c
NAME    READY   STATUS       RESTARTS   AGE
ws-2c   1/2     StartError   0          8s

OCI runtime start failed: starting container: creating gofer filestore files:
".../kubernetes.io~empty-dir/ws" mount source already has a filestore file at
".../.gvisor.filestore.ec604a9e65cf5077ca693b1a142226f9e21b10f38cda6032654e39e41883a228";
repeated submounts are not supported with overlay optimizations: unknown
  1. force-shared: keeps the bind but sets share=shared (a gofer passthrough) — on disk and shared, but not captured by runsc checkpoint.

force-shared already shows the shim can keep an emptyDir a bind; the missing piece is doing so with share=pod (the sandbox-local, checkpointable overlay) rather than share=shared (gofer).

Proposed behavior

When an emptyDir opts into a shared overlay — e.g. dev.gvisor.spec.mount.<name>.share=pod, or a new dev.gvisor.empty-dir.<name> field — have the shim keep the OCI mount a bind and set the hint type=tmpfs, instead of rewriting the mount to tmpfs or forcing share=shared. That hands runsc the bind + type=tmpfs + share=pod shape it already understands, yielding a disk-backed SelfOverlay shared across the Pod's containers and captured by runsc checkpoint / runsc fscheckpoint. The kubelet continues to own the emptyDir lifecycle.

Environment

runsc from PR #13326 (containerd-shim checkpoint/restore), commit 2f05ec978e4c; reproduced on kind, containerd v2.2.0, single-node, arm64.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions