containerd-shim: support a disk-backed, pod-shared emptyDir that runsc checkpoint can capture

### Problem

The containerd shim's `UpdateVolumeAnnotations` (`pkg/shim/v1/utils/volumes.go`) gives an `emptyDir` one of two shapes when it carries a mount hint: an empty `emptyDir` becomes a memory-backed tmpfs, and a `force-shared`/non-empty one becomes a `share=shared` gofer bind. Neither is simultaneously **disk-backed**, **shared by two containers in the Pod**, and **captured by `runsc checkpoint`** — even though the runsc runtime itself supports exactly that combination.

Use case: a Pod with two containers sharing a working directory (one writes, the other reads/serves it). We want the directory on ephemeral disk (it can grow large, so RAM isn't acceptable), shared coherently between both containers, and included in `runsc checkpoint` / `runsc fscheckpoint` so a snapshot captures the workload's files together with its memory.

### runsc already supports this (raw OCI bundle)

With the mount declared as a `bind` from a real directory plus a hint of `type=tmpfs` + `share=pod`, runsc builds a disk-backed SelfOverlay shared across both containers, captured by checkpoint. Driving runsc directly (a sandbox container + two app containers joining via `io.kubernetes.cri.sandbox-id`), `/workspace` mounted as `{type: bind, source: <emptyDir node dir>, options: [rbind,rw]}` with annotations `dev.gvisor.spec.mount.ws.{type=tmpfs, share=pod, source=<same dir>}`:

```
# disk-backed SelfOverlay filestore, on disk, inside the source dir:
$ ls -la /var/lib/kubelet/pods/013952ee-.../volumes/kubernetes.io~empty-dir/data
-rw-r--r-- 1 root root 1073741824 .gvisor.filestore.ed

# second container reads the first container's writes (shared, no error):
=== (shared) sidecar sees agent.txt? ===
agent n=3
SIDE:
agent n=6

# captured by fscheckpoint:
$ runsc ... fscheckpoint --leave-running --image-path=/out --path=/workspace ed
$ cat /out/fscheckpoint.json
... "tmpfs":[{"resource_id":{"path":"/workspace"},"tar_start":0,"tar_end":3584}] ...
multitar bytes: 3584

# captured by full checkpoint:
$ runsc ... checkpoint --leave-running --image-path=/out2 ed
-rw-r--r-- 1 root root 340340 checkpoint.img
-rw-r--r-- 1 root root 606208 pages.img
```

### Through the containerd shim (a real Pod) it is unreachable

1. Empty `emptyDir` + the same `type=tmpfs`/`share=pod` hint: the shim rewrites the OCI mount from `bind` to `tmpfs`, so it becomes memory-backed. Two containers still share it, but the on-disk dir stays empty — the contents are in RAM, not on disk:

```
# pod sp-2c: emptyDir + dev.gvisor.spec.mount.ws.type=tmpfs + .share=pod, 2 containers
$ kubectl get pod sp-2c
NAME    READY   STATUS    RESTARTS   AGE
sp-2c   2/2     Running   0          8s
# second container shares /workspace:
SEES:
agent n=7
# but the emptyDir's node dir is empty -> RAM, no .gvisor.filestore.* on disk:
$ ls -la .../kubernetes.io~empty-dir/ws/
total 8
drwxrwxrwx 2 root root 4096 .
drwxr-xr-x 3 root root 4096 ..
# the full runsc checkpoint does capture this mount (it lives in the main MemoryFile):
$ runsc checkpoint --leave-running --image-path=/out <cid>
-rw-r--r-- 1 root root 444279 checkpoint.img
-rw-r--r-- 1 root root 446464 pages.img
```

   So sharing across two containers does work this way — but only with the workspace in **RAM**, which is what we need to avoid for a workspace that can grow large. (Whether `fscheckpoint` can also capture a *memory*-backed tmpfs is a separate matter, tracked in #13566; the request here is to keep it on **disk**.)

2. Two containers + `--overlay2=all:self` (no hint): the second container fails to start, since each tries to create its own SelfOverlay filestore on the same source:

```
$ kubectl get pod ws-2c
NAME    READY   STATUS       RESTARTS   AGE
ws-2c   1/2     StartError   0          8s

OCI runtime start failed: starting container: creating gofer filestore files:
".../kubernetes.io~empty-dir/ws" mount source already has a filestore file at
".../.gvisor.filestore.ec604a9e65cf5077ca693b1a142226f9e21b10f38cda6032654e39e41883a228";
repeated submounts are not supported with overlay optimizations: unknown
```

3. `force-shared`: keeps the bind but sets `share=shared` (a gofer passthrough) — on disk and shared, but not captured by `runsc checkpoint`.

`force-shared` already shows the shim can keep an `emptyDir` a `bind`; the missing piece is doing so with `share=pod` (the sandbox-local, checkpointable overlay) rather than `share=shared` (gofer).

### Proposed behavior

When an `emptyDir` opts into a shared overlay — e.g. `dev.gvisor.spec.mount.<name>.share=pod`, or a new `dev.gvisor.empty-dir.<name>` field — have the shim keep the OCI mount a `bind` and set the hint `type=tmpfs`, instead of rewriting the mount to tmpfs or forcing `share=shared`. That hands runsc the `bind` + `type=tmpfs` + `share=pod` shape it already understands, yielding a disk-backed SelfOverlay shared across the Pod's containers and captured by `runsc checkpoint` / `runsc fscheckpoint`. The kubelet continues to own the `emptyDir` lifecycle.

### Environment

runsc from PR #13326 (containerd-shim checkpoint/restore), commit `2f05ec978e4c`; reproduced on kind, containerd v2.2.0, single-node, arm64.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

containerd-shim: support a disk-backed, pod-shared emptyDir that runsc checkpoint can capture #13595

Problem

runsc already supports this (raw OCI bundle)

Through the containerd shim (a real Pod) it is unreachable

Proposed behavior

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

containerd-shim: support a disk-backed, pod-shared emptyDir that runsc checkpoint can capture #13595

Description

Problem

runsc already supports this (raw OCI bundle)

Through the containerd shim (a real Pod) it is unreachable

Proposed behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions