Skip to content

libcontainer: reuse tmpfs for directory masks#5262

Open
dims wants to merge 1 commit intoopencontainers:mainfrom
dims:maskpaths-shared-tmpfs
Open

libcontainer: reuse tmpfs for directory masks#5262
dims wants to merge 1 commit intoopencontainers:mainfrom
dims:maskpaths-shared-tmpfs

Conversation

@dims
Copy link
Copy Markdown
Contributor

@dims dims commented Apr 25, 2026

Wondering if a problem that showed up recently in k8s when we added more masked paths can be handled better in runc itself? please see details below:

Kubernetes may add one sysfs thermal_throttle entry per CPU to maskedPaths. On large Intel systems this can produce many directory masks for a single container. runc currently handles each directory mask with a separate read-only tmpfs mount, and therefore a separate tmpfs superblock.

On Linux 4.18/RHEL 8 kernels, creating and tearing down many tmpfs superblocks can contend on the global shrinker_rwsem when containers start or stop concurrently.

Use one read-only tmpfs for directory masks and bind-mount it over the remaining directory targets. The first non-procfs-fd directory mount is reopened through the container root fd before it is reused. File masks still bind /dev/null, and procfs fd targets keep the existing one-tmpfs-per-target behavior because they are fd aliases rather than stable rootfs paths.

The bind mounts do not create additional tmpfs superblocks. They also retain the read-only mount flag inherited from the source vfsmount, so the masking semantics remain unchanged.

xref: kubernetes/kubernetes#138512
xref: kubernetes/kubernetes#138388
xref: kubernetes/kubernetes#131018

(With some assistance from claude/codex)

Comment thread libcontainer/rootfs_linux.go
Kubernetes may add one sysfs thermal_throttle entry per CPU to maskedPaths.
On large Intel systems this can produce many directory masks for a single
container. runc currently handles each directory mask with a separate
read-only tmpfs mount, and therefore a separate tmpfs superblock.

On Linux 4.18/RHEL 8 kernels, creating and tearing down many tmpfs
superblocks can contend on the global shrinker_rwsem when containers start or
stop concurrently.

Use one read-only tmpfs for directory masks and bind-mount it over the
remaining directory targets. The first non-procfs-fd directory mount is
reopened through the container root fd before it is reused. File masks still
bind /dev/null, and procfs fd targets keep the existing one-tmpfs-per-target
behavior because they are fd aliases rather than stable rootfs paths.

The bind mounts do not create additional tmpfs superblocks. They also retain
the read-only mount flag inherited from the source vfsmount, so the masking
semantics remain unchanged.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
@dims dims force-pushed the maskpaths-shared-tmpfs branch from ec09615 to df913ca Compare April 26, 2026 09:58
@dims dims changed the title [WIP] libcontainer: reuse tmpfs for directory masks libcontainer: reuse tmpfs for directory masks Apr 26, 2026
@dims
Copy link
Copy Markdown
Contributor Author

dims commented Apr 27, 2026

cc @kolyshkin @AkihiroSuda

( i think the fedora ci break is a flake )

@thaJeztah
Copy link
Copy Markdown
Member

( i think the fedora ci break is a flake )

Kicked CI; its green now

@kolyshkin
Copy link
Copy Markdown
Contributor

A similar fix was done to crun a while back, and it makes sense (especially with more masked paths and containers).

(I wish a kernel had a special mount flag for in-container /proc and /sys but until we have it...)

@dims
Copy link
Copy Markdown
Contributor Author

dims commented Apr 28, 2026

thanks @thaJeztah @kolyshkin !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants