libcontainer: reuse tmpfs for directory masks#5262
Open
dims wants to merge 1 commit intoopencontainers:mainfrom
Open
libcontainer: reuse tmpfs for directory masks#5262dims wants to merge 1 commit intoopencontainers:mainfrom
dims wants to merge 1 commit intoopencontainers:mainfrom
Conversation
45da79b to
d50f69e
Compare
AkihiroSuda
reviewed
Apr 25, 2026
Kubernetes may add one sysfs thermal_throttle entry per CPU to maskedPaths. On large Intel systems this can produce many directory masks for a single container. runc currently handles each directory mask with a separate read-only tmpfs mount, and therefore a separate tmpfs superblock. On Linux 4.18/RHEL 8 kernels, creating and tearing down many tmpfs superblocks can contend on the global shrinker_rwsem when containers start or stop concurrently. Use one read-only tmpfs for directory masks and bind-mount it over the remaining directory targets. The first non-procfs-fd directory mount is reopened through the container root fd before it is reused. File masks still bind /dev/null, and procfs fd targets keep the existing one-tmpfs-per-target behavior because they are fd aliases rather than stable rootfs paths. The bind mounts do not create additional tmpfs superblocks. They also retain the read-only mount flag inherited from the source vfsmount, so the masking semantics remain unchanged. Signed-off-by: Davanum Srinivas <davanum@gmail.com>
ec09615 to
df913ca
Compare
Contributor
Author
|
( i think the fedora ci break is a flake ) |
Member
Kicked CI; its green now |
Contributor
|
A similar fix was done to crun a while back, and it makes sense (especially with more masked paths and containers). (I wish a kernel had a special mount flag for in-container /proc and /sys but until we have it...) |
Contributor
Author
|
thanks @thaJeztah @kolyshkin ! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wondering if a problem that showed up recently in k8s when we added more masked paths can be handled better in
runcitself? please see details below:Kubernetes may add one sysfs thermal_throttle entry per CPU to maskedPaths. On large Intel systems this can produce many directory masks for a single container. runc currently handles each directory mask with a separate read-only tmpfs mount, and therefore a separate tmpfs superblock.
On Linux 4.18/RHEL 8 kernels, creating and tearing down many tmpfs superblocks can contend on the global shrinker_rwsem when containers start or stop concurrently.
Use one read-only tmpfs for directory masks and bind-mount it over the remaining directory targets. The first non-procfs-fd directory mount is reopened through the container root fd before it is reused. File masks still bind /dev/null, and procfs fd targets keep the existing one-tmpfs-per-target behavior because they are fd aliases rather than stable rootfs paths.
The bind mounts do not create additional tmpfs superblocks. They also retain the read-only mount flag inherited from the source vfsmount, so the masking semantics remain unchanged.
xref: kubernetes/kubernetes#138512
xref: kubernetes/kubernetes#138388
xref: kubernetes/kubernetes#131018
(With some assistance from claude/codex)