fix(shim): use per-container id for TaskOOM routing (fixes #12838)#13097
Open
copybara-service[bot] wants to merge 1 commit intomasterfrom
Open
fix(shim): use per-container id for TaskOOM routing (fixes #12838)#13097copybara-service[bot] wants to merge 1 commit intomasterfrom
copybara-service[bot] wants to merge 1 commit intomasterfrom
Conversation
## Summary Fixes #12838 for real. On aarch64 (and any architecture where the containerd race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor are reported by kubelet with `reason:Error` instead of `reason:OOMKilled`. PR #12843 added a synchronous cgroup check at init exit to close the async OOM-vs-exit race. That part works — but it also publishes `TaskOOM{ContainerID: s.id}`, and `s.id` is empty in the normal shim spawn path. So on aarch64 the symptom survives #12843 because the bug is routing, not just timing. ## Root cause 1. `pkg/shim/v1/manager.go:newCommand()` spawns the shim binary with `-namespace`, `-address`, and `-publish-binary` only — no `-id`. 2. In the spawned shim, containerd's vendored shim framework parses the flags; `-id` defaults to `""` (`runtime/v2/shim/shim.go`: `flag.StringVar(&id, "id", "", ...)`). With `manager == nil`, that empty id is threaded into `runsc.NewTaskService(ctx, id, ...)`, so `runscService.id = ""`. 3. Every `TaskOOM{ContainerID: s.id}` — from the async `oom_v2.go`'s `run()` and from the sync `service.go`'s `checkProcesses()` added by #12843 — ships with `ContainerID=""`. 4. containerd's CRI handler (`internal/cri/server/events.go`) routes TaskOOM via `containerStore.Get(e.ContainerID)`. `Get("")` returns `ErrEmptyPrefix` (from `internal/truncindex`). That is **not** `IsNotFound`, so `handleEvent` wraps and returns the error; containerd logs `can't find container for TaskOOM event: prefix can't be empty`, and the container's `Status.Reason` is never set. 5. `TaskExit` is unaffected because containerd routes TaskExit by `e.ID` (init process id), not `e.ContainerID` — different field, different lookup. ## Fix Don't rely on `s.id` for TaskOOM routing. Use the per-container id that's already available at the point the OOM poller is wired up and at the point the exit is observed. * `CreateWithFSRestore`: register the cgroup with `rfs.Create.ID` (the per-container id from the CreateTaskRequest) rather than `s.id`. The id stored in the watcher's `cgroups` and `lastOOM` maps is now non-empty and correctly keyed per container, so both the async (`EventChan -> run`) and sync (`isOOM` at exit) paths publish with a real `ContainerID`. * `checkProcesses`: locate the container whose process matched the exit event, use that container's id for `isOOM`, and use it for both `TaskOOM.ContainerID` and `TaskExit.ContainerID`. `oom_v2.go` is unchanged — with `add()` keyed by `rfs.Create.ID`, the existing async publish path already emits TaskOOM with the correct id via `i.id`. ## Validation End-to-end in a prod-matching environment (AL2023 aarch64, kernel 6.12, containerd 2.2.1, kubelet 1.34, runsc built from this branch): * **Before**: OOM'd pods show `reason:Error`; containerd journal shows `TaskOOM event ` (trailing space = empty ContainerID) and `Failed to handle backOff event for error="can't find container for TaskOOM event: prefix can't be empty"`. * **After**: OOM'd pods show `reason:OOMKilled`; journal shows `TaskOOM event container_id:"<real-64-hex>"` and zero `prefix can't be empty` errors. Also soaked on staging at Semgrep with real workloads across both x86_64 and aarch64 nodepools (multi-container gvisor pods, forced OOM on a 128Mi-limited python container alongside a busybox sidecar). Main-container OOMs now consistently report `reason:OOMKilled` on both arches. ## Scope This PR touches only the cgroups v2 init-exit path (`CreateWithFSRestore` and `checkProcesses` in `runsc/service.go`). The cgroups v1 poller (`epoll.go`) has the same structural reliance on `s.id` and should get the equivalent treatment — happy to file a follow-up once this lands. ## Follow-ups noted in this work (separate PRs) * cgroups v1 shim (`epoll.go`) has the same empty-id TaskOOM path. * `pkg/shim/v1/runsccmd/utils.go:FormatShimLogPath` documents `%ID%` substitution but the implementation only appends a filename when the path ends in `/`. As a result the shim currently writes all container logs into a literal `/var/log/runsc/%ID%/` directory rather than per-container directories. Noticed while gathering logs for this PR. FUTURE_COPYBARA_INTEGRATE_REVIEW=#12996 from kpurdon:kpurdon/12838/fix-taskoom-empty-id ef14594 PiperOrigin-RevId: 911236015
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix(shim): use per-container id for TaskOOM routing (fixes #12838)
Summary
Fixes #12838 for real. On aarch64 (and any architecture where the containerd race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor are reported by kubelet with
reason:Errorinstead ofreason:OOMKilled.PR #12843 added a synchronous cgroup check at init exit to close the async OOM-vs-exit race. That part works — but it also publishes
TaskOOM{ContainerID: s.id}, ands.idis empty in the normal shim spawn path. So on aarch64 the symptom survives #12843 because the bug is routing, not just timing.Root cause
pkg/shim/v1/manager.go:newCommand()spawns the shim binary with-namespace,-address, and-publish-binaryonly — no-id.-iddefaults to""(runtime/v2/shim/shim.go:flag.StringVar(&id, "id", "", ...)). Withmanager == nil, that empty id is threaded intorunsc.NewTaskService(ctx, id, ...), sorunscService.id = "".TaskOOM{ContainerID: s.id}— from the asyncoom_v2.go'srun()and from the syncservice.go'scheckProcesses()added by fix(shim): synchronously check cgroup for OOM at container exit #12843 — ships withContainerID="".internal/cri/server/events.go) routes TaskOOM viacontainerStore.Get(e.ContainerID).Get("")returnsErrEmptyPrefix(frominternal/truncindex). That is notIsNotFound, sohandleEventwraps and returns the error; containerd logscan't find container for TaskOOM event: prefix can't be empty, and the container'sStatus.Reasonis never set.TaskExitis unaffected because containerd routes TaskExit bye.ID(init process id), note.ContainerID— different field, different lookup.Fix
Don't rely on
s.idfor TaskOOM routing. Use the per-container id that's already available at the point the OOM poller is wired up and at the point the exit is observed.CreateWithFSRestore: register the cgroup withrfs.Create.ID(the per-container id from the CreateTaskRequest) rather thans.id. The id stored in the watcher'scgroupsandlastOOMmaps is now non-empty and correctly keyed per container, so both the async (EventChan -> run) and sync (isOOMat exit) paths publish with a realContainerID.checkProcesses: locate the container whose process matched the exit event, use that container's id forisOOM, and use it for bothTaskOOM.ContainerIDandTaskExit.ContainerID.oom_v2.gois unchanged — withadd()keyed byrfs.Create.ID, the existing async publish path already emits TaskOOM with the correct id viai.id.Validation
End-to-end in a prod-matching environment (AL2023 aarch64, kernel 6.12, containerd 2.2.1, kubelet 1.34, runsc built from this branch):
reason:Error; containerd journal showsTaskOOM event(trailing space = empty ContainerID) andFailed to handle backOff event for error="can't find container for TaskOOM event: prefix can't be empty".reason:OOMKilled; journal showsTaskOOM event container_id:"<real-64-hex>"and zeroprefix can't be emptyerrors.Also soaked on staging at Semgrep with real workloads across both x86_64 and aarch64 nodepools (multi-container gvisor pods, forced OOM on a 128Mi-limited python container alongside a busybox sidecar). Main-container OOMs now consistently report
reason:OOMKilledon both arches.Scope
This PR touches only the cgroups v2 init-exit path (
CreateWithFSRestoreandcheckProcessesinrunsc/service.go). The cgroups v1 poller (epoll.go) has the same structural reliance ons.idand should get the equivalent treatment — happy to file a follow-up once this lands.Follow-ups noted in this work (separate PRs)
epoll.go) has the same empty-id TaskOOM path.pkg/shim/v1/runsccmd/utils.go:FormatShimLogPathdocuments%ID%substitution but the implementation only appends a filename when the path ends in/. As a result the shim currently writes all container logs into a literal/var/log/runsc/%ID%/directory rather than per-container directories. Noticed while gathering logs for this PR.FUTURE_COPYBARA_INTEGRATE_REVIEW=#12996 from kpurdon:kpurdon/12838/fix-taskoom-empty-id ef14594