OOM and process telemetry by Sayan- · Pull Request #254 · kernel/kernel-images

Sayan- · 2026-05-27T02:35:15Z

Why

Customers running browsers on Kernel currently get no signal when something inside the VM dies. Chrome renderer gets OOM-killed by the kernel? Silent. Mutter segfaults? Silent. The session just degrades and the customer has to guess. This PR adds two always-on event types that flow through the existing EventStream → SSE/S2 pipeline so they show up on the dashboard like any other browser event.

Two sources, no overlap

Different failures only surface on different channels, so we need both:

Failure	kmsg	supervisord
Chrome renderer subprocess OOM	✓	✗ (not supervised)
Chromium parent OOM	✓	✓
Chromium segfault / V8 abort	✗	✓
`kernel-images-api` panic	✗	✓ (on respawn)

No de-dup if both fire for the same kill — that overlap itself is a signal (RAM exhaustion vs. process bug).

Design choices

Public schema is decoupled from impl. phase: startup|running|gave_up instead of supervisord's RUNNING|STARTING|BACKOFF; Source.Event: "service.crashed" instead of "supervisord.process_exited". If we ever swap supervisord for systemd-style, the contract holds.

OOM payload answers the customer's actual question. "Chrome died at 2GB RSS" tells you nothing; "chrome held 4.8GB on a 2GB system with 17MB free, top tasks were chrome/mutter/sshd/systemd" tells you it was a chrome memory leak, not an infra problem. We parse the kernel's full Mem-Info + Tasks-state dump (atomically present in kmsg before the kill line) for total/free memory and the top-5 processes by RSS. No race, no separate /proc/meminfo read.

Custom state machine on euank/go-kmsg-parser rather than cadvisor/oomparser. cadvisor's parser is solid but doesn't extract rss_kb, hard-codes /dev/kmsg open-from-start (which would replay the ring buffer on every API restart), and drags klog. Building on the line-level parser directly lets us SeekEnd() on open, extract the full payload, and stay in our logger.

Shim is a separate binary talking to the API over HTTP. Supervisord's eventlistener model requires a separate process; the API already exposes /telemetry/events; running on a localhost trusted boundary so no auth needed. Adding a unix socket would gain nothing.

startretries=999999 on the shim. Supervisord doesn't emit events about its own eventlisteners, so if the shim ever entered FATAL state we'd lose all service_crashed telemetry silently. The shim is tiny and side-effect-free; effectively infinite restarts is the right safety bias.

Known limitations (intentional)

API-self-crash isn't captured — when kernel-images-api dies, the shim's POST fails and the event is lost. Unikraft Cloud's VM-level monitoring catches it at the platform layer. Buffering inside the shim would close the gap but isn't worth the scope here.
process_name truncates at 15 chars — fundamental kernel TASK_COMM_LEN limit. Documented.
Page size hard-coded to 4 KiB — correct on x86_64 Unikraft; would be wrong on ARM 16K/64K. Documented.

Test plan

State machine unit tests: canonical 5.x dump, legacy 4.x (no Mem-Info), top-N capping, sequential kills, abandoned section, watchdog discriminates noise vs recognised lines
Round-trip: stub kmsg source → Monitor → EventStream → JSON
Shim: byte-level regression test for RESULT 2\nOK (no trailing newline; previous version deadlocked the listener after one event), phase mapping for RUNNING/STARTING/BACKOFF, unknown-state skip
go vet, go test -race (full suite minus e2e), cross-platform build

end to end tests

Scenario	Mechanism	Expected event
Service crash mid-running	`docker exec ... kill -KILL $(pgrep -f /opt/chrome-for-testing/chromium)`	`service_crashed` with `phase=running`
Service exhausts restart retries	install a flaky supervisord program (`sleep 0.1; exit 1`, `startretries=3`) and `supervisorctl start` it	`service_crashed` with `phase=gave_up`, no `pid`
Clean stop suppressed (negative)	`docker exec ... supervisorctl stop chromium`	no event (only SSE keepalive)
Synthetic OOM dump	inject canonical kmsg lines via `echo "<6>$line" > /dev/kmsg` (opener, `CPU/PID/Comm`, Mem-Info, Tasks-state, `oom-kill:constraint=...`, `Killed process`)	one `system_oom_kill` with `constraint=none`, `mem_total_kb`, top-1 task `chromium`, `trigger_process_name=chromium`
Real cgroup OOM	`docker run --memory 512m ...` then run a python memory-hog inside	`system_oom_kill` with `constraint=memcg`; `mem_total_kb`/`mem_free_kb` omitted (memcg dumps skip global Mem-Info); `top_tasks` names are single tokens
Real Linux 6.x VM	`echo 1 > /proc/sys/kernel/sysrq; echo f > /proc/sysrq-trigger`	`system_oom_kill` visible in `/var/log/supervisord/kernel-images-api`

Full reproduction steps for each row live in server/lib/sysmon/README.md.

Note

Medium Risk
Introduces always-on system telemetry and a new supervisord eventlistener in browser images; failures are mostly best-effort (dropped POSTs, optional kmsg) but mis-parsing or shim protocol bugs could affect crash/OOM visibility.

Overview
Adds always-on VM failure telemetry on the existing EventStream → SSE/S2 path: system_oom_kill (kernel OOM via /dev/kmsg) and service_crashed (unexpected supervisord exits / restart give-up).

In-process: lib/sysmon tails kmsg (with SeekEnd on start), parses full OOM dumps (Mem-Info, Tasks state, constraint, trigger vs victim), and publishes system_oom_kill. Wired from cmd/api/main; a failed kmsg open is logged and the API keeps running.

Sidecar: cmd/supervisord-shim is a supervisord eventlistener that maps PROCESS_STATE_EXITED (expected=0) and PROCESS_STATE_FATAL to service_crashed phases (startup / running / gave_up) and POSTs to POST /telemetry/events on localhost. It always ACKs supervisord (RESULT 2\nOK without a trailing newline).

Images & schema: Headful/headless Dockerfiles build/install the shim and ship supervisord-shim.conf (larger event buffer, very high startretries). OpenAPI/oapi gain the new event types and payloads; go-kmsg-parser is a new dependency. Unit tests cover the kmsg state machine, shim protocol, and end-to-end publish.

^{Reviewed by Cursor Bugbot for commit 5cdbd63. Bugbot is set up for automated code reviews on this repo. Configure here.}

Two new telemetry event types surface VM-level failures alongside the existing browser events: - `system_oom_kill` — emitted by an in-process /dev/kmsg reader in the api server whenever the kernel OOM-killer terminates a process, including unsupervised Chrome renderer subprocesses. - `service_crashed` — emitted by a tiny supervisord eventlistener binary that POSTs to the local /telemetry/events endpoint whenever a supervised service unexpectedly exits (PROCESS_STATE_EXITED with expected=0, or PROCESS_STATE_FATAL). Both events flow through the existing EventStream and inherit the SSE and S2 sinks for free. Categorized as `system` so they're always-on. The shim is shipped in both the chromium-headful and chromium-headless images and registered as `[eventlistener:supervisord-shim]`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The default supervisord eventlistener buffer is 10. When several supervised services flap in close succession (which is exactly what happens during a real failure cascade) supervisord drops events before the shim has a chance to drain them.

firetiger-agent · 2026-05-27T02:35:31Z

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: PR title and empty description don't indicate changes to API endpoints or Temporal workflows; please confirm if this modifies kernel API before merging.

To monitor this PR anyway, reply with @firetiger monitor this.

socket-security · 2026-05-27T02:35:39Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	golang/github.com/euank/go-kmsg-parser/v2@v2.1.0

View full report

hiroTamada

lgtm — solid design, the multi-source story (kmsg + supervisord) is well-thought-out and the state-machine tests cover the cases I'd worry about. two notes that aren't blockers:

Naming

images/chromium-{headful,headless}/.../supervisord-shim.conf, server/cmd/supervisord-shim/ — name describes adjacency to supervisord, not purpose. Pretty generic if more supervisord helpers ever ship. Consider crash-event-listener / service-crash-reporter / crash-telemetry-shim.

Questions / minor

server/lib/sysmon/sysmon.go:147-149 — MemFreeKb is suppressed from the payload when zero, but 0 kB free is a plausible (and informative) value at the moment of an OOM. Consider distinguishing "absent" from "zero" with a different sentinel or always emitting when parsed.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5cdbd63. Configure here.}

cursor · 2026-05-28T02:18:29Z

+	Memcg        BrowserSystemOomKillEventDataConstraint = "memcg"
+	MemoryPolicy BrowserSystemOomKillEventDataConstraint = "memory_policy"
+	None         BrowserSystemOomKillEventDataConstraint = "none"
+)


Exported None constant invites future naming collisions

Low Severity

The new constraint enum constants None, Cpuset, Memcg, MemoryPolicy are exported from the shared oapi package with very generic names. In the same commit, Exited/Running were renamed to ProcessStatusStateExited/ProcessStatusStateRunning specifically to disambiguate — but the OOM constraint constants weren't given equivalent prefixes. None in particular is highly collision-prone if any future OpenAPI enum also includes a "none" value, which would force a breaking rename later.

^{Reviewed by Cursor Bugbot for commit 5cdbd63. Configure here.}

rgarcia · 2026-05-28T17:04:22Z

+// Monitor runs the in-process sysmon goroutine and publishes events
+// directly to the EventStream. System-category events are always
+// captured regardless of any active TelemetrySession config, so we
+// deliberately bypass TelemetrySession here.


we are carving out special "you're gonna get this telemetry no matter what" which I think is fine but also might make the API harder to understand unless we're explicit about it, e.g. in the openapi.yaml description of telemetry config we should probably mention which events are not possible to disable

totally agree! it's actually something I've been wanting to revisit a bit, esp thinking through the cdp / live view stuff. will follow up in slack!

Sayan- and others added 7 commits May 22, 2026 21:45

shim: bump eventlistener buffer_size to 100

4ecdc0d

The default supervisord eventlistener buffer is 10. When several supervised services flap in close succession (which is exactly what happens during a real failure cascade) supervisord drops events before the shim has a chance to drain them.

oapi spec

e0d008b

generated

a44813d

go mod

ce8977d

vibe code

aebf939

shim conf

ec172f9

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread server/cmd/supervisord-shim/main.go

Comment thread server/cmd/supervisord-shim/main.go Outdated

hiroTamada self-requested a review May 27, 2026 12:49

Sayan- added 2 commits May 27, 2026 09:05

vibe code

c7d70f7

self review

faa2e9c

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread server/lib/sysmon/kmsg.go

Comment thread images/chromium-headful/supervisor/services/supervisord-shim.conf

Sayan- added 3 commits May 27, 2026 11:44

testing finds

da5fe8d

vibe code

b7ff02a

review and clean up

3c7d3c1

Sayan- changed the title ~~[wip] oom and process telemetry~~ OOM and process telemetry May 28, 2026

fix up

5cdbd63

Sayan- requested a review from rgarcia May 28, 2026 02:12

hiroTamada approved these changes May 28, 2026

View reviewed changes

cursor Bot reviewed May 28, 2026

View reviewed changes

rgarcia approved these changes May 28, 2026

View reviewed changes

Sayan- merged commit 0dc1a65 into main May 28, 2026
12 of 13 checks passed

Sayan- deleted the sayan/kernel-1316-ooms-process-crash-telemetry branch May 28, 2026 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM and process telemetry #254

OOM and process telemetry #254
Sayan- merged 13 commits into
mainfrom
sayan/kernel-1316-ooms-process-crash-telemetry

Sayan- commented May 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

firetiger-agent Bot commented May 27, 2026

Uh oh!

socket-security Bot commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hiroTamada left a comment

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 28, 2026

Uh oh!

rgarcia May 28, 2026

Uh oh!

Sayan- May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Sayan- commented May 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Two sources, no overlap

Design choices

Known limitations (intentional)

Test plan

Uh oh!

firetiger-agent Bot commented May 27, 2026

Uh oh!

socket-security Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hiroTamada left a comment

Choose a reason for hiding this comment

Naming

Questions / minor

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 28, 2026

Choose a reason for hiding this comment

Exported None constant invites future naming collisions

Uh oh!

rgarcia May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Sayan- May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sayan- commented May 27, 2026 •

edited by cursor Bot

Loading

socket-security Bot commented May 27, 2026 •

edited

Loading

Exported `None` constant invites future naming collisions