OOM and process telemetry #254
Conversation
Two new telemetry event types surface VM-level failures alongside the existing browser events: - `system_oom_kill` — emitted by an in-process /dev/kmsg reader in the api server whenever the kernel OOM-killer terminates a process, including unsupervised Chrome renderer subprocesses. - `service_crashed` — emitted by a tiny supervisord eventlistener binary that POSTs to the local /telemetry/events endpoint whenever a supervised service unexpectedly exits (PROCESS_STATE_EXITED with expected=0, or PROCESS_STATE_FATAL). Both events flow through the existing EventStream and inherit the SSE and S2 sinks for free. Categorized as `system` so they're always-on. The shim is shipped in both the chromium-headful and chromium-headless images and registered as `[eventlistener:supervisord-shim]`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The default supervisord eventlistener buffer is 10. When several supervised services flap in close succession (which is exactly what happens during a real failure cascade) supervisord drops events before the shim has a chance to drain them.
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: PR title and empty description don't indicate changes to API endpoints or Temporal workflows; please confirm if this modifies kernel API before merging. To monitor this PR anyway, reply with |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
hiroTamada
left a comment
There was a problem hiding this comment.
lgtm — solid design, the multi-source story (kmsg + supervisord) is well-thought-out and the state-machine tests cover the cases I'd worry about. two notes that aren't blockers:
Naming
images/chromium-{headful,headless}/.../supervisord-shim.conf,server/cmd/supervisord-shim/— name describes adjacency to supervisord, not purpose. Pretty generic if more supervisord helpers ever ship. Considercrash-event-listener/service-crash-reporter/crash-telemetry-shim.
Questions / minor
server/lib/sysmon/sysmon.go:147-149—MemFreeKbis suppressed from the payload when zero, but0 kB freeis a plausible (and informative) value at the moment of an OOM. Consider distinguishing "absent" from "zero" with a different sentinel or always emitting when parsed.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5cdbd63. Configure here.
| Memcg BrowserSystemOomKillEventDataConstraint = "memcg" | ||
| MemoryPolicy BrowserSystemOomKillEventDataConstraint = "memory_policy" | ||
| None BrowserSystemOomKillEventDataConstraint = "none" | ||
| ) |
There was a problem hiding this comment.
Exported None constant invites future naming collisions
Low Severity
The new constraint enum constants None, Cpuset, Memcg, MemoryPolicy are exported from the shared oapi package with very generic names. In the same commit, Exited/Running were renamed to ProcessStatusStateExited/ProcessStatusStateRunning specifically to disambiguate — but the OOM constraint constants weren't given equivalent prefixes. None in particular is highly collision-prone if any future OpenAPI enum also includes a "none" value, which would force a breaking rename later.
Reviewed by Cursor Bugbot for commit 5cdbd63. Configure here.
| // Monitor runs the in-process sysmon goroutine and publishes events | ||
| // directly to the EventStream. System-category events are always | ||
| // captured regardless of any active TelemetrySession config, so we | ||
| // deliberately bypass TelemetrySession here. |
There was a problem hiding this comment.
we are carving out special "you're gonna get this telemetry no matter what" which I think is fine but also might make the API harder to understand unless we're explicit about it, e.g. in the openapi.yaml description of telemetry config we should probably mention which events are not possible to disable
There was a problem hiding this comment.
totally agree! it's actually something I've been wanting to revisit a bit, esp thinking through the cdp / live view stuff. will follow up in slack!


Why
Customers running browsers on Kernel currently get no signal when something inside the VM dies. Chrome renderer gets OOM-killed by the kernel? Silent. Mutter segfaults? Silent. The session just degrades and the customer has to guess. This PR adds two always-on event types that flow through the existing
EventStream→ SSE/S2 pipeline so they show up on the dashboard like any other browser event.Two sources, no overlap
Different failures only surface on different channels, so we need both:
kernel-images-apipanicNo de-dup if both fire for the same kill — that overlap itself is a signal (RAM exhaustion vs. process bug).
Design choices
Public schema is decoupled from impl.
phase: startup|running|gave_upinstead of supervisord'sRUNNING|STARTING|BACKOFF;Source.Event: "service.crashed"instead of"supervisord.process_exited". If we ever swap supervisord for systemd-style, the contract holds.OOM payload answers the customer's actual question. "Chrome died at 2GB RSS" tells you nothing; "chrome held 4.8GB on a 2GB system with 17MB free, top tasks were chrome/mutter/sshd/systemd" tells you it was a chrome memory leak, not an infra problem. We parse the kernel's full Mem-Info + Tasks-state dump (atomically present in kmsg before the kill line) for total/free memory and the top-5 processes by RSS. No race, no separate
/proc/meminforead.Custom state machine on
euank/go-kmsg-parserrather thancadvisor/oomparser. cadvisor's parser is solid but doesn't extractrss_kb, hard-codes/dev/kmsgopen-from-start (which would replay the ring buffer on every API restart), and dragsklog. Building on the line-level parser directly lets usSeekEnd()on open, extract the full payload, and stay in our logger.Shim is a separate binary talking to the API over HTTP. Supervisord's eventlistener model requires a separate process; the API already exposes
/telemetry/events; running on a localhost trusted boundary so no auth needed. Adding a unix socket would gain nothing.startretries=999999on the shim. Supervisord doesn't emit events about its own eventlisteners, so if the shim ever entered FATAL state we'd lose allservice_crashedtelemetry silently. The shim is tiny and side-effect-free; effectively infinite restarts is the right safety bias.Known limitations (intentional)
kernel-images-apidies, the shim's POST fails and the event is lost. Unikraft Cloud's VM-level monitoring catches it at the platform layer. Buffering inside the shim would close the gap but isn't worth the scope here.process_nametruncates at 15 chars — fundamental kernelTASK_COMM_LENlimit. Documented.Test plan
RESULT 2\nOK(no trailing newline; previous version deadlocked the listener after one event), phase mapping for RUNNING/STARTING/BACKOFF, unknown-state skipgo vet,go test -race(full suite minus e2e), cross-platform buildend to end tests
docker exec ... kill -KILL $(pgrep -f /opt/chrome-for-testing/chromium)service_crashedwithphase=runningsleep 0.1; exit 1,startretries=3) andsupervisorctl startitservice_crashedwithphase=gave_up, nopiddocker exec ... supervisorctl stop chromiumecho "<6>$line" > /dev/kmsg(opener,CPU/PID/Comm, Mem-Info, Tasks-state,oom-kill:constraint=...,Killed process)system_oom_killwithconstraint=none,mem_total_kb, top-1 taskchromium,trigger_process_name=chromiumdocker run --memory 512m ...then run a python memory-hog insidesystem_oom_killwithconstraint=memcg;mem_total_kb/mem_free_kbomitted (memcg dumps skip global Mem-Info);top_tasksnames are single tokensecho 1 > /proc/sys/kernel/sysrq; echo f > /proc/sysrq-triggersystem_oom_killvisible in/var/log/supervisord/kernel-images-apiFull reproduction steps for each row live in
server/lib/sysmon/README.md.Note
Medium Risk
Introduces always-on system telemetry and a new supervisord eventlistener in browser images; failures are mostly best-effort (dropped POSTs, optional kmsg) but mis-parsing or shim protocol bugs could affect crash/OOM visibility.
Overview
Adds always-on VM failure telemetry on the existing
EventStream→ SSE/S2 path:system_oom_kill(kernel OOM via/dev/kmsg) andservice_crashed(unexpected supervisord exits / restart give-up).In-process:
lib/sysmontails kmsg (withSeekEndon start), parses full OOM dumps (Mem-Info, Tasks state, constraint, trigger vs victim), and publishessystem_oom_kill. Wired fromcmd/api/main; a failed kmsg open is logged and the API keeps running.Sidecar:
cmd/supervisord-shimis a supervisord eventlistener that mapsPROCESS_STATE_EXITED(expected=0) andPROCESS_STATE_FATALtoservice_crashedphases (startup/running/gave_up) and POSTs toPOST /telemetry/eventson localhost. It always ACKs supervisord (RESULT 2\nOKwithout a trailing newline).Images & schema: Headful/headless Dockerfiles build/install the shim and ship
supervisord-shim.conf(larger event buffer, very highstartretries). OpenAPI/oapi gain the new event types and payloads;go-kmsg-parseris a new dependency. Unit tests cover the kmsg state machine, shim protocol, and end-to-end publish.Reviewed by Cursor Bugbot for commit 5cdbd63. Bugbot is set up for automated code reviews on this repo. Configure here.