Skip to content

OOM and process telemetry #254

Merged
Sayan- merged 13 commits into
mainfrom
sayan/kernel-1316-ooms-process-crash-telemetry
May 28, 2026
Merged

OOM and process telemetry #254
Sayan- merged 13 commits into
mainfrom
sayan/kernel-1316-ooms-process-crash-telemetry

Conversation

@Sayan-
Copy link
Copy Markdown
Contributor

@Sayan- Sayan- commented May 27, 2026

Why

Customers running browsers on Kernel currently get no signal when something inside the VM dies. Chrome renderer gets OOM-killed by the kernel? Silent. Mutter segfaults? Silent. The session just degrades and the customer has to guess. This PR adds two always-on event types that flow through the existing EventStream → SSE/S2 pipeline so they show up on the dashboard like any other browser event.

Two sources, no overlap

Different failures only surface on different channels, so we need both:

Failure kmsg supervisord
Chrome renderer subprocess OOM ✗ (not supervised)
Chromium parent OOM
Chromium segfault / V8 abort
kernel-images-api panic ✓ (on respawn)

No de-dup if both fire for the same kill — that overlap itself is a signal (RAM exhaustion vs. process bug).

Design choices

Public schema is decoupled from impl. phase: startup|running|gave_up instead of supervisord's RUNNING|STARTING|BACKOFF; Source.Event: "service.crashed" instead of "supervisord.process_exited". If we ever swap supervisord for systemd-style, the contract holds.

OOM payload answers the customer's actual question. "Chrome died at 2GB RSS" tells you nothing; "chrome held 4.8GB on a 2GB system with 17MB free, top tasks were chrome/mutter/sshd/systemd" tells you it was a chrome memory leak, not an infra problem. We parse the kernel's full Mem-Info + Tasks-state dump (atomically present in kmsg before the kill line) for total/free memory and the top-5 processes by RSS. No race, no separate /proc/meminfo read.

Custom state machine on euank/go-kmsg-parser rather than cadvisor/oomparser. cadvisor's parser is solid but doesn't extract rss_kb, hard-codes /dev/kmsg open-from-start (which would replay the ring buffer on every API restart), and drags klog. Building on the line-level parser directly lets us SeekEnd() on open, extract the full payload, and stay in our logger.

Shim is a separate binary talking to the API over HTTP. Supervisord's eventlistener model requires a separate process; the API already exposes /telemetry/events; running on a localhost trusted boundary so no auth needed. Adding a unix socket would gain nothing.

startretries=999999 on the shim. Supervisord doesn't emit events about its own eventlisteners, so if the shim ever entered FATAL state we'd lose all service_crashed telemetry silently. The shim is tiny and side-effect-free; effectively infinite restarts is the right safety bias.

Known limitations (intentional)

  • API-self-crash isn't captured — when kernel-images-api dies, the shim's POST fails and the event is lost. Unikraft Cloud's VM-level monitoring catches it at the platform layer. Buffering inside the shim would close the gap but isn't worth the scope here.
  • process_name truncates at 15 chars — fundamental kernel TASK_COMM_LEN limit. Documented.
  • Page size hard-coded to 4 KiB — correct on x86_64 Unikraft; would be wrong on ARM 16K/64K. Documented.

Test plan

  • State machine unit tests: canonical 5.x dump, legacy 4.x (no Mem-Info), top-N capping, sequential kills, abandoned section, watchdog discriminates noise vs recognised lines
  • Round-trip: stub kmsg source → Monitor → EventStream → JSON
  • Shim: byte-level regression test for RESULT 2\nOK (no trailing newline; previous version deadlocked the listener after one event), phase mapping for RUNNING/STARTING/BACKOFF, unknown-state skip
  • go vet, go test -race (full suite minus e2e), cross-platform build

end to end tests

Scenario Mechanism Expected event
Service crash mid-running docker exec ... kill -KILL $(pgrep -f /opt/chrome-for-testing/chromium) service_crashed with phase=running
Service exhausts restart retries install a flaky supervisord program (sleep 0.1; exit 1, startretries=3) and supervisorctl start it service_crashed with phase=gave_up, no pid
Clean stop suppressed (negative) docker exec ... supervisorctl stop chromium no event (only SSE keepalive)
Synthetic OOM dump inject canonical kmsg lines via echo "<6>$line" > /dev/kmsg (opener, CPU/PID/Comm, Mem-Info, Tasks-state, oom-kill:constraint=..., Killed process) one system_oom_kill with constraint=none, mem_total_kb, top-1 task chromium, trigger_process_name=chromium
Real cgroup OOM docker run --memory 512m ... then run a python memory-hog inside system_oom_kill with constraint=memcg; mem_total_kb/mem_free_kb omitted (memcg dumps skip global Mem-Info); top_tasks names are single tokens
Real Linux 6.x VM echo 1 > /proc/sys/kernel/sysrq; echo f > /proc/sysrq-trigger system_oom_kill visible in /var/log/supervisord/kernel-images-api

Full reproduction steps for each row live in server/lib/sysmon/README.md.


Note

Medium Risk
Introduces always-on system telemetry and a new supervisord eventlistener in browser images; failures are mostly best-effort (dropped POSTs, optional kmsg) but mis-parsing or shim protocol bugs could affect crash/OOM visibility.

Overview
Adds always-on VM failure telemetry on the existing EventStream → SSE/S2 path: system_oom_kill (kernel OOM via /dev/kmsg) and service_crashed (unexpected supervisord exits / restart give-up).

In-process: lib/sysmon tails kmsg (with SeekEnd on start), parses full OOM dumps (Mem-Info, Tasks state, constraint, trigger vs victim), and publishes system_oom_kill. Wired from cmd/api/main; a failed kmsg open is logged and the API keeps running.

Sidecar: cmd/supervisord-shim is a supervisord eventlistener that maps PROCESS_STATE_EXITED (expected=0) and PROCESS_STATE_FATAL to service_crashed phases (startup / running / gave_up) and POSTs to POST /telemetry/events on localhost. It always ACKs supervisord (RESULT 2\nOK without a trailing newline).

Images & schema: Headful/headless Dockerfiles build/install the shim and ship supervisord-shim.conf (larger event buffer, very high startretries). OpenAPI/oapi gain the new event types and payloads; go-kmsg-parser is a new dependency. Unit tests cover the kmsg state machine, shim protocol, and end-to-end publish.

Reviewed by Cursor Bugbot for commit 5cdbd63. Bugbot is set up for automated code reviews on this repo. Configure here.

Sayan- and others added 7 commits May 22, 2026 21:45
Two new telemetry event types surface VM-level failures alongside the
existing browser events:

- `system_oom_kill` — emitted by an in-process /dev/kmsg reader in the
  api server whenever the kernel OOM-killer terminates a process,
  including unsupervised Chrome renderer subprocesses.
- `service_crashed` — emitted by a tiny supervisord eventlistener
  binary that POSTs to the local /telemetry/events endpoint whenever
  a supervised service unexpectedly exits (PROCESS_STATE_EXITED with
  expected=0, or PROCESS_STATE_FATAL).

Both events flow through the existing EventStream and inherit the SSE
and S2 sinks for free. Categorized as `system` so they're always-on.

The shim is shipped in both the chromium-headful and chromium-headless
images and registered as `[eventlistener:supervisord-shim]`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The default supervisord eventlistener buffer is 10. When several
supervised services flap in close succession (which is exactly what
happens during a real failure cascade) supervisord drops events before
the shim has a chance to drain them.
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: PR title and empty description don't indicate changes to API endpoints or Temporal workflows; please confirm if this modifies kernel API before merging.

To monitor this PR anyway, reply with @firetiger monitor this.

@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 27, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedgolang/​github.com/​euank/​go-kmsg-parser/​v2@​v2.1.0100100100100100

View full report

Comment thread server/cmd/supervisord-shim/main.go
Comment thread server/cmd/supervisord-shim/main.go Outdated
@hiroTamada hiroTamada self-requested a review May 27, 2026 12:49
Comment thread server/lib/sysmon/kmsg.go
Comment thread images/chromium-headful/supervisor/services/supervisord-shim.conf
@Sayan- Sayan- changed the title [wip] oom and process telemetry OOM and process telemetry May 28, 2026
@Sayan- Sayan- requested a review from rgarcia May 28, 2026 02:12
Copy link
Copy Markdown
Contributor

@hiroTamada hiroTamada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm — solid design, the multi-source story (kmsg + supervisord) is well-thought-out and the state-machine tests cover the cases I'd worry about. two notes that aren't blockers:

Naming

  • images/chromium-{headful,headless}/.../supervisord-shim.conf, server/cmd/supervisord-shim/ — name describes adjacency to supervisord, not purpose. Pretty generic if more supervisord helpers ever ship. Consider crash-event-listener / service-crash-reporter / crash-telemetry-shim.

Questions / minor

  • server/lib/sysmon/sysmon.go:147-149MemFreeKb is suppressed from the payload when zero, but 0 kB free is a plausible (and informative) value at the moment of an OOM. Consider distinguishing "absent" from "zero" with a different sentinel or always emitting when parsed.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5cdbd63. Configure here.

Comment thread server/lib/oapi/oapi.go
Memcg BrowserSystemOomKillEventDataConstraint = "memcg"
MemoryPolicy BrowserSystemOomKillEventDataConstraint = "memory_policy"
None BrowserSystemOomKillEventDataConstraint = "none"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported None constant invites future naming collisions

Low Severity

The new constraint enum constants None, Cpuset, Memcg, MemoryPolicy are exported from the shared oapi package with very generic names. In the same commit, Exited/Running were renamed to ProcessStatusStateExited/ProcessStatusStateRunning specifically to disambiguate — but the OOM constraint constants weren't given equivalent prefixes. None in particular is highly collision-prone if any future OpenAPI enum also includes a "none" value, which would force a breaking rename later.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5cdbd63. Configure here.

// Monitor runs the in-process sysmon goroutine and publishes events
// directly to the EventStream. System-category events are always
// captured regardless of any active TelemetrySession config, so we
// deliberately bypass TelemetrySession here.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are carving out special "you're gonna get this telemetry no matter what" which I think is fine but also might make the API harder to understand unless we're explicit about it, e.g. in the openapi.yaml description of telemetry config we should probably mention which events are not possible to disable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally agree! it's actually something I've been wanting to revisit a bit, esp thinking through the cdp / live view stuff. will follow up in slack!

@Sayan- Sayan- merged commit 0dc1a65 into main May 28, 2026
12 of 13 checks passed
@Sayan- Sayan- deleted the sayan/kernel-1316-ooms-process-crash-telemetry branch May 28, 2026 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants