fix(server): tighten startChromiumAndWait readiness gate#257
Merged
Conversation
… duration
startChromiumAndWait could return "devtools ready" within ~135 ms of a
supervisorctl-start ack, while the new Chromium hadn't yet bound its
DevTools listener. Two contributing failure modes:
1. The tryReady "allowCurrent" shortcut dialed the *previous*
Chromium's UpstreamManager URL the instant supervisorctl start
returned. That URL points at the just-killed Chromium, and the
bare websocket.Dial could complete against a half-closed socket
of the dying process (or against a freshly bound but not-yet-
routed listener of the new one), producing a false "ready".
2. tryReady never made a real CDP round-trip after dial, so it
couldn't tell the difference between a live Chromium and a stale
listener.
End-to-end this caused POST /configure to return 200 several seconds
before Chromium could actually serve requests, leaving live view blank
during the gap (measured ~5 s on production sfo, ~8.5 s reproducing
locally on Apple Silicon under Rosetta).
Fix:
* lib/cdpclient: add Client.GetBrowserVersion which sends a single
Browser.getVersion CDP message. A successful round-trip proves the
WebSocket is wired to a live, CDP-responsive Chromium.
* cmd/api/api/chromium.go: in startChromiumAndWait, drop the
allowCurrent parameter; tryReady now always requires the upstream
URL to differ from prevUpstream AND a successful
Browser.getVersion. The supervisorctl-start doneCh signal is used
only to short-circuit on start failure, not to attempt readiness.
Also adds an `outcome` + `elapsed` + `elapsed_ms` info log when
stopChromium completes successfully. This is the first numeric
duration field on the stop path, enabling SigNoz p99(elapsed_ms)
aggregations without parsing the existing Go duration strings.
Tests:
* TestGetBrowserVersion covers happy path and CDP-error propagation.
* go vet ./... and go test -race ./... pass (non-e2e).
Reproducer benchmark (separate commit) measures the api-vs-real gap
for POST /configure with viewport + --kiosk on kernel-images-headless
docker images:
before: GAP = 8488 ms median /configure returned 200 ~8.5 s
before Browser.getVersion worked
after: GAP = 45 ms median gap collapses to probe-interval
noise; chromium really is ready
Co-authored-by: Cursor <cursoragent@cursor.com>
Reproducer for the readiness false-positive fixed in the previous
commit. Boots a kernel-images-headless container, waits for the
initial chromium to be CDP-responsive, issues POST /configure with
viewport (1280x800@60) + chromium_flags ["--kiosk"], then concurrently
times:
- configureMs: POST /configure return-to-200
- newUUIDMs: when /json/version starts reporting a NEW
webSocketDebuggerUrl (proves new chromium emitted
its `DevTools listening` line)
- cdpReadyMs: when Browser.getVersion CDP round-trip succeeds
against that new URL
GAP = cdpReadyMs − configureMs is the user-visible bug window: how
long after the api claims success the new chromium is actually
serving requests.
Usage:
docker build kernel-images-headless:before # upstream main
docker build kernel-images-headless:after # with this fix
bun bench/bench-configure-readiness.ts \\
--image kernel-images-headless:before --label before
bun bench/bench-configure-readiness.ts \\
--image kernel-images-headless:after --label after
Local results on Apple Silicon (linux/amd64 under Rosetta), n=3:
before GAP min=8488ms median=8516ms max=8829ms
after GAP min= 42ms median= 45ms max= 47ms
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: Changes are to server/cmd/api/api/chromium.go and server/lib/cdpclient, which are not in the monitored API endpoint paths (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal); this appears to be a different codebase structure. To monitor this PR anyway, reply with |
rgarcia
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Checklist
Summary
startChromiumAndWaitcould return"devtools ready"within ~135 ms ofsupervisorctl startack'ing — long before the new Chromium had actually bound its DevTools listener. Two issues combined to cause this:tryReady'sallowCurrent=trueshortcut dialed the previous Chromium's URL the instantsupervisorctl startreturned. That URL points at the just-killed Chromium, and a barewebsocket.Dialcould complete against a half-closed socket of the dying process (or against a freshly bound but not-yet-routed listener of the new one), producing a false-positive ready.tryReadynever made a real CDP round-trip after dial, so it couldn't distinguish a live Chromium from a stale listener.End-to-end this caused
POST /configure(andPATCH /chromium/*/PATCH /display) to return200several seconds before Chromium could actually serve requests. We measured this both in production (~5 s gap) and via a controlled local benchmark (~8.5 s gap on Apple Silicon under Rosetta).What changes
server/lib/cdpclientClient.GetBrowserVersion(ctx)that sends a singleBrowser.getVersionCDP message. A successful round-trip proves the WebSocket is wired to a live, CDP-responsive Chromium.server/cmd/api/api/chromium.gostartChromiumAndWait:allowCurrentparameter fromtryReady. The readiness check is now: upstream URL must differ fromprevUpstreamAND aBrowser.getVersionround-trip must succeed.doneCh(supervisorctl start command return) is used only to short-circuit on start failure — it no longer attempts readiness against the current upstream.stopChromium:"chromium stopped"info log on the success paths withoutcome,elapsed, andelapsed_msattributes. First numeric duration field on the stop path; enablesp99(elapsed_ms)aggregations in SigNoz without parsing the existing Go duration strings.Test plan
go vet ./...✅go test -race $(go list ./... | grep -v /e2e$)✅ — all packages passBenchmark
Two
kernel-images-headlessimages built locally (linux/amd64 under Rosetta on Apple Silicon):kernel-images-headless:before— upstreammainkernel-images-headless:after— this PRFor each, the container is started, the initial Chromium is verified CDP-responsive, then
POST /configureis issued withdisplay {1280x800@60}+chromium_flags ["--kiosk"](which triggers the stop/start path). The benchmark times:configureMs— when/configurereturns 200newUUIDMs— when/json/versionreports a newwebSocketDebuggerUrl(proves new chromium emitted itsDevTools listeningline)cdpReadyMs— whenBrowser.getVersionsucceeds against that new UUIDGAP = cdpReadyMs − configureMs— the user-visible bug magnituden=3 per image:
/configurereturns 200Browser.getVersionokTotal time-to-real-readiness is identical (~12 s — Chromium itself takes that long to fully restart with kiosk + viewport under Rosetta). The fix doesn't make Chromium faster; it stops the api from lying about when Chromium is ready. The remaining 45 ms in
afteris just the 25 ms probe interval + WebSocket round-trip — measurement noise.Benchmark script lives under
bench/bench-configure-readiness.ts. Re-run with:Not addressed (suggested follow-ups)
stopsignal=KILL stopwaitsecs=0inimages/chromium-{headful,headless}/.../supervisor/services/chromium.confcauses SIGKILL spam and skips Chromium's graceful shutdown path. Recommendstopsignal=TERM stopwaitsecs=5 killasgroup=true stopasgroup=trueplus aSetpgid: trueon the chromium exec inchromium-launchersokillasgrouphas a group to target. Worth a separate PR because it changes shutdown semantics (graceful exit can let us drop theSingletonLockcleanup hack inchromium-launcher/main.go).elapsed_msnumeric attribute onstopChromiumcould be retrofitted onto the other sixelapsed=time.Since(start).String()log lines in this file for consistent aggregation in SigNoz.Made with Cursor
Note
Medium Risk
Changes core restart readiness for configure and related Chromium APIs; responses may take longer but should be more accurate; mis-tuned timeouts could increase 500s on slow restarts.
Overview
Fixes false-positive DevTools readiness after Chromium restarts so configure/flags/policy/extension endpoints no longer return success while the browser is still unusable.
startChromiumAndWaitnow treats Chromium as ready only when the upstream DevTools URL differs from the pre-restart URL and aBrowser.getVersionCDP round-trip succeeds on that URL. It removes theallowCurrentshortcut and no longer declares ready whensupervisorctl startreturns (which only means the process was forked).stopChromiumgains structuredchromium stoppedlogs withoutcome,elapsed, andelapsed_msfor observability.cdpclientaddsGetBrowserVersion(plus unit tests). A newbench/bench-configure-readiness.tsscript measures the gap betweenPOST /configurereturning 200 and real CDP readiness for before/after images.Reviewed by Cursor Bugbot for commit df450e2. Bugbot is set up for automated code reviews on this repo. Configure here.