Skip to content

fix(server): tighten startChromiumAndWait readiness gate#257

Merged
hiroTamada merged 2 commits into
mainfrom
fix/chromium-readiness-false-positive
May 28, 2026
Merged

fix(server): tighten startChromiumAndWait readiness gate#257
hiroTamada merged 2 commits into
mainfrom
fix/chromium-readiness-false-positive

Conversation

@hiroTamada
Copy link
Copy Markdown
Contributor

@hiroTamada hiroTamada commented May 28, 2026

Checklist

  • A link to a related issue in our repository.
  • A description of the changes proposed in the pull request.
  • @mentions of the person or team responsible for reviewing proposed changes.

Summary

startChromiumAndWait could return "devtools ready" within ~135 ms of supervisorctl start ack'ing — long before the new Chromium had actually bound its DevTools listener. Two issues combined to cause this:

  1. tryReady's allowCurrent=true shortcut dialed the previous Chromium's URL the instant supervisorctl start returned. That URL points at the just-killed Chromium, and a bare websocket.Dial could complete against a half-closed socket of the dying process (or against a freshly bound but not-yet-routed listener of the new one), producing a false-positive ready.
  2. tryReady never made a real CDP round-trip after dial, so it couldn't distinguish a live Chromium from a stale listener.

End-to-end this caused POST /configure (and PATCH /chromium/* / PATCH /display) to return 200 several seconds before Chromium could actually serve requests. We measured this both in production (~5 s gap) and via a controlled local benchmark (~8.5 s gap on Apple Silicon under Rosetta).

What changes

server/lib/cdpclient

  • New Client.GetBrowserVersion(ctx) that sends a single Browser.getVersion CDP message. A successful round-trip proves the WebSocket is wired to a live, CDP-responsive Chromium.
  • Test coverage for both happy path and CDP-error-from-chromium.

server/cmd/api/api/chromium.go

  • startChromiumAndWait:
    • Drop the allowCurrent parameter from tryReady. The readiness check is now: upstream URL must differ from prevUpstream AND a Browser.getVersion round-trip must succeed.
    • doneCh (supervisorctl start command return) is used only to short-circuit on start failure — it no longer attempts readiness against the current upstream.
  • stopChromium:
    • Add an "chromium stopped" info log on the success paths with outcome, elapsed, and elapsed_ms attributes. First numeric duration field on the stop path; enables p99(elapsed_ms) aggregations in SigNoz without parsing the existing Go duration strings.

Test plan

  • go vet ./...
  • go test -race $(go list ./... | grep -v /e2e$) ✅ — all packages pass

Benchmark

Two kernel-images-headless images built locally (linux/amd64 under Rosetta on Apple Silicon):

  • kernel-images-headless:before — upstream main
  • kernel-images-headless:after — this PR

For each, the container is started, the initial Chromium is verified CDP-responsive, then POST /configure is issued with display {1280x800@60} + chromium_flags ["--kiosk"] (which triggers the stop/start path). The benchmark times:

  • configureMs — when /configure returns 200
  • newUUIDMs — when /json/version reports a new webSocketDebuggerUrl (proves new chromium emitted its DevTools listening line)
  • cdpReadyMs — when Browser.getVersion succeeds against that new UUID
  • GAP = cdpReadyMs − configureMs — the user-visible bug magnitude

n=3 per image:

before (median) after (median)
/configure returns 200 3,586 ms 12,046 ms
New UUID observed 12,006 ms 12,009 ms
Browser.getVersion ok 12,074 ms 12,093 ms
GAP 8,516 ms 45 ms

Total time-to-real-readiness is identical (~12 s — Chromium itself takes that long to fully restart with kiosk + viewport under Rosetta). The fix doesn't make Chromium faster; it stops the api from lying about when Chromium is ready. The remaining 45 ms in after is just the 25 ms probe interval + WebSocket round-trip — measurement noise.

Benchmark script lives under bench/bench-configure-readiness.ts. Re-run with:

bun bench/bench-configure-readiness.ts --image kernel-images-headless:before --label before
bun bench/bench-configure-readiness.ts --image kernel-images-headless:after  --label after

Not addressed (suggested follow-ups)

  • stopsignal=KILL stopwaitsecs=0 in images/chromium-{headful,headless}/.../supervisor/services/chromium.conf causes SIGKILL spam and skips Chromium's graceful shutdown path. Recommend stopsignal=TERM stopwaitsecs=5 killasgroup=true stopasgroup=true plus a Setpgid: true on the chromium exec in chromium-launcher so killasgroup has a group to target. Worth a separate PR because it changes shutdown semantics (graceful exit can let us drop the SingletonLock cleanup hack in chromium-launcher/main.go).
  • The new elapsed_ms numeric attribute on stopChromium could be retrofitted onto the other six elapsed=time.Since(start).String() log lines in this file for consistent aggregation in SigNoz.

Made with Cursor


Note

Medium Risk
Changes core restart readiness for configure and related Chromium APIs; responses may take longer but should be more accurate; mis-tuned timeouts could increase 500s on slow restarts.

Overview
Fixes false-positive DevTools readiness after Chromium restarts so configure/flags/policy/extension endpoints no longer return success while the browser is still unusable.

startChromiumAndWait now treats Chromium as ready only when the upstream DevTools URL differs from the pre-restart URL and a Browser.getVersion CDP round-trip succeeds on that URL. It removes the allowCurrent shortcut and no longer declares ready when supervisorctl start returns (which only means the process was forked). stopChromium gains structured chromium stopped logs with outcome, elapsed, and elapsed_ms for observability.

cdpclient adds GetBrowserVersion (plus unit tests). A new bench/bench-configure-readiness.ts script measures the gap between POST /configure returning 200 and real CDP readiness for before/after images.

Reviewed by Cursor Bugbot for commit df450e2. Bugbot is set up for automated code reviews on this repo. Configure here.

hiroTamada and others added 2 commits May 28, 2026 13:53
… duration

startChromiumAndWait could return "devtools ready" within ~135 ms of a
supervisorctl-start ack, while the new Chromium hadn't yet bound its
DevTools listener. Two contributing failure modes:

  1. The tryReady "allowCurrent" shortcut dialed the *previous*
     Chromium's UpstreamManager URL the instant supervisorctl start
     returned. That URL points at the just-killed Chromium, and the
     bare websocket.Dial could complete against a half-closed socket
     of the dying process (or against a freshly bound but not-yet-
     routed listener of the new one), producing a false "ready".
  2. tryReady never made a real CDP round-trip after dial, so it
     couldn't tell the difference between a live Chromium and a stale
     listener.

End-to-end this caused POST /configure to return 200 several seconds
before Chromium could actually serve requests, leaving live view blank
during the gap (measured ~5 s on production sfo, ~8.5 s reproducing
locally on Apple Silicon under Rosetta).

Fix:

* lib/cdpclient: add Client.GetBrowserVersion which sends a single
  Browser.getVersion CDP message. A successful round-trip proves the
  WebSocket is wired to a live, CDP-responsive Chromium.
* cmd/api/api/chromium.go: in startChromiumAndWait, drop the
  allowCurrent parameter; tryReady now always requires the upstream
  URL to differ from prevUpstream AND a successful
  Browser.getVersion. The supervisorctl-start doneCh signal is used
  only to short-circuit on start failure, not to attempt readiness.

Also adds an `outcome` + `elapsed` + `elapsed_ms` info log when
stopChromium completes successfully. This is the first numeric
duration field on the stop path, enabling SigNoz p99(elapsed_ms)
aggregations without parsing the existing Go duration strings.

Tests:

* TestGetBrowserVersion covers happy path and CDP-error propagation.
* go vet ./... and go test -race ./... pass (non-e2e).

Reproducer benchmark (separate commit) measures the api-vs-real gap
for POST /configure with viewport + --kiosk on kernel-images-headless
docker images:

    before: GAP = 8488 ms median   /configure returned 200 ~8.5 s
                                   before Browser.getVersion worked
    after:  GAP =   45 ms median   gap collapses to probe-interval
                                   noise; chromium really is ready

Co-authored-by: Cursor <cursoragent@cursor.com>
Reproducer for the readiness false-positive fixed in the previous
commit. Boots a kernel-images-headless container, waits for the
initial chromium to be CDP-responsive, issues POST /configure with
viewport (1280x800@60) + chromium_flags ["--kiosk"], then concurrently
times:

  - configureMs:  POST /configure return-to-200
  - newUUIDMs:    when /json/version starts reporting a NEW
                  webSocketDebuggerUrl (proves new chromium emitted
                  its `DevTools listening` line)
  - cdpReadyMs:   when Browser.getVersion CDP round-trip succeeds
                  against that new URL

GAP = cdpReadyMs − configureMs is the user-visible bug window: how
long after the api claims success the new chromium is actually
serving requests.

Usage:

    docker build kernel-images-headless:before  # upstream main
    docker build kernel-images-headless:after   # with this fix
    bun bench/bench-configure-readiness.ts \\
      --image kernel-images-headless:before --label before
    bun bench/bench-configure-readiness.ts \\
      --image kernel-images-headless:after  --label after

Local results on Apple Silicon (linux/amd64 under Rosetta), n=3:

    before  GAP min=8488ms median=8516ms max=8829ms
    after   GAP min=  42ms median=  45ms max=  47ms

Co-authored-by: Cursor <cursoragent@cursor.com>
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: Changes are to server/cmd/api/api/chromium.go and server/lib/cdpclient, which are not in the monitored API endpoint paths (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal); this appears to be a different codebase structure.

To monitor this PR anyway, reply with @firetiger monitor this.

@hiroTamada hiroTamada requested a review from rgarcia May 28, 2026 17:56
@hiroTamada hiroTamada merged commit 3ca0415 into main May 28, 2026
10 checks passed
@hiroTamada hiroTamada deleted the fix/chromium-readiness-false-positive branch May 28, 2026 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants