Skip to content

fix(sandbox): pin gateway.reload=hot and add gateway serving watchdog (#4710)#5181

Open
ericksoa wants to merge 8 commits into
mainfrom
fix/4710-gateway-wedge-v2
Open

fix(sandbox): pin gateway.reload=hot and add gateway serving watchdog (#4710)#5181
ericksoa wants to merge 8 commits into
mainfrom
fix/4710-gateway-wedge-v2

Conversation

@ericksoa

@ericksoa ericksoa commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes out the remaining failure mode behind #4710: with the healthcheck marker made truthful (#5116), docker-driver sandboxes still go (unhealthy) whenever the in-sandbox OpenClaw gateway wedges — a restart-class openclaw.json change makes the gateway SIGUSR1-restart itself in-process, and a failed restart parks the process alive with its HTTP listener closed (gateway startup failed: ... Process will stay alive). The PID-wait respawn loop (#2757) never sees it because the process never exits. This PR prevents the wedge and makes any future wedge self-heal.

Update: #5116 has merged; this branch has absorbed it via merge with main, so the diff now shows only this PR's own changes (reload pin, watchdog, HEALTHCHECK pattern, docs, tests). Conflicts with the merged #5116 were resolved in favor of its final wording/test shape.

Related Issue

Addresses #4710 (final closure after QA re-verification on a rebuilt sandbox image).

Changes

  • Pin gateway.reload.mode: "hot" in the generated sandbox config (scripts/generate-openclaw-config.mts). In the default hybrid mode, restart-class config paths (plugins.installs, models.pricing, unrecognized keys, …) trigger a self-SIGUSR1 in-process restart that closes the listener; in containers OpenClaw deliberately disables fresh-PID respawn, and a failed in-process restart parks the gateway alive-but-deaf. Hot mode ignores plan-driven restarts, so the listener never closes; NemoClaw applies restart-class changes via rebuild or nemoclaw <name> recover instead. The fix(openclaw): merge restored config with runtime state #5174 restore merge preserves the freshly generated gateway section, so the pin survives host-side state restores.
  • Gateway serving watchdog in scripts/nemoclaw-start.sh (defense-in-depth; covers writer-intent restarts that bypass hot mode and any future wedge cause): once the gateway has served at least one probe, sustained connection-refused (4 × 30s, env-tunable) on the dashboard port with the PID still alive logs a CRITICAL line to /tmp/gateway.log, TERM/KILLs the gateway, and lets the existing respawn loop relaunch it. Only curl exit 7 counts — timeouts/HTTP errors mean a listener exists and stay the Docker HEALTHCHECK's call. A PID-1-owned pidfile written at every launch/respawn site keeps the watchdog aimed at the live gateway PID; a cmdline identity check guards against PID reuse before killing.
  • HEALTHCHECK liveness pattern alignment (Dockerfile): adds the anchored ^openclaw$ alternative for OpenClaw builds that truncate the rewritten process title to the bare binary name, matching the recovery-script pattern family without letting ordinary openclaw <subcommand> invocations satisfy the probe.
  • Docs: troubleshooting section explaining the wedge signature, the hot-mode pin, and how to apply restart-class config changes (recover / rebuild).
  • Tests: new test/nemoclaw-start-gateway-health.test.ts (watchdog kill/arm/reset/identity behavior against a real process, marker idempotency, behavioral launch-block wiring for both root and non-root paths with OPENSHELL_DRIVERS=docker exported to lock the fix: skip gateway marker for docker-driver sandboxes #4748 env-gate regression, respawn-loop pidfile refresh); new test/generate-openclaw-config-reload.test.ts (pin emission, env permutations, regeneration over an existing config); gateway.reload restore-preservation case in openclaw-config-merge.test.ts; HEALTHCHECK pattern matrix updates in sandbox-provisioning.test.ts.

Architecture

  • One lifecycle owner. The repo's supervision model makes the PID-1 entrypoint (nemoclaw-start.sh, [Station][Recovery] openclaw gateway killed — no auto-respawn; parent daemon exits with child; recovery requires nemoclaw connect #2757 respawn loop) the sole authority over the gateway's lifecycle. The default hybrid reload mode gave OpenClaw a second, unobservable restart authority inside the container; the hot pin removes it, routing every restart-class change through the supervisor's own observable paths (recover/rebuild). Same direction as fix(sandbox): tie gateway healthcheck marker to launch site, not env hint (#4710) #5116's marker work: observable, true-by-construction state over env-hint heuristics and hidden in-process transitions.
  • The watchdog composes with — never duplicates — the existing supervisor. It converts the one state PID-wait cannot see (alive-but-deaf) into the event the respawn loop already handles (process death). No second recovery mechanism, no new state machine; per-PID arming keeps its kill authority subordinate to observed serving.
  • Trust boundaries. The pidfile follows the repo's established symlink-safe /tmp discipline (mktemp + atomic rename, cf. prepare_plugin_refresh_log): root never opens a sandbox-influencable path, and a hostile sandbox user can only degrade its own self-healing.
  • Single source of truth for process identity. The HEALTHCHECK liveness pattern now matches the host-side recovery script's gateway-process family, eliminating drift between the two probes.
  • Workaround contract. The pin and watchdog carry an explicit removal condition (an OpenClaw release that exits non-zero on failed re-bind, proven by a wedge drill) — the same source-boundary/removal discipline as the existing OpenClaw dist-patches, without growing the patch surface.
  • Test architecture. Behavioral harnesses against real processes and the compiled artifacts (source-shape budget 0), with gateway-launch coverage consolidated into a focused gateway-health suite and the legacy file's size budget ratcheted down — the direction the growth guardrails enforce.

Type of Change

  • Code change with doc updates

Verification

  • npx prek run --all-files passes — except the test-cli hook's two pre-existing, machine-bound failures that reproduce identically on a clean main checkout on this host (live e2e scenario ubuntu-repo-cloud-openclaw attempting a real onboard, and the timing-sensitive auto-pair keepalive retry test); all other hooks pass including shellcheck, shfmt, hadolint, Biome, source-shape and size budgets.
  • npm test passes for all touched suites (nemoclaw-start-gateway-health, nemoclaw-start, nemoclaw-start-plugin-refresh, sandbox-provisioning, generate-openclaw-config, generate-openclaw-config-reload, openclaw-config-merge) — same two pre-existing host-bound failures as above in the full run.
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes

End-to-end proof still required before closing #4710 (needs an image rebuild on a docker-driver Ubuntu host): fresh onboard → docker inspect .State.Health.Status stays healthy for ≥3 health intervals; write a restart-class key in-sandbox and confirm config reload requires gateway restart; hot mode ignoring with the listener staying up; wedge drill (manual SIGUSR1) → watchdog kill → respawn → healthy.


Signed-off-by: Aaron Erickson aerickson@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Improved gateway liveness detection so sandboxes are marked unhealthy only when truly unresponsive, including handling of truncated process titles.
  • New Features

    • Added an in-sandbox HTTP listener watchdog that detects and restarts gateways that stop accepting connections.
  • Documentation

    • New troubleshooting guidance for “alive but not listening” failures with recover/rebuild steps.
  • Tests

    • Added and expanded tests covering watchdog behavior, pidfile/marker handling, and liveness-pattern matching.
  • Chores

    • Generated gateway config now pins reload mode to hot; reduced a test file size budget.

nvshaxie and others added 2 commits June 10, 2026 23:56
…hint (#4710)

The early conditional gating mark_in_container_gateway on
OPENSHELL_DRIVERS=docker (PR #4748) never fires inside the sandbox
container because OpenShell 0.0.44 does not export OPENSHELL_DRIVERS
into the sandbox env. hulynn's 2026-06-08 re-verification on v0.0.60
confirmed the symptom is unchanged from v0.0.57: marker file is created
on every NEMOCLAW_CMD-empty container, so the Dockerfile HEALTHCHECK
short-circuit (`[ -f /tmp/nemoclaw-gateway-local ] || exit 0`) is
unreachable for docker-driver sandboxes and the container is marked
(unhealthy) on every fresh onboard.

Fix: drop the early env-gated marker write and call
mark_in_container_gateway() directly before each
`openclaw gateway run --port ...` launch (both the non-root and
root-mode paths). The marker is now true-by-construction: it exists
if-and-only-if this container is about to start the gateway, regardless
of how the deployment-mode signal is (or isn't) plumbed through the
sandbox env. Restart-loop fallback launches inherit the marker from the
first launch — the function is idempotent (`: >file`) so calling it
again on retry is a no-op.

Verified by two new regression tests in test/nemoclaw-start.test.ts:
- 'ties the in-container gateway healthcheck marker to the gateway
  launch site (#4503, #4710)': asserts (a) the region immediately after
  the function definition contains no early call and no
  OPENSHELL_DRIVERS conditional, and (b) every `openclaw gateway run
  --port` invocation is preceded by mark_in_container_gateway within
  6 lines (or sits inside a restart-loop body).
- 'mark_in_container_gateway writes the marker file idempotently
  (#4710)': behavioral check on the helper itself.

The two existing launch-block tests now stub
`mark_in_container_gateway() { :; }` in their preamble to keep the
extracted shell snippet self-contained. Test count: 120 passing
(vs. 119 baseline; 19 pre-existing failures unchanged).

Note: a separate OpenClaw-side concern surfaced in @Dongni-Yang's
2026-06-04 thread (in-sandbox gateway loses its HTTP listener while
the process stays alive) is out of scope here — this fix only
addresses the marker-file regression hulynn re-confirmed in #4710 for
docker-driver sandboxes where the gateway legitimately runs on the
host. Standalone deployments where the gateway runs in-container are
unaffected: the marker is still created (just at the launch site
instead of at startup), so the existing pgrep healthcheck behavior is
preserved bit-for-bit.

Fixes #4710.

Signed-off-by: Shawn Xie <shaxie@nvidia.com>
…#4710)

The in-sandbox OpenClaw gateway watches openclaw.json and, in its default
hybrid reload mode, SIGUSR1-restarts itself in-process on restart-class
config changes (plugins.installs, models.pricing, unrecognized keys, ...).
In containers a failed in-process restart parks the gateway alive with its
HTTP listener closed ("gateway startup failed: ... Process will stay
alive"), which the PID-wait respawn loop (#2757) cannot observe. With the
healthcheck marker now truthful, Docker correctly reports the container
(unhealthy) forever — the remaining failure mode behind #4710.

Two complementary defenses:

- Pin gateway.reload.mode=hot in the generated sandbox config so plan-driven
  in-process restarts never close the listener; NemoClaw applies
  restart-class changes via rebuild or 'nemoclaw <name> recover' instead.
  Host-side state restore already preserves the freshly generated gateway
  section (#5174), so the pin survives restores.
- Add a serving watchdog to nemoclaw-start.sh: once the gateway has served
  at least one probe, sustained connection-refused on the dashboard port
  with the process still alive triggers TERM/KILL of the gateway PID so the
  existing respawn loop relaunches it. A pidfile written by PID 1 at every
  launch/respawn site keeps the watchdog aimed at the live gateway PID, and
  a cmdline identity check guards against PID reuse before killing.

Also align the Dockerfile HEALTHCHECK liveness pattern with the recovery
script's gateway-process pattern family, adding an anchored ^openclaw$
alternative for builds that truncate the rewritten process title, and
document the hot-mode pin in the troubleshooting page.

The #4503/#4710 marker coverage moves from test/nemoclaw-start.test.ts (at
its size budget) into a focused gateway-health suite, converted from
source-text assertions to behavioral launch-block and respawn-loop
harnesses per the source-shape budget; the legacy size budget ratchets
down accordingly.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8d87354f-0bd8-4147-a428-d6de5aedadb8

📥 Commits

Reviewing files that changed from the base of the PR and between 6bb5c3e and 9c18e06.

📒 Files selected for processing (2)
  • ci/test-file-size-budget.json
  • docs/reference/troubleshooting.mdx
✅ Files skipped from review due to trivial changes (2)
  • ci/test-file-size-budget.json
  • docs/reference/troubleshooting.mdx

📝 Walkthrough

Walkthrough

Pins gateway reload mode to "hot", broadens Docker HEALTHCHECK process matching to include truncated/rewritten OpenClaw titles, adds an in-container HTTP-listener watchdog that kills a non-serving gateway to trigger respawn, and adds tests and troubleshooting docs covering these behaviors.

Changes

Gateway Health and Listener Recovery

Layer / File(s) Summary
Health check pattern refinement
Dockerfile, test/sandbox-provisioning.test.ts
HEALTHCHECK liveness detection updated to use `^openclaw$
Gateway reload mode pinning
scripts/generate-openclaw-config.mts, test/generate-openclaw-config-reload.test.ts, src/lib/state/openclaw-config-merge.test.ts
Config generation and merge behavior enforce gateway.reload.mode: "hot"; tests confirm the pin is preserved across env permutations and when regenerating existing configs.
In-container watchdog system
scripts/nemoclaw-start.sh
Start script adds pidfile recording, PID validation, and a background watchdog probing /health; on sustained connection-refused (curl exit code 7) after serving, it SIGTERM/SIGKILLs the gateway PID to trigger respawn; integrated into both root and non-root launch/respawn flows.
Test coverage for watchdog and reload behavior
test/nemoclaw-start-gateway-health.test.ts, test/nemoclaw-start.test.ts
New end-to-end and unit tests exercise watchdog scenarios, pidfile behavior, gateway PID validation, marker idempotency, launch wiring, respawn pidfile refresh, and removal of an older launch-signal test block.
User documentation and maintenance
docs/reference/troubleshooting.mdx, ci/test-file-size-budget.json
Troubleshooting entry documents listener-drop failure mode and recovery commands; test file size budget for test/nemoclaw-start.test.ts reduced.

Sequence Diagram

sequenceDiagram
  participant Watchdog
  participant Gateway
  participant RespawnLoop
  participant ProcFS
  Watchdog->>Gateway: curl /health probe
  Gateway-->>Watchdog: success or exit 7 (connection refused)
  Note over Watchdog: after observed serving, count consecutive refusals
  Watchdog->>ProcFS: validate gateway PID cmdline matches openclaw identity
  Watchdog->>Gateway: SIGTERM (if refused threshold reached)
  Watchdog->>Gateway: SIGKILL (if not exiting)
  Gateway-->>RespawnLoop: process exits
  RespawnLoop->>Watchdog: writes new PID to pidfile
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#5174: Related to merge/restore logic and tests for preserving gateway.reload.mode: "hot".
  • NVIDIA/NemoClaw#5177: Related changes to merge/restore logic for preserving gateway.reload.mode: "hot" during config restore.

Suggested labels

bug-fix, v0.0.63

Suggested reviewers

  • cv
  • jyaunches
  • prekshivyas

"I am a rabbit, watching the gate,
When listeners drop I end their wait,
Hot reload keeps the config tight,
I nudge a kill to wake the light,
Sandbox wakes — gateway back in flight."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main changes: pinning gateway reload mode to hot and adding a gateway watchdog to address the sandbox wedging issue.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/4710-gateway-wedge-v2

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 1 needs attention, 4 worth checking, 1 nice ideas
Since last review: 0 prior items resolved, 5 still apply, 1 new item found

Review findings

🛠️ Needs attention

🔎 Worth checking

  • Source-of-truth review needed: Dockerfile absent-marker healthcheck shortcut and scripts/nemoclaw-start.sh marker timing: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: Dockerfile exits 0 on curl exit 7 when `/tmp/nemoclaw-gateway-local` is absent; `nemoclaw-start.sh` writes the marker only immediately before initial gateway launch.
  • Source-of-truth review needed: Dockerfile and watchdog compatibility for bare `openclaw` process titles: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: Dockerfile uses `^openclaw$|openclaw[ -]gateway`; `gateway_pid_is_openclaw_gateway()` uses `openclaw([ -]gateway| gateway run|$)`.
  • Absent marker still conflates host-owned gateway delivery with local pre-launch startup (Dockerfile:1089): The Docker HEALTHCHECK exits healthy on curl exit 7 whenever `/tmp/nemoclaw-gateway-local` is absent. Moving marker creation to the gateway launch sites avoids the unreliable `OPENSHELL_DRIVERS` heuristic, but marker absence also describes the standalone/local-serving startup window before the launch site is reached. A local setup hang or failure before `mark_in_container_gateway` can therefore be masked as healthy instead of being treated as a local gateway boot failure.
    • Recommendation: Use an authoritative ownership/intent signal instead of overloading marker absence, such as a distinct host-owned state or a local-serving-intent marker written before long setup starts. Align the Dockerfile and script comments with that state model and add negative coverage showing standalone/local-serving pre-launch failure does not pass health solely because the marker has not yet been written.
    • Evidence: The Dockerfile has `[ -f /tmp/nemoclaw-gateway-local ] || exit 0` after curl exit 7. `scripts/nemoclaw-start.sh` writes the marker only immediately before the root and non-root `openclaw gateway run` launch sites. Issue [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 comments also dispute whether marker suppression can mask a real in-container listener failure.
  • Gateway liveness and watchdog identity still accept spoofable process-title matches (scripts/nemoclaw-start.sh:3328): The Docker health fallback and watchdog PID-reuse check rely on process argv regexes, including newly accepted bare `openclaw` compatibility. A sandbox process can generally choose argv, and the non-bare alternatives are not fully anchored, so an adversarial or accidental non-gateway process can satisfy liveness checks once marker/log/pidfile prerequisites line up. This weakens sandbox lifecycle policy even though it is not a confirmed sandbox escape.
    • Recommendation: Tighten gateway identity checks where possible: verify expected UID, executable path or inode, exact argv forms with boundaries, parent/session relationship, or a root-owned launch record rather than only `pgrep -f`/grep over cmdline. Add adversarial negative tests for `exec -a openclaw sleep 60`, suffix variants such as `openclaw-gateway-helper`, and embedded matches such as `myopenclaw-gateway`.
    • Evidence: Dockerfile uses `pgrep --ignore-ancestors -f '^openclaw$|openclaw[ -]gateway'`. `gateway_pid_is_openclaw_gateway()` uses `grep -qE 'openclaw([ -]gateway| gateway run|$)'`. Current tests cover ordinary CLI negatives but not spoofed argv or suffix/embedded variants.

🌱 Nice ideas

  • Avoid broad `@ts-nocheck` in the new reload-config test (test/generate-openclaw-config-reload.test.ts:1): The new test file disables TypeScript checking for the whole file. That can hide drift in the `buildConfig` and file-I/O test harness contracts, especially around generated config shape.
    • Recommendation: Prefer adding narrow types for the config object and helper environment overrides, or use localized casts for dynamic JSON assertions instead of file-wide `@ts-nocheck`.
    • Evidence: `test/generate-openclaw-config-reload.test.ts` starts with `// @ts-nocheck`.
Consider writing more tests for
  • **Runtime validation** — Docker-driver-shaped entrypoint/container leaves `/tmp/nemoclaw-gateway-local` absent when no in-container dashboard gateway launch occurs.. The unit and extracted-shell coverage is broad, but this PR changes Docker HEALTHCHECK behavior, sandbox entrypoint launch, watchdog recovery, and OpenClaw config reload semantics. The key acceptance properties depend on actual runtime/container topology, Docker health state, PID 1 behavior, and OpenShell Docker-driver delivery.
  • **Runtime validation** — Docker-driver-shaped container with curl exit 7 reports Docker health `Status="healthy"` and `FailingStreak=0` without invoking `pgrep`.. The unit and extracted-shell coverage is broad, but this PR changes Docker HEALTHCHECK behavior, sandbox entrypoint launch, watchdog recovery, and OpenClaw config reload semantics. The key acceptance properties depend on actual runtime/container topology, Docker health state, PID 1 behavior, and OpenShell Docker-driver delivery.
  • **Runtime validation** — Standalone/local-serving startup that stalls before `openclaw gateway run` does not pass health solely because `/tmp/nemoclaw-gateway-local` is absent, or writes a separate local-serving-intent state before long setup.. The unit and extracted-shell coverage is broad, but this PR changes Docker HEALTHCHECK behavior, sandbox entrypoint launch, watchdog recovery, and OpenClaw config reload semantics. The key acceptance properties depend on actual runtime/container topology, Docker health state, PID 1 behavior, and OpenShell Docker-driver delivery.
  • **Runtime validation** — Rebuilt sandbox wedge drill: after `/health` has served once, force listener refusal while the PID remains alive; assert the watchdog logs CRITICAL, kills the PID, respawn records the new PID, and `/health` becomes reachable.. The unit and extracted-shell coverage is broad, but this PR changes Docker HEALTHCHECK behavior, sandbox entrypoint launch, watchdog recovery, and OpenClaw config reload semantics. The key acceptance properties depend on actual runtime/container topology, Docker health state, PID 1 behavior, and OpenShell Docker-driver delivery.
  • **Runtime validation** — Healthcheck and watchdog process identity reject adversarial argv shaping: `exec -a openclaw sleep 60`, `openclaw-gateway-helper`, and `myopenclaw-gateway`.. The unit and extracted-shell coverage is broad, but this PR changes Docker HEALTHCHECK behavior, sandbox entrypoint launch, watchdog recovery, and OpenClaw config reload semantics. The key acceptance properties depend on actual runtime/container topology, Docker health state, PID 1 behavior, and OpenShell Docker-driver delivery.
  • **Acceptance clause:** Issue [Station][Recovery] openclaw gateway killed — no auto-respawn; parent daemon exits with child; recovery requires nemoclaw connect #2757 Description: "After kill -9 on the openclaw-gateway process inside the sandbox, the parent openclaw daemon exits with the child and no auto-respawn occurs. Recovery requires the user to manually run \"nemoclaw connect\"." — add test evidence or identify existing coverage. The existing respawn loop remains and the PR adds a watchdog that converts alive-but-deaf gateway wedges into process death so the loop can relaunch. Tests cover respawn pidfile refresh, but not a real sandbox `kill -9` followed by `nemoclaw status` recovery.
  • **Acceptance clause:** Issue [Station][Recovery] openclaw gateway killed — no auto-respawn; parent daemon exits with child; recovery requires nemoclaw connect #2757 Expected Result: "Within 60 seconds, the openclaw gateway auto-respawns without user intervention. nemoclaw status shows the sandbox healthy again." — add test evidence or identify existing coverage. `scripts/nemoclaw-start.sh` still waits on `GATEWAY_PID` and respawns non-zero exits; the new watchdog records the PID and kills alive-but-deaf gateways. The diff does not include runtime `nemoclaw status` evidence.
  • **Acceptance clause:** Issue [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 Description: "marker file `/tmp/nemoclaw-gateway-local` present = gateway runs in THIS container (standalone deployments); marker absent = gateway runs on host (OpenShell Docker-driver, [Ubuntu 24.04][Sandbox][GitHub Issue #4503] sandbox Docker HEALTHCHECK always (unhealthy) — pgrep openclaw-gateway runs inside container but gateway runs on host #4503) — in-container probe cannot observe it, so HEALTHCHECK should exit 0 and defer to OpenShell host-side delivery-chain monitoring." — add test evidence or identify existing coverage. `mark_in_container_gateway` is now called at the root and non-root `openclaw gateway run` launch sites, and `test/sandbox-provisioning.test.ts` verifies absent marker plus curl exit 7 exits healthy. The remaining gap is proving a real Docker-driver-shaped sandbox leaves the marker absent; marker absence also still covers local pre-launch state.
Since last review details

Current findings:

  • Source-of-truth review needed: Dockerfile absent-marker healthcheck shortcut and scripts/nemoclaw-start.sh marker timing: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: Dockerfile exits 0 on curl exit 7 when `/tmp/nemoclaw-gateway-local` is absent; `nemoclaw-start.sh` writes the marker only immediately before initial gateway launch.
  • Source-of-truth review needed: Dockerfile and watchdog compatibility for bare `openclaw` process titles: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: Dockerfile uses `^openclaw$|openclaw[ -]gateway`; `gateway_pid_is_openclaw_gateway()` uses `openclaw([ -]gateway| gateway run|$)`.
  • Runtime evidence is still missing for the [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 Docker-driver health contract (test/nemoclaw-start-gateway-health.test.ts:588): Issue [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710's observable expected result is that Docker-driver sandboxes leave `/tmp/nemoclaw-gateway-local` absent and Docker reports `Status="healthy"` with `FailingStreak=0`. The changed tests cover the Dockerfile command in isolation, launch-site marker creation, pidfile/watchdog behavior, and config reload pinning, but they do not exercise a Docker-driver-shaped entrypoint/container lifecycle to prove the marker stays absent and Docker reaches the requested health state end-to-end.
    • Recommendation: Add or identify targeted runtime/integration validation that starts a Docker-driver-shaped sandbox/entrypoint, confirms `/tmp/nemoclaw-gateway-local` is absent when the dashboard gateway is not launched in-container, and confirms the Docker HEALTHCHECK result for curl exit 7 is healthy without invoking `pgrep` and with `FailingStreak=0`.
    • Evidence: Issue [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 Expected Result: "Specifically at step 4, the marker file should NOT exist; step 3 should show `Status=\"healthy\"` with `FailingStreak=0`." The diff adds extracted-shell and Dockerfile-command tests, not a real Docker-driver container health assertion.
  • Absent marker still conflates host-owned gateway delivery with local pre-launch startup (Dockerfile:1089): The Docker HEALTHCHECK exits healthy on curl exit 7 whenever `/tmp/nemoclaw-gateway-local` is absent. Moving marker creation to the gateway launch sites avoids the unreliable `OPENSHELL_DRIVERS` heuristic, but marker absence also describes the standalone/local-serving startup window before the launch site is reached. A local setup hang or failure before `mark_in_container_gateway` can therefore be masked as healthy instead of being treated as a local gateway boot failure.
    • Recommendation: Use an authoritative ownership/intent signal instead of overloading marker absence, such as a distinct host-owned state or a local-serving-intent marker written before long setup starts. Align the Dockerfile and script comments with that state model and add negative coverage showing standalone/local-serving pre-launch failure does not pass health solely because the marker has not yet been written.
    • Evidence: The Dockerfile has `[ -f /tmp/nemoclaw-gateway-local ] || exit 0` after curl exit 7. `scripts/nemoclaw-start.sh` writes the marker only immediately before the root and non-root `openclaw gateway run` launch sites. Issue [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 comments also dispute whether marker suppression can mask a real in-container listener failure.
  • Gateway liveness and watchdog identity still accept spoofable process-title matches (scripts/nemoclaw-start.sh:3328): The Docker health fallback and watchdog PID-reuse check rely on process argv regexes, including newly accepted bare `openclaw` compatibility. A sandbox process can generally choose argv, and the non-bare alternatives are not fully anchored, so an adversarial or accidental non-gateway process can satisfy liveness checks once marker/log/pidfile prerequisites line up. This weakens sandbox lifecycle policy even though it is not a confirmed sandbox escape.
    • Recommendation: Tighten gateway identity checks where possible: verify expected UID, executable path or inode, exact argv forms with boundaries, parent/session relationship, or a root-owned launch record rather than only `pgrep -f`/grep over cmdline. Add adversarial negative tests for `exec -a openclaw sleep 60`, suffix variants such as `openclaw-gateway-helper`, and embedded matches such as `myopenclaw-gateway`.
    • Evidence: Dockerfile uses `pgrep --ignore-ancestors -f '^openclaw$|openclaw[ -]gateway'`. `gateway_pid_is_openclaw_gateway()` uses `grep -qE 'openclaw([ -]gateway| gateway run|$)'`. Current tests cover ordinary CLI negatives but not spoofed argv or suffix/embedded variants.
  • Avoid broad `@ts-nocheck` in the new reload-config test (test/generate-openclaw-config-reload.test.ts:1): The new test file disables TypeScript checking for the whole file. That can hide drift in the `buildConfig` and file-I/O test harness contracts, especially around generated config shape.
    • Recommendation: Prefer adding narrow types for the config object and helper environment overrides, or use localized casts for dynamic JSON assertions instead of file-wide `@ts-nocheck`.
    • Evidence: `test/generate-openclaw-config-reload.test.ts` starts with `// @ts-nocheck`.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: sandbox-survival-e2e, issue-2478-crash-loop-recovery-e2e, sandbox-operations-e2e, test-e2e-gateway-isolation
Optional E2E: cloud-onboard-e2e, runtime-overrides-e2e, gateway-guard-recovery, test-e2e-port-overrides

Dispatch hint: sandbox-survival-e2e,issue-2478-crash-loop-recovery-e2e,sandbox-operations-e2e

Auto-dispatched E2E: sandbox-survival-e2e, issue-2478-crash-loop-recovery-e2e, sandbox-operations-e2e via nightly-e2e.yaml at 9c18e06cd48234ec8f132be881c601ec29c50ed6nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • sandbox-survival-e2e (high (~30 min, live NVIDIA API key)): High-signal live proof for sandbox lifecycle after gateway restart: onboards a real sandbox, verifies workspace/state persistence, restarts the gateway path, and confirms inference still works. This is directly relevant to the new gateway watchdog and respawn pidfile behavior.
  • issue-2478-crash-loop-recovery-e2e (high (~30 min, live NVIDIA API key)): Exercises repeated in-sandbox gateway kill/recovery cycles and verifies the gateway returns to a serving state with inference after recovery. This is the closest existing live E2E to the changed gateway supervision/respawn path in nemoclaw-start.sh.
  • sandbox-operations-e2e (very high (~60 min, live sandbox operations)): Broad sandbox operations suite covering onboarded sandbox behavior, recovery operations, multi-sandbox lifecycle, and destructive gateway-kill recovery. Required because this PR changes entrypoint-managed gateway lifecycle and health semantics.
  • test-e2e-gateway-isolation (medium (Docker image build plus container checks)): Builds the production sandbox image and validates gateway/sandbox user separation plus entrypoint hardening. Required because the new watchdog pidfile/log behavior touches /tmp trust boundaries and gateway process control inside the image.

Optional E2E

  • cloud-onboard-e2e (high (~45 min, live NVIDIA API key)): Useful broad confidence that the generated OpenClaw config, image entrypoint, sandbox health, security checks, and inference.local all work in the standard cloud onboarding flow.
  • runtime-overrides-e2e (medium (~45 min, Docker image build)): Useful image-level smoke for generated sandbox config and the real nemoclaw-start entrypoint. It can catch malformed config/entrypoint regressions adjacent to the new gateway.reload.mode pin, though it does not directly exercise live reload behavior.
  • gateway-guard-recovery (high (~45 min, live scenario)): Adjacent Vitest live recovery scenario for gateway recovery after guard-chain disruption. It is not a direct watchdog/listener-drop test, but it increases confidence in the same recover/relaunch surface.
  • test-e2e-port-overrides (medium (requires built production image)): Optional Docker-image confidence for the real entrypoint and dashboard port handling, adjacent to the Docker HEALTHCHECK changes that probe the dashboard health endpoint.

New E2E recommendations

  • gateway listener watchdog / OpenClaw reload wedge (high): Existing E2E tests kill or restart gateway processes, but none appear to create the [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 condition where the OpenClaw process stays alive while its HTTP listener disappears after a restart-class config reload. Add a live test that onboards a sandbox, induces or simulates the listener-drop state, verifies [gateway-watchdog] CRITICAL is logged, confirms the process is killed/respawned, and proves /health plus inference recover.
    • Suggested test: gateway-reload-watchdog-e2e
  • generated OpenClaw reload pin in live sandbox (medium): Unit tests assert gateway.reload.mode: hot in generated config, but there is no live E2E that verifies the deployed sandbox carries the pin and that a restart-class change such as plugin installation does not self-restart/wedge the gateway before a supervised nemoclaw <name> recover.
    • Suggested test: openclaw-hot-reload-pin-live-e2e
  • Docker HEALTHCHECK process-title compatibility (medium): The Dockerfile pgrep pattern now accepts bare openclaw; current coverage is mostly command-level/unit-style. A small image E2E could run the production HEALTHCHECK command against controlled process-title variants and assert no false healthy result for ordinary openclaw <subcommand> CLI processes.
    • Suggested test: docker-healthcheck-openclaw-title-e2e

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: sandbox-survival-e2e,issue-2478-crash-loop-recovery-e2e,sandbox-operations-e2e

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: ubuntu-repo-cloud-openclaw
Optional Vitest E2E scenarios: ubuntu-repo-docker-post-reboot-recovery

Dispatch required Vitest E2E scenarios:

  • gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required Vitest E2E scenarios

  • ubuntu-repo-cloud-openclaw: Core live-supported OpenClaw Docker onboarding scenario exercises the changed Dockerfile entrypoint/healthcheck path plus generated OpenClaw config and validates the gateway and sandbox reach a healthy ready state.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional Vitest E2E scenarios

  • ubuntu-repo-docker-post-reboot-recovery: Adjacent live-supported recovery scenario exercises Docker-driver sandbox preservation after lifecycle disruption; useful because the PR changes gateway supervision/health behavior, but it is not the primary startup health path.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-docker-post-reboot-recovery

Relevant changed files

  • Dockerfile
  • scripts/generate-openclaw-config.mts
  • scripts/nemoclaw-start.sh

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
Dockerfile (1)

1091-1103: Run the container-level E2E matrix before merge.

Line 1091-Line 1103 changes are runtime/container-health semantics; please execute the recommended E2E jobs before closing this PR.

As per coding guidelines, Dockerfile layer ordering, permissions, and baked config changes are only testable with a real container build.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile` around lines 1091 - 1103, The Dockerfile HEALTHCHECK block and
related runtime semantics (HEALTHCHECK, NEMOCLAW_DASHBOARD_PORT,
OPENCLAW_GATEWAY_PORT, CHAT_UI_URL, /tmp/nemoclaw-gateway-local, pgrep
'^openclaw$|openclaw[ -]gateway', and /tmp/gateway.log) affect container
behavior and must be validated by running the container-level E2E matrix before
merging; build the image and run the recommended E2E jobs that exercise the
HEALTHCHECK path (including variants with and without CHAT_UI_URL and both port
resolution fallbacks), verify layer ordering, file permissions, and baked config
inside the running container, and confirm the healthcheck exits/status match
expectations under success, curl failure (non-7 rc), curl
timeout/connection-refused (rc 7) and local-gateway fallback scenarios prior to
approving the PR.

Source: Coding guidelines

scripts/nemoclaw-start.sh (1)

3218-3320: Run the entrypoint E2Es for both launch paths.

This touches PID 1 startup, respawn, and watchdog wiring in both the non-root and root branches. Unit tests help here, but they will not cover the process-lifecycle and sandbox-behavior risks this file owns. As per coding guidelines, scripts/nemoclaw-start.sh changes should be verified with sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e, and openclaw-slack-pairing-e2e.

Also applies to: 3427-3454, 3658-3704

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/nemoclaw-start.sh` around lines 3218 - 3320, Run the full entrypoint
end-to-end suites for both launch paths (non-root and root) to validate PID 1
startup, respawn and watchdog wiring changes touching GATEWAY_PID_FILE,
record_gateway_pid, gateway_pid_is_openclaw_gateway,
start_gateway_serving_watchdog and GATEWAY_WATCHDOG_PID; specifically execute
sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e and
openclaw-slack-pairing-e2e against the updated scripts/nemoclaw-start.sh
behavior (also verify the other touched regions mentioned: the similar changes
around lines referenced in the review) and report any lifecycle or sandbox
failures for fixes.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/reference/troubleshooting.mdx`:
- Around line 369-372: Split the long sentence that currently spans multiple
lines into separate source lines so each sentence is on its own line, and
replace the colon after "stays alive" with a period (i.e., change "stays alive:
a restart-class..." to "stays alive. A restart-class..."); ensure any remaining
colons are only used to introduce explicit lists, and update the sentence breaks
in the paragraph about the in-sandbox OpenClaw gateway /tmp/gateway.log and
Docker health status accordingly.

In `@scripts/nemoclaw-start.sh`:
- Around line 3267-3291: The watchdog must reset armed and refused_streak when
the pidfile value changes; detect when the read pid (from "$GATEWAY_PID_FILE"
into pid) differs from a saved prior_pid and set armed=0 and refused_streak=0
(and update prior_pid) immediately after reading pid (before checking kill -0 or
probing the health endpoint) so a newly spawned gateway does not inherit the old
process's armed/refused state; use the existing pid/GATEWAY_PID_FILE variables
and maintain a prior_pid variable in the loop to perform this reset.
- Around line 3264-3268: The watchdog uses
NEMOCLAW_GATEWAY_WATCHDOG_INTERVAL_SECONDS and
NEMOCLAW_GATEWAY_WATCHDOG_REFUSED_THRESHOLD directly; validate both in
start_gateway_serving_watchdog before any sleep or numeric comparison by
ensuring they match a positive-integer regex (e.g. ^[1-9][0-9]*$) so zero-like
values ("0","00","000") are rejected, log an error and either exit or fall back
to a safe default when invalid, then convert/assign validated values to interval
and refused_threshold and use those for sleep "$interval" and the numeric test [
"$refused_streak" -lt "$refused_threshold" ].

---

Nitpick comments:
In `@Dockerfile`:
- Around line 1091-1103: The Dockerfile HEALTHCHECK block and related runtime
semantics (HEALTHCHECK, NEMOCLAW_DASHBOARD_PORT, OPENCLAW_GATEWAY_PORT,
CHAT_UI_URL, /tmp/nemoclaw-gateway-local, pgrep '^openclaw$|openclaw[
-]gateway', and /tmp/gateway.log) affect container behavior and must be
validated by running the container-level E2E matrix before merging; build the
image and run the recommended E2E jobs that exercise the HEALTHCHECK path
(including variants with and without CHAT_UI_URL and both port resolution
fallbacks), verify layer ordering, file permissions, and baked config inside the
running container, and confirm the healthcheck exits/status match expectations
under success, curl failure (non-7 rc), curl timeout/connection-refused (rc 7)
and local-gateway fallback scenarios prior to approving the PR.

In `@scripts/nemoclaw-start.sh`:
- Around line 3218-3320: Run the full entrypoint end-to-end suites for both
launch paths (non-root and root) to validate PID 1 startup, respawn and watchdog
wiring changes touching GATEWAY_PID_FILE, record_gateway_pid,
gateway_pid_is_openclaw_gateway, start_gateway_serving_watchdog and
GATEWAY_WATCHDOG_PID; specifically execute sandbox-survival-e2e,
sandbox-operations-e2e, cloud-e2e and openclaw-slack-pairing-e2e against the
updated scripts/nemoclaw-start.sh behavior (also verify the other touched
regions mentioned: the similar changes around lines referenced in the review)
and report any lifecycle or sandbox failures for fixes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ba0b254f-f450-4a0a-8b84-9a2a441c6f15

📥 Commits

Reviewing files that changed from the base of the PR and between 4517ecd and 092a677.

📒 Files selected for processing (10)
  • Dockerfile
  • ci/test-file-size-budget.json
  • docs/reference/troubleshooting.mdx
  • scripts/generate-openclaw-config.mts
  • scripts/nemoclaw-start.sh
  • src/lib/state/openclaw-config-merge.test.ts
  • test/generate-openclaw-config-reload.test.ts
  • test/nemoclaw-start-gateway-health.test.ts
  • test/nemoclaw-start.test.ts
  • test/sandbox-provisioning.test.ts

Comment thread docs/reference/troubleshooting.mdx Outdated
Comment thread scripts/nemoclaw-start.sh
Comment thread scripts/nemoclaw-start.sh
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27312060801
Target ref: 092a677c1273784a8aff457ba00ddb1568e970ec
Workflow ref: main
Requested jobs: sandbox-survival-e2e,sandbox-operations-e2e,issue-2478-crash-loop-recovery-e2e,rebuild-openclaw-e2e
Summary: 3 passed, 1 failed, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ✅ success
rebuild-openclaw-e2e ❌ failure
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success

Failed jobs: rebuild-openclaw-e2e. Check run artifacts for logs.

ericksoa added 2 commits June 11, 2026 08:04
…e-v2

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

# Conflicts:
#	ci/test-file-size-budget.json
…nobs (#4710)

Review follow-ups from #5181:

- The serving watchdog's armed state and refused streak survived a fast
  respawn: if the pidfile switched to a relaunched gateway between probes,
  the new process inherited its predecessor's serve history and could be
  killed for refusals emitted during its own boot. Track the last observed
  PID and re-arm from scratch whenever it changes.
- Validate NEMOCLAW_GATEWAY_WATCHDOG_INTERVAL_SECONDS and
  NEMOCLAW_GATEWAY_WATCHDOG_REFUSED_THRESHOLD as positive integers; a zero
  or garbage interval would busy-loop the probe and a zero threshold would
  kill on the first refusal. Invalid values log a warning and fall back to
  the defaults.
- Reflow the troubleshooting section to one sentence per source line per
  the docs style guide.

The pid-swap regression test fails against the previous watchdog body and
passes with the per-PID reset.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
scripts/nemoclaw-start.sh (1)

3269-3282: 💤 Low value

Glob pattern accepts trailing non-digits.

The pattern [1-9][0-9]* uses glob semantics where * matches any characters. Values like "30abc" pass validation (3→[1-9], 0→[0-9], abc→*), then sleep "30abc" fails, silently killing the watchdog subshell.

The past review suggested *[!0-9]* which rejects any string containing a non-digit:

♻️ Stricter validation pattern
     case "$interval" in
-      [1-9] | [1-9][0-9]*) ;;
+      '' | *[!0-9]* | 0)
+        echo "[gateway-watchdog] invalid NEMOCLAW_GATEWAY_WATCHDOG_INTERVAL_SECONDS='${interval}'; defaulting to 30" >&2
+        interval=30
+        ;;
+    esac
+    case "$refused_threshold" in
+      '' | *[!0-9]* | 0)
+        echo "[gateway-watchdog] invalid NEMOCLAW_GATEWAY_WATCHDOG_REFUSED_THRESHOLD='${refused_threshold}'; defaulting to 4" >&2
+        refused_threshold=4
+        ;;
+    esac
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/nemoclaw-start.sh` around lines 3269 - 3282, The case patterns for
interval and refused_threshold allow trailing non-digits due to glob *, so
update the validation to explicitly reject empty or any non-digit characters
first (e.g. match ''|*[!0-9]* ) and set the defaults, then fall through to the
existing numeric patterns; apply this change for the variables interval and
refused_threshold (and their related env names
NEMOCLAW_GATEWAY_WATCHDOG_INTERVAL_SECONDS and
NEMOCLAW_GATEWAY_WATCHDOG_REFUSED_THRESHOLD) so values like "30abc" are treated
as invalid rather than passed to sleep.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 3269-3282: The case patterns for interval and refused_threshold
allow trailing non-digits due to glob *, so update the validation to explicitly
reject empty or any non-digit characters first (e.g. match ''|*[!0-9]* ) and set
the defaults, then fall through to the existing numeric patterns; apply this
change for the variables interval and refused_threshold (and their related env
names NEMOCLAW_GATEWAY_WATCHDOG_INTERVAL_SECONDS and
NEMOCLAW_GATEWAY_WATCHDOG_REFUSED_THRESHOLD) so values like "30abc" are treated
as invalid rather than passed to sleep.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 65a4472d-1d22-4ce8-84df-42d8a0ce36b9

📥 Commits

Reviewing files that changed from the base of the PR and between 092a677 and 5d439f9.

📒 Files selected for processing (8)
  • Dockerfile
  • ci/test-file-size-budget.json
  • docs/reference/troubleshooting.mdx
  • scripts/generate-openclaw-config.mts
  • scripts/nemoclaw-start.sh
  • src/lib/state/openclaw-config-merge.test.ts
  • test/nemoclaw-start-gateway-health.test.ts
  • test/sandbox-provisioning.test.ts
💤 Files with no reviewable changes (1)
  • test/nemoclaw-start-gateway-health.test.ts
✅ Files skipped from review due to trivial changes (1)
  • ci/test-file-size-budget.json
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/lib/state/openclaw-config-merge.test.ts
  • scripts/generate-openclaw-config.mts
  • docs/reference/troubleshooting.mdx
  • test/sandbox-provisioning.test.ts
  • Dockerfile

PR Review Advisor security finding on #5181: record_gateway_pid wrote the
pidfile in sticky, world-writable /tmp with unlink-then-open plus chmod.
In root mode a sandbox process could race-plant a symlink at that path
between respawns; PID 1 (root) would follow it on open, yielding an
arbitrary in-container root file write/chmod primitive.

Write to a mktemp-owned 0600 file and atomically rename it into place:
rename(2) replaces a planted symlink as a directory entry instead of
following it, and the result stays owned by PID 1's uid. The sandbox user
can at worst deny its own sandbox's self-healing (e.g. by pre-creating a
directory at the path), which the watchdog already tolerates. Regression
test plants a symlink to a sensitive file and proves the target is never
opened, written, or chmod-ed.

Also document the removal condition for the reload pin and watchdog next
to the pin (an OpenClaw release that exits non-zero when a failed
in-process restart cannot re-bind, proven by a wedge drill).

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa

Copy link
Copy Markdown
Contributor Author

Re: PR Review Advisor findings —

Pidfile symlink race (needs attention): valid — fixed in 0fd52dc. record_gateway_pid now writes to a mktemp-owned 0600 file and atomically renames it into place; rename(2) replaces a planted symlink as a directory entry instead of following it, so PID 1 never opens an attacker-controlled /tmp pathname. Worst case for a hostile sandbox process is denying its own sandbox's self-healing (e.g. pre-creating a directory at the path), which the watchdog tolerates. Regression test plants a symlink to a sensitive 0600 file and proves the target is never opened, written, or chmod-ed.

Removal condition for the hot pin / watchdog: added in 0fd52dc next to the reload: { mode: "hot" } pin — both can be revisited once the pinned OpenClaw release exits non-zero when a failed in-process restart cannot re-bind its listener (so the PID-wait respawn loop sees the death), proven by a wedge drill. An upstream OpenClaw report with the captured evidence is being prepared.

Marker-absence startup window (worth checking): acknowledged as a real but bounded trade-off inherited from the #4503/#5116 marker design rather than introduced here: before the launch site runs, curl-7 plus no marker reads as healthy, bounded by the HEALTHCHECK start-period and the fact that every local-serving path writes the marker immediately before openclaw gateway run. A distinct local-serving-intent state is a reasonable follow-up to the marker design, but it belongs with #5116's marker semantics, not this PR.

Runtime evidence for the Docker-driver health contract: the dispatched ubuntu-repo-cloud-openclaw Vitest scenario covers the onboard path on this branch, and the auto-dispatched sandbox-survival/sandbox-operations/crash-loop-recovery/rebuild-openclaw e2e jobs run against this head. The full Docker-driver health acceptance (marker absent + healthy + FailingStreak=0 on a real docker-driver host) is the QA re-verification gate for closing #4710, called out in the PR description's end-to-end proof section.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27327674517
Target ref: 5d439f91317cc66ab18f8a0c48cf08baf7bc3bf3
Workflow ref: main
Requested jobs: device-auth-health-e2e,sandbox-survival-e2e,issue-2478-crash-loop-recovery-e2e,openclaw-inference-switch-e2e
Summary: 4 passed, 0 failed, 0 skipped

Job Result
device-auth-health-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
openclaw-inference-switch-e2e ✅ success
sandbox-survival-e2e ✅ success

@ericksoa ericksoa self-assigned this Jun 11, 2026
@ericksoa ericksoa added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery integration: openclaw OpenClaw integration behavior platform: container Affects Docker, containerd, Podman, or images v0.0.64 Release target labels Jun 11, 2026
…e-v2

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

# Conflicts:
#	ci/test-file-size-budget.json
#	scripts/nemoclaw-start.sh
#	test/nemoclaw-start.test.ts
Comment thread test/nemoclaw-start-gateway-health.test.ts Fixed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/nemoclaw-start.sh (1)

2310-2335: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a unique temp path for each repair write.

save() always stages through .<name>.tmp, so two concurrent openclaw devices approve repairs can trample the same temp file. One invocation can os.replace() the staging path out from under the other and leave pending.json / paired.json only half-repaired.

💡 Minimal fix
 import json
 import os
 import re
+import tempfile
 from pathlib import Path
@@
 def save(name, value):
     path = root / name
-    tmp = path.with_name(f".{path.name}.tmp")
-    with tmp.open("w", encoding="utf-8") as handle:
-        handle.write(json.dumps(value, indent=2, sort_keys=True) + "\n")
-        handle.flush()
-        os.fsync(handle.fileno())
-    os.replace(tmp, path)
+    fd, tmp_name = tempfile.mkstemp(
+        prefix=f".{path.name}.",
+        suffix=".tmp",
+        dir=path.parent,
+    )
+    try:
+        with os.fdopen(fd, "w", encoding="utf-8") as handle:
+            handle.write(json.dumps(value, indent=2, sort_keys=True) + "\n")
+            handle.flush()
+            os.fsync(handle.fileno())
+        os.replace(tmp_name, path)
+    finally:
+        try:
+            os.unlink(tmp_name)
+        except FileNotFoundError:
+            pass
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/nemoclaw-start.sh` around lines 2310 - 2335, The save() helper
currently always stages writes to a fixed temp name (tmp =
path.with_name(f".{path.name}.tmp")), causing concurrent approve processes to
clobber each other's staging file; change save() to create a unique temp file
per invocation in the same directory (e.g., include PID and a short random/UUID
suffix or use a per-call tempfile in root) so tmp is unique, write and fsync to
that unique tmp, then os.replace(tmp, path) to atomically swap; ensure the
symbols to modify are save(), path, tmp, and os.replace usage so concurrent runs
don't trample each other's temp files.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 2310-2335: The save() helper currently always stages writes to a
fixed temp name (tmp = path.with_name(f".{path.name}.tmp")), causing concurrent
approve processes to clobber each other's staging file; change save() to create
a unique temp file per invocation in the same directory (e.g., include PID and a
short random/UUID suffix or use a per-call tempfile in root) so tmp is unique,
write and fsync to that unique tmp, then os.replace(tmp, path) to atomically
swap; ensure the symbols to modify are save(), path, tmp, and os.replace usage
so concurrent runs don't trample each other's temp files.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 42172e84-7862-429e-94a2-45757bf1e0ae

📥 Commits

Reviewing files that changed from the base of the PR and between 5d439f9 and cdae954.

📒 Files selected for processing (5)
  • ci/test-file-size-budget.json
  • scripts/generate-openclaw-config.mts
  • scripts/nemoclaw-start.sh
  • test/nemoclaw-start-gateway-health.test.ts
  • test/nemoclaw-start.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • ci/test-file-size-budget.json
  • scripts/generate-openclaw-config.mts
  • test/nemoclaw-start-gateway-health.test.ts

…en-use races (#4710)

Two CI follow-ups on the main sync:

- CodeQL flagged a js/file-system-race (high) in the pidfile symlink test:
  lstat-then-read leaves a check-to-use window. Open the pidfile once with
  O_NOFOLLOW so a single syscall both rejects symlinks and reads content —
  a strictly stronger assertion — and replace every remaining
  exists-then-read pattern in the gateway-health suite with a race-free
  read-and-catch helper.
- The codebase growth guardrail requires the test size budget to stay
  monotonic against main, which ratcheted test/nemoclaw-start.test.ts to
  5231 lines while this branch's merge resolution sat at 5237. Move the
  gateway-launch signal-handling suite into the gateway-health file — the
  file that owns gateway-launch coverage — bringing the legacy file to
  5100 lines and ratcheting its budget entry down accordingly.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa requested a review from cv June 11, 2026 12:28
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27346544424
Target ref: 6bb5c3e80c109e79b2a54c1e06b06fd2e7fd27af
Workflow ref: main
Requested jobs: cloud-e2e,sandbox-operations-e2e,sandbox-survival-e2e,issue-2478-crash-loop-recovery-e2e
Summary: 1 passed, 3 failed, 0 skipped

Job Result
cloud-e2e ❌ failure
issue-2478-crash-loop-recovery-e2e ❌ failure
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ✅ success

Failed jobs: cloud-e2e, issue-2478-crash-loop-recovery-e2e, sandbox-operations-e2e. Check run artifacts for logs.

@ericksoa ericksoa removed the request for review from cv June 11, 2026 12:43
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27347711217
Target ref: fix/4710-gateway-wedge-v2
Requested jobs: cloud-e2e,sandbox-operations-e2e,sandbox-survival-e2e,issue-2478-crash-loop-recovery-e2e
Summary: 2 passed, 2 failed, 0 skipped

Job Result
cloud-e2e ❌ failure
issue-2478-crash-loop-recovery-e2e ✅ success
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ✅ success

Failed jobs: cloud-e2e, sandbox-operations-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27348600338
Target ref: fix/4710-gateway-wedge-v2
Requested jobs: cloud-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
cloud-e2e ✅ success

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27349135996
Target ref: fix/4710-gateway-wedge-v2
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
sandbox-operations-e2e ✅ success

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27347711217
Target ref: fix/4710-gateway-wedge-v2
Requested jobs: cloud-e2e,sandbox-operations-e2e,sandbox-survival-e2e,issue-2478-crash-loop-recovery-e2e
Summary: 2 passed, 2 failed, 0 skipped

Job Result
cloud-e2e ❌ failure
issue-2478-crash-loop-recovery-e2e ✅ success
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ✅ success

Failed jobs: cloud-e2e, sandbox-operations-e2e. Check run artifacts for logs.

…e-v2

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

# Conflicts:
#	ci/test-file-size-budget.json
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27377892065
Target ref: 9c18e06cd48234ec8f132be881c601ec29c50ed6
Workflow ref: main
Requested jobs: sandbox-survival-e2e,issue-2478-crash-loop-recovery-e2e,sandbox-operations-e2e
Summary: 1 passed, 2 failed, 0 cancelled, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ❌ failure
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ❌ failure

Failed jobs: issue-2478-crash-loop-recovery-e2e, sandbox-survival-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27378332375
Target ref: fix/4710-gateway-wedge-v2
Requested jobs: all (no filter)
Summary: 27 passed, 38 failed, 0 cancelled, 3 skipped

Job Result
agent-turn-latency-e2e ❌ failure
bedrock-runtime-compatible-anthropic-e2e ✅ success
brave-search-e2e ✅ success
channels-add-remove-e2e ❌ failure
channels-stop-start-hermes-e2e ❌ failure
channels-stop-start-openclaw-e2e ❌ failure
cloud-e2e ❌ failure
cloud-inference-e2e ❌ failure
cloud-onboard-e2e ✅ success
common-egress-agent-e2e ✅ success
concurrent-gateway-ports-e2e ✅ success
credential-migration-e2e ❌ failure
credential-sanitization-e2e ✅ success
cron-preflight-inference-local-e2e ❌ failure
device-auth-health-e2e ✅ success
diagnostics-e2e ❌ failure
docs-validation-e2e ✅ success
double-onboard-e2e ❌ failure
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
gpu-jetson-nvmap-e2e ⏭️ skipped
hermes-anthropic-inference-switch-e2e ❌ failure
hermes-dashboard-e2e ❌ failure
hermes-discord-e2e ❌ failure
hermes-e2e ❌ failure
hermes-inference-switch-e2e ❌ failure
hermes-onboard-security-posture-e2e ❌ failure
hermes-root-entrypoint-smoke-e2e ✅ success
hermes-secret-boundary-e2e ✅ success
hermes-slack-e2e ❌ failure
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ❌ failure
issue-3600-gpu-proof-optional-e2e ✅ success
issue-4434-tui-unreachable-inference-e2e ✅ success
issue-4462-gateway-pinned-approval-characterization-e2e ❌ failure
issue-4462-scope-upgrade-approval-e2e ❌ failure
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ❌ failure
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ❌ failure
onboard-negative-paths-e2e ❌ failure
onboard-repair-e2e ❌ failure
onboard-resume-e2e ✅ success
openclaw-anthropic-inference-switch-e2e ✅ success
openclaw-discord-pairing-e2e ✅ success
openclaw-inference-switch-e2e ❌ failure
openclaw-onboard-security-posture-e2e ❌ failure
openclaw-skill-cli-e2e ❌ failure
openclaw-slack-pairing-e2e ✅ success
openclaw-tui-chat-correlation-e2e ❌ failure
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ❌ failure
rebuild-hermes-stale-base-e2e ❌ failure
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ❌ failure
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ❌ failure
sessions-agents-cli-e2e ❌ failure
shields-config-e2e ❌ failure
skill-agent-e2e ❌ failure
snapshot-commands-e2e ❌ failure
state-backup-restore-e2e ✅ success
telegram-injection-e2e ❌ failure
token-rotation-e2e ✅ success
tunnel-lifecycle-e2e ✅ success
upgrade-stale-sandbox-e2e ❌ failure

Failed jobs: agent-turn-latency-e2e, channels-add-remove-e2e, channels-stop-start-hermes-e2e, channels-stop-start-openclaw-e2e, cloud-e2e, cloud-inference-e2e, credential-migration-e2e, cron-preflight-inference-local-e2e, diagnostics-e2e, double-onboard-e2e, hermes-anthropic-inference-switch-e2e, hermes-dashboard-e2e, hermes-discord-e2e, hermes-e2e, hermes-inference-switch-e2e, hermes-onboard-security-posture-e2e, hermes-slack-e2e, issue-2478-crash-loop-recovery-e2e, issue-4462-gateway-pinned-approval-characterization-e2e, issue-4462-scope-upgrade-approval-e2e, launchable-smoke-e2e, network-policy-e2e, onboard-negative-paths-e2e, onboard-repair-e2e, openclaw-inference-switch-e2e, openclaw-onboard-security-posture-e2e, openclaw-skill-cli-e2e, openclaw-tui-chat-correlation-e2e, rebuild-hermes-e2e, rebuild-hermes-stale-base-e2e, runtime-overrides-e2e, sandbox-survival-e2e, sessions-agents-cli-e2e, shields-config-e2e, skill-agent-e2e, snapshot-commands-e2e, telegram-injection-e2e, upgrade-stale-sandbox-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27393873567
Target ref: fix/4710-gateway-wedge-v2
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 0 cancelled, 0 skipped

Job Result
sandbox-operations-e2e ✅ success

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27394430307
Target ref: fix/4710-gateway-wedge-v2
Requested jobs: all (no filter)
Summary: 52 passed, 13 failed, 0 cancelled, 3 skipped

Job Result
agent-turn-latency-e2e ✅ success
bedrock-runtime-compatible-anthropic-e2e ✅ success
brave-search-e2e ✅ success
channels-add-remove-e2e ✅ success
channels-stop-start-hermes-e2e ✅ success
channels-stop-start-openclaw-e2e ✅ success
cloud-e2e ✅ success
cloud-inference-e2e ❌ failure
cloud-onboard-e2e ❌ failure
common-egress-agent-e2e ✅ success
concurrent-gateway-ports-e2e ✅ success
credential-migration-e2e ❌ failure
credential-sanitization-e2e ✅ success
cron-preflight-inference-local-e2e ❌ failure
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
gpu-jetson-nvmap-e2e ⏭️ skipped
hermes-anthropic-inference-switch-e2e ✅ success
hermes-dashboard-e2e ❌ failure
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-inference-switch-e2e ✅ success
hermes-onboard-security-posture-e2e ✅ success
hermes-root-entrypoint-smoke-e2e ✅ success
hermes-secret-boundary-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ❌ failure
issue-3600-gpu-proof-optional-e2e ✅ success
issue-4434-tui-unreachable-inference-e2e ✅ success
issue-4462-gateway-pinned-approval-characterization-e2e ✅ success
issue-4462-scope-upgrade-approval-e2e ❌ failure
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ❌ failure
onboard-negative-paths-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openclaw-anthropic-inference-switch-e2e ✅ success
openclaw-discord-pairing-e2e ✅ success
openclaw-inference-switch-e2e ✅ success
openclaw-onboard-security-posture-e2e ✅ success
openclaw-skill-cli-e2e ✅ success
openclaw-slack-pairing-e2e ✅ success
openclaw-tui-chat-correlation-e2e ❌ failure
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ❌ failure
rebuild-hermes-stale-base-e2e ❌ failure
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ✅ success
sessions-agents-cli-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ❌ failure
snapshot-commands-e2e ✅ success
state-backup-restore-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
tunnel-lifecycle-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: cloud-inference-e2e, cloud-onboard-e2e, credential-migration-e2e, cron-preflight-inference-local-e2e, hermes-dashboard-e2e, issue-2478-crash-loop-recovery-e2e, issue-4462-scope-upgrade-approval-e2e, network-policy-e2e, openclaw-tui-chat-correlation-e2e, rebuild-hermes-e2e, rebuild-hermes-stale-base-e2e, sandbox-operations-e2e, skill-agent-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27394430307
Target ref: fix/4710-gateway-wedge-v2
Requested jobs: all (no filter)
Summary: 62 passed, 3 failed, 0 cancelled, 3 skipped

Job Result
agent-turn-latency-e2e ✅ success
bedrock-runtime-compatible-anthropic-e2e ✅ success
brave-search-e2e ✅ success
channels-add-remove-e2e ✅ success
channels-stop-start-hermes-e2e ✅ success
channels-stop-start-openclaw-e2e ✅ success
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
common-egress-agent-e2e ✅ success
concurrent-gateway-ports-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
cron-preflight-inference-local-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
gpu-jetson-nvmap-e2e ⏭️ skipped
hermes-anthropic-inference-switch-e2e ✅ success
hermes-dashboard-e2e ✅ success
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-inference-switch-e2e ✅ success
hermes-onboard-security-posture-e2e ✅ success
hermes-root-entrypoint-smoke-e2e ✅ success
hermes-secret-boundary-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ❌ failure
issue-3600-gpu-proof-optional-e2e ✅ success
issue-4434-tui-unreachable-inference-e2e ✅ success
issue-4462-gateway-pinned-approval-characterization-e2e ✅ success
issue-4462-scope-upgrade-approval-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-negative-paths-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openclaw-anthropic-inference-switch-e2e ✅ success
openclaw-discord-pairing-e2e ✅ success
openclaw-inference-switch-e2e ✅ success
openclaw-onboard-security-posture-e2e ✅ success
openclaw-skill-cli-e2e ✅ success
openclaw-slack-pairing-e2e ✅ success
openclaw-tui-chat-correlation-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ❌ failure
rebuild-hermes-stale-base-e2e ❌ failure
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
sessions-agents-cli-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
state-backup-restore-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
tunnel-lifecycle-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: issue-2478-crash-loop-recovery-e2e, rebuild-hermes-e2e, rebuild-hermes-stale-base-e2e. Check run artifacts for logs.

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27394430307
Target ref: fix/4710-gateway-wedge-v2
Requested jobs: all (no filter)
Summary: 62 passed, 3 failed, 0 cancelled, 3 skipped

Job Result
agent-turn-latency-e2e ✅ success
bedrock-runtime-compatible-anthropic-e2e ✅ success
brave-search-e2e ✅ success
channels-add-remove-e2e ✅ success
channels-stop-start-hermes-e2e ✅ success
channels-stop-start-openclaw-e2e ✅ success
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
common-egress-agent-e2e ✅ success
concurrent-gateway-ports-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
cron-preflight-inference-local-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
gpu-jetson-nvmap-e2e ⏭️ skipped
hermes-anthropic-inference-switch-e2e ✅ success
hermes-dashboard-e2e ✅ success
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-inference-switch-e2e ✅ success
hermes-onboard-security-posture-e2e ✅ success
hermes-root-entrypoint-smoke-e2e ✅ success
hermes-secret-boundary-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ❌ failure
issue-3600-gpu-proof-optional-e2e ✅ success
issue-4434-tui-unreachable-inference-e2e ✅ success
issue-4462-gateway-pinned-approval-characterization-e2e ✅ success
issue-4462-scope-upgrade-approval-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-negative-paths-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openclaw-anthropic-inference-switch-e2e ✅ success
openclaw-discord-pairing-e2e ✅ success
openclaw-inference-switch-e2e ✅ success
openclaw-onboard-security-posture-e2e ✅ success
openclaw-skill-cli-e2e ✅ success
openclaw-slack-pairing-e2e ✅ success
openclaw-tui-chat-correlation-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ❌ failure
rebuild-hermes-stale-base-e2e ❌ failure
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
sessions-agents-cli-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
state-backup-restore-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
tunnel-lifecycle-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: issue-2478-crash-loop-recovery-e2e, rebuild-hermes-e2e, rebuild-hermes-stale-base-e2e. Check run artifacts for logs.

@cv cv added v0.0.65 Release target and removed v0.0.64 Release target labels Jun 12, 2026
@ericksoa ericksoa requested review from cv and removed request for cjagwani June 12, 2026 17:11
cv added a commit that referenced this pull request Jun 13, 2026
#5182)

## Summary

`nemoclaw <name> recover` declared success on a single gateway health
probe. A wedged gateway (#4710) serves for ~20 seconds after logging
ready and then drops its HTTP listener while the process stays alive —
so recover reported a healthy gateway that was already on its way back
to the wedge. This PR makes recovery require *sustained* serving and
surfaces the wedge signature when it recurs.

## Related Issue

Related: #4710 (host-side hardening; helps existing sandboxes without an
image rebuild — complements the sandbox-side fix).

## Changes

- `waitForRecoveredSandboxGateway`
(`src/lib/actions/sandbox/process-recovery.ts`): after the first
successful probe, wait a settle window
(`NEMOCLAW_GATEWAY_RECOVERY_SETTLE_SECONDS`, default 25s, `0` disables)
and require a confirm probe to still succeed before declaring recovery.
Prints a one-line progress note so the wait doesn't read as a hang.
- New `collectGatewayWedgeDiagnostics`: on confirm failure, greps
`/tmp/gateway.log` for the wedge signature (`config change requires
gateway restart` / `gateway startup failed` / `Process will stay alive`)
and prints the matching lines so the operator sees why the gateway is
unreachable despite a live PID.
- Both functions take injectable probe/sleep/exec seams; tests cover
settle-confirm success, the serve-then-drop wedge, the `0`-disable path,
initial polling interplay, and diagnostics extraction/edge cases.

## Architecture

- **Health means proven serving, not existence.** Recovery already moved
from pgrep to HTTP probes (#2342); the settle-confirm extends that same
principle from point-in-time proof to sustained proof, matched to the
wedge's ~20s time constant. Success criteria stay aligned with what the
operator actually needs to be true.
- **Monolith reduction.** The wedge diagnostics live in a focused
`gateway-wedge-diagnostics` module rather than growing the high-churn
`process-recovery.ts` lifecycle monolith (the deterministic growth
gate's direction). The sandbox exec is passed explicitly, keeping the
import graph acyclic.
- **Trust boundaries.** Matched gateway-log lines are sandbox-writable
bytes: control characters are neutralized and credential shapes redacted
before anything reaches an operator terminal.
- **Workaround contract.** The module header documents the invalid
OpenClaw state, the upstream source boundary, and the removal condition
— this PR is detection/confirmation only and is deliberately scoped as
the host-side complement to the sandbox-side fix (#5181); it does not
claim the original #4710 HEALTHCHECK/marker acceptance.
- **Seams and parity.** Probe/sleep/exec are injectable in the
established `execImpl` style, and tests run against the compiled `dist/`
like the rest of the CLI suite, with the CLI-level settle test isolated
in a focused file under the size budgets.


## Type of Change

- [x] Code change (feature, bug fix, or refactor)

## Verification

- [x] `npx prek run --all-files` passes — except the `test-cli` hook's
two pre-existing, machine-bound failures that reproduce identically on a
clean `main` checkout on this host (live e2e scenario
`ubuntu-repo-cloud-openclaw`, timing-sensitive auto-pair keepalive
retry); all other hooks pass.
- [x] `npm test` passes for the touched suites
(`src/lib/actions/sandbox/process-recovery.test.ts`,
`test/process-recovery.test.ts`).
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes — env var follows
the existing undocumented
`NEMOCLAW_GATEWAY_RECOVERY_WAIT_SECONDS`/`_POLL_INTERVAL_SECONDS`
convention in the same function; happy to add a reference entry if
preferred.

---
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Sandbox recovery now prints sanitized gateway "wedge" diagnostics on
failure and supports an optional post-recovery "settle" confirmation
(configurable; can be disabled).

* **Tests**
* Expanded coverage for settle-confirm behavior, polling/timeouts,
intermittent listener-drop scenarios, diagnostic
extraction/sanitization, and a CLI test path that disables the settle
delay for faster runs.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
@cv cv added v0.0.66 Release target and removed v0.0.65 Release target labels Jun 15, 2026
@cv

cv commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Flagging one overlap with #5517: that PR hardens the targeted /tmp creation/write paths in scripts/nemoclaw-start.sh, including /tmp/nemoclaw-gateway.pid, using same-directory temp files plus atomic rename and best-effort startup semantics.

This watchdog work still looks distinct and useful, but after #5517 lands it will likely need a rebase. Please keep/reuse the safe PID write path from #5517 rather than reintroducing a separate direct writer for /tmp/nemoclaw-gateway.pid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery integration: openclaw OpenClaw integration behavior platform: container Affects Docker, containerd, Podman, or images v0.0.66 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants