fix(sandbox): pin gateway.reload=hot and add gateway serving watchdog (#4710)#5181
fix(sandbox): pin gateway.reload=hot and add gateway serving watchdog (#4710)#5181ericksoa wants to merge 8 commits into
Conversation
…hint (#4710) The early conditional gating mark_in_container_gateway on OPENSHELL_DRIVERS=docker (PR #4748) never fires inside the sandbox container because OpenShell 0.0.44 does not export OPENSHELL_DRIVERS into the sandbox env. hulynn's 2026-06-08 re-verification on v0.0.60 confirmed the symptom is unchanged from v0.0.57: marker file is created on every NEMOCLAW_CMD-empty container, so the Dockerfile HEALTHCHECK short-circuit (`[ -f /tmp/nemoclaw-gateway-local ] || exit 0`) is unreachable for docker-driver sandboxes and the container is marked (unhealthy) on every fresh onboard. Fix: drop the early env-gated marker write and call mark_in_container_gateway() directly before each `openclaw gateway run --port ...` launch (both the non-root and root-mode paths). The marker is now true-by-construction: it exists if-and-only-if this container is about to start the gateway, regardless of how the deployment-mode signal is (or isn't) plumbed through the sandbox env. Restart-loop fallback launches inherit the marker from the first launch — the function is idempotent (`: >file`) so calling it again on retry is a no-op. Verified by two new regression tests in test/nemoclaw-start.test.ts: - 'ties the in-container gateway healthcheck marker to the gateway launch site (#4503, #4710)': asserts (a) the region immediately after the function definition contains no early call and no OPENSHELL_DRIVERS conditional, and (b) every `openclaw gateway run --port` invocation is preceded by mark_in_container_gateway within 6 lines (or sits inside a restart-loop body). - 'mark_in_container_gateway writes the marker file idempotently (#4710)': behavioral check on the helper itself. The two existing launch-block tests now stub `mark_in_container_gateway() { :; }` in their preamble to keep the extracted shell snippet self-contained. Test count: 120 passing (vs. 119 baseline; 19 pre-existing failures unchanged). Note: a separate OpenClaw-side concern surfaced in @Dongni-Yang's 2026-06-04 thread (in-sandbox gateway loses its HTTP listener while the process stays alive) is out of scope here — this fix only addresses the marker-file regression hulynn re-confirmed in #4710 for docker-driver sandboxes where the gateway legitimately runs on the host. Standalone deployments where the gateway runs in-container are unaffected: the marker is still created (just at the launch site instead of at startup), so the existing pgrep healthcheck behavior is preserved bit-for-bit. Fixes #4710. Signed-off-by: Shawn Xie <shaxie@nvidia.com>
…#4710) The in-sandbox OpenClaw gateway watches openclaw.json and, in its default hybrid reload mode, SIGUSR1-restarts itself in-process on restart-class config changes (plugins.installs, models.pricing, unrecognized keys, ...). In containers a failed in-process restart parks the gateway alive with its HTTP listener closed ("gateway startup failed: ... Process will stay alive"), which the PID-wait respawn loop (#2757) cannot observe. With the healthcheck marker now truthful, Docker correctly reports the container (unhealthy) forever — the remaining failure mode behind #4710. Two complementary defenses: - Pin gateway.reload.mode=hot in the generated sandbox config so plan-driven in-process restarts never close the listener; NemoClaw applies restart-class changes via rebuild or 'nemoclaw <name> recover' instead. Host-side state restore already preserves the freshly generated gateway section (#5174), so the pin survives restores. - Add a serving watchdog to nemoclaw-start.sh: once the gateway has served at least one probe, sustained connection-refused on the dashboard port with the process still alive triggers TERM/KILL of the gateway PID so the existing respawn loop relaunches it. A pidfile written by PID 1 at every launch/respawn site keeps the watchdog aimed at the live gateway PID, and a cmdline identity check guards against PID reuse before killing. Also align the Dockerfile HEALTHCHECK liveness pattern with the recovery script's gateway-process pattern family, adding an anchored ^openclaw$ alternative for builds that truncate the rewritten process title, and document the hot-mode pin in the troubleshooting page. The #4503/#4710 marker coverage moves from test/nemoclaw-start.test.ts (at its size budget) into a focused gateway-health suite, converted from source-text assertions to behavioral launch-block and respawn-loop harnesses per the source-shape budget; the legacy size budget ratchets down accordingly. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (2)
📝 WalkthroughWalkthroughPins gateway reload mode to "hot", broadens Docker HEALTHCHECK process matching to include truncated/rewritten OpenClaw titles, adds an in-container HTTP-listener watchdog that kills a non-serving gateway to trigger respawn, and adds tests and troubleshooting docs covering these behaviors. ChangesGateway Health and Listener Recovery
Sequence DiagramsequenceDiagram
participant Watchdog
participant Gateway
participant RespawnLoop
participant ProcFS
Watchdog->>Gateway: curl /health probe
Gateway-->>Watchdog: success or exit 7 (connection refused)
Note over Watchdog: after observed serving, count consecutive refusals
Watchdog->>ProcFS: validate gateway PID cmdline matches openclaw identity
Watchdog->>Gateway: SIGTERM (if refused threshold reached)
Watchdog->>Gateway: SIGKILL (if not exiting)
Gateway-->>RespawnLoop: process exits
RespawnLoop->>Watchdog: writes new PID to pidfile
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
🌿 Preview your docs: https://nvidia-preview-pr-5181.docs.buildwithfern.com/nemoclaw |
PR Review AdvisorFindings: 1 needs attention, 4 worth checking, 1 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Vitest E2E Scenario RecommendationRequired Vitest E2E scenarios: Dispatch required Vitest E2E scenarios:
Full Vitest E2E advisor summaryVitest E2E Scenario AdvisorBase: Required Vitest E2E scenarios
Optional Vitest E2E scenarios
Relevant changed files
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
Dockerfile (1)
1091-1103: Run the container-level E2E matrix before merge.Line 1091-Line 1103 changes are runtime/container-health semantics; please execute the recommended E2E jobs before closing this PR.
As per coding guidelines,
Dockerfilelayer ordering, permissions, and baked config changes are only testable with a real container build.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Dockerfile` around lines 1091 - 1103, The Dockerfile HEALTHCHECK block and related runtime semantics (HEALTHCHECK, NEMOCLAW_DASHBOARD_PORT, OPENCLAW_GATEWAY_PORT, CHAT_UI_URL, /tmp/nemoclaw-gateway-local, pgrep '^openclaw$|openclaw[ -]gateway', and /tmp/gateway.log) affect container behavior and must be validated by running the container-level E2E matrix before merging; build the image and run the recommended E2E jobs that exercise the HEALTHCHECK path (including variants with and without CHAT_UI_URL and both port resolution fallbacks), verify layer ordering, file permissions, and baked config inside the running container, and confirm the healthcheck exits/status match expectations under success, curl failure (non-7 rc), curl timeout/connection-refused (rc 7) and local-gateway fallback scenarios prior to approving the PR.Source: Coding guidelines
scripts/nemoclaw-start.sh (1)
3218-3320: Run the entrypoint E2Es for both launch paths.This touches PID 1 startup, respawn, and watchdog wiring in both the non-root and root branches. Unit tests help here, but they will not cover the process-lifecycle and sandbox-behavior risks this file owns. As per coding guidelines,
scripts/nemoclaw-start.shchanges should be verified withsandbox-survival-e2e,sandbox-operations-e2e,cloud-e2e, andopenclaw-slack-pairing-e2e.Also applies to: 3427-3454, 3658-3704
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/nemoclaw-start.sh` around lines 3218 - 3320, Run the full entrypoint end-to-end suites for both launch paths (non-root and root) to validate PID 1 startup, respawn and watchdog wiring changes touching GATEWAY_PID_FILE, record_gateway_pid, gateway_pid_is_openclaw_gateway, start_gateway_serving_watchdog and GATEWAY_WATCHDOG_PID; specifically execute sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e and openclaw-slack-pairing-e2e against the updated scripts/nemoclaw-start.sh behavior (also verify the other touched regions mentioned: the similar changes around lines referenced in the review) and report any lifecycle or sandbox failures for fixes.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/reference/troubleshooting.mdx`:
- Around line 369-372: Split the long sentence that currently spans multiple
lines into separate source lines so each sentence is on its own line, and
replace the colon after "stays alive" with a period (i.e., change "stays alive:
a restart-class..." to "stays alive. A restart-class..."); ensure any remaining
colons are only used to introduce explicit lists, and update the sentence breaks
in the paragraph about the in-sandbox OpenClaw gateway /tmp/gateway.log and
Docker health status accordingly.
In `@scripts/nemoclaw-start.sh`:
- Around line 3267-3291: The watchdog must reset armed and refused_streak when
the pidfile value changes; detect when the read pid (from "$GATEWAY_PID_FILE"
into pid) differs from a saved prior_pid and set armed=0 and refused_streak=0
(and update prior_pid) immediately after reading pid (before checking kill -0 or
probing the health endpoint) so a newly spawned gateway does not inherit the old
process's armed/refused state; use the existing pid/GATEWAY_PID_FILE variables
and maintain a prior_pid variable in the loop to perform this reset.
- Around line 3264-3268: The watchdog uses
NEMOCLAW_GATEWAY_WATCHDOG_INTERVAL_SECONDS and
NEMOCLAW_GATEWAY_WATCHDOG_REFUSED_THRESHOLD directly; validate both in
start_gateway_serving_watchdog before any sleep or numeric comparison by
ensuring they match a positive-integer regex (e.g. ^[1-9][0-9]*$) so zero-like
values ("0","00","000") are rejected, log an error and either exit or fall back
to a safe default when invalid, then convert/assign validated values to interval
and refused_threshold and use those for sleep "$interval" and the numeric test [
"$refused_streak" -lt "$refused_threshold" ].
---
Nitpick comments:
In `@Dockerfile`:
- Around line 1091-1103: The Dockerfile HEALTHCHECK block and related runtime
semantics (HEALTHCHECK, NEMOCLAW_DASHBOARD_PORT, OPENCLAW_GATEWAY_PORT,
CHAT_UI_URL, /tmp/nemoclaw-gateway-local, pgrep '^openclaw$|openclaw[
-]gateway', and /tmp/gateway.log) affect container behavior and must be
validated by running the container-level E2E matrix before merging; build the
image and run the recommended E2E jobs that exercise the HEALTHCHECK path
(including variants with and without CHAT_UI_URL and both port resolution
fallbacks), verify layer ordering, file permissions, and baked config inside the
running container, and confirm the healthcheck exits/status match expectations
under success, curl failure (non-7 rc), curl timeout/connection-refused (rc 7)
and local-gateway fallback scenarios prior to approving the PR.
In `@scripts/nemoclaw-start.sh`:
- Around line 3218-3320: Run the full entrypoint end-to-end suites for both
launch paths (non-root and root) to validate PID 1 startup, respawn and watchdog
wiring changes touching GATEWAY_PID_FILE, record_gateway_pid,
gateway_pid_is_openclaw_gateway, start_gateway_serving_watchdog and
GATEWAY_WATCHDOG_PID; specifically execute sandbox-survival-e2e,
sandbox-operations-e2e, cloud-e2e and openclaw-slack-pairing-e2e against the
updated scripts/nemoclaw-start.sh behavior (also verify the other touched
regions mentioned: the similar changes around lines referenced in the review)
and report any lifecycle or sandbox failures for fixes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ba0b254f-f450-4a0a-8b84-9a2a441c6f15
📒 Files selected for processing (10)
Dockerfileci/test-file-size-budget.jsondocs/reference/troubleshooting.mdxscripts/generate-openclaw-config.mtsscripts/nemoclaw-start.shsrc/lib/state/openclaw-config-merge.test.tstest/generate-openclaw-config-reload.test.tstest/nemoclaw-start-gateway-health.test.tstest/nemoclaw-start.test.tstest/sandbox-provisioning.test.ts
Selective E2E Results — ❌ Some jobs failedRun: 27312060801
|
…e-v2 Signed-off-by: Aaron Erickson <aerickson@nvidia.com> # Conflicts: # ci/test-file-size-budget.json
…nobs (#4710) Review follow-ups from #5181: - The serving watchdog's armed state and refused streak survived a fast respawn: if the pidfile switched to a relaunched gateway between probes, the new process inherited its predecessor's serve history and could be killed for refusals emitted during its own boot. Track the last observed PID and re-arm from scratch whenever it changes. - Validate NEMOCLAW_GATEWAY_WATCHDOG_INTERVAL_SECONDS and NEMOCLAW_GATEWAY_WATCHDOG_REFUSED_THRESHOLD as positive integers; a zero or garbage interval would busy-loop the probe and a zero threshold would kill on the first refusal. Invalid values log a warning and fall back to the defaults. - Reflow the troubleshooting section to one sentence per source line per the docs style guide. The pid-swap regression test fails against the previous watchdog body and passes with the per-PID reset. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
scripts/nemoclaw-start.sh (1)
3269-3282: 💤 Low valueGlob pattern accepts trailing non-digits.
The pattern
[1-9][0-9]*uses glob semantics where*matches any characters. Values like"30abc"pass validation (3→[1-9], 0→[0-9], abc→*), thensleep "30abc"fails, silently killing the watchdog subshell.The past review suggested
*[!0-9]*which rejects any string containing a non-digit:♻️ Stricter validation pattern
case "$interval" in - [1-9] | [1-9][0-9]*) ;; + '' | *[!0-9]* | 0) + echo "[gateway-watchdog] invalid NEMOCLAW_GATEWAY_WATCHDOG_INTERVAL_SECONDS='${interval}'; defaulting to 30" >&2 + interval=30 + ;; + esac + case "$refused_threshold" in + '' | *[!0-9]* | 0) + echo "[gateway-watchdog] invalid NEMOCLAW_GATEWAY_WATCHDOG_REFUSED_THRESHOLD='${refused_threshold}'; defaulting to 4" >&2 + refused_threshold=4 + ;; + esac🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/nemoclaw-start.sh` around lines 3269 - 3282, The case patterns for interval and refused_threshold allow trailing non-digits due to glob *, so update the validation to explicitly reject empty or any non-digit characters first (e.g. match ''|*[!0-9]* ) and set the defaults, then fall through to the existing numeric patterns; apply this change for the variables interval and refused_threshold (and their related env names NEMOCLAW_GATEWAY_WATCHDOG_INTERVAL_SECONDS and NEMOCLAW_GATEWAY_WATCHDOG_REFUSED_THRESHOLD) so values like "30abc" are treated as invalid rather than passed to sleep.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 3269-3282: The case patterns for interval and refused_threshold
allow trailing non-digits due to glob *, so update the validation to explicitly
reject empty or any non-digit characters first (e.g. match ''|*[!0-9]* ) and set
the defaults, then fall through to the existing numeric patterns; apply this
change for the variables interval and refused_threshold (and their related env
names NEMOCLAW_GATEWAY_WATCHDOG_INTERVAL_SECONDS and
NEMOCLAW_GATEWAY_WATCHDOG_REFUSED_THRESHOLD) so values like "30abc" are treated
as invalid rather than passed to sleep.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 65a4472d-1d22-4ce8-84df-42d8a0ce36b9
📒 Files selected for processing (8)
Dockerfileci/test-file-size-budget.jsondocs/reference/troubleshooting.mdxscripts/generate-openclaw-config.mtsscripts/nemoclaw-start.shsrc/lib/state/openclaw-config-merge.test.tstest/nemoclaw-start-gateway-health.test.tstest/sandbox-provisioning.test.ts
💤 Files with no reviewable changes (1)
- test/nemoclaw-start-gateway-health.test.ts
✅ Files skipped from review due to trivial changes (1)
- ci/test-file-size-budget.json
🚧 Files skipped from review as they are similar to previous changes (5)
- src/lib/state/openclaw-config-merge.test.ts
- scripts/generate-openclaw-config.mts
- docs/reference/troubleshooting.mdx
- test/sandbox-provisioning.test.ts
- Dockerfile
PR Review Advisor security finding on #5181: record_gateway_pid wrote the pidfile in sticky, world-writable /tmp with unlink-then-open plus chmod. In root mode a sandbox process could race-plant a symlink at that path between respawns; PID 1 (root) would follow it on open, yielding an arbitrary in-container root file write/chmod primitive. Write to a mktemp-owned 0600 file and atomically rename it into place: rename(2) replaces a planted symlink as a directory entry instead of following it, and the result stays owned by PID 1's uid. The sandbox user can at worst deny its own sandbox's self-healing (e.g. by pre-creating a directory at the path), which the watchdog already tolerates. Regression test plants a symlink to a sensitive file and proves the target is never opened, written, or chmod-ed. Also document the removal condition for the reload pin and watchdog next to the pin (an OpenClaw release that exits non-zero when a failed in-process restart cannot re-bind, proven by a wedge drill). Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
Re: PR Review Advisor findings — Pidfile symlink race (needs attention): valid — fixed in 0fd52dc. Removal condition for the hot pin / watchdog: added in 0fd52dc next to the Marker-absence startup window (worth checking): acknowledged as a real but bounded trade-off inherited from the #4503/#5116 marker design rather than introduced here: before the launch site runs, curl-7 plus no marker reads as healthy, bounded by the HEALTHCHECK start-period and the fact that every local-serving path writes the marker immediately before Runtime evidence for the Docker-driver health contract: the dispatched |
Selective E2E Results — ✅ All requested jobs passedRun: 27327674517
|
…e-v2 Signed-off-by: Aaron Erickson <aerickson@nvidia.com> # Conflicts: # ci/test-file-size-budget.json # scripts/nemoclaw-start.sh # test/nemoclaw-start.test.ts
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/nemoclaw-start.sh (1)
2310-2335:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a unique temp path for each repair write.
save()always stages through.<name>.tmp, so two concurrentopenclaw devices approverepairs can trample the same temp file. One invocation canos.replace()the staging path out from under the other and leavepending.json/paired.jsononly half-repaired.💡 Minimal fix
import json import os import re +import tempfile from pathlib import Path @@ def save(name, value): path = root / name - tmp = path.with_name(f".{path.name}.tmp") - with tmp.open("w", encoding="utf-8") as handle: - handle.write(json.dumps(value, indent=2, sort_keys=True) + "\n") - handle.flush() - os.fsync(handle.fileno()) - os.replace(tmp, path) + fd, tmp_name = tempfile.mkstemp( + prefix=f".{path.name}.", + suffix=".tmp", + dir=path.parent, + ) + try: + with os.fdopen(fd, "w", encoding="utf-8") as handle: + handle.write(json.dumps(value, indent=2, sort_keys=True) + "\n") + handle.flush() + os.fsync(handle.fileno()) + os.replace(tmp_name, path) + finally: + try: + os.unlink(tmp_name) + except FileNotFoundError: + pass🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/nemoclaw-start.sh` around lines 2310 - 2335, The save() helper currently always stages writes to a fixed temp name (tmp = path.with_name(f".{path.name}.tmp")), causing concurrent approve processes to clobber each other's staging file; change save() to create a unique temp file per invocation in the same directory (e.g., include PID and a short random/UUID suffix or use a per-call tempfile in root) so tmp is unique, write and fsync to that unique tmp, then os.replace(tmp, path) to atomically swap; ensure the symbols to modify are save(), path, tmp, and os.replace usage so concurrent runs don't trample each other's temp files.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 2310-2335: The save() helper currently always stages writes to a
fixed temp name (tmp = path.with_name(f".{path.name}.tmp")), causing concurrent
approve processes to clobber each other's staging file; change save() to create
a unique temp file per invocation in the same directory (e.g., include PID and a
short random/UUID suffix or use a per-call tempfile in root) so tmp is unique,
write and fsync to that unique tmp, then os.replace(tmp, path) to atomically
swap; ensure the symbols to modify are save(), path, tmp, and os.replace usage
so concurrent runs don't trample each other's temp files.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 42172e84-7862-429e-94a2-45757bf1e0ae
📒 Files selected for processing (5)
ci/test-file-size-budget.jsonscripts/generate-openclaw-config.mtsscripts/nemoclaw-start.shtest/nemoclaw-start-gateway-health.test.tstest/nemoclaw-start.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- ci/test-file-size-budget.json
- scripts/generate-openclaw-config.mts
- test/nemoclaw-start-gateway-health.test.ts
…en-use races (#4710) Two CI follow-ups on the main sync: - CodeQL flagged a js/file-system-race (high) in the pidfile symlink test: lstat-then-read leaves a check-to-use window. Open the pidfile once with O_NOFOLLOW so a single syscall both rejects symlinks and reads content — a strictly stronger assertion — and replace every remaining exists-then-read pattern in the gateway-health suite with a race-free read-and-catch helper. - The codebase growth guardrail requires the test size budget to stay monotonic against main, which ratcheted test/nemoclaw-start.test.ts to 5231 lines while this branch's merge resolution sat at 5237. Move the gateway-launch signal-handling suite into the gateway-health file — the file that owns gateway-launch coverage — bringing the legacy file to 5100 lines and ratcheting its budget entry down accordingly. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 27346544424
|
Selective E2E Results — ❌ Some jobs failedRun: 27347711217
|
Selective E2E Results — ✅ All requested jobs passedRun: 27348600338
|
Selective E2E Results — ✅ All requested jobs passedRun: 27349135996
|
Selective E2E Results — ❌ Some jobs failedRun: 27347711217
|
…e-v2 Signed-off-by: Aaron Erickson <aerickson@nvidia.com> # Conflicts: # ci/test-file-size-budget.json
Selective E2E Results — ❌ Some jobs failedRun: 27377892065
|
Selective E2E Results — ❌ Some jobs failedRun: 27378332375
|
Selective E2E Results — ✅ All requested jobs passedRun: 27393873567
|
Selective E2E Results — ❌ Some jobs failedRun: 27394430307
|
Selective E2E Results — ❌ Some jobs failedRun: 27394430307
|
1 similar comment
Selective E2E Results — ❌ Some jobs failedRun: 27394430307
|
#5182) ## Summary `nemoclaw <name> recover` declared success on a single gateway health probe. A wedged gateway (#4710) serves for ~20 seconds after logging ready and then drops its HTTP listener while the process stays alive — so recover reported a healthy gateway that was already on its way back to the wedge. This PR makes recovery require *sustained* serving and surfaces the wedge signature when it recurs. ## Related Issue Related: #4710 (host-side hardening; helps existing sandboxes without an image rebuild — complements the sandbox-side fix). ## Changes - `waitForRecoveredSandboxGateway` (`src/lib/actions/sandbox/process-recovery.ts`): after the first successful probe, wait a settle window (`NEMOCLAW_GATEWAY_RECOVERY_SETTLE_SECONDS`, default 25s, `0` disables) and require a confirm probe to still succeed before declaring recovery. Prints a one-line progress note so the wait doesn't read as a hang. - New `collectGatewayWedgeDiagnostics`: on confirm failure, greps `/tmp/gateway.log` for the wedge signature (`config change requires gateway restart` / `gateway startup failed` / `Process will stay alive`) and prints the matching lines so the operator sees why the gateway is unreachable despite a live PID. - Both functions take injectable probe/sleep/exec seams; tests cover settle-confirm success, the serve-then-drop wedge, the `0`-disable path, initial polling interplay, and diagnostics extraction/edge cases. ## Architecture - **Health means proven serving, not existence.** Recovery already moved from pgrep to HTTP probes (#2342); the settle-confirm extends that same principle from point-in-time proof to sustained proof, matched to the wedge's ~20s time constant. Success criteria stay aligned with what the operator actually needs to be true. - **Monolith reduction.** The wedge diagnostics live in a focused `gateway-wedge-diagnostics` module rather than growing the high-churn `process-recovery.ts` lifecycle monolith (the deterministic growth gate's direction). The sandbox exec is passed explicitly, keeping the import graph acyclic. - **Trust boundaries.** Matched gateway-log lines are sandbox-writable bytes: control characters are neutralized and credential shapes redacted before anything reaches an operator terminal. - **Workaround contract.** The module header documents the invalid OpenClaw state, the upstream source boundary, and the removal condition — this PR is detection/confirmation only and is deliberately scoped as the host-side complement to the sandbox-side fix (#5181); it does not claim the original #4710 HEALTHCHECK/marker acceptance. - **Seams and parity.** Probe/sleep/exec are injectable in the established `execImpl` style, and tests run against the compiled `dist/` like the rest of the CLI suite, with the CLI-level settle test isolated in a focused file under the size budgets. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npx prek run --all-files` passes — except the `test-cli` hook's two pre-existing, machine-bound failures that reproduce identically on a clean `main` checkout on this host (live e2e scenario `ubuntu-repo-cloud-openclaw`, timing-sensitive auto-pair keepalive retry); all other hooks pass. - [x] `npm test` passes for the touched suites (`src/lib/actions/sandbox/process-recovery.test.ts`, `test/process-recovery.test.ts`). - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes — env var follows the existing undocumented `NEMOCLAW_GATEWAY_RECOVERY_WAIT_SECONDS`/`_POLL_INTERVAL_SECONDS` convention in the same function; happy to add a reference entry if preferred. --- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Sandbox recovery now prints sanitized gateway "wedge" diagnostics on failure and supports an optional post-recovery "settle" confirmation (configurable; can be disabled). * **Tests** * Expanded coverage for settle-confirm behavior, polling/timeouts, intermittent listener-drop scenarios, diagnostic extraction/sanitization, and a CLI test path that disables the settle delay for faster runs. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
|
Flagging one overlap with #5517: that PR hardens the targeted /tmp creation/write paths in scripts/nemoclaw-start.sh, including /tmp/nemoclaw-gateway.pid, using same-directory temp files plus atomic rename and best-effort startup semantics. This watchdog work still looks distinct and useful, but after #5517 lands it will likely need a rebase. Please keep/reuse the safe PID write path from #5517 rather than reintroducing a separate direct writer for /tmp/nemoclaw-gateway.pid. |
Summary
Closes out the remaining failure mode behind #4710: with the healthcheck marker made truthful (#5116), docker-driver sandboxes still go
(unhealthy)whenever the in-sandbox OpenClaw gateway wedges — a restart-classopenclaw.jsonchange makes the gateway SIGUSR1-restart itself in-process, and a failed restart parks the process alive with its HTTP listener closed (gateway startup failed: ... Process will stay alive). The PID-wait respawn loop (#2757) never sees it because the process never exits. This PR prevents the wedge and makes any future wedge self-heal.Related Issue
Addresses #4710 (final closure after QA re-verification on a rebuilt sandbox image).
Changes
gateway.reload.mode: "hot"in the generated sandbox config (scripts/generate-openclaw-config.mts). In the defaulthybridmode, restart-class config paths (plugins.installs,models.pricing, unrecognized keys, …) trigger a self-SIGUSR1 in-process restart that closes the listener; in containers OpenClaw deliberately disables fresh-PID respawn, and a failed in-process restart parks the gateway alive-but-deaf. Hot mode ignores plan-driven restarts, so the listener never closes; NemoClaw applies restart-class changes via rebuild ornemoclaw <name> recoverinstead. The fix(openclaw): merge restored config with runtime state #5174 restore merge preserves the freshly generatedgatewaysection, so the pin survives host-side state restores.scripts/nemoclaw-start.sh(defense-in-depth; covers writer-intent restarts that bypass hot mode and any future wedge cause): once the gateway has served at least one probe, sustained connection-refused (4 × 30s, env-tunable) on the dashboard port with the PID still alive logs a CRITICAL line to/tmp/gateway.log, TERM/KILLs the gateway, and lets the existing respawn loop relaunch it. Only curl exit 7 counts — timeouts/HTTP errors mean a listener exists and stay the Docker HEALTHCHECK's call. A PID-1-owned pidfile written at every launch/respawn site keeps the watchdog aimed at the live gateway PID; a cmdline identity check guards against PID reuse before killing.Dockerfile): adds the anchored^openclaw$alternative for OpenClaw builds that truncate the rewritten process title to the bare binary name, matching the recovery-script pattern family without letting ordinaryopenclaw <subcommand>invocations satisfy the probe.recover/rebuild).test/nemoclaw-start-gateway-health.test.ts(watchdog kill/arm/reset/identity behavior against a real process, marker idempotency, behavioral launch-block wiring for both root and non-root paths withOPENSHELL_DRIVERS=dockerexported to lock the fix: skip gateway marker for docker-driver sandboxes #4748 env-gate regression, respawn-loop pidfile refresh); newtest/generate-openclaw-config-reload.test.ts(pin emission, env permutations, regeneration over an existing config);gateway.reloadrestore-preservation case inopenclaw-config-merge.test.ts; HEALTHCHECK pattern matrix updates insandbox-provisioning.test.ts.Architecture
nemoclaw-start.sh, [Station][Recovery] openclaw gateway killed — no auto-respawn; parent daemon exits with child; recovery requires nemoclaw connect #2757 respawn loop) the sole authority over the gateway's lifecycle. The defaulthybridreload mode gave OpenClaw a second, unobservable restart authority inside the container; thehotpin removes it, routing every restart-class change through the supervisor's own observable paths (recover/rebuild). Same direction as fix(sandbox): tie gateway healthcheck marker to launch site, not env hint (#4710) #5116's marker work: observable, true-by-construction state over env-hint heuristics and hidden in-process transitions./tmpdiscipline (mktemp + atomic rename, cf.prepare_plugin_refresh_log): root never opens a sandbox-influencable path, and a hostile sandbox user can only degrade its own self-healing.gateway-healthsuite and the legacy file's size budget ratcheted down — the direction the growth guardrails enforce.Type of Change
Verification
npx prek run --all-filespasses — except thetest-clihook's two pre-existing, machine-bound failures that reproduce identically on a cleanmaincheckout on this host (live e2e scenarioubuntu-repo-cloud-openclawattempting a real onboard, and the timing-sensitive auto-pair keepalive retry test); all other hooks pass including shellcheck, shfmt, hadolint, Biome, source-shape and size budgets.npm testpasses for all touched suites (nemoclaw-start-gateway-health,nemoclaw-start,nemoclaw-start-plugin-refresh,sandbox-provisioning,generate-openclaw-config,generate-openclaw-config-reload,openclaw-config-merge) — same two pre-existing host-bound failures as above in the full run.End-to-end proof still required before closing #4710 (needs an image rebuild on a docker-driver Ubuntu host): fresh onboard →
docker inspect .State.Health.Statusstayshealthyfor ≥3 health intervals; write a restart-class key in-sandbox and confirmconfig reload requires gateway restart; hot mode ignoringwith the listener staying up; wedge drill (manual SIGUSR1) → watchdog kill → respawn → healthy.Signed-off-by: Aaron Erickson aerickson@nvidia.com
Summary by CodeRabbit
Bug Fixes
New Features
Documentation
Tests
Chores