fix(sandbox): self-heal recovery preload guards by laitingsheng · Pull Request #5265 · NVIDIA/NemoClaw

laitingsheng · 2026-06-12T02:30:07Z

Summary

The gateway recovery script aborts with refusing unguarded gateway relaunch when /tmp/nemoclaw-proxy-env.sh is present but does not install the safety-net / ciao --require preload guards into NODE_OPTIONS. Sandboxes onboarded by an older entrypoint that emitted only proxy / OPENCLAW_* exports into that file — or that have lost the staged .js preloads from /tmp — therefore cannot recover. Recovery now regenerates the staged copies from the immutable /usr/local/lib/nemoclaw/preloads/ source files, verifies provenance (regular non-symlink, mode 444, root-owned when running as uid 0) before grafting any path into NODE_OPTIONS, pins the trailing _GUARDS_MISSING check to the exact --require /tmp/<preload>.js token, and keeps the existing refusal as the final invariant when provenance fails.

This PR fixes the in-sandbox refusal mechanism — provenance, trusted-source install, and the exact-path guard check. End-to-end coverage of gateway recovery (process tree, dashboard reachability, inference healthy) is exercised by the existing test/e2e/test-issue-2478-crash-loop-recovery.sh soak; the unit tests added here pin the new boundary inside that flow.

Related Issue

Fixes #5253

Changes

src/lib/agent/runtime-recovery-preload.ts — new module. Hosts GATEWAY_PRELOAD_GUARDS (pairs each /tmp/<preload>.js with its immutable /usr/local/lib/nemoclaw/preloads/<basename>.js source) and buildGatewayPreloadSelfHealLines(). The latter emits a _nemoclaw_install_recovery_preload shell function that (a) recreates the staged /tmp copy from the trusted source via stage-and-rename when missing, mirroring emit_sandbox_sourced_file; (b) refuses to add an entry to NODE_OPTIONS unless the staged copy is a regular non-symlink with mode 444 (and root-owned when uid 0); (c) only grafts --require <path> once that exact path is not already present, so a stale marker substring no longer satisfies the check.
src/lib/agent/runtime.ts — splice the self-heal into both buildOpenClawRecoveryScript and buildRecoveryScript immediately after sourcing /tmp/nemoclaw-proxy-env.sh, wrapped in if [ "$_PE_MISSING" = "0" ]; then … fi; at the call site so the conditional is visible to reviewers. Tighten the trailing _GUARDS_MISSING check from a marker substring to the exact --require /tmp/<preload>.js token, closing the bypass where a NODE_OPTIONS that mentioned the marker text but lacked the real --require entry was treated as guarded.
src/lib/agent/runtime-recovery-preload.test.ts — new dedicated file. Shape tests pin the installer function, the _PE_MISSING=0 gate at each splice site, the trusted source paths, the ordering across both buildRecoveryScript and buildOpenClawRecoveryScript, and the exact-path form of the trailing guard check. Behavioural tests spawn bash against the extracted block with stubbed paths and explicit NODE_OPTIONS="" env override and cover: install-from-source on a clean /tmp, reuse of an already-staged mode 444 file, refusal of a symlink target, refusal of a mode 666 target, warn-and-skip when both staged and source are absent, marker-substring-without---require still installs, idempotency when both --require entries are already present, and _PE_MISSING=1 bypassing the block entirely.
src/lib/agent/runtime.test.ts — drops the obsolete shape and bash-probe tests that asserted the previous [ -r <path> ] form; the surviving #2478 tests in this file continue to pin the proxy-env sourcing, refusal, and gateway-log invariants.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
npm run docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

New Features
- Recovery scripts now include a gated self-heal step that stages trusted preload files into secure temporary locations and ensures safe application to Node startup.
Bug Fixes
- Improved startup resilience by restoring missing preload entries before other recovery checks, avoiding duplicates and honoring security/provenance checks.
Tests
- Added end-to-end tests covering generation, execution, permission checks, idempotence, gating behavior, and various edge cases.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai · 2026-06-12T02:30:21Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Emit guarded shell fragments that stage trusted preload files into /tmp with provenance checks and idempotently append matching --require entries to NODE_OPTIONS; add tests validating script shape, provenance refusals, idempotency, and NODE_OPTIONS behavior.

Changes

Gateway Recovery Self-Heal Mechanism

Layer / File(s)	Summary
Self-heal implementation and recovery script integration `src/lib/agent/runtime-recovery-preload.ts`, `src/lib/agent/runtime.ts`	Add `GATEWAY_PRELOAD_GUARDS` and `buildGatewayPreloadSelfHealLines()` which emit an installer function and per-preload call lines; wire emitted lines into both `buildOpenClawRecoveryScript()` and `buildRecoveryScript()` gated by `_PE_MISSING="0"` and update NODE_OPTIONS guard checks to exact `--require /tmp/...` paths.
Test harness and fixture setup `src/lib/agent/runtime-recovery-preload.test.ts`	Test helpers: minimal `AgentDefinition` factory, regex extraction of the self-heal block, per-test temp/source workspace creation, and fixture lifecycle helpers.
Probe execution helper `src/lib/agent/runtime-recovery-preload.test.ts`	`runProbe()` executes the rewritten self-heal snippet via `bash -c` with controlled `_PE_MISSING` and seeded `NODE_OPTIONS`, returning exit status/stdout/stderr.
Script structure and ordering tests `src/lib/agent/runtime-recovery-preload.test.ts`	Assert generated scripts include the installer function, are gated on `_PE_MISSING=0`, reference immutable trusted source paths, and run self-heal before proxy-env guard-refusal logic (including OpenClaw ordering and pinned `--require` checks).
Self-heal behavioral execution tests `src/lib/agent/runtime-recovery-preload.test.ts`	Execute rewritten snippets to assert filesystem effects and NODE_OPTIONS handling: create missing `/tmp` preloads with mode 444 and inject `--require`; preserve existing mode-444 files (no mtime/content change); refuse symlinks/non-444 modes; warn/skip when sources missing; treat marker substrings without `--require` as not installed; avoid duplicate `--require`; skip when `_PE_MISSING=1`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

#5262: Updates the same recovery script generation to regenerate missing preload-related NODE_OPTIONS entries; this change implements conditional self-heal wiring addressing that objective.

Suggested labels

area: sandbox

Suggested reviewers

cv
ericksoa

🐰 I stitched the preloads with careful paws,
I checked each bit and audited the cause.
When markers vanish and NODE_OPTIONS sighs,
I hop in, copy, seal, and tidy the ties.
Safe, idempotent—no duplicate cries.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(sandbox): self-heal recovery preload guards' clearly describes the main change - implementing self-healing logic for recovery preload guards in the sandbox environment.
Linked Issues check	✅ Passed	All changes directly address `#5253` requirements: recovering missing preload files from trusted sources, validating provenance, and safely restarting the gateway with proper guards while maintaining safety invariants.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing the recovery preload self-healing mechanism for `#5253`; no unrelated modifications to other subsystems were introduced.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/recover-preload-guard-self-heal

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-12T02:31:26Z

E2E Advisor Recommendation

Required E2E: gateway-guard-recovery, issue-2478-crash-loop-recovery-e2e, hermes-secret-boundary-e2e
Optional E2E: hermes-e2e, full

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

gateway-guard-recovery (medium): Direct live Vitest scenario for recovery restoring the /tmp guard chain after guard files are wiped. This PR changes exactly that generated recovery logic and guard matching.
issue-2478-crash-loop-recovery-e2e (medium): Existing long-running bash E2E for gateway crash-loop recovery and NODE_OPTIONS guard preservation across repeated relaunches. Required because the PR changes recovery guard self-heal behavior and final refusal checks.
hermes-secret-boundary-e2e (medium): Validates Hermes sandbox secret boundary and startup refusal behavior in a container/image context. Required because Hermes recovery now sources proxy-env with trusted --require paths and still must refuse raw secret leakage without logging secret values.

Optional E2E

hermes-e2e (medium): Useful confidence that the modified generic non-OpenClaw recovery script still launches and recovers Hermes in the normal user flow, beyond the targeted secret-boundary checks.
full (medium): Broad install/onboard/sandbox/inference smoke for OpenClaw. Optional because the targeted guard-recovery E2Es cover the changed recovery behavior more directly.

New E2E recommendations

None.

github-actions · 2026-06-12T02:31:27Z

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: gateway-guard-recovery
Optional Vitest E2E scenarios: ubuntu-repo-cloud-openclaw

Dispatch required Vitest E2E scenarios:

gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field jobs=gateway-guard-recovery

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required Vitest E2E scenarios

gateway-guard-recovery: The PR changes gateway recovery script generation and NODE_OPTIONS preload guard self-healing in src/lib/agent/runtime.ts and the new runtime-recovery-preload helper. The existing free-standing Vitest live job gateway-guard-recovery is wired in e2e-vitest-scenarios.yaml to exercise gateway guard recovery behavior through test/e2e-scenario/live/gateway-guard-recovery.test.ts, making it the smallest live Vitest dispatch for this recovery/preload surface.
- Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field jobs=gateway-guard-recovery

Optional Vitest E2E scenarios

ubuntu-repo-cloud-openclaw: Optional baseline OpenClaw live scenario coverage because the recovery-script change affects default OpenClaw gateway relaunch paths beyond the focused gateway-guard-recovery job.
- Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Relevant changed files

src/lib/agent/runtime-recovery-preload.ts
src/lib/agent/runtime.ts
src/lib/agent/runtime-hermes-secret-boundary-behavioural.test.ts
src/lib/agent/runtime-recovery-preload.test.ts

github-actions · 2026-06-12T02:34:29Z

PR Review Advisor

Findings: 1 needs attention, 4 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 4 still apply, 1 new item found

Review findings

🛠️ Needs attention

Full Hermes recover acceptance path is still not proven (src/lib/agent/runtime-recovery-preload.test.ts:125): Issue [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 expects `nemohermes <name> recover` to recover a broken Hermes dashboard/gateway, restart the Hermes gateway with the safety-net and ciao preloads, re-establish `http://127.0.0.1:18789/\`, exit 0 with a clear restart message, and leave status/inference healthy. The changed tests cover generated script shape and an extracted self-heal shell fragment, but they do not run the composed Hermes recovery path, host port-forward recovery, dashboard reachability, command output, or post-recovery health checks.
- Recommendation: Add or identify targeted runtime/integration validation for the complete [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 path: start from a healthy Hermes sandbox or equivalent stubbed recovery harness, break dashboard/port-forward access, run the real recovery path, assert exact trusted preloads are applied before launch, assert gateway/dashboard recovery succeeds, assert the command returns success with the expected clear message, and assert dashboard curl plus status/inference remain healthy. If this PR intentionally fixes only the preload refusal mechanism, narrow the linked acceptance claim.
- Evidence: `describe("gateway recovery preload self-heal ([All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253)")` uses `makeFixture()` to regex-match `SELF_HEAL_RE` and `runProbe()` to execute only that extracted block under bash. No changed test executes `nemohermes hermes recover`, `process-recovery.ts`, host port-forward recovery, dashboard HTML reachability, or status/inference verification.

🔎 Worth checking

Non-root recovery can still trust a sandbox-controlled preload (src/lib/agent/runtime-recovery-preload.ts:80): The installer appends a pre-existing `/tmp/nemoclaw-*.js` file to `NODE_OPTIONS` after checking non-symlink, regular-file, and mode `444`, but it only checks ownership when recovery runs as uid 0. Non-root recovery runs as the sandbox user, and a sandbox-owned `0444` file can be chmod'd writable again by its owner and modified before Node loads it. Because these files are security preloads, accepting sandbox-controlled bytes weakens the sandbox policy boundary.
- Recommendation: Do not reuse pre-existing `/tmp` preload bytes unless they are proven to match the trusted source. Prefer always replacing the staged copy from `/usr/local/lib/nemoclaw/preloads/*` immediately before adding it to `NODE_OPTIONS`, or verify content equivalence/immutable ownership in non-root recovery. Add a non-root full recovery test where a sandbox-owned tampered `0444` `/tmp/nemoclaw-sandbox-safety-net.js` is refused or replaced before launch.
- Evidence: `if [ "$(id -u)" -eq 0 ]; then owner=... expected root ... fi;` gates the owner check to root mode, then `export NODE_OPTIONS="${NODE_OPTIONS:+$NODE_OPTIONS }--require $tmp"` appends the existing file. `scripts/lib/sandbox-init.sh` documents non-root `sandbox:sandbox 444` as best-effort because the owner can chmod it back.
Full recovery script composition remains under-tested (src/lib/agent/runtime-recovery-preload.test.ts:65): The new fragment tests are useful, but the security-critical behavior is composed inside the full generated recovery scripts. The full path includes health probing, hardened log setup, stale process cleanup, binary validation, proxy-env sourcing, self-heal, final guard refusal, Hermes runtime boundary checks, gateway launch, dashboard recovery, and port-forward recovery. Fragment extraction can miss ordering and caller/callee contract bugs across those boundaries.
- Recommendation: Add full-script behavioral tests with stubbed `curl`, `pkill`, `pgrep`, `stat`, `id`, gateway binaries, and log files for successful trusted-source install-to-launch, missing trusted sources refusing before launch, marker-only `NODE_OPTIONS` refusing without exact `--require` entries, wrong-owner root-mode refusal, non-root tampered-preload refusal or replacement, and persistence of provenance failures to the selected gateway log.
- Evidence: `makeFixture()` matches `SELF_HEAL_RE`, rewrites only the extracted block, and `runProbe()` prints `NODE_OPTIONS` after executing that block. The full `_PE_MISSING`, `_GUARDS_MISSING`, refusal, log setup, process cleanup, launch, Hermes dashboard recovery, and host-forward recovery control flow is not exercised by the new tests.
OpenClaw self-heal failures may not persist to the selected gateway log (src/lib/agent/runtime.ts:196): The OpenClaw recovery script now runs the self-heal block before the health probe, hardened log setup, and `_GATEWAY_LOG` selection. The helper writes provenance failures to `_GATEWAY_LOG` only when that variable is set, so OpenClaw self-heal failures can be visible only on stderr instead of the durable gateway log. That conflicts with the surrounding [DGX Spark] Gateway crash loop on startup: @homebridge/ciao networkInterfaces() returns EPERM in OpenShell sandbox #2478 recovery invariant that recovery warnings/errors should survive for sysadmin tailing and with the helper tests, which inject `_GATEWAY_LOG` artificially.
- Recommendation: Move OpenClaw self-heal after hardened gateway log preparation/selection, or initialize a safe `_GATEWAY_LOG` before the self-heal block. Add a full OpenClaw recovery test where missing trusted sources or wrong-owner staged preloads persist the provenance message to the selected gateway log before refusing launch.
- Evidence: `buildOpenClawRecoveryScript()` sources `/tmp/nemoclaw-proxy-env.sh`, runs `buildGatewayPreloadSelfHealLines()`, and performs the final guard check before `_GW_CODE`, `buildGatewayLogSetup(...)`, and `buildGatewayLogSelection()`. The helper logs with `[ -n "${_GATEWAY_LOG:-}" ] && echo "$_msg" >> "$_GATEWAY_LOG"`.
Localized preload self-heal lacks source-of-truth closure (src/lib/agent/runtime-recovery-preload.ts:4): This is a localized recovery workaround for legacy sandboxes whose `proxy-env.sh` or staged preloads are missing required guard exports. The helper documents the invalid legacy state and trusted source paths, but it does not explain why the authoritative startup/proxy-env source cannot be fixed or regenerated here, does not add regression coverage for those source writers, and does not define a removal condition. Active overlapping work in fix(agent): regenerate proxy-env.sh guard chain during recovery (#2701) #5259 also touches `runtime.ts` around proxy-env regeneration during recovery.
- Recommendation: Document the source boundary and removal condition, and add source-level regression coverage that OpenClaw and Hermes startup/proxy-env writers emit exact safety-net and ciao `--require /tmp/nemoclaw-*.js` entries with non-writable sourced-file permissions. Coordinate with overlapping proxy-env regeneration work so this helper does not become a second long-term source of truth for guard installation.
- Evidence: `runtime-recovery-preload.ts` says it restores guards for a legacy sandbox whose proxy-env was emitted before the entrypoint emitted those guards. The authoritative writers live around `scripts/nemoclaw-start.sh`, `agents/hermes/start.sh`, and `scripts/lib/sandbox-init.sh`, while this PR tests only the localized recovery installer block.

🌱 Nice ideas

None.

Consider writing more tests for

**Runtime validation** — Full Hermes recovery script with legacy proxy-env present, no guard exports, and trusted preload sources available self-heals exact `--require /tmp/nemoclaw-*.js` entries and reaches the gateway launch branch.. The changed code generates shell executed across sandbox recovery, `NODE_OPTIONS` preload trust boundaries, process cleanup, logging, Hermes launch/dashboard recovery, and host port-forward recovery. Unit and shell-fragment tests are useful but do not cover the composed runtime behavior or the linked [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 acceptance path.
**Runtime validation** — Full Hermes recovery path after dashboard port-forward is killed re-establishes `http://127.0.0.1:18789/\`, returns success, and leaves status/inference healthy.. The changed code generates shell executed across sandbox recovery, `NODE_OPTIONS` preload trust boundaries, process cleanup, logging, Hermes launch/dashboard recovery, and host port-forward recovery. Unit and shell-fragment tests are useful but do not cover the composed runtime behavior or the linked [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 acceptance path.
**Runtime validation** — Full Hermes recovery with marker-only `NODE_OPTIONS`, absent trusted sources, and no exact `/tmp/nemoclaw-*.js` `--require` entries exits before launch with `refusing unguarded gateway relaunch`.. The changed code generates shell executed across sandbox recovery, `NODE_OPTIONS` preload trust boundaries, process cleanup, logging, Hermes launch/dashboard recovery, and host port-forward recovery. Unit and shell-fragment tests are useful but do not cover the composed runtime behavior or the linked [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 acceptance path.
**Runtime validation** — Full non-root Hermes recovery with a pre-existing sandbox-owned tampered `0444` `/tmp/nemoclaw-sandbox-safety-net.js` refuses or replaces it from the trusted source before appending to `NODE_OPTIONS`.. The changed code generates shell executed across sandbox recovery, `NODE_OPTIONS` preload trust boundaries, process cleanup, logging, Hermes launch/dashboard recovery, and host port-forward recovery. Unit and shell-fragment tests are useful but do not cover the composed runtime behavior or the linked [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 acceptance path.
**Runtime validation** — Full OpenClaw root-mode recovery with wrong-owner `/tmp/nemoclaw-ciao-network-guard.js` refuses before launch and persists the owner error to the selected gateway log.. The changed code generates shell executed across sandbox recovery, `NODE_OPTIONS` preload trust boundaries, process cleanup, logging, Hermes launch/dashboard recovery, and host port-forward recovery. Unit and shell-fragment tests are useful but do not cover the composed runtime behavior or the linked [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 acceptance path.
**Full recovery script composition remains under-tested** — Add full-script behavioral tests with stubbed `curl`, `pkill`, `pgrep`, `stat`, `id`, gateway binaries, and log files for successful trusted-source install-to-launch, missing trusted sources refusing before launch, marker-only `NODE_OPTIONS` refusing without exact `--require` entries, wrong-owner root-mode refusal, non-root tampered-preload refusal or replacement, and persistence of provenance failures to the selected gateway log.
**Acceptance clause:** Hermes sandbox exists and is healthy: — add test evidence or identify existing coverage. No changed test sets up or simulates a healthy Hermes sandbox before recovery; tests synthesize an `AgentDefinition` and execute only an extracted self-heal block.
**Acceptance clause:** `curl -sf http://127.0.0.1:18789/\` returns `curl-exit:0` and dashboard HTML. — add test evidence or identify existing coverage. No changed test probes host dashboard reachability or dashboard HTML before or after recovery.

Since last review details

Current findings:

Full Hermes recover acceptance path is still not proven (src/lib/agent/runtime-recovery-preload.test.ts:125): Issue [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 expects `nemohermes <name> recover` to recover a broken Hermes dashboard/gateway, restart the Hermes gateway with the safety-net and ciao preloads, re-establish `http://127.0.0.1:18789/\`, exit 0 with a clear restart message, and leave status/inference healthy. The changed tests cover generated script shape and an extracted self-heal shell fragment, but they do not run the composed Hermes recovery path, host port-forward recovery, dashboard reachability, command output, or post-recovery health checks.
- Recommendation: Add or identify targeted runtime/integration validation for the complete [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 path: start from a healthy Hermes sandbox or equivalent stubbed recovery harness, break dashboard/port-forward access, run the real recovery path, assert exact trusted preloads are applied before launch, assert gateway/dashboard recovery succeeds, assert the command returns success with the expected clear message, and assert dashboard curl plus status/inference remain healthy. If this PR intentionally fixes only the preload refusal mechanism, narrow the linked acceptance claim.
- Evidence: `describe("gateway recovery preload self-heal ([All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253)")` uses `makeFixture()` to regex-match `SELF_HEAL_RE` and `runProbe()` to execute only that extracted block under bash. No changed test executes `nemohermes hermes recover`, `process-recovery.ts`, host port-forward recovery, dashboard HTML reachability, or status/inference verification.
Non-root recovery can still trust a sandbox-controlled preload (src/lib/agent/runtime-recovery-preload.ts:80): The installer appends a pre-existing `/tmp/nemoclaw-*.js` file to `NODE_OPTIONS` after checking non-symlink, regular-file, and mode `444`, but it only checks ownership when recovery runs as uid 0. Non-root recovery runs as the sandbox user, and a sandbox-owned `0444` file can be chmod'd writable again by its owner and modified before Node loads it. Because these files are security preloads, accepting sandbox-controlled bytes weakens the sandbox policy boundary.
- Recommendation: Do not reuse pre-existing `/tmp` preload bytes unless they are proven to match the trusted source. Prefer always replacing the staged copy from `/usr/local/lib/nemoclaw/preloads/*` immediately before adding it to `NODE_OPTIONS`, or verify content equivalence/immutable ownership in non-root recovery. Add a non-root full recovery test where a sandbox-owned tampered `0444` `/tmp/nemoclaw-sandbox-safety-net.js` is refused or replaced before launch.
- Evidence: `if [ "$(id -u)" -eq 0 ]; then owner=... expected root ... fi;` gates the owner check to root mode, then `export NODE_OPTIONS="${NODE_OPTIONS:+$NODE_OPTIONS }--require $tmp"` appends the existing file. `scripts/lib/sandbox-init.sh` documents non-root `sandbox:sandbox 444` as best-effort because the owner can chmod it back.
Full recovery script composition remains under-tested (src/lib/agent/runtime-recovery-preload.test.ts:65): The new fragment tests are useful, but the security-critical behavior is composed inside the full generated recovery scripts. The full path includes health probing, hardened log setup, stale process cleanup, binary validation, proxy-env sourcing, self-heal, final guard refusal, Hermes runtime boundary checks, gateway launch, dashboard recovery, and port-forward recovery. Fragment extraction can miss ordering and caller/callee contract bugs across those boundaries.
- Recommendation: Add full-script behavioral tests with stubbed `curl`, `pkill`, `pgrep`, `stat`, `id`, gateway binaries, and log files for successful trusted-source install-to-launch, missing trusted sources refusing before launch, marker-only `NODE_OPTIONS` refusing without exact `--require` entries, wrong-owner root-mode refusal, non-root tampered-preload refusal or replacement, and persistence of provenance failures to the selected gateway log.
- Evidence: `makeFixture()` matches `SELF_HEAL_RE`, rewrites only the extracted block, and `runProbe()` prints `NODE_OPTIONS` after executing that block. The full `_PE_MISSING`, `_GUARDS_MISSING`, refusal, log setup, process cleanup, launch, Hermes dashboard recovery, and host-forward recovery control flow is not exercised by the new tests.
OpenClaw self-heal failures may not persist to the selected gateway log (src/lib/agent/runtime.ts:196): The OpenClaw recovery script now runs the self-heal block before the health probe, hardened log setup, and `_GATEWAY_LOG` selection. The helper writes provenance failures to `_GATEWAY_LOG` only when that variable is set, so OpenClaw self-heal failures can be visible only on stderr instead of the durable gateway log. That conflicts with the surrounding [DGX Spark] Gateway crash loop on startup: @homebridge/ciao networkInterfaces() returns EPERM in OpenShell sandbox #2478 recovery invariant that recovery warnings/errors should survive for sysadmin tailing and with the helper tests, which inject `_GATEWAY_LOG` artificially.
- Recommendation: Move OpenClaw self-heal after hardened gateway log preparation/selection, or initialize a safe `_GATEWAY_LOG` before the self-heal block. Add a full OpenClaw recovery test where missing trusted sources or wrong-owner staged preloads persist the provenance message to the selected gateway log before refusing launch.
- Evidence: `buildOpenClawRecoveryScript()` sources `/tmp/nemoclaw-proxy-env.sh`, runs `buildGatewayPreloadSelfHealLines()`, and performs the final guard check before `_GW_CODE`, `buildGatewayLogSetup(...)`, and `buildGatewayLogSelection()`. The helper logs with `[ -n "${_GATEWAY_LOG:-}" ] && echo "$_msg" >> "$_GATEWAY_LOG"`.
Localized preload self-heal lacks source-of-truth closure (src/lib/agent/runtime-recovery-preload.ts:4): This is a localized recovery workaround for legacy sandboxes whose `proxy-env.sh` or staged preloads are missing required guard exports. The helper documents the invalid legacy state and trusted source paths, but it does not explain why the authoritative startup/proxy-env source cannot be fixed or regenerated here, does not add regression coverage for those source writers, and does not define a removal condition. Active overlapping work in fix(agent): regenerate proxy-env.sh guard chain during recovery (#2701) #5259 also touches `runtime.ts` around proxy-env regeneration during recovery.
- Recommendation: Document the source boundary and removal condition, and add source-level regression coverage that OpenClaw and Hermes startup/proxy-env writers emit exact safety-net and ciao `--require /tmp/nemoclaw-*.js` entries with non-writable sourced-file permissions. Coordinate with overlapping proxy-env regeneration work so this helper does not become a second long-term source of truth for guard installation.
- Evidence: `runtime-recovery-preload.ts` says it restores guards for a legacy sandbox whose proxy-env was emitted before the entrypoint emitted those guards. The authoritative writers live around `scripts/nemoclaw-start.sh`, `agents/hermes/start.sh`, and `scripts/lib/sandbox-init.sh`, while this PR tests only the localized recovery installer block.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/agent/runtime.test.ts`:
- Around line 287-314: The test "self-heal lines install both --require preloads
into NODE_OPTIONS under bash" currently calls spawnSync("bash", ["-c", probe],
...) which inherits the parent environment and can pick up pre-existing
NODE_OPTIONS; update the spawnSync invocation to pass an explicit env object
that clones process.env but sets NODE_OPTIONS to an empty string (or undefined)
so the shell probe runs with a clean NODE_OPTIONS; locate the spawnSync call in
this test and add the env override to ensure deterministic behavior.

In `@src/lib/agent/runtime.ts`:
- Around line 141-145: The self-heal preload lines produced by
buildGatewayPreloadSelfHealLines() are always spliced in, causing NODE_OPTIONS
to be rewritten even when preload files are missing; gate that recovery behavior
on the _PE_MISSING env flag by only emitting the self-heal lines when
_PE_MISSING is set to "0" (or by wrapping the generated shell snippet in a test
like [ "${_PE_MISSING:-}" = "0" ] && ...). Update
buildGatewayPreloadSelfHealLines() to implement this guard and make the same
change where the same snippet is injected (the other splices referenced in the
diff around the code that merges GATEWAY_PRELOAD_GUARDS at the other sites) so
the warning about “proxy-env exists but NODE_OPTIONS is stale” remains accurate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ee3b1c6e-ea3b-485d-b8b1-01bb323b65d7

📥 Commits

Reviewing files that changed from the base of the PR and between fa09629 and 8b3adec.

📒 Files selected for processing (2)

src/lib/agent/runtime.test.ts
src/lib/agent/runtime.ts

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

…PTIONS Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

…uard-self-heal

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-06-12T05:11:49Z

Selective E2E Results — ❌ Some jobs failed

Run: 27395540540
Target ref: 00fdcaa2b37cd404c73b4b70b814f037dfb91127
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 0 passed, 1 failed, 0 cancelled, 0 skipped

Job	Result
issue-2478-crash-loop-recovery-e2e	❌ failure

Failed jobs: issue-2478-crash-loop-recovery-e2e. Check run artifacts for logs.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-06-12T05:31:20Z

Selective E2E Results — ❌ Some jobs failed

Run: 27396243506
Target ref: 4e2e94cc7186520422b85d22d20e849f490e6d0e
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 0 passed, 1 failed, 0 cancelled, 0 skipped

Job	Result
issue-2478-crash-loop-recovery-e2e	❌ failure

Failed jobs: issue-2478-crash-loop-recovery-e2e. Check run artifacts for logs.

…uire guard Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

cv · 2026-06-12T06:17:40Z

Closing this in favor of #5321. The replacement carries forward the trusted packaged preload/provenance approach here and adds the missing _PE_MISSING=1 restoration path, so crash-loop recovery can recreate proxy-env.sh after a pod/tmp wipe. Thanks for the clean base.

## Summary This PR addresses the shared gateway recovery-state failure from #2701 by restoring the guard chain from trusted packaged preload sources when `/tmp/nemoclaw-proxy-env.sh` is missing or incomplete. Recovery now recreates a minimal proxy-env file for the critical safety-net and ciao preloads, validates exact `--require` entries, and refuses an unguarded relaunch if trusted staging fails. ## Related Issue Addresses the shared guard-chain recovery failure in #2701; does not close the broader hardware/provider/recreate-trigger validation matrix. Refs #2701 Refs #2478 Supersedes #5259 Supersedes #5265 ## Changes - Added a shared recovery helper that stages safety-net and ciao preloads from `/usr/local/lib/nemoclaw/preloads/` into hardened `/tmp` files. - Wired OpenClaw and agent-specific gateway recovery through the helper instead of the old warning-only missing-proxy-env path. - Added shell-executed tests for missing proxy-env restoration, exact `--require` matching, symlink replacement, missing trusted sources, and Hermes recovery harness integration. - Documented the live E2E coverage scope: production `connect --probe-only` / sandbox-exec recovery after a pod-recreate-equivalent guard-chain wipe, with DGX Spark / GB10 / aarch64, provider breadth, and destructive recreate triggers intentionally left for dedicated platform validation. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `npm run docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ---  Signed-off-by: Carlos Villela <cvillela@nvidia.com>  ## Summary by CodeRabbit * **New Features** * Recovery scripts now stage and validate trusted preload modules, restore guard chains early, and hard-fail if critical guards remain missing. * Added test harness helpers to simulate trusted preload sources and rewrite recovery scripts for E2E tests. * **Bug Fixes** * Hardened proxy-env handling: reject symlinks, enforce strict permissions/ownership, avoid sourcing attacker-controlled content, and prevent duplicate/unsafe preload entries. * **Tests** * Expanded E2E and unit coverage with ordering, permission/symlink scenarios, logging and PID-stability assertions.  --------- Signed-off-by: Carlos Villela <cvillela@nvidia.com>

fix(sandbox): self-heal recovery preload guards

8b3adec

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng added the bug-fix PR fixes a bug or regression label Jun 12, 2026

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread src/lib/agent/runtime.test.ts Outdated

Comment thread src/lib/agent/runtime.ts Outdated

style(test): apply biome formatting to recovery preload tests

c01dfe2

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng added integration: hermes Hermes integration behavior area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow labels Jun 12, 2026

laitingsheng added 4 commits June 12, 2026 04:11

fix(sandbox): require provenance before grafting preloads into NODE_O…

4873978

…PTIONS Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

Merge remote-tracking branch 'origin/main' into fix/recover-preload-g…

3faa2ca

…uard-self-heal

refactor(sandbox): extract gateway recovery preload self-heal builder

d91698e

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

refactor(sandbox): hoist _PE_MISSING gate to recovery splice sites

00fdcaa

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

fix(sandbox): pin recovery guard check to trusted --require path

4e2e94c

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

test(sandbox): align hermes secret-boundary fixtures with exact --req…

b362700

…uire guard Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

cv mentioned this pull request Jun 12, 2026

fix(agent): restore gateway guard chain during recovery #5321

Merged

12 tasks

cv closed this Jun 12, 2026

Conversation

laitingsheng commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Vitest E2E Scenario Recommendation

Vitest E2E Scenario Advisor

Required Vitest E2E scenarios

Optional Vitest E2E scenarios

Relevant changed files

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 12, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Jun 12, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

cv commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

laitingsheng commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

github-actions Bot commented Jun 12, 2026 •

edited

Loading

github-actions Bot commented Jun 12, 2026 •

edited

Loading

github-actions Bot commented Jun 12, 2026 •

edited

Loading