Skip to content

fix(sandbox): self-heal recovery preload guards#5265

Closed
laitingsheng wants to merge 8 commits into
mainfrom
fix/recover-preload-guard-self-heal
Closed

fix(sandbox): self-heal recovery preload guards#5265
laitingsheng wants to merge 8 commits into
mainfrom
fix/recover-preload-guard-self-heal

Conversation

@laitingsheng

@laitingsheng laitingsheng commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

The gateway recovery script aborts with refusing unguarded gateway relaunch when /tmp/nemoclaw-proxy-env.sh is present but does not install the safety-net / ciao --require preload guards into NODE_OPTIONS. Sandboxes onboarded by an older entrypoint that emitted only proxy / OPENCLAW_* exports into that file — or that have lost the staged .js preloads from /tmp — therefore cannot recover. Recovery now regenerates the staged copies from the immutable /usr/local/lib/nemoclaw/preloads/ source files, verifies provenance (regular non-symlink, mode 444, root-owned when running as uid 0) before grafting any path into NODE_OPTIONS, pins the trailing _GUARDS_MISSING check to the exact --require /tmp/<preload>.js token, and keeps the existing refusal as the final invariant when provenance fails.

This PR fixes the in-sandbox refusal mechanism — provenance, trusted-source install, and the exact-path guard check. End-to-end coverage of gateway recovery (process tree, dashboard reachability, inference healthy) is exercised by the existing test/e2e/test-issue-2478-crash-loop-recovery.sh soak; the unit tests added here pin the new boundary inside that flow.

Related Issue

Fixes #5253

Changes

  • src/lib/agent/runtime-recovery-preload.ts — new module. Hosts GATEWAY_PRELOAD_GUARDS (pairs each /tmp/<preload>.js with its immutable /usr/local/lib/nemoclaw/preloads/<basename>.js source) and buildGatewayPreloadSelfHealLines(). The latter emits a _nemoclaw_install_recovery_preload shell function that (a) recreates the staged /tmp copy from the trusted source via stage-and-rename when missing, mirroring emit_sandbox_sourced_file; (b) refuses to add an entry to NODE_OPTIONS unless the staged copy is a regular non-symlink with mode 444 (and root-owned when uid 0); (c) only grafts --require <path> once that exact path is not already present, so a stale marker substring no longer satisfies the check.
  • src/lib/agent/runtime.ts — splice the self-heal into both buildOpenClawRecoveryScript and buildRecoveryScript immediately after sourcing /tmp/nemoclaw-proxy-env.sh, wrapped in if [ "$_PE_MISSING" = "0" ]; then … fi; at the call site so the conditional is visible to reviewers. Tighten the trailing _GUARDS_MISSING check from a marker substring to the exact --require /tmp/<preload>.js token, closing the bypass where a NODE_OPTIONS that mentioned the marker text but lacked the real --require entry was treated as guarded.
  • src/lib/agent/runtime-recovery-preload.test.ts — new dedicated file. Shape tests pin the installer function, the _PE_MISSING=0 gate at each splice site, the trusted source paths, the ordering across both buildRecoveryScript and buildOpenClawRecoveryScript, and the exact-path form of the trailing guard check. Behavioural tests spawn bash against the extracted block with stubbed paths and explicit NODE_OPTIONS="" env override and cover: install-from-source on a clean /tmp, reuse of an already-staged mode 444 file, refusal of a symlink target, refusal of a mode 666 target, warn-and-skip when both staged and source are absent, marker-substring-without---require still installs, idempotency when both --require entries are already present, and _PE_MISSING=1 bypassing the block entirely.
  • src/lib/agent/runtime.test.ts — drops the obsolete shape and bash-probe tests that asserted the previous [ -r <path> ] form; the surviving #2478 tests in this file continue to pin the proxy-env sourcing, refusal, and gateway-log invariants.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • Recovery scripts now include a gated self-heal step that stages trusted preload files into secure temporary locations and ensures safe application to Node startup.
  • Bug Fixes

    • Improved startup resilience by restoring missing preload entries before other recovery checks, avoiding duplicates and honoring security/provenance checks.
  • Tests

    • Added end-to-end tests covering generation, execution, permission checks, idempotence, gating behavior, and various edge cases.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Emit guarded shell fragments that stage trusted preload files into /tmp with provenance checks and idempotently append matching --require entries to NODE_OPTIONS; add tests validating script shape, provenance refusals, idempotency, and NODE_OPTIONS behavior.

Changes

Gateway Recovery Self-Heal Mechanism

Layer / File(s) Summary
Self-heal implementation and recovery script integration
src/lib/agent/runtime-recovery-preload.ts, src/lib/agent/runtime.ts
Add GATEWAY_PRELOAD_GUARDS and buildGatewayPreloadSelfHealLines() which emit an installer function and per-preload call lines; wire emitted lines into both buildOpenClawRecoveryScript() and buildRecoveryScript() gated by _PE_MISSING="0" and update NODE_OPTIONS guard checks to exact --require /tmp/... paths.
Test harness and fixture setup
src/lib/agent/runtime-recovery-preload.test.ts
Test helpers: minimal AgentDefinition factory, regex extraction of the self-heal block, per-test temp/source workspace creation, and fixture lifecycle helpers.
Probe execution helper
src/lib/agent/runtime-recovery-preload.test.ts
runProbe() executes the rewritten self-heal snippet via bash -c with controlled _PE_MISSING and seeded NODE_OPTIONS, returning exit status/stdout/stderr.
Script structure and ordering tests
src/lib/agent/runtime-recovery-preload.test.ts
Assert generated scripts include the installer function, are gated on _PE_MISSING=0, reference immutable trusted source paths, and run self-heal before proxy-env guard-refusal logic (including OpenClaw ordering and pinned --require checks).
Self-heal behavioral execution tests
src/lib/agent/runtime-recovery-preload.test.ts
Execute rewritten snippets to assert filesystem effects and NODE_OPTIONS handling: create missing /tmp preloads with mode 444 and inject --require; preserve existing mode-444 files (no mtime/content change); refuse symlinks/non-444 modes; warn/skip when sources missing; treat marker substrings without --require as not installed; avoid duplicate --require; skip when _PE_MISSING=1.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • #5262: Updates the same recovery script generation to regenerate missing preload-related NODE_OPTIONS entries; this change implements conditional self-heal wiring addressing that objective.

Suggested labels

area: sandbox

Suggested reviewers

  • cv
  • ericksoa

🐰 I stitched the preloads with careful paws,
I checked each bit and audited the cause.
When markers vanish and NODE_OPTIONS sighs,
I hop in, copy, seal, and tidy the ties.
Safe, idempotent—no duplicate cries.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(sandbox): self-heal recovery preload guards' clearly describes the main change - implementing self-healing logic for recovery preload guards in the sandbox environment.
Linked Issues check ✅ Passed All changes directly address #5253 requirements: recovering missing preload files from trusted sources, validating provenance, and safely restarting the gateway with proper guards while maintaining safety invariants.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing the recovery preload self-healing mechanism for #5253; no unrelated modifications to other subsystems were introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/recover-preload-guard-self-heal

Comment @coderabbitai help to get the list of available commands and usage tips.

@laitingsheng laitingsheng added the bug-fix PR fixes a bug or regression label Jun 12, 2026
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: gateway-guard-recovery, issue-2478-crash-loop-recovery-e2e, hermes-secret-boundary-e2e
Optional E2E: hermes-e2e, full

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gateway-guard-recovery (medium): Direct live Vitest scenario for recovery restoring the /tmp guard chain after guard files are wiped. This PR changes exactly that generated recovery logic and guard matching.
  • issue-2478-crash-loop-recovery-e2e (medium): Existing long-running bash E2E for gateway crash-loop recovery and NODE_OPTIONS guard preservation across repeated relaunches. Required because the PR changes recovery guard self-heal behavior and final refusal checks.
  • hermes-secret-boundary-e2e (medium): Validates Hermes sandbox secret boundary and startup refusal behavior in a container/image context. Required because Hermes recovery now sources proxy-env with trusted --require paths and still must refuse raw secret leakage without logging secret values.

Optional E2E

  • hermes-e2e (medium): Useful confidence that the modified generic non-OpenClaw recovery script still launches and recovers Hermes in the normal user flow, beyond the targeted secret-boundary checks.
  • full (medium): Broad install/onboard/sandbox/inference smoke for OpenClaw. Optional because the targeted guard-recovery E2Es cover the changed recovery behavior more directly.

New E2E recommendations

  • None.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: gateway-guard-recovery
Optional Vitest E2E scenarios: ubuntu-repo-cloud-openclaw

Dispatch required Vitest E2E scenarios:

  • gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field jobs=gateway-guard-recovery

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required Vitest E2E scenarios

  • gateway-guard-recovery: The PR changes gateway recovery script generation and NODE_OPTIONS preload guard self-healing in src/lib/agent/runtime.ts and the new runtime-recovery-preload helper. The existing free-standing Vitest live job gateway-guard-recovery is wired in e2e-vitest-scenarios.yaml to exercise gateway guard recovery behavior through test/e2e-scenario/live/gateway-guard-recovery.test.ts, making it the smallest live Vitest dispatch for this recovery/preload surface.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field jobs=gateway-guard-recovery

Optional Vitest E2E scenarios

  • ubuntu-repo-cloud-openclaw: Optional baseline OpenClaw live scenario coverage because the recovery-script change affects default OpenClaw gateway relaunch paths beyond the focused gateway-guard-recovery job.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Relevant changed files

  • src/lib/agent/runtime-recovery-preload.ts
  • src/lib/agent/runtime.ts
  • src/lib/agent/runtime-hermes-secret-boundary-behavioural.test.ts
  • src/lib/agent/runtime-recovery-preload.test.ts

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 1 needs attention, 4 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 4 still apply, 1 new item found

Review findings

🛠️ Needs attention

🔎 Worth checking

  • Non-root recovery can still trust a sandbox-controlled preload (src/lib/agent/runtime-recovery-preload.ts:80): The installer appends a pre-existing `/tmp/nemoclaw-*.js` file to `NODE_OPTIONS` after checking non-symlink, regular-file, and mode `444`, but it only checks ownership when recovery runs as uid 0. Non-root recovery runs as the sandbox user, and a sandbox-owned `0444` file can be chmod'd writable again by its owner and modified before Node loads it. Because these files are security preloads, accepting sandbox-controlled bytes weakens the sandbox policy boundary.
    • Recommendation: Do not reuse pre-existing `/tmp` preload bytes unless they are proven to match the trusted source. Prefer always replacing the staged copy from `/usr/local/lib/nemoclaw/preloads/*` immediately before adding it to `NODE_OPTIONS`, or verify content equivalence/immutable ownership in non-root recovery. Add a non-root full recovery test where a sandbox-owned tampered `0444` `/tmp/nemoclaw-sandbox-safety-net.js` is refused or replaced before launch.
    • Evidence: `if [ "$(id -u)" -eq 0 ]; then owner=... expected root ... fi;` gates the owner check to root mode, then `export NODE_OPTIONS="${NODE_OPTIONS:+$NODE_OPTIONS }--require $tmp"` appends the existing file. `scripts/lib/sandbox-init.sh` documents non-root `sandbox:sandbox 444` as best-effort because the owner can chmod it back.
  • Full recovery script composition remains under-tested (src/lib/agent/runtime-recovery-preload.test.ts:65): The new fragment tests are useful, but the security-critical behavior is composed inside the full generated recovery scripts. The full path includes health probing, hardened log setup, stale process cleanup, binary validation, proxy-env sourcing, self-heal, final guard refusal, Hermes runtime boundary checks, gateway launch, dashboard recovery, and port-forward recovery. Fragment extraction can miss ordering and caller/callee contract bugs across those boundaries.
    • Recommendation: Add full-script behavioral tests with stubbed `curl`, `pkill`, `pgrep`, `stat`, `id`, gateway binaries, and log files for successful trusted-source install-to-launch, missing trusted sources refusing before launch, marker-only `NODE_OPTIONS` refusing without exact `--require` entries, wrong-owner root-mode refusal, non-root tampered-preload refusal or replacement, and persistence of provenance failures to the selected gateway log.
    • Evidence: `makeFixture()` matches `SELF_HEAL_RE`, rewrites only the extracted block, and `runProbe()` prints `NODE_OPTIONS` after executing that block. The full `_PE_MISSING`, `_GUARDS_MISSING`, refusal, log setup, process cleanup, launch, Hermes dashboard recovery, and host-forward recovery control flow is not exercised by the new tests.
  • OpenClaw self-heal failures may not persist to the selected gateway log (src/lib/agent/runtime.ts:196): The OpenClaw recovery script now runs the self-heal block before the health probe, hardened log setup, and `_GATEWAY_LOG` selection. The helper writes provenance failures to `_GATEWAY_LOG` only when that variable is set, so OpenClaw self-heal failures can be visible only on stderr instead of the durable gateway log. That conflicts with the surrounding [DGX Spark] Gateway crash loop on startup: @homebridge/ciao networkInterfaces() returns EPERM in OpenShell sandbox #2478 recovery invariant that recovery warnings/errors should survive for sysadmin tailing and with the helper tests, which inject `_GATEWAY_LOG` artificially.
    • Recommendation: Move OpenClaw self-heal after hardened gateway log preparation/selection, or initialize a safe `_GATEWAY_LOG` before the self-heal block. Add a full OpenClaw recovery test where missing trusted sources or wrong-owner staged preloads persist the provenance message to the selected gateway log before refusing launch.
    • Evidence: `buildOpenClawRecoveryScript()` sources `/tmp/nemoclaw-proxy-env.sh`, runs `buildGatewayPreloadSelfHealLines()`, and performs the final guard check before `_GW_CODE`, `buildGatewayLogSetup(...)`, and `buildGatewayLogSelection()`. The helper logs with `[ -n "${_GATEWAY_LOG:-}" ] && echo "$_msg" >> "$_GATEWAY_LOG"`.
  • Localized preload self-heal lacks source-of-truth closure (src/lib/agent/runtime-recovery-preload.ts:4): This is a localized recovery workaround for legacy sandboxes whose `proxy-env.sh` or staged preloads are missing required guard exports. The helper documents the invalid legacy state and trusted source paths, but it does not explain why the authoritative startup/proxy-env source cannot be fixed or regenerated here, does not add regression coverage for those source writers, and does not define a removal condition. Active overlapping work in fix(agent): regenerate proxy-env.sh guard chain during recovery (#2701) #5259 also touches `runtime.ts` around proxy-env regeneration during recovery.
    • Recommendation: Document the source boundary and removal condition, and add source-level regression coverage that OpenClaw and Hermes startup/proxy-env writers emit exact safety-net and ciao `--require /tmp/nemoclaw-*.js` entries with non-writable sourced-file permissions. Coordinate with overlapping proxy-env regeneration work so this helper does not become a second long-term source of truth for guard installation.
    • Evidence: `runtime-recovery-preload.ts` says it restores guards for a legacy sandbox whose proxy-env was emitted before the entrypoint emitted those guards. The authoritative writers live around `scripts/nemoclaw-start.sh`, `agents/hermes/start.sh`, and `scripts/lib/sandbox-init.sh`, while this PR tests only the localized recovery installer block.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Full Hermes recovery script with legacy proxy-env present, no guard exports, and trusted preload sources available self-heals exact `--require /tmp/nemoclaw-*.js` entries and reaches the gateway launch branch.. The changed code generates shell executed across sandbox recovery, `NODE_OPTIONS` preload trust boundaries, process cleanup, logging, Hermes launch/dashboard recovery, and host port-forward recovery. Unit and shell-fragment tests are useful but do not cover the composed runtime behavior or the linked [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 acceptance path.
  • **Runtime validation** — Full Hermes recovery path after dashboard port-forward is killed re-establishes `http://127.0.0.1:18789/\`, returns success, and leaves status/inference healthy.. The changed code generates shell executed across sandbox recovery, `NODE_OPTIONS` preload trust boundaries, process cleanup, logging, Hermes launch/dashboard recovery, and host port-forward recovery. Unit and shell-fragment tests are useful but do not cover the composed runtime behavior or the linked [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 acceptance path.
  • **Runtime validation** — Full Hermes recovery with marker-only `NODE_OPTIONS`, absent trusted sources, and no exact `/tmp/nemoclaw-*.js` `--require` entries exits before launch with `refusing unguarded gateway relaunch`.. The changed code generates shell executed across sandbox recovery, `NODE_OPTIONS` preload trust boundaries, process cleanup, logging, Hermes launch/dashboard recovery, and host port-forward recovery. Unit and shell-fragment tests are useful but do not cover the composed runtime behavior or the linked [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 acceptance path.
  • **Runtime validation** — Full non-root Hermes recovery with a pre-existing sandbox-owned tampered `0444` `/tmp/nemoclaw-sandbox-safety-net.js` refuses or replaces it from the trusted source before appending to `NODE_OPTIONS`.. The changed code generates shell executed across sandbox recovery, `NODE_OPTIONS` preload trust boundaries, process cleanup, logging, Hermes launch/dashboard recovery, and host port-forward recovery. Unit and shell-fragment tests are useful but do not cover the composed runtime behavior or the linked [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 acceptance path.
  • **Runtime validation** — Full OpenClaw root-mode recovery with wrong-owner `/tmp/nemoclaw-ciao-network-guard.js` refuses before launch and persists the owner error to the selected gateway log.. The changed code generates shell executed across sandbox recovery, `NODE_OPTIONS` preload trust boundaries, process cleanup, logging, Hermes launch/dashboard recovery, and host port-forward recovery. Unit and shell-fragment tests are useful but do not cover the composed runtime behavior or the linked [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 acceptance path.
  • **Full recovery script composition remains under-tested** — Add full-script behavioral tests with stubbed `curl`, `pkill`, `pgrep`, `stat`, `id`, gateway binaries, and log files for successful trusted-source install-to-launch, missing trusted sources refusing before launch, marker-only `NODE_OPTIONS` refusing without exact `--require` entries, wrong-owner root-mode refusal, non-root tampered-preload refusal or replacement, and persistence of provenance failures to the selected gateway log.
  • **Acceptance clause:** Hermes sandbox exists and is healthy: — add test evidence or identify existing coverage. No changed test sets up or simulates a healthy Hermes sandbox before recovery; tests synthesize an `AgentDefinition` and execute only an extracted self-heal block.
  • **Acceptance clause:** `curl -sf http://127.0.0.1:18789/\` returns `curl-exit:0` and dashboard HTML. — add test evidence or identify existing coverage. No changed test probes host dashboard reachability or dashboard HTML before or after recovery.
Since last review details

Current findings:

  • Full Hermes recover acceptance path is still not proven (src/lib/agent/runtime-recovery-preload.test.ts:125): Issue [All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload #5253 expects `nemohermes <name> recover` to recover a broken Hermes dashboard/gateway, restart the Hermes gateway with the safety-net and ciao preloads, re-establish `http://127.0.0.1:18789/\`, exit 0 with a clear restart message, and leave status/inference healthy. The changed tests cover generated script shape and an extracted self-heal shell fragment, but they do not run the composed Hermes recovery path, host port-forward recovery, dashboard reachability, command output, or post-recovery health checks.
  • Non-root recovery can still trust a sandbox-controlled preload (src/lib/agent/runtime-recovery-preload.ts:80): The installer appends a pre-existing `/tmp/nemoclaw-*.js` file to `NODE_OPTIONS` after checking non-symlink, regular-file, and mode `444`, but it only checks ownership when recovery runs as uid 0. Non-root recovery runs as the sandbox user, and a sandbox-owned `0444` file can be chmod'd writable again by its owner and modified before Node loads it. Because these files are security preloads, accepting sandbox-controlled bytes weakens the sandbox policy boundary.
    • Recommendation: Do not reuse pre-existing `/tmp` preload bytes unless they are proven to match the trusted source. Prefer always replacing the staged copy from `/usr/local/lib/nemoclaw/preloads/*` immediately before adding it to `NODE_OPTIONS`, or verify content equivalence/immutable ownership in non-root recovery. Add a non-root full recovery test where a sandbox-owned tampered `0444` `/tmp/nemoclaw-sandbox-safety-net.js` is refused or replaced before launch.
    • Evidence: `if [ "$(id -u)" -eq 0 ]; then owner=... expected root ... fi;` gates the owner check to root mode, then `export NODE_OPTIONS="${NODE_OPTIONS:+$NODE_OPTIONS }--require $tmp"` appends the existing file. `scripts/lib/sandbox-init.sh` documents non-root `sandbox:sandbox 444` as best-effort because the owner can chmod it back.
  • Full recovery script composition remains under-tested (src/lib/agent/runtime-recovery-preload.test.ts:65): The new fragment tests are useful, but the security-critical behavior is composed inside the full generated recovery scripts. The full path includes health probing, hardened log setup, stale process cleanup, binary validation, proxy-env sourcing, self-heal, final guard refusal, Hermes runtime boundary checks, gateway launch, dashboard recovery, and port-forward recovery. Fragment extraction can miss ordering and caller/callee contract bugs across those boundaries.
    • Recommendation: Add full-script behavioral tests with stubbed `curl`, `pkill`, `pgrep`, `stat`, `id`, gateway binaries, and log files for successful trusted-source install-to-launch, missing trusted sources refusing before launch, marker-only `NODE_OPTIONS` refusing without exact `--require` entries, wrong-owner root-mode refusal, non-root tampered-preload refusal or replacement, and persistence of provenance failures to the selected gateway log.
    • Evidence: `makeFixture()` matches `SELF_HEAL_RE`, rewrites only the extracted block, and `runProbe()` prints `NODE_OPTIONS` after executing that block. The full `_PE_MISSING`, `_GUARDS_MISSING`, refusal, log setup, process cleanup, launch, Hermes dashboard recovery, and host-forward recovery control flow is not exercised by the new tests.
  • OpenClaw self-heal failures may not persist to the selected gateway log (src/lib/agent/runtime.ts:196): The OpenClaw recovery script now runs the self-heal block before the health probe, hardened log setup, and `_GATEWAY_LOG` selection. The helper writes provenance failures to `_GATEWAY_LOG` only when that variable is set, so OpenClaw self-heal failures can be visible only on stderr instead of the durable gateway log. That conflicts with the surrounding [DGX Spark] Gateway crash loop on startup: @homebridge/ciao networkInterfaces() returns EPERM in OpenShell sandbox #2478 recovery invariant that recovery warnings/errors should survive for sysadmin tailing and with the helper tests, which inject `_GATEWAY_LOG` artificially.
    • Recommendation: Move OpenClaw self-heal after hardened gateway log preparation/selection, or initialize a safe `_GATEWAY_LOG` before the self-heal block. Add a full OpenClaw recovery test where missing trusted sources or wrong-owner staged preloads persist the provenance message to the selected gateway log before refusing launch.
    • Evidence: `buildOpenClawRecoveryScript()` sources `/tmp/nemoclaw-proxy-env.sh`, runs `buildGatewayPreloadSelfHealLines()`, and performs the final guard check before `_GW_CODE`, `buildGatewayLogSetup(...)`, and `buildGatewayLogSelection()`. The helper logs with `[ -n "${_GATEWAY_LOG:-}" ] && echo "$_msg" >> "$_GATEWAY_LOG"`.
  • Localized preload self-heal lacks source-of-truth closure (src/lib/agent/runtime-recovery-preload.ts:4): This is a localized recovery workaround for legacy sandboxes whose `proxy-env.sh` or staged preloads are missing required guard exports. The helper documents the invalid legacy state and trusted source paths, but it does not explain why the authoritative startup/proxy-env source cannot be fixed or regenerated here, does not add regression coverage for those source writers, and does not define a removal condition. Active overlapping work in fix(agent): regenerate proxy-env.sh guard chain during recovery (#2701) #5259 also touches `runtime.ts` around proxy-env regeneration during recovery.
    • Recommendation: Document the source boundary and removal condition, and add source-level regression coverage that OpenClaw and Hermes startup/proxy-env writers emit exact safety-net and ciao `--require /tmp/nemoclaw-*.js` entries with non-writable sourced-file permissions. Coordinate with overlapping proxy-env regeneration work so this helper does not become a second long-term source of truth for guard installation.
    • Evidence: `runtime-recovery-preload.ts` says it restores guards for a legacy sandbox whose proxy-env was emitted before the entrypoint emitted those guards. The authoritative writers live around `scripts/nemoclaw-start.sh`, `agents/hermes/start.sh`, and `scripts/lib/sandbox-init.sh`, while this PR tests only the localized recovery installer block.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/agent/runtime.test.ts`:
- Around line 287-314: The test "self-heal lines install both --require preloads
into NODE_OPTIONS under bash" currently calls spawnSync("bash", ["-c", probe],
...) which inherits the parent environment and can pick up pre-existing
NODE_OPTIONS; update the spawnSync invocation to pass an explicit env object
that clones process.env but sets NODE_OPTIONS to an empty string (or undefined)
so the shell probe runs with a clean NODE_OPTIONS; locate the spawnSync call in
this test and add the env override to ensure deterministic behavior.

In `@src/lib/agent/runtime.ts`:
- Around line 141-145: The self-heal preload lines produced by
buildGatewayPreloadSelfHealLines() are always spliced in, causing NODE_OPTIONS
to be rewritten even when preload files are missing; gate that recovery behavior
on the _PE_MISSING env flag by only emitting the self-heal lines when
_PE_MISSING is set to "0" (or by wrapping the generated shell snippet in a test
like [ "${_PE_MISSING:-}" = "0" ] && ...). Update
buildGatewayPreloadSelfHealLines() to implement this guard and make the same
change where the same snippet is injected (the other splices referenced in the
diff around the code that merges GATEWAY_PRELOAD_GUARDS at the other sites) so
the warning about “proxy-env exists but NODE_OPTIONS is stale” remains accurate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ee3b1c6e-ea3b-485d-b8b1-01bb323b65d7

📥 Commits

Reviewing files that changed from the base of the PR and between fa09629 and 8b3adec.

📒 Files selected for processing (2)
  • src/lib/agent/runtime.test.ts
  • src/lib/agent/runtime.ts

Comment thread src/lib/agent/runtime.test.ts Outdated
Comment thread src/lib/agent/runtime.ts Outdated
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng added integration: hermes Hermes integration behavior area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow labels Jun 12, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27395540540
Target ref: 00fdcaa2b37cd404c73b4b70b814f037dfb91127
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 0 passed, 1 failed, 0 cancelled, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ❌ failure

Failed jobs: issue-2478-crash-loop-recovery-e2e. Check run artifacts for logs.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27396243506
Target ref: 4e2e94cc7186520422b85d22d20e849f490e6d0e
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 0 passed, 1 failed, 0 cancelled, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ❌ failure

Failed jobs: issue-2478-crash-loop-recovery-e2e. Check run artifacts for logs.

…uire guard

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@cv

cv commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Closing this in favor of #5321. The replacement carries forward the trusted packaged preload/provenance approach here and adds the missing _PE_MISSING=1 restoration path, so crash-loop recovery can recreate proxy-env.sh after a pod/tmp wipe. Thanks for the clean base.

@cv cv closed this Jun 12, 2026
cv added a commit that referenced this pull request Jun 12, 2026
## Summary
This PR addresses the shared gateway recovery-state failure from #2701
by restoring the guard chain from trusted packaged preload sources when
`/tmp/nemoclaw-proxy-env.sh` is missing or incomplete. Recovery now
recreates a minimal proxy-env file for the critical safety-net and ciao
preloads, validates exact `--require` entries, and refuses an unguarded
relaunch if trusted staging fails.

## Related Issue
Addresses the shared guard-chain recovery failure in #2701; does not
close the broader hardware/provider/recreate-trigger validation matrix.
Refs #2701
Refs #2478
Supersedes #5259
Supersedes #5265

## Changes
- Added a shared recovery helper that stages safety-net and ciao
preloads from `/usr/local/lib/nemoclaw/preloads/` into hardened `/tmp`
files.
- Wired OpenClaw and agent-specific gateway recovery through the helper
instead of the old warning-only missing-proxy-env path.
- Added shell-executed tests for missing proxy-env restoration, exact
`--require` matching, symlink replacement, missing trusted sources, and
Hermes recovery harness integration.
- Documented the live E2E coverage scope: production `connect
--probe-only` / sandbox-exec recovery after a pod-recreate-equivalent
guard-chain wipe, with DGX Spark / GB10 / aarch64, provider breadth, and
destructive recreate triggers intentionally left for dedicated platform
validation.

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `npm run docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
<!-- DCO sign-off required by CI. Run: git config user.name && git
config user.email -->
Signed-off-by: Carlos Villela <cvillela@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Recovery scripts now stage and validate trusted preload modules,
restore guard chains early, and hard-fail if critical guards remain
missing.
* Added test harness helpers to simulate trusted preload sources and
rewrite recovery scripts for E2E tests.

* **Bug Fixes**
* Hardened proxy-env handling: reject symlinks, enforce strict
permissions/ownership, avoid sourcing attacker-controlled content, and
prevent duplicate/unsafe preload entries.

* **Tests**
* Expanded E2E and unit coverage with ordering, permission/symlink
scenarios, logging and PID-stability assertions.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow bug-fix PR fixes a bug or regression integration: hermes Hermes integration behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[All Platforms][Sandbox] nemohermes <name> recover refuses gateway relaunch due to missing NODE_OPTIONS safety-net preload

3 participants