Skip to content

fix(agent): restore gateway guard chain during recovery#5321

Merged
cv merged 8 commits into
mainfrom
codex/supersede-recovery-guard-prs
Jun 12, 2026
Merged

fix(agent): restore gateway guard chain during recovery#5321
cv merged 8 commits into
mainfrom
codex/supersede-recovery-guard-prs

Conversation

@cv

@cv cv commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR addresses the shared gateway recovery-state failure from #2701 by restoring the guard chain from trusted packaged preload sources when /tmp/nemoclaw-proxy-env.sh is missing or incomplete. Recovery now recreates a minimal proxy-env file for the critical safety-net and ciao preloads, validates exact --require entries, and refuses an unguarded relaunch if trusted staging fails.

Related Issue

Addresses the shared guard-chain recovery failure in #2701; does not close the broader hardware/provider/recreate-trigger validation matrix.
Refs #2701
Refs #2478
Supersedes #5259
Supersedes #5265

Changes

  • Added a shared recovery helper that stages safety-net and ciao preloads from /usr/local/lib/nemoclaw/preloads/ into hardened /tmp files.
  • Wired OpenClaw and agent-specific gateway recovery through the helper instead of the old warning-only missing-proxy-env path.
  • Added shell-executed tests for missing proxy-env restoration, exact --require matching, symlink replacement, missing trusted sources, and Hermes recovery harness integration.
  • Documented the live E2E coverage scope: production connect --probe-only / sandbox-exec recovery after a pod-recreate-equivalent guard-chain wipe, with DGX Spark / GB10 / aarch64, provider breadth, and destructive recreate triggers intentionally left for dedicated platform validation.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Carlos Villela cvillela@nvidia.com

Summary by CodeRabbit

  • New Features

    • Recovery scripts now stage and validate trusted preload modules, restore guard chains early, and hard-fail if critical guards remain missing.
    • Added test harness helpers to simulate trusted preload sources and rewrite recovery scripts for E2E tests.
  • Bug Fixes

    • Hardened proxy-env handling: reject symlinks, enforce strict permissions/ownership, avoid sourcing attacker-controlled content, and prevent duplicate/unsafe preload entries.
  • Tests

    • Expanded E2E and unit coverage with ordering, permission/symlink scenarios, logging and PID-stability assertions.

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@cv cv added security Potential vulnerability, unsafe behavior, or access risk area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression labels Jun 12, 2026
@cv cv self-assigned this Jun 12, 2026
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 58b1d2d4-2159-47b8-974a-2c6f938a3c6a

📥 Commits

Reviewing files that changed from the base of the PR and between 9e5653d and 03a3168.

📒 Files selected for processing (2)
  • src/lib/agent/runtime-recovery-preload.test.ts
  • src/lib/agent/runtime.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/agent/runtime-recovery-preload.test.ts

📝 Walkthrough

Walkthrough

Adds packaged preload guard mappings and a builder that stages/validates trusted preloads into /tmp, wires that logic into generated recovery scripts to rebuild/source /tmp/nemoclaw-proxy-env.sh before launch, and adds helpers plus unit/behavioral/e2e tests exercising repair and refusal cases.

Changes

Gateway Recovery Preload Guard Restoration

Layer / File(s) Summary
Preload guard definition and recovery-line generation
src/lib/agent/runtime-recovery-preload.ts
GATEWAY_PRELOAD_GUARDS defines two Node --require pairs. buildGatewayGuardRecoveryLines() emits shell helpers to validate and stage packaged preloads into /tmp, regenerate /tmp/nemoclaw-proxy-env.sh when missing/unsafe, append missing --require entries to NODE_OPTIONS, and set _GUARDS_MISSING if critical guards fail to install.
Integration into recovery script builders
src/lib/agent/runtime.ts, src/lib/agent/runtime.test.ts
buildOpenClawRecoveryScript and buildRecoveryScript import and insert buildGatewayGuardRecoveryLines() instead of inline guard checks: scripts run guard-recovery lines early, validate/source the proxy env only after validation, and hard-fail when _GUARDS_MISSING=1 (refs #2478/#2701). Tests updated to assert sourcing, restoration messaging, permission/readiness ordering, and guard-readiness checks.
Test helpers
test/helpers/runtime-recovery-preload-test-helpers.ts
Adds utilities to compute harness-specific preload paths, write placeholder staged preload sources, and rewrite emitted recovery-script path strings for test harness execution.
End-to-end recovery-preload validation
src/lib/agent/runtime-recovery-preload.test.ts
Adds a Vitest suite that executes generated recovery scripts in a temporary filesystem, covering repair (restore missing proxy-env, rebuild unsafe env, replace symlinks) and refusal cases (missing/unsafe packaged sources, directory proxy-env), and asserts file contents, permission hardening, symlink replacement, exit codes, and log markers.
Behavioral and live e2e updates
src/lib/agent/runtime-hermes-secret-boundary-behavioural.test.ts, test/e2e-scenario/live/gateway-guard-recovery.test.ts
Behavioral harness now threads preload harness paths and rewrites generated scripts; runtime-env fixtures are hardened to read-only in tests; live e2e assertions now expect explicit restoration messages and absence of an unguarded-launch log entry rather than asserting WARNING absence.

Sequence Diagram(s)

sequenceDiagram
  participant Pod as Sandbox Pod
  participant Recovery as Recovery Script
  participant Preloader as Packaged Preloads
  participant Tmp as /tmp
  participant ProxyEnv as /tmp/nemoclaw-proxy-env.sh
  participant Gateway as Gateway Process

  Pod->>Recovery: trigger recovery (connect/relaunch)
  Recovery->>Tmp: check for proxy-env.sh
  alt proxy-env missing or unsafe
    Recovery->>Preloader: stage trusted preloads (safety-net, ciao)
    Preloader-->>Tmp: staged preload files
    Recovery->>ProxyEnv: regenerate proxy-env from staged exports
    ProxyEnv-->>Recovery: sourced
  else proxy-env present and valid
    ProxyEnv-->>Recovery: sourced
  end
  Recovery->>Recovery: append missing --require entries to NODE_OPTIONS
  Recovery->>Recovery: validate guard readiness (_GUARDS_MISSING/_NEMOCLAW_CRITICAL_GUARDS_READY)
  alt guards ready
    Recovery->>Gateway: proceed with launch
  else guards missing
    Recovery-->>Pod: FAIL (abort launch, `#2478/#2701`)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/NemoClaw#5049: Updates live gateway-guard-recovery E2E expectations aligned with the new guard-repair logic.

Suggested labels

area: security

"🐇
I hop to staged archives, tiny and spry,
copy safety-net and ciao by-and-by,
restore the env with permissions tight,
source the guards so the gateway's right,
no naked launches tonight."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the main change: adding functionality to restore the gateway guard chain during the recovery process, which is the core objective across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/supersede-recovery-guard-prs

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: gateway-guard-recovery
Optional E2E: hermes-e2e-vitest, network-policy-vitest

Dispatch hint: gateway-guard-recovery

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gateway-guard-recovery (~45 minutes; live OpenShell sandbox; requires NVIDIA_API_KEY): Direct coverage for the changed production recovery route: wipes /tmp guard-chain state, kills the gateway, triggers nemoclaw <sandbox> connect --probe-only, and asserts trusted preload restoration plus stable gateway behavior.

Optional E2E

  • hermes-e2e-vitest (live Hermes sandbox; requires NVIDIA_API_KEY; moderate/high cost): Adjacent confidence for non-OpenClaw agent/Hermes live flows because buildRecoveryScript changed for Hermes and Hermes runtime-env secret-boundary recovery tests were touched. Useful, but the PR’s primary acceptance path is OpenClaw gateway guard-chain recovery.
  • network-policy-vitest (~90 minutes; live OpenShell sandbox; requires NVIDIA_API_KEY): Adjacent confidence that the ciao/network guard security boundary remains effective in a live sandbox after changes to the critical NODE_OPTIONS preload guard chain. It does not specifically cover recovery after /tmp wipe, so it is optional.

New E2E recommendations

  • platform-runtime/aarch64-pod-recreate (medium): The updated live test intentionally models pod-recreate-equivalent state on a Docker-driver runner, but does not exercise physical DGX Spark/GB10/aarch64, real host reboot/OOM/supervisor crash, or Kubernetes pod deletion triggers called out in the test comments.
    • Suggested test: Add a dedicated platform-runtime E2E job that runs gateway guard-chain recovery on DGX Spark/GB10 or an aarch64 OpenShell sandbox after an actual pod/container recreate event.
  • agent-specific-recovery/hermes (medium): Hermes recovery uses the same recovery script builder and secret-boundary guards, but there is no existing live E2E that poisons or wipes proxy-env/preload state and then triggers Hermes recovery to prove runtime-env validation still happens before relaunch.
    • Suggested test: Add a Hermes live recovery scenario that wipes /tmp preload/proxy-env state, triggers recovery for a Hermes sandbox, and asserts trusted guard restoration plus secret-boundary refusal on raw runtime secrets.

Dispatch hint

  • Workflow: .github/workflows/e2e-vitest-scenarios.yaml
  • jobs input: gateway-guard-recovery

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: gateway-guard-recovery
Optional Vitest E2E scenarios: None

Dispatch required Vitest E2E scenarios:

  • gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field jobs=gateway-guard-recovery

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required Vitest E2E scenarios

  • gateway-guard-recovery: Focused free-standing Vitest job wired for changed live test test/e2e-scenario/live/gateway-guard-recovery.test.ts.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field jobs=gateway-guard-recovery

Optional Vitest E2E scenarios

  • None.

Relevant changed files

  • src/lib/agent/runtime-recovery-preload.ts
  • src/lib/agent/runtime.ts
  • test/e2e-scenario/live/gateway-guard-recovery.test.ts

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 8 worth checking, 0 nice ideas
Since last review: 2 prior items resolved, 4 still apply, 1 new item found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Source-of-truth review needed: Minimal recovered proxy-env writer: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `_nemoclaw_write_recovered_proxy_env` emits only recovery comment plus two critical `NODE_OPTIONS` exports, while the entrypoint writer emits the full runtime shell environment.
  • Source-of-truth review needed: Metadata-valid proxy-env shell content: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: After `_nemoclaw_validate_recovery_proxy_env` succeeds, the helper immediately executes `. /tmp/nemoclaw-proxy-env.sh`.
  • Source-of-truth review needed: Hermes dashboard-only recovery proxy-env sourcing: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `buildHermesDashboardProcessRecoveryScript()` still emits `if [ -r /tmp/nemoclaw-proxy-env.sh ]; then . /tmp/nemoclaw-proxy-env.sh; fi;` and is not wired through `buildGatewayGuardRecoveryLines()`.
  • Metadata-valid proxy-env content is still executed as shell (src/lib/agent/runtime-recovery-preload.ts:143): The new helper rejects symlinks, unsafe modes, and wrong root ownership in root mode, which is a substantial improvement. However, once `_nemoclaw_validate_recovery_proxy_env` succeeds, recovery immediately sources `/tmp/nemoclaw-proxy-env.sh`. A regular mode-0444 file containing exact required guard entries plus arbitrary shell commands would pass the current checks and execute before content authenticity is established. This also keeps credential-exfiltration risk in scope because the sourced file normally carries gateway/proxy environment state.
    • Recommendation: Prefer rebuilding the recovery proxy-env from trusted generated content whenever this volatile `/tmp` trust boundary is crossed, or add a documented trust model plus a regression test where a metadata-valid proxy-env with exact guard entries and hostile shell content must not execute.
    • Evidence: `_nemoclaw_validate_recovery_proxy_env` checks symlink/file/mode/root owner, then the returned recovery line is `if _nemoclaw_validate_recovery_proxy_env /tmp/nemoclaw-proxy-env.sh; then . /tmp/nemoclaw-proxy-env.sh; _PE_MISSING=0; else _PE_MISSING=1; fi;`.
  • Hermes dashboard-only recovery still bypasses hardened proxy-env validation (src/lib/agent/runtime.ts:183): OpenClaw and generic gateway recovery now validate/rebuild the proxy-env before shell init and health probing, but the Hermes dashboard-only path keeps the older direct readable-file source behavior and still sources `~/.bashrc` first. That leaves this recovery route outside the new symlink, mode, owner, regular-file, and trusted-preload checks.
    • Recommendation: Route `buildHermesDashboardProcessRecoveryScript()` through the same hardened proxy-env/preload handling, or document why dashboard-only recovery is safe and add negative tests for symlinked, world-writable, and hostile `/tmp/nemoclaw-proxy-env.sh` content.
    • Evidence: `buildHermesDashboardProcessRecoveryScript()` emits `[ -f ~/.bashrc ] && . ~/.bashrc;` followed by `if [ -r /tmp/nemoclaw-proxy-env.sh ]; then . /tmp/nemoclaw-proxy-env.sh; fi;`; `buildGatewayGuardRecoveryLines()` is wired into `buildOpenClawRecoveryScript()` and `buildRecoveryScript()` only.
  • Non-root packaged-preload trust model is implicit (src/lib/agent/runtime-recovery-preload.ts:94): The recovery helper treats `/usr/local/lib/nemoclaw/preloads/*.js` as trusted packaged sources. In root mode it verifies owner UID 0 and rejects group/other-writable modes, but in non-root mode it only checks readable, non-symlink, regular-file status. If a non-root recovery environment can alter those packaged source paths, recovery could install hostile preload code into `/tmp` and append it to `NODE_OPTIONS`.
    • Recommendation: Document why packaged preload sources are immutable in non-root recovery, or apply equivalent owner/writability validation where possible and add a non-root tampered-source regression test.
    • Evidence: `_nemoclaw_validate_trusted_preload_source` checks owner and group/other writability only inside `if [ "$(id -u)" -eq 0 ]; then ... fi;`; tests cover group-writable trusted source only with `fakeRoot: true`.
  • Recovered proxy-env remains a minimal second source of runtime shell state (src/lib/agent/runtime-recovery-preload.ts:127): The entrypoint remains the full source of truth for `/tmp/nemoclaw-proxy-env.sh`, but recovery now has a second writer that emits only the two critical `NODE_OPTIONS` exports. The code comment usefully bounds this to gateway recovery and says sandbox restart refreshes the full proxy env, but the long-term ownership/removal condition and the effects of omitted proxy, token, OpenClaw path, guard-function, and tool-redirect state remain only partially documented.
    • Recommendation: Either extract/reuse the entrypoint proxy-env emission source for recovery, or add a short design note that explicitly bounds this minimal env to gateway recovery, describes effects on later connect shells until restart, and states when this workaround can be removed.
    • Evidence: `_nemoclaw_write_recovered_proxy_env` writes a recovery comment plus safety-net and ciao `NODE_OPTIONS` exports, while `scripts/nemoclaw-start.sh::write_runtime_shell_env` emits proxy variables, OpenClaw paths, gateway URL/token, guard functions, and tool redirects.
  • Broader [DGX Spark] Host reboot bricks sandbox until 5-min rebuild --yes: connect recovery path warns about missing /tmp guards but launches gateway naked → @homebridge/ciao crash loop #2701 platform and recreate-trigger clauses remain out of scope (test/e2e-scenario/live/gateway-guard-recovery.test.ts:82): The PR correctly narrows its claim to the shared guard-chain recovery contract and explicitly does not close the broader hardware/provider/recreate-trigger matrix. If this PR is later treated as full [DGX Spark] Host reboot bricks sandbox until 5-min rebuild --yes: connect recovery path warns about missing /tmp guards but launches gateway naked → @homebridge/ciao crash loop #2701 closure, the named hardware, provider breadth, and destructive recreate triggers still lack direct evidence.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Run OpenClaw/generic recovery with a 0444 regular `/tmp/nemoclaw-proxy-env.sh` containing exact safety-net/ciao `--require` entries plus `touch /tmp/hostile-marker`; assert the marker is not created or the file is rebuilt before sourcing.. The changed behavior generates shell that runs across sandbox lifecycle recovery, volatile `/tmp` trust boundaries, root/non-root ownership postures, packaged preload staging, shell init hooks, and health-probe fast paths. Unit shell tests are strong, but targeted runtime/integration validation would improve confidence.
  • **Runtime validation** — Run non-root recovery with a user-owned 0444 proxy-env containing exact guard entries plus hostile shell; assert it is rejected/rebuilt or document why that ownership is trusted.. The changed behavior generates shell that runs across sandbox lifecycle recovery, volatile `/tmp` trust boundaries, root/non-root ownership postures, packaged preload staging, shell init hooks, and health-probe fast paths. Unit shell tests are strong, but targeted runtime/integration validation would improve confidence.
  • **Runtime validation** — Run Hermes dashboard-only recovery with symlinked and world-writable `/tmp/nemoclaw-proxy-env.sh`; assert hostile shell content is not executed before dashboard launch.. The changed behavior generates shell that runs across sandbox lifecycle recovery, volatile `/tmp` trust boundaries, root/non-root ownership postures, packaged preload staging, shell init hooks, and health-probe fast paths. Unit shell tests are strong, but targeted runtime/integration validation would improve confidence.
  • **Runtime validation** — Run dashboard-only Hermes recovery with missing or incomplete proxy-env; assert it restores the guard chain or explicitly documents why dashboard-only recovery does not require it.. The changed behavior generates shell that runs across sandbox lifecycle recovery, volatile `/tmp` trust boundaries, root/non-root ownership postures, packaged preload staging, shell init hooks, and health-probe fast paths. Unit shell tests are strong, but targeted runtime/integration validation would improve confidence.
  • **Runtime validation** — Run sandbox-level recovery where `/tmp/nemoclaw-proxy-env.sh` and both tmp preload files are missing while the health probe would return already-running; assert repair happens before `ALREADY_RUNNING`.. The changed behavior generates shell that runs across sandbox lifecycle recovery, volatile `/tmp` trust boundaries, root/non-root ownership postures, packaged preload staging, shell init hooks, and health-probe fast paths. Unit shell tests are strong, but targeted runtime/integration validation would improve confidence.
  • **Acceptance clause:** This PR addresses the shared gateway recovery-state failure from [DGX Spark] Host reboot bricks sandbox until 5-min rebuild --yes: connect recovery path warns about missing /tmp guards but launches gateway naked → @homebridge/ciao crash loop #2701 by restoring the guard chain from trusted packaged preload sources when `/tmp/nemoclaw-proxy-env.sh` is missing or incomplete. — add test evidence or identify existing coverage. Met for OpenClaw/generic gateway recovery via `buildGatewayGuardRecoveryLines()` wiring, unit coverage for missing/incomplete proxy-env repair, and live E2E `connect --probe-only` after guard-chain wipe. Partial because Hermes dashboard-only recovery still directly sources `/tmp/nemoclaw-proxy-env.sh` without the new validation/rebuild path.
  • **Acceptance clause:** openshell sandbox exec -- sh -c OpenClaw recovery script — add test evidence or identify existing coverage. The E2E comments identify this as the production route reached through `connect --probe-only`, and the unit tests assert `buildOpenClawRecoveryScript()` wiring. The diff does not directly instrument the internal `openshell sandbox exec -- sh -c` invocation.
  • **Minimal recovered proxy-env writer** — Unit tests cover missing and incomplete proxy-env repair; the live E2E wipes the guard chain, kills the gateway tree, runs `connect --probe-only`, and asserts guard restoration.. `_nemoclaw_write_recovered_proxy_env` emits only recovery comment plus two critical `NODE_OPTIONS` exports, while the entrypoint writer emits the full runtime shell environment.
Since last review details

Current findings:

  • Source-of-truth review needed: Minimal recovered proxy-env writer: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `_nemoclaw_write_recovered_proxy_env` emits only recovery comment plus two critical `NODE_OPTIONS` exports, while the entrypoint writer emits the full runtime shell environment.
  • Source-of-truth review needed: Metadata-valid proxy-env shell content: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: After `_nemoclaw_validate_recovery_proxy_env` succeeds, the helper immediately executes `. /tmp/nemoclaw-proxy-env.sh`.
  • Source-of-truth review needed: Hermes dashboard-only recovery proxy-env sourcing: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `buildHermesDashboardProcessRecoveryScript()` still emits `if [ -r /tmp/nemoclaw-proxy-env.sh ]; then . /tmp/nemoclaw-proxy-env.sh; fi;` and is not wired through `buildGatewayGuardRecoveryLines()`.
  • Metadata-valid proxy-env content is still executed as shell (src/lib/agent/runtime-recovery-preload.ts:143): The new helper rejects symlinks, unsafe modes, and wrong root ownership in root mode, which is a substantial improvement. However, once `_nemoclaw_validate_recovery_proxy_env` succeeds, recovery immediately sources `/tmp/nemoclaw-proxy-env.sh`. A regular mode-0444 file containing exact required guard entries plus arbitrary shell commands would pass the current checks and execute before content authenticity is established. This also keeps credential-exfiltration risk in scope because the sourced file normally carries gateway/proxy environment state.
    • Recommendation: Prefer rebuilding the recovery proxy-env from trusted generated content whenever this volatile `/tmp` trust boundary is crossed, or add a documented trust model plus a regression test where a metadata-valid proxy-env with exact guard entries and hostile shell content must not execute.
    • Evidence: `_nemoclaw_validate_recovery_proxy_env` checks symlink/file/mode/root owner, then the returned recovery line is `if _nemoclaw_validate_recovery_proxy_env /tmp/nemoclaw-proxy-env.sh; then . /tmp/nemoclaw-proxy-env.sh; _PE_MISSING=0; else _PE_MISSING=1; fi;`.
  • Hermes dashboard-only recovery still bypasses hardened proxy-env validation (src/lib/agent/runtime.ts:183): OpenClaw and generic gateway recovery now validate/rebuild the proxy-env before shell init and health probing, but the Hermes dashboard-only path keeps the older direct readable-file source behavior and still sources `~/.bashrc` first. That leaves this recovery route outside the new symlink, mode, owner, regular-file, and trusted-preload checks.
    • Recommendation: Route `buildHermesDashboardProcessRecoveryScript()` through the same hardened proxy-env/preload handling, or document why dashboard-only recovery is safe and add negative tests for symlinked, world-writable, and hostile `/tmp/nemoclaw-proxy-env.sh` content.
    • Evidence: `buildHermesDashboardProcessRecoveryScript()` emits `[ -f ~/.bashrc ] && . ~/.bashrc;` followed by `if [ -r /tmp/nemoclaw-proxy-env.sh ]; then . /tmp/nemoclaw-proxy-env.sh; fi;`; `buildGatewayGuardRecoveryLines()` is wired into `buildOpenClawRecoveryScript()` and `buildRecoveryScript()` only.
  • Non-root packaged-preload trust model is implicit (src/lib/agent/runtime-recovery-preload.ts:94): The recovery helper treats `/usr/local/lib/nemoclaw/preloads/*.js` as trusted packaged sources. In root mode it verifies owner UID 0 and rejects group/other-writable modes, but in non-root mode it only checks readable, non-symlink, regular-file status. If a non-root recovery environment can alter those packaged source paths, recovery could install hostile preload code into `/tmp` and append it to `NODE_OPTIONS`.
    • Recommendation: Document why packaged preload sources are immutable in non-root recovery, or apply equivalent owner/writability validation where possible and add a non-root tampered-source regression test.
    • Evidence: `_nemoclaw_validate_trusted_preload_source` checks owner and group/other writability only inside `if [ "$(id -u)" -eq 0 ]; then ... fi;`; tests cover group-writable trusted source only with `fakeRoot: true`.
  • Recovered proxy-env remains a minimal second source of runtime shell state (src/lib/agent/runtime-recovery-preload.ts:127): The entrypoint remains the full source of truth for `/tmp/nemoclaw-proxy-env.sh`, but recovery now has a second writer that emits only the two critical `NODE_OPTIONS` exports. The code comment usefully bounds this to gateway recovery and says sandbox restart refreshes the full proxy env, but the long-term ownership/removal condition and the effects of omitted proxy, token, OpenClaw path, guard-function, and tool-redirect state remain only partially documented.
    • Recommendation: Either extract/reuse the entrypoint proxy-env emission source for recovery, or add a short design note that explicitly bounds this minimal env to gateway recovery, describes effects on later connect shells until restart, and states when this workaround can be removed.
    • Evidence: `_nemoclaw_write_recovered_proxy_env` writes a recovery comment plus safety-net and ciao `NODE_OPTIONS` exports, while `scripts/nemoclaw-start.sh::write_runtime_shell_env` emits proxy variables, OpenClaw paths, gateway URL/token, guard functions, and tool redirects.
  • Broader [DGX Spark] Host reboot bricks sandbox until 5-min rebuild --yes: connect recovery path warns about missing /tmp guards but launches gateway naked → @homebridge/ciao crash loop #2701 platform and recreate-trigger clauses remain out of scope (test/e2e-scenario/live/gateway-guard-recovery.test.ts:82): The PR correctly narrows its claim to the shared guard-chain recovery contract and explicitly does not close the broader hardware/provider/recreate-trigger matrix. If this PR is later treated as full [DGX Spark] Host reboot bricks sandbox until 5-min rebuild --yes: connect recovery path warns about missing /tmp guards but launches gateway naked → @homebridge/ciao crash loop #2701 closure, the named hardware, provider breadth, and destructive recreate triggers still lack direct evidence.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@cv cv requested review from laitingsheng and sandl99 June 12, 2026 06:29
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ❌ Some jobs failed

Run: 27398589871
Workflow ref: codex/supersede-recovery-guard-prs
Requested scenarios: (default — all supported)
Requested jobs: gateway-guard-recovery,hermes-e2e-vitest
Summary: 2 passed, 1 failed, 13 skipped

Job Result
credential-migration-vitest ⏭️ skipped
double-onboard-vitest ⏭️ skipped
gateway-guard-recovery ❌ failure
generate-matrix ✅ success
hermes-e2e-vitest ⚠️ cancelled
inference-routing-vitest ⏭️ skipped
issue-4434-tui-unreachable-inference-vitest ⏭️ skipped
launchable-smoke-vitest ⏭️ skipped
live-scenarios ⏭️ skipped
network-policy-vitest ⏭️ skipped
onboard-negative-paths-vitest ⏭️ skipped
openclaw-tui-chat-correlation-vitest ⏭️ skipped
openshell-version-pin-vitest ⏭️ skipped
rebuild-openclaw-vitest ⏭️ skipped
runtime-overrides-vitest ⏭️ skipped
token-rotation-vitest ⏭️ skipped
validate-jobs ✅ success

Failed jobs: gateway-guard-recovery. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ❌ Some jobs failed

Run: 27399005694
Workflow ref: codex/supersede-recovery-guard-prs
Requested scenarios: (default — all supported)
Requested jobs: gateway-guard-recovery,hermes-e2e-vitest
Summary: 3 passed, 1 failed, 13 skipped

Job Result
credential-migration-vitest ⏭️ skipped
double-onboard-vitest ⏭️ skipped
gateway-guard-recovery ✅ success
generate-matrix ✅ success
hermes-e2e-vitest ❌ failure
inference-routing-vitest ⏭️ skipped
issue-4434-tui-unreachable-inference-vitest ⏭️ skipped
launchable-smoke-vitest ⏭️ skipped
live-scenarios ⏭️ skipped
network-policy-vitest ⏭️ skipped
onboard-negative-paths-vitest ⏭️ skipped
openclaw-tui-chat-correlation-vitest ⏭️ skipped
openshell-version-pin-vitest ⏭️ skipped
rebuild-openclaw-vitest ⏭️ skipped
runtime-overrides-vitest ⏭️ skipped
token-rotation-vitest ⏭️ skipped
validate-jobs ✅ success

Failed jobs: hermes-e2e-vitest. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ✅ All jobs passed

Run: 27399558627
Workflow ref: codex/supersede-recovery-guard-prs
Requested scenarios: (default — all supported)
Requested jobs: hermes-e2e-vitest
Summary: 3 passed, 0 failed, 14 skipped

Job Result
credential-migration-vitest ⏭️ skipped
double-onboard-vitest ⏭️ skipped
gateway-guard-recovery ⏭️ skipped
generate-matrix ✅ success
hermes-e2e-vitest ✅ success
inference-routing-vitest ⏭️ skipped
issue-4434-tui-unreachable-inference-vitest ⏭️ skipped
launchable-smoke-vitest ⏭️ skipped
live-scenarios ⏭️ skipped
network-policy-vitest ⏭️ skipped
onboard-negative-paths-vitest ⏭️ skipped
openclaw-tui-chat-correlation-vitest ⏭️ skipped
openshell-version-pin-vitest ⏭️ skipped
rebuild-openclaw-vitest ⏭️ skipped
runtime-overrides-vitest ⏭️ skipped
token-rotation-vitest ⏭️ skipped
validate-jobs ✅ success

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ❌ Some jobs failed

Run: 27399005694
Workflow ref: codex/supersede-recovery-guard-prs
Requested scenarios: (default — all supported)
Requested jobs: gateway-guard-recovery,hermes-e2e-vitest
Summary: 3 passed, 1 failed, 13 skipped

Job Result
credential-migration-vitest ⏭️ skipped
double-onboard-vitest ⏭️ skipped
gateway-guard-recovery ✅ success
generate-matrix ✅ success
hermes-e2e-vitest ❌ failure
inference-routing-vitest ⏭️ skipped
issue-4434-tui-unreachable-inference-vitest ⏭️ skipped
launchable-smoke-vitest ⏭️ skipped
live-scenarios ⏭️ skipped
network-policy-vitest ⏭️ skipped
onboard-negative-paths-vitest ⏭️ skipped
openclaw-tui-chat-correlation-vitest ⏭️ skipped
openshell-version-pin-vitest ⏭️ skipped
rebuild-openclaw-vitest ⏭️ skipped
runtime-overrides-vitest ⏭️ skipped
token-rotation-vitest ⏭️ skipped
validate-jobs ✅ success

Failed jobs: hermes-e2e-vitest. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Results — ✅ All jobs passed

Run: 27399005694
Workflow ref: codex/supersede-recovery-guard-prs
Requested scenarios: (default — all supported)
Requested jobs: gateway-guard-recovery,hermes-e2e-vitest
Summary: 4 passed, 0 failed, 13 skipped

Job Result
credential-migration-vitest ⏭️ skipped
double-onboard-vitest ⏭️ skipped
gateway-guard-recovery ✅ success
generate-matrix ✅ success
hermes-e2e-vitest ✅ success
inference-routing-vitest ⏭️ skipped
issue-4434-tui-unreachable-inference-vitest ⏭️ skipped
launchable-smoke-vitest ⏭️ skipped
live-scenarios ⏭️ skipped
network-policy-vitest ⏭️ skipped
onboard-negative-paths-vitest ⏭️ skipped
openclaw-tui-chat-correlation-vitest ⏭️ skipped
openshell-version-pin-vitest ⏭️ skipped
rebuild-openclaw-vitest ⏭️ skipped
runtime-overrides-vitest ⏭️ skipped
token-rotation-vitest ⏭️ skipped
validate-jobs ✅ success

Comment thread src/lib/agent/runtime-recovery-preload.test.ts Fixed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/agent/runtime-recovery-preload.ts (1)

140-148: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Safe-but-incomplete proxy-env.sh never gets repaired on disk.

If /tmp/nemoclaw-proxy-env.sh is 0444/root-owned but missing one exact guard, this path sets _PE_MISSING=0, sources it, and only fixes the current recovery shell via _nemoclaw_append_node_require. The file itself stays incomplete, so an ALREADY_RUNNING fast-path exit — or any later relaunch that reuses the file — still starts from the stale guard chain instead of the persisted repair described in the PR objective for “missing or incomplete” env files.

Either mark incomplete env files for rewrite as well, or persist the missing --require entries back into the file before the health fast path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/agent/runtime-recovery-preload.ts` around lines 140 - 148, The
recovery flow currently sources /tmp/nemoclaw-proxy-env.sh after
_nemoclaw_validate_recovery_proxy_env sets _PE_MISSING, but when
_nemoclaw_append_node_require only fixes the in-memory shell the on-disk file
can remain incomplete; update the logic so that when _PE_MISSING was 1 and you
successfully apply fixes (via _nemoclaw_append_node_require or
_nemoclaw_write_recovered_proxy_env) you also persist the repaired require
entries back into /tmp/nemoclaw-proxy-env.sh (or mark the file for rewrite and
call _nemoclaw_write_recovered_proxy_env) before allowing the ALREADY_RUNNING
fast-path; locate this behavior around the code referencing _PE_MISSING,
_NEMOCLAW_CRITICAL_GUARDS_READY, _nemoclaw_append_node_require and
_nemoclaw_write_recovered_proxy_env and ensure the on-disk file is updated when
fixes are applied.
🧹 Nitpick comments (1)
src/lib/agent/runtime-recovery-preload.test.ts (1)

188-210: ⚡ Quick win

Add an execution test for a metadata-safe but incomplete proxy-env.sh.

These new cases prove missing and unsafe files are rebuilt, but they still do not assert what happens when proxy-env.sh is a valid regular 0444 file and is only missing one exact --require. That branch is where stdout can look repaired while the persisted file stays stale, so this suite should pin that contract down explicitly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/agent/runtime-recovery-preload.test.ts` around lines 188 - 210, Add a
new test alongside the existing case in runtime-recovery-preload.test.ts that
uses runGuardRecovery to exercise the branch where proxy-env.sh is metadata-safe
(mode 0o444) but missing exactly one packaged --require entry; in the
beforeScript write a proxy-env.sh with correct permissions (0o444) and all
required lines except the specific --require that should be injected, then
assert result.status is 0 and result.stdout shows the repaired --require present
(e.g., contains `--require ${result.paths.tmpSafetyNet}`) while the persisted
file (result.files.proxyEnv) remains unchanged and does not contain that
--require, also assert files.proxyEnvMode is 0o444 and gatewayLog contains the
expected "unsafe mode"/"rebuilding from packaged preloads" messages and that
hostileProxyEnvSourced remains false to pin the contract that stdout may show a
repair while the on-disk file stays stale.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/agent/runtime-recovery-preload.ts`:
- Around line 140-148: The recovery flow currently sources
/tmp/nemoclaw-proxy-env.sh after _nemoclaw_validate_recovery_proxy_env sets
_PE_MISSING, but when _nemoclaw_append_node_require only fixes the in-memory
shell the on-disk file can remain incomplete; update the logic so that when
_PE_MISSING was 1 and you successfully apply fixes (via
_nemoclaw_append_node_require or _nemoclaw_write_recovered_proxy_env) you also
persist the repaired require entries back into /tmp/nemoclaw-proxy-env.sh (or
mark the file for rewrite and call _nemoclaw_write_recovered_proxy_env) before
allowing the ALREADY_RUNNING fast-path; locate this behavior around the code
referencing _PE_MISSING, _NEMOCLAW_CRITICAL_GUARDS_READY,
_nemoclaw_append_node_require and _nemoclaw_write_recovered_proxy_env and ensure
the on-disk file is updated when fixes are applied.

---

Nitpick comments:
In `@src/lib/agent/runtime-recovery-preload.test.ts`:
- Around line 188-210: Add a new test alongside the existing case in
runtime-recovery-preload.test.ts that uses runGuardRecovery to exercise the
branch where proxy-env.sh is metadata-safe (mode 0o444) but missing exactly one
packaged --require entry; in the beforeScript write a proxy-env.sh with correct
permissions (0o444) and all required lines except the specific --require that
should be injected, then assert result.status is 0 and result.stdout shows the
repaired --require present (e.g., contains `--require
${result.paths.tmpSafetyNet}`) while the persisted file (result.files.proxyEnv)
remains unchanged and does not contain that --require, also assert
files.proxyEnvMode is 0o444 and gatewayLog contains the expected "unsafe
mode"/"rebuilding from packaged preloads" messages and that
hostileProxyEnvSourced remains false to pin the contract that stdout may show a
repair while the on-disk file stays stale.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9cd38010-f9e0-4f90-8df4-dd709b5df4d6

📥 Commits

Reviewing files that changed from the base of the PR and between 9058d77 and cd55950.

📒 Files selected for processing (6)
  • src/lib/agent/runtime-hermes-secret-boundary-behavioural.test.ts
  • src/lib/agent/runtime-recovery-preload.test.ts
  • src/lib/agent/runtime-recovery-preload.ts
  • src/lib/agent/runtime.test.ts
  • src/lib/agent/runtime.ts
  • test/helpers/runtime-recovery-preload-test-helpers.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/agent/runtime-recovery-preload.test.ts`:
- Around line 326-343: The test currently anchors ordering to the sentinel
string "_NEMOCLAW_CRITICAL_GUARDS_READY=1" which can drift; change the assertion
to locate the actual staged-repair call sites (search for the staged preload
invocation, e.g. the literal "_nemoclaw_stage_recovery_preload" or the guard
source paths via SAFETY_NET_GUARD.sourcePath and CIAO_GUARD.sourcePath) instead
of the sentinel, then assert those indices occur before the "_GW_CODE=" health
probe and before the ALREADY_RUNNING fast-path; leave the existing
source-before-sentinel validation to the other test and only assert the
repair-before-health/fast-path ordering here (use the same loop over
buildRecoveryScript/buildOpenClawRecoveryScript and replace repairIdx with the
index(es) of the actual preload/guard source calls).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7150a8cc-bf2f-4831-9398-6539b2689a5f

📥 Commits

Reviewing files that changed from the base of the PR and between ea191c3 and f0418cf.

📒 Files selected for processing (3)
  • src/lib/agent/runtime-recovery-preload.test.ts
  • src/lib/agent/runtime-recovery-preload.ts
  • src/lib/agent/runtime.test.ts
💤 Files with no reviewable changes (1)
  • src/lib/agent/runtime.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/agent/runtime-recovery-preload.ts

Comment thread src/lib/agent/runtime-recovery-preload.test.ts
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27404283063
Target ref: 35c188c949fac64667f35781dd406acbdc5692bc
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e,hermes-secret-boundary-e2e
Summary: 1 passed, 1 failed, 0 cancelled, 0 skipped

Job Result
hermes-secret-boundary-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ❌ failure

Failed jobs: issue-2478-crash-loop-recovery-e2e. Check run artifacts for logs.

@cv cv merged commit b747bfa into main Jun 12, 2026
43 checks passed
@cv cv deleted the codex/supersede-recovery-guard-prs branch June 12, 2026 17:14
cv added a commit that referenced this pull request Jun 13, 2026
## Summary
Hardens recovery proxy-env handling so gateway and Hermes dashboard
recovery inspect existing `/tmp/nemoclaw-proxy-env.sh` only for
diagnostics, then source a freshly generated minimal recovery env. This
prevents metadata-valid `/tmp` shell content from being executed during
recovery while preserving the trusted guard-chain restoration path from
#5321.

## Related Issue
Refs #2701
Refs #2478
Follow-up to #5321

## Changes
- Regenerate and source a fresh minimal recovery proxy-env instead of
sourcing pre-existing `/tmp/nemoclaw-proxy-env.sh` content.
- Reuse the hardened guard recovery path for Hermes dashboard-only
recovery before shell init and runtime-env validation.
- Reject group/other-writable packaged preload sources in non-root
recovery and add regression coverage for hostile proxy-env content.

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `npm run docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Carlos Villela <cvillela@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Expanded recovery suites to verify rebuilt proxy envs never expose raw
secrets, confirm absence of secret values in logs/output, add cases for
safe rebuilding, guard-file repairs, unsafe-preload fail-closed
behavior, and unified validate-then-source assertions. Added helpers for
temp cleanup, waiting for markers, and for inspecting generated recovery
env contents. E2E crash-loop tests now use bounded log-window checks.

* **Refactor**
* Hardened runtime recovery flow: earlier guard validation, adjusted
shell init/sourcing order, and sourcing of a dedicated recovery env.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Co-authored-by: Julie Yaunches <jyaunches@nvidia.com>
@cv cv added the v0.0.65 Release target label Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression security Potential vulnerability, unsafe behavior, or access risk v0.0.65 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants