fix(sandbox): self-heal recovery preload guards#5265
Conversation
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughEmit guarded shell fragments that stage trusted preload files into /tmp with provenance checks and idempotently append matching ChangesGateway Recovery Self-Heal Mechanism
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
|
Vitest E2E Scenario RecommendationRequired Vitest E2E scenarios: Dispatch required Vitest E2E scenarios:
Full Vitest E2E advisor summaryVitest E2E Scenario AdvisorBase: Required Vitest E2E scenarios
Optional Vitest E2E scenarios
Relevant changed files
|
PR Review AdvisorFindings: 1 needs attention, 4 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/agent/runtime.test.ts`:
- Around line 287-314: The test "self-heal lines install both --require preloads
into NODE_OPTIONS under bash" currently calls spawnSync("bash", ["-c", probe],
...) which inherits the parent environment and can pick up pre-existing
NODE_OPTIONS; update the spawnSync invocation to pass an explicit env object
that clones process.env but sets NODE_OPTIONS to an empty string (or undefined)
so the shell probe runs with a clean NODE_OPTIONS; locate the spawnSync call in
this test and add the env override to ensure deterministic behavior.
In `@src/lib/agent/runtime.ts`:
- Around line 141-145: The self-heal preload lines produced by
buildGatewayPreloadSelfHealLines() are always spliced in, causing NODE_OPTIONS
to be rewritten even when preload files are missing; gate that recovery behavior
on the _PE_MISSING env flag by only emitting the self-heal lines when
_PE_MISSING is set to "0" (or by wrapping the generated shell snippet in a test
like [ "${_PE_MISSING:-}" = "0" ] && ...). Update
buildGatewayPreloadSelfHealLines() to implement this guard and make the same
change where the same snippet is injected (the other splices referenced in the
diff around the code that merges GATEWAY_PRELOAD_GUARDS at the other sites) so
the warning about “proxy-env exists but NODE_OPTIONS is stale” remains accurate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ee3b1c6e-ea3b-485d-b8b1-01bb323b65d7
📒 Files selected for processing (2)
src/lib/agent/runtime.test.tssrc/lib/agent/runtime.ts
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…PTIONS Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 27395540540
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 27396243506
|
…uire guard Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Closing this in favor of #5321. The replacement carries forward the trusted packaged preload/provenance approach here and adds the missing |
## Summary This PR addresses the shared gateway recovery-state failure from #2701 by restoring the guard chain from trusted packaged preload sources when `/tmp/nemoclaw-proxy-env.sh` is missing or incomplete. Recovery now recreates a minimal proxy-env file for the critical safety-net and ciao preloads, validates exact `--require` entries, and refuses an unguarded relaunch if trusted staging fails. ## Related Issue Addresses the shared guard-chain recovery failure in #2701; does not close the broader hardware/provider/recreate-trigger validation matrix. Refs #2701 Refs #2478 Supersedes #5259 Supersedes #5265 ## Changes - Added a shared recovery helper that stages safety-net and ciao preloads from `/usr/local/lib/nemoclaw/preloads/` into hardened `/tmp` files. - Wired OpenClaw and agent-specific gateway recovery through the helper instead of the old warning-only missing-proxy-env path. - Added shell-executed tests for missing proxy-env restoration, exact `--require` matching, symlink replacement, missing trusted sources, and Hermes recovery harness integration. - Documented the live E2E coverage scope: production `connect --probe-only` / sandbox-exec recovery after a pod-recreate-equivalent guard-chain wipe, with DGX Spark / GB10 / aarch64, provider breadth, and destructive recreate triggers intentionally left for dedicated platform validation. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `npm run docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Carlos Villela <cvillela@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Recovery scripts now stage and validate trusted preload modules, restore guard chains early, and hard-fail if critical guards remain missing. * Added test harness helpers to simulate trusted preload sources and rewrite recovery scripts for E2E tests. * **Bug Fixes** * Hardened proxy-env handling: reject symlinks, enforce strict permissions/ownership, avoid sourcing attacker-controlled content, and prevent duplicate/unsafe preload entries. * **Tests** * Expanded E2E and unit coverage with ordering, permission/symlink scenarios, logging and PID-stability assertions. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Summary
The gateway recovery script aborts with
refusing unguarded gateway relaunchwhen/tmp/nemoclaw-proxy-env.shis present but does not install the safety-net / ciao--requirepreload guards intoNODE_OPTIONS. Sandboxes onboarded by an older entrypoint that emitted only proxy /OPENCLAW_*exports into that file — or that have lost the staged.jspreloads from/tmp— therefore cannot recover. Recovery now regenerates the staged copies from the immutable/usr/local/lib/nemoclaw/preloads/source files, verifies provenance (regular non-symlink, mode 444, root-owned when running as uid 0) before grafting any path intoNODE_OPTIONS, pins the trailing_GUARDS_MISSINGcheck to the exact--require /tmp/<preload>.jstoken, and keeps the existing refusal as the final invariant when provenance fails.This PR fixes the in-sandbox refusal mechanism — provenance, trusted-source install, and the exact-path guard check. End-to-end coverage of gateway recovery (process tree, dashboard reachability, inference healthy) is exercised by the existing
test/e2e/test-issue-2478-crash-loop-recovery.shsoak; the unit tests added here pin the new boundary inside that flow.Related Issue
Fixes #5253
Changes
src/lib/agent/runtime-recovery-preload.ts— new module. HostsGATEWAY_PRELOAD_GUARDS(pairs each/tmp/<preload>.jswith its immutable/usr/local/lib/nemoclaw/preloads/<basename>.jssource) andbuildGatewayPreloadSelfHealLines(). The latter emits a_nemoclaw_install_recovery_preloadshell function that (a) recreates the staged/tmpcopy from the trusted source via stage-and-rename when missing, mirroringemit_sandbox_sourced_file; (b) refuses to add an entry toNODE_OPTIONSunless the staged copy is a regular non-symlink with mode 444 (and root-owned when uid 0); (c) only grafts--require <path>once that exact path is not already present, so a stale marker substring no longer satisfies the check.src/lib/agent/runtime.ts— splice the self-heal into bothbuildOpenClawRecoveryScriptandbuildRecoveryScriptimmediately after sourcing/tmp/nemoclaw-proxy-env.sh, wrapped inif [ "$_PE_MISSING" = "0" ]; then … fi;at the call site so the conditional is visible to reviewers. Tighten the trailing_GUARDS_MISSINGcheck from a marker substring to the exact--require /tmp/<preload>.jstoken, closing the bypass where a NODE_OPTIONS that mentioned the marker text but lacked the real--requireentry was treated as guarded.src/lib/agent/runtime-recovery-preload.test.ts— new dedicated file. Shape tests pin the installer function, the_PE_MISSING=0gate at each splice site, the trusted source paths, the ordering across bothbuildRecoveryScriptandbuildOpenClawRecoveryScript, and the exact-path form of the trailing guard check. Behavioural tests spawnbashagainst the extracted block with stubbed paths and explicitNODE_OPTIONS=""env override and cover: install-from-source on a clean/tmp, reuse of an already-staged mode 444 file, refusal of a symlink target, refusal of a mode 666 target, warn-and-skip when both staged and source are absent, marker-substring-without---requirestill installs, idempotency when both--requireentries are already present, and_PE_MISSING=1bypassing the block entirely.src/lib/agent/runtime.test.ts— drops the obsolete shape and bash-probe tests that asserted the previous[ -r <path> ]form; the surviving#2478tests in this file continue to pin the proxy-env sourcing, refusal, and gateway-log invariants.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Tests