fix(agent): restore gateway guard chain during recovery#5321
Conversation
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds packaged preload guard mappings and a builder that stages/validates trusted preloads into /tmp, wires that logic into generated recovery scripts to rebuild/source /tmp/nemoclaw-proxy-env.sh before launch, and adds helpers plus unit/behavioral/e2e tests exercising repair and refusal cases. ChangesGateway Recovery Preload Guard Restoration
Sequence Diagram(s)sequenceDiagram
participant Pod as Sandbox Pod
participant Recovery as Recovery Script
participant Preloader as Packaged Preloads
participant Tmp as /tmp
participant ProxyEnv as /tmp/nemoclaw-proxy-env.sh
participant Gateway as Gateway Process
Pod->>Recovery: trigger recovery (connect/relaunch)
Recovery->>Tmp: check for proxy-env.sh
alt proxy-env missing or unsafe
Recovery->>Preloader: stage trusted preloads (safety-net, ciao)
Preloader-->>Tmp: staged preload files
Recovery->>ProxyEnv: regenerate proxy-env from staged exports
ProxyEnv-->>Recovery: sourced
else proxy-env present and valid
ProxyEnv-->>Recovery: sourced
end
Recovery->>Recovery: append missing --require entries to NODE_OPTIONS
Recovery->>Recovery: validate guard readiness (_GUARDS_MISSING/_NEMOCLAW_CRITICAL_GUARDS_READY)
alt guards ready
Recovery->>Gateway: proceed with launch
else guards missing
Recovery-->>Pod: FAIL (abort launch, `#2478/#2701`)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Vitest E2E Scenario RecommendationRequired Vitest E2E scenarios: Dispatch required Vitest E2E scenarios:
Full Vitest E2E advisor summaryVitest E2E Scenario AdvisorBase: Required Vitest E2E scenarios
Optional Vitest E2E scenarios
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 8 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Vitest E2E Scenario Results — ❌ Some jobs failedRun: 27398589871
|
Vitest E2E Scenario Results — ❌ Some jobs failedRun: 27399005694
|
Vitest E2E Scenario Results — ✅ All jobs passedRun: 27399558627
|
Vitest E2E Scenario Results — ❌ Some jobs failedRun: 27399005694
|
Vitest E2E Scenario Results — ✅ All jobs passedRun: 27399005694
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/agent/runtime-recovery-preload.ts (1)
140-148:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftSafe-but-incomplete
proxy-env.shnever gets repaired on disk.If
/tmp/nemoclaw-proxy-env.shis0444/root-owned but missing one exact guard, this path sets_PE_MISSING=0, sources it, and only fixes the current recovery shell via_nemoclaw_append_node_require. The file itself stays incomplete, so anALREADY_RUNNINGfast-path exit — or any later relaunch that reuses the file — still starts from the stale guard chain instead of the persisted repair described in the PR objective for “missing or incomplete” env files.Either mark incomplete env files for rewrite as well, or persist the missing
--requireentries back into the file before the health fast path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/agent/runtime-recovery-preload.ts` around lines 140 - 148, The recovery flow currently sources /tmp/nemoclaw-proxy-env.sh after _nemoclaw_validate_recovery_proxy_env sets _PE_MISSING, but when _nemoclaw_append_node_require only fixes the in-memory shell the on-disk file can remain incomplete; update the logic so that when _PE_MISSING was 1 and you successfully apply fixes (via _nemoclaw_append_node_require or _nemoclaw_write_recovered_proxy_env) you also persist the repaired require entries back into /tmp/nemoclaw-proxy-env.sh (or mark the file for rewrite and call _nemoclaw_write_recovered_proxy_env) before allowing the ALREADY_RUNNING fast-path; locate this behavior around the code referencing _PE_MISSING, _NEMOCLAW_CRITICAL_GUARDS_READY, _nemoclaw_append_node_require and _nemoclaw_write_recovered_proxy_env and ensure the on-disk file is updated when fixes are applied.
🧹 Nitpick comments (1)
src/lib/agent/runtime-recovery-preload.test.ts (1)
188-210: ⚡ Quick winAdd an execution test for a metadata-safe but incomplete
proxy-env.sh.These new cases prove missing and unsafe files are rebuilt, but they still do not assert what happens when
proxy-env.shis a valid regular0444file and is only missing one exact--require. That branch is where stdout can look repaired while the persisted file stays stale, so this suite should pin that contract down explicitly.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/agent/runtime-recovery-preload.test.ts` around lines 188 - 210, Add a new test alongside the existing case in runtime-recovery-preload.test.ts that uses runGuardRecovery to exercise the branch where proxy-env.sh is metadata-safe (mode 0o444) but missing exactly one packaged --require entry; in the beforeScript write a proxy-env.sh with correct permissions (0o444) and all required lines except the specific --require that should be injected, then assert result.status is 0 and result.stdout shows the repaired --require present (e.g., contains `--require ${result.paths.tmpSafetyNet}`) while the persisted file (result.files.proxyEnv) remains unchanged and does not contain that --require, also assert files.proxyEnvMode is 0o444 and gatewayLog contains the expected "unsafe mode"/"rebuilding from packaged preloads" messages and that hostileProxyEnvSourced remains false to pin the contract that stdout may show a repair while the on-disk file stays stale.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/lib/agent/runtime-recovery-preload.ts`:
- Around line 140-148: The recovery flow currently sources
/tmp/nemoclaw-proxy-env.sh after _nemoclaw_validate_recovery_proxy_env sets
_PE_MISSING, but when _nemoclaw_append_node_require only fixes the in-memory
shell the on-disk file can remain incomplete; update the logic so that when
_PE_MISSING was 1 and you successfully apply fixes (via
_nemoclaw_append_node_require or _nemoclaw_write_recovered_proxy_env) you also
persist the repaired require entries back into /tmp/nemoclaw-proxy-env.sh (or
mark the file for rewrite and call _nemoclaw_write_recovered_proxy_env) before
allowing the ALREADY_RUNNING fast-path; locate this behavior around the code
referencing _PE_MISSING, _NEMOCLAW_CRITICAL_GUARDS_READY,
_nemoclaw_append_node_require and _nemoclaw_write_recovered_proxy_env and ensure
the on-disk file is updated when fixes are applied.
---
Nitpick comments:
In `@src/lib/agent/runtime-recovery-preload.test.ts`:
- Around line 188-210: Add a new test alongside the existing case in
runtime-recovery-preload.test.ts that uses runGuardRecovery to exercise the
branch where proxy-env.sh is metadata-safe (mode 0o444) but missing exactly one
packaged --require entry; in the beforeScript write a proxy-env.sh with correct
permissions (0o444) and all required lines except the specific --require that
should be injected, then assert result.status is 0 and result.stdout shows the
repaired --require present (e.g., contains `--require
${result.paths.tmpSafetyNet}`) while the persisted file (result.files.proxyEnv)
remains unchanged and does not contain that --require, also assert
files.proxyEnvMode is 0o444 and gatewayLog contains the expected "unsafe
mode"/"rebuilding from packaged preloads" messages and that
hostileProxyEnvSourced remains false to pin the contract that stdout may show a
repair while the on-disk file stays stale.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9cd38010-f9e0-4f90-8df4-dd709b5df4d6
📒 Files selected for processing (6)
src/lib/agent/runtime-hermes-secret-boundary-behavioural.test.tssrc/lib/agent/runtime-recovery-preload.test.tssrc/lib/agent/runtime-recovery-preload.tssrc/lib/agent/runtime.test.tssrc/lib/agent/runtime.tstest/helpers/runtime-recovery-preload-test-helpers.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/agent/runtime-recovery-preload.test.ts`:
- Around line 326-343: The test currently anchors ordering to the sentinel
string "_NEMOCLAW_CRITICAL_GUARDS_READY=1" which can drift; change the assertion
to locate the actual staged-repair call sites (search for the staged preload
invocation, e.g. the literal "_nemoclaw_stage_recovery_preload" or the guard
source paths via SAFETY_NET_GUARD.sourcePath and CIAO_GUARD.sourcePath) instead
of the sentinel, then assert those indices occur before the "_GW_CODE=" health
probe and before the ALREADY_RUNNING fast-path; leave the existing
source-before-sentinel validation to the other test and only assert the
repair-before-health/fast-path ordering here (use the same loop over
buildRecoveryScript/buildOpenClawRecoveryScript and replace repairIdx with the
index(es) of the actual preload/guard source calls).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7150a8cc-bf2f-4831-9398-6539b2689a5f
📒 Files selected for processing (3)
src/lib/agent/runtime-recovery-preload.test.tssrc/lib/agent/runtime-recovery-preload.tssrc/lib/agent/runtime.test.ts
💤 Files with no reviewable changes (1)
- src/lib/agent/runtime.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/agent/runtime-recovery-preload.ts
Selective E2E Results — ❌ Some jobs failedRun: 27404283063
|
## Summary Hardens recovery proxy-env handling so gateway and Hermes dashboard recovery inspect existing `/tmp/nemoclaw-proxy-env.sh` only for diagnostics, then source a freshly generated minimal recovery env. This prevents metadata-valid `/tmp` shell content from being executed during recovery while preserving the trusted guard-chain restoration path from #5321. ## Related Issue Refs #2701 Refs #2478 Follow-up to #5321 ## Changes - Regenerate and source a fresh minimal recovery proxy-env instead of sourcing pre-existing `/tmp/nemoclaw-proxy-env.sh` content. - Reuse the hardened guard recovery path for Hermes dashboard-only recovery before shell init and runtime-env validation. - Reject group/other-writable packaged preload sources in non-root recovery and add regression coverage for hostile proxy-env content. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `npm run docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Carlos Villela <cvillela@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Expanded recovery suites to verify rebuilt proxy envs never expose raw secrets, confirm absence of secret values in logs/output, add cases for safe rebuilding, guard-file repairs, unsafe-preload fail-closed behavior, and unified validate-then-source assertions. Added helpers for temp cleanup, waiting for markers, and for inspecting generated recovery env contents. E2E crash-loop tests now use bounded log-window checks. * **Refactor** * Hardened runtime recovery flow: earlier guard validation, adjusted shell init/sourcing order, and sourcing of a dedicated recovery env. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: Julie Yaunches <jyaunches@nvidia.com>
Summary
This PR addresses the shared gateway recovery-state failure from #2701 by restoring the guard chain from trusted packaged preload sources when
/tmp/nemoclaw-proxy-env.shis missing or incomplete. Recovery now recreates a minimal proxy-env file for the critical safety-net and ciao preloads, validates exact--requireentries, and refuses an unguarded relaunch if trusted staging fails.Related Issue
Addresses the shared guard-chain recovery failure in #2701; does not close the broader hardware/provider/recreate-trigger validation matrix.
Refs #2701
Refs #2478
Supersedes #5259
Supersedes #5265
Changes
/usr/local/lib/nemoclaw/preloads/into hardened/tmpfiles.--requirematching, symlink replacement, missing trusted sources, and Hermes recovery harness integration.connect --probe-only/ sandbox-exec recovery after a pod-recreate-equivalent guard-chain wipe, with DGX Spark / GB10 / aarch64, provider breadth, and destructive recreate triggers intentionally left for dedicated platform validation.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Signed-off-by: Carlos Villela cvillela@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Tests