Skip to content

fix(agent): regenerate proxy-env.sh guard chain during recovery (#2701)#5259

Closed
hunglp6d wants to merge 2 commits into
mainfrom
fix/nightly-e2e-recovery-guard-chain-regen-a3ae21b
Closed

fix(agent): regenerate proxy-env.sh guard chain during recovery (#2701)#5259
hunglp6d wants to merge 2 commits into
mainfrom
fix/nightly-e2e-recovery-guard-chain-regen-a3ae21b

Conversation

@hunglp6d

@hunglp6d hunglp6d commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Fixes #5262

✨ [AI-generated issue]

Summary

When /tmp/nemoclaw-proxy-env.sh is missing at recovery time (e.g. after a pod recreate on DGX Spark), the recovery script now scans for preload .js files that still exist on disk and regenerates a minimal proxy-env.sh with the corresponding NODE_OPTIONS --require entries. Previously, the recovery script logged a WARNING and launched the gateway naked, which triggers the @homebridge/ciao crash loop on aarch64 / DGX Spark (#2701).

Related Issue

Changes

  • src/lib/agent/runtime.ts (buildOpenClawRecoveryScript + buildRecoveryScript): Replace the warn-and-proceed branch when _PE_MISSING=1 with a regenerate-and-proceed branch that:
    1. Iterates over known preload script paths (/tmp/nemoclaw-*.js)
    2. Builds a NODE_OPTIONS string from whichever files are present
    3. Writes the regenerated /tmp/nemoclaw-proxy-env.sh with mode 444
    4. Sources it and re-checks the guard chain
    5. Falls through to the existing WARNING only if no preload scripts exist at all

Validation

Custom-e2e validation was not run — the available token lacks the workflow scope required to push .github/workflows/custom-e2e.yaml. Manual validation: run issue-2478-crash-loop-recovery-e2e on this branch.

  • Original failing run: 27386272836 on a3ae21b6eed8399099fd390bd45ad43e78218258
  • Targeted job: issue-2478-crash-loop-recovery-e2e / run (#80933852556)

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes

AI Disclosure

  • AI-assisted — tool: Claude Code (nemoclaw-diagnosis skill)

Signed-off-by: Hung Le hple@nvidia.com

hunglp6d added 2 commits June 12, 2026 01:08
Commit 27ae4c3 extracted patchStagedDockerfile() from onboard.ts into
sandbox-dockerfile-patch-flow.ts but dropped the design-intent comment
that documents why darwinVmCompat=false for Docker builds. The
openshell-gateway-upgrade-e2e test greps for this comment in onboard.ts
as a design guard. Restore the comment in the new file and update the
test to grep the correct path.

Signed-off-by: Hung Le <hple@nvidia.com>
When /tmp/nemoclaw-proxy-env.sh is missing at recovery time (e.g. after
a pod recreate), the recovery script now scans for preload .js files
that still exist on disk and regenerates a minimal proxy-env.sh with
the corresponding NODE_OPTIONS --require entries. This prevents the
@homebridge/ciao crash loop on aarch64 / DGX Spark where the gateway
respawns without the ciao-network-guard preload and hits an unhandled
os.networkInterfaces() exception.

Both the OpenClaw and non-OpenClaw agent recovery paths are fixed.

Signed-off-by: Hung Le <hple@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 064ebcf3-b9c5-4742-942b-2055eac341d4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/nightly-e2e-recovery-guard-chain-regen-a3ae21b

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: gateway-guard-recovery
Optional E2E: issue-2478-crash-loop-recovery-e2e, sandbox-survival-e2e, hermes-e2e-vitest

Dispatch hint: gateway-guard-recovery

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

Optional E2E

New E2E recommendations

  • None.

Dispatch hint

  • Workflow: .github/workflows/e2e-vitest-scenarios.yaml
  • jobs input: gateway-guard-recovery

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: gateway-guard-recovery
Optional Vitest E2E scenarios: ubuntu-repo-docker-post-reboot-recovery

Dispatch required Vitest E2E scenarios:

  • gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field jobs=gateway-guard-recovery

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required Vitest E2E scenarios

Optional Vitest E2E scenarios

  • ubuntu-repo-docker-post-reboot-recovery: Adjacent live-supported typed scenario that exercises OpenClaw lifecycle recovery through the registry-driven Vitest path; useful as broader coverage beyond the targeted guard-chain recovery job.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-docker-post-reboot-recovery

Relevant changed files

  • src/lib/agent/runtime.ts

@github-actions

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 4 needs attention, 2 worth checking, 0 nice ideas
Top item: Regenerate guard chain from trusted packaged preloads, not unverified /tmp leftovers

Review findings

🛠️ Needs attention

  • Recovery still launches naked when the linked pod-recreate scenario wipes all guard files (src/lib/agent/runtime.ts:208): The new branch only regenerates /tmp/nemoclaw-proxy-env.sh from preload scripts that still exist under /tmp. The linked [DGX Spark] Host reboot bricks sandbox until 5-min rebuild --yes: connect recovery path warns about missing /tmp guards but launches gateway naked → @homebridge/ciao crash loop #2701 repro explicitly says pod recreate creates a fresh container where /tmp/nemoclaw-proxy-env.sh and the guard JavaScript files are gone. In that case _REGEN_OPTS stays empty, the script logs a warning, and recovery continues to launch the gateway without the safety-net/ciao preloads, preserving the crash loop this PR is meant to fix.
  • Recovery trusts predictable /tmp preload files as gateway code without provenance checks (src/lib/agent/runtime.ts:208): The regenerated NODE_OPTIONS chain is built from predictable files under /tmp using only `[ -f ]`, which follows symlinks and does not verify root ownership, mode, non-writability, or trusted content. Because these files are loaded with `--require` by the relaunched gateway, a sandbox-controlled or replaced /tmp file can cross the trusted-code boundary.
    • Recommendation: Do not rebuild the gateway preload chain from unverified /tmp artifacts. Prefer copying from immutable packaged preload sources using the same trust-boundary installer as startup. If any /tmp files are reused, verify regular-file, non-symlink, owner, mode, and non-writability before adding them to NODE_OPTIONS.
    • Evidence: The new code appends `--require $_f` for each `[ -f "$_f" ]` path including `/tmp/nemoclaw-sandbox-safety-net.js` and `/tmp/nemoclaw-ciao-network-guard.js`. Nearby `scripts/lib/sandbox-init.sh` documents /tmp sourced files as a cross-user trust boundary, while the new recovery branch performs no comparable validation.
  • proxy-env regeneration writes a sourced trust-boundary file with raw redirection (src/lib/agent/runtime.ts:208): The recovery path writes `/tmp/nemoclaw-proxy-env.sh` using `> /tmp/nemoclaw-proxy-env.sh && chmod 444`, which follows symlinks and is not atomic. That diverges from the repository's existing trust-boundary helpers and from the no-follow care already used in this file for gateway logs.
    • Recommendation: Write the regenerated proxy-env file through a safe temp-file plus rename/no-follow helper with the expected ownership and permissions, or invoke the shared sandbox sourced-file installer. Reject symlinked or otherwise unsafe destinations.
    • Evidence: The new branch uses `printf ... > /tmp/nemoclaw-proxy-env.sh && chmod 444 /tmp/nemoclaw-proxy-env.sh`. `scripts/lib/sandbox-init.sh` says files sourced across user boundaries must be root-owned 444 in root mode and are written via `emit_sandbox_sourced_file` to avoid rm/recreate and symlink races.
  • Implementation does not follow the linked issue's source-of-truth recovery design (src/lib/agent/runtime.ts:208): Issue [DGX Spark] Host reboot bricks sandbox until 5-min rebuild --yes: connect recovery path warns about missing /tmp guards but launches gateway naked → @homebridge/ciao crash loop #2701's remaining scope calls for installing from `/usr/local/lib/nemoclaw/preloads/` into `/tmp` with correct ownership/mode and emitting proxy-env dynamically. This PR instead reconstructs proxy-env from whichever /tmp files happen to remain, so it handles only a narrower invalid state and leaves the real source boundary unresolved.

🔎 Worth checking

  • Source-of-truth review needed: Gateway guard-chain recovery fallback in `buildOpenClawRecoveryScript` and `buildRecoveryScript`: The advisor marked localized patch analysis as missing.
  • New recovery and trust-boundary behavior lacks targeted unit coverage (src/lib/agent/runtime.ts:208): No tests are added for the new regeneration path, especially negative cases around missing all guard files, symlinked proxy-env, symlinked or writable preload candidates, partial chains, and preserving trusted permissions. Existing tests cover [DGX Spark] Gateway crash loop on startup: @homebridge/ciao networkInterfaces() returns EPERM in OpenShell sandbox #2478 warning/refusal shape, but not this new fallback.
    • Recommendation: Add deterministic unit tests for the recovery script shape and negative paths before relying on E2E coverage. Include security cases for symlink/tamper handling and correctness cases for all-files-wiped recovery.
    • Evidence: The PR changes only `src/lib/agent/runtime.ts`. Existing `test/e2e-scenario/live/gateway-guard-recovery.test.ts` models wiping proxy-env and all guard files, while this diff adds no corresponding unit coverage for the new branch.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — buildOpenClawRecoveryScript restores guard chain from packaged preload sources when proxy-env and all /tmp guard files are absent. The changed code is deterministic script generation and should have unit tests, but because it affects sandbox recovery and trusted-code boundaries, runtime validation of the all-/tmp-wiped recovery path is also important.
  • **Runtime validation** — buildOpenClawRecoveryScript refuses or repairs rather than launching naked when no trusted preload scripts can be installed. The changed code is deterministic script generation and should have unit tests, but because it affects sandbox recovery and trusted-code boundaries, runtime validation of the all-/tmp-wiped recovery path is also important.
  • **Runtime validation** — buildOpenClawRecoveryScript rejects symlinked /tmp/nemoclaw-proxy-env.sh during regeneration. The changed code is deterministic script generation and should have unit tests, but because it affects sandbox recovery and trusted-code boundaries, runtime validation of the all-/tmp-wiped recovery path is also important.
  • **Runtime validation** — buildOpenClawRecoveryScript ignores symlinked or sandbox-writable /tmp/nemoclaw-*.js preload candidates. The changed code is deterministic script generation and should have unit tests, but because it affects sandbox recovery and trusted-code boundaries, runtime validation of the all-/tmp-wiped recovery path is also important.
  • **Runtime validation** — buildRecoveryScript fails before launch when regenerated NODE_OPTIONS lacks the ciao preload. The changed code is deterministic script generation and should have unit tests, but because it affects sandbox recovery and trusted-code boundaries, runtime validation of the all-/tmp-wiped recovery path is also important.
  • **New recovery and trust-boundary behavior lacks targeted unit coverage** — Add deterministic unit tests for the recovery script shape and negative paths before relying on E2E coverage. Include security cases for symlink/tamper handling and correctness cases for all-files-wiped recovery.
  • **Acceptance clause:** Issue [DGX Spark] Gateway crash loop on startup: @homebridge/ciao networkInterfaces() returns EPERM in OpenShell sandbox #2478: `@homebridge/ciao` calls `os.networkInterfaces()` during init; inside the OpenShell sandbox the syscall fails with EPERM and the unhandled rejection takes the gateway down. — add test evidence or identify existing coverage. The PR continues to require the `nemoclaw-ciao-network-guard` marker after sourcing/regeneration, but only restores that guard if `/tmp/nemoclaw-ciao-network-guard.js` already exists.
  • **Acceptance clause:** Issue [DGX Spark] Gateway crash loop on startup: @homebridge/ciao networkInterfaces() returns EPERM in OpenShell sandbox #2478: Workaround that worked: override `os.networkInterfaces` before ciao loads via a NODE_OPTIONS preload. — add test evidence or identify existing coverage. The diff regenerates a minimal `export NODE_OPTIONS="--require ..."` file from existing /tmp preload scripts, but it does not reinstall the preload from a durable source when /tmp is empty.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@cv

cv commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Closing this in favor of #5321. The replacement keeps the missing proxy-env.sh recovery path, but restores the guard chain from trusted packaged preloads, writes the recovered proxy env via staged file, validates exact --require entries, and fails closed if trusted staging is unavailable. Thanks for chasing the #2701 path.

@cv cv closed this Jun 12, 2026
@hunglp6d

Copy link
Copy Markdown
Contributor Author

@cv Thanks, this is AI-generated PR, will close the associated issue.

@hunglp6d hunglp6d deleted the fix/nightly-e2e-recovery-guard-chain-regen-a3ae21b branch June 12, 2026 14:15
cv added a commit that referenced this pull request Jun 12, 2026
## Summary
This PR addresses the shared gateway recovery-state failure from #2701
by restoring the guard chain from trusted packaged preload sources when
`/tmp/nemoclaw-proxy-env.sh` is missing or incomplete. Recovery now
recreates a minimal proxy-env file for the critical safety-net and ciao
preloads, validates exact `--require` entries, and refuses an unguarded
relaunch if trusted staging fails.

## Related Issue
Addresses the shared guard-chain recovery failure in #2701; does not
close the broader hardware/provider/recreate-trigger validation matrix.
Refs #2701
Refs #2478
Supersedes #5259
Supersedes #5265

## Changes
- Added a shared recovery helper that stages safety-net and ciao
preloads from `/usr/local/lib/nemoclaw/preloads/` into hardened `/tmp`
files.
- Wired OpenClaw and agent-specific gateway recovery through the helper
instead of the old warning-only missing-proxy-env path.
- Added shell-executed tests for missing proxy-env restoration, exact
`--require` matching, symlink replacement, missing trusted sources, and
Hermes recovery harness integration.
- Documented the live E2E coverage scope: production `connect
--probe-only` / sandbox-exec recovery after a pod-recreate-equivalent
guard-chain wipe, with DGX Spark / GB10 / aarch64, provider breadth, and
destructive recreate triggers intentionally left for dedicated platform
validation.

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `npm run docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
<!-- DCO sign-off required by CI. Run: git config user.name && git
config user.email -->
Signed-off-by: Carlos Villela <cvillela@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Recovery scripts now stage and validate trusted preload modules,
restore guard chains early, and hard-fail if critical guards remain
missing.
* Added test harness helpers to simulate trusted preload sources and
rewrite recovery scripts for E2E tests.

* **Bug Fixes**
* Hardened proxy-env handling: reject symlinks, enforce strict
permissions/ownership, avoid sourcing attacker-controlled content, and
prevent duplicate/unsafe preload entries.

* **Tests**
* Expanded E2E and unit coverage with ordering, permission/symlink
scenarios, logging and PID-stability assertions.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nightly-e2e: gateway recovery does not regenerate proxy-env.sh guard chain (#2701)

3 participants