Skip to content

test(e2e): accept recovered proxy env#5365

Closed
jyaunches wants to merge 1 commit into
mainfrom
test/2478-recovered-proxy-env-hypothesis
Closed

test(e2e): accept recovered proxy env#5365
jyaunches wants to merge 1 commit into
mainfrom
test/2478-recovered-proxy-env-hypothesis

Conversation

@jyaunches

@jyaunches jyaunches commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Tests the hypothesis from nightly run 27435780434 that issue-2478-crash-loop-recovery-e2e is failing because PR #5321 now restores a minimal hardened /tmp/nemoclaw-proxy-env.sh, while the bash test still expects to overwrite it with the original byte-identical snapshot.

This keeps the #2701 contract assertion intact, but after the negative-case recovery it accepts either:

  • byte-identical restoration of the original proxy env snapshot, or
  • the already-recovered proxy env when gateway_guards_active proves the safety-net + ciao guards are active.

Validation

  • bash -n test/e2e/test-issue-2478-crash-loop-recovery.sh
  • git diff --check
  • pre-commit hooks on commit, including shellcheck

Selective nightly run will be dispatched for issue-2478-crash-loop-recovery-e2e against this branch.

Refs #2478 #2701 #5321

Summary by CodeRabbit

  • Tests
    • Improved crash loop recovery testing with more flexible validation logic. The test now better handles edge cases during recovery scenarios while maintaining verification of critical guard mechanisms.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7a748329-4172-4f5b-9293-bdb68d83a2b3

📥 Commits

Reviewing files that changed from the base of the PR and between 7f583ed and 658b6e5.

📒 Files selected for processing (1)
  • test/e2e/test-issue-2478-crash-loop-recovery.sh

📝 Walkthrough

Walkthrough

The test's negative-case proxy-env restoration validation now conditionally accepts recovered environments when the gateway guard chain is active, rather than requiring byte-identical file restoration after recovery.

Changes

Proxy-env Restoration Validation

Layer / File(s) Summary
Proxy-env restoration with conditional guard chain validation
test/e2e/test-issue-2478-crash-loop-recovery.sh
The negative-case restoration assertion compares restored file size against the snapshot. If they match, it logs success. If they differ, it checks whether the gateway guard chain is active; if guards are already active, the contract is satisfied and recovery is logged as acceptable, otherwise the test fails with a detailed mismatch message.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#5321: Updates the issue #2478 e2e recovery test to accept recovered guard environments when the gateway guard chain is active, aligning with proxy-env and guard-chain restoration behavior in the recovery test harness.

Suggested labels

nightly-e2e, area: e2e, bug-fix

Suggested reviewers

  • cv
  • sandl99

Poem

🐰 The proxy-env now knows when to bend,
Not byte-for-byte, but guard-chain friend,
If guards are active, all is well,
Recovery's contract it can tell! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title concisely and clearly describes the main change: modifying the e2e test to accept recovered proxy environment files under certain conditions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/2478-recovered-proxy-env-hypothesis

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: None
Optional E2E: issue-2478-crash-loop-recovery-e2e

Dispatch hint: issue-2478-crash-loop-recovery-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • None. No required E2E: this PR changes only an existing E2E test script and cannot affect NemoClaw runtime or user flows unless the test is manually dispatched.

Optional E2E

  • issue-2478-crash-loop-recovery-e2e (high; real OpenShell sandbox with NVIDIA_API_KEY/GITHUB_TOKEN and 30 minute timeout): Optional validation that the modified E2E script still runs correctly and continues covering gateway recovery, guard-chain restoration, and inference availability after crash-loop recovery. This is useful confidence for the test change but not merge-blocking because no runtime code changed.

New E2E recommendations

  • None.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: issue-2478-crash-loop-recovery-e2e

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: None
Optional Vitest E2E scenarios: None

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required Vitest E2E scenarios

  • None. Changed file is a legacy shell E2E test under test/e2e/ and does not affect the Vitest scenario workflow, registry, runtime support, live tests, or shared test/e2e-scenario fixtures. No Vitest scenario dispatch is required.

Optional Vitest E2E scenarios

  • None.

Relevant changed files

  • None.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27438172279
Target ref: test/2478-recovered-proxy-env-hypothesis
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 0 passed, 1 failed, 0 cancelled, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ❌ failure

Failed jobs: issue-2478-crash-loop-recovery-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 2 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 0 still apply, 1 new item found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Source-of-truth review needed: Phase 4 recovered `/tmp/nemoclaw-proxy-env.sh` tolerance: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: The diff adds comments explaining recovered minimal hardened proxy env and changes restore verification from strict size equality to accepting `gateway_guards_active "$NEGATIVE_PID" 5`; however, that helper can use stale gateway log markers.
  • Recovered-env fallback can pass on stale guard log evidence (test/e2e/test-issue-2478-crash-loop-recovery.sh:549): The new fallback accepts a non-byte-identical recovered proxy env when `gateway_guards_active "$NEGATIVE_PID" 5` passes. That helper checks that `/tmp/nemoclaw-proxy-env.sh` contains the expected guard strings, that `/tmp/gateway.log` contains guard marker text, and that some gateway PID exists, but it does not prove those log markers were emitted by `NEGATIVE_PID` or after this recovery step. Earlier phases already produce matching guard markers, so this branch could accept a recovered/restore state using stale log evidence.
    • Recommendation: Before accepting the recovered-env branch, capture a fresh recovery boundary, such as truncating/capturing the gateway log offset or recording a start marker, then require guard preload markers emitted after that boundary. If possible, also make the helper verify the current gateway process/start rather than only any live gateway PID.
    • Evidence: The changed branch falls back from size equality to `elif gateway_guards_active "$NEGATIVE_PID" 5; then`, while `gateway_guards_active` greps `/tmp/gateway.log` for marker strings and calls `gateway_pid` independently rather than binding evidence to the supplied PID.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Recovered root-owned `/tmp/nemoclaw-proxy-env.sh` cannot be overwritten by the sandbox user, but the next `connect --probe-only` recovery launches a guarded gateway and inference remains available.. The change is test-only and cannot directly affect runtime behavior, but it changes the oracle for a security-relevant sandbox recovery E2E. Confidence would improve if the recovered-env path required fresh evidence from the current recovery rather than historical log signatures.
  • **Runtime validation** — `gateway_guards_active` fails when proxy env contains guard strings but the current recovered gateway did not consume NODE_OPTIONS.. The change is test-only and cannot directly affect runtime behavior, but it changes the oracle for a security-relevant sandbox recovery E2E. Confidence would improve if the recovered-env path required fresh evidence from the current recovery rather than historical log signatures.
  • **Runtime validation** — Recovered-env fallback requires guard preload markers emitted after the negative-case recovery boundary, not markers left in `/tmp/gateway.log` from earlier phases.. The change is test-only and cannot directly affect runtime behavior, but it changes the oracle for a security-relevant sandbox recovery E2E. Confidence would improve if the recovered-env path required fresh evidence from the current recovery rather than historical log signatures.
  • **Recovered-env fallback can pass on stale guard log evidence** — Before accepting the recovered-env branch, capture a fresh recovery boundary, such as truncating/capturing the gateway log offset or recording a start marker, then require guard preload markers emitted after that boundary. If possible, also make the helper verify the current gateway process/start rather than only any live gateway PID.
  • **Acceptance clause:** PR body: "accepts either: the already-recovered proxy env when `gateway_guards_active` proves the safety-net + ciao guards are active." — add test evidence or identify existing coverage. The diff adds the recovered-env branch, but the proof can be stale because `gateway_guards_active` does not tie log signatures to the current recovered gateway or to a fresh log boundary.
  • **Phase 4 recovered `/tmp/nemoclaw-proxy-env.sh` tolerance** — The existing bash Phase 4/5 still exercises missing proxy env, recovery, guard-chain presence, soak stability, and inference availability; the Vitest live test `test/e2e-scenario/live/gateway-guard-recovery.test.ts` also covers empty `/tmp` plus missing gateway process and requires restored guards.. The diff adds comments explaining recovered minimal hardened proxy env and changes restore verification from strict size equality to accepting `gateway_guards_active "$NEGATIVE_PID" 5`; however, that helper can use stale gateway log markers.
Since last review details

Current findings:

  • Source-of-truth review needed: Phase 4 recovered `/tmp/nemoclaw-proxy-env.sh` tolerance: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: The diff adds comments explaining recovered minimal hardened proxy env and changes restore verification from strict size equality to accepting `gateway_guards_active "$NEGATIVE_PID" 5`; however, that helper can use stale gateway log markers.
  • Recovered-env fallback can pass on stale guard log evidence (test/e2e/test-issue-2478-crash-loop-recovery.sh:549): The new fallback accepts a non-byte-identical recovered proxy env when `gateway_guards_active "$NEGATIVE_PID" 5` passes. That helper checks that `/tmp/nemoclaw-proxy-env.sh` contains the expected guard strings, that `/tmp/gateway.log` contains guard marker text, and that some gateway PID exists, but it does not prove those log markers were emitted by `NEGATIVE_PID` or after this recovery step. Earlier phases already produce matching guard markers, so this branch could accept a recovered/restore state using stale log evidence.
    • Recommendation: Before accepting the recovered-env branch, capture a fresh recovery boundary, such as truncating/capturing the gateway log offset or recording a start marker, then require guard preload markers emitted after that boundary. If possible, also make the helper verify the current gateway process/start rather than only any live gateway PID.
    • Evidence: The changed branch falls back from size equality to `elif gateway_guards_active "$NEGATIVE_PID" 5; then`, while `gateway_guards_active` greps `/tmp/gateway.log` for marker strings and calls `gateway_pid` independently rather than binding evidence to the supplied PID.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27438652246
Target ref: test/2478-recovered-proxy-env-hypothesis
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 0 passed, 1 failed, 0 cancelled, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ❌ failure

Failed jobs: issue-2478-crash-loop-recovery-e2e. Check run artifacts for logs.

@cv

cv commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

There's a follow-up here #5342 that is related to this as well.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27444194176
Target ref: test/2478-recovered-proxy-env-hypothesis
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 0 passed, 1 failed, 0 cancelled, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ❌ failure

Failed jobs: issue-2478-crash-loop-recovery-e2e. Check run artifacts for logs.

@cv cv added the v0.0.65 Release target label Jun 13, 2026
@cv

cv commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Closing as superseded by #5342, which folded in the recovered proxy-env test update, tightened the guard-log proof to use fresh recovery boundaries, and has a passing issue-2478 crash-loop recovery E2E in run 27449849361.

@cv cv closed this Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.65 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants