Skip to content

fix(recover): require sustained gateway serving after recovery (#4710)#5182

Merged
cv merged 7 commits into
mainfrom
fix/4710-recover-settle-window
Jun 13, 2026
Merged

fix(recover): require sustained gateway serving after recovery (#4710)#5182
cv merged 7 commits into
mainfrom
fix/4710-recover-settle-window

Conversation

@ericksoa

@ericksoa ericksoa commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

nemoclaw <name> recover declared success on a single gateway health probe. A wedged gateway (#4710) serves for ~20 seconds after logging ready and then drops its HTTP listener while the process stays alive — so recover reported a healthy gateway that was already on its way back to the wedge. This PR makes recovery require sustained serving and surfaces the wedge signature when it recurs.

Related Issue

Related: #4710 (host-side hardening; helps existing sandboxes without an image rebuild — complements the sandbox-side fix).

Changes

  • waitForRecoveredSandboxGateway (src/lib/actions/sandbox/process-recovery.ts): after the first successful probe, wait a settle window (NEMOCLAW_GATEWAY_RECOVERY_SETTLE_SECONDS, default 25s, 0 disables) and require a confirm probe to still succeed before declaring recovery. Prints a one-line progress note so the wait doesn't read as a hang.
  • New collectGatewayWedgeDiagnostics: on confirm failure, greps /tmp/gateway.log for the wedge signature (config change requires gateway restart / gateway startup failed / Process will stay alive) and prints the matching lines so the operator sees why the gateway is unreachable despite a live PID.
  • Both functions take injectable probe/sleep/exec seams; tests cover settle-confirm success, the serve-then-drop wedge, the 0-disable path, initial polling interplay, and diagnostics extraction/edge cases.

Architecture

  • Health means proven serving, not existence. Recovery already moved from pgrep to HTTP probes ([NemoClaw][Brev Launchable] OpenClaw Gateway Dashboard shows "Version n/a" and "Health Offline" after Brev Launchable deployment succeeds #2342); the settle-confirm extends that same principle from point-in-time proof to sustained proof, matched to the wedge's ~20s time constant. Success criteria stay aligned with what the operator actually needs to be true.
  • Monolith reduction. The wedge diagnostics live in a focused gateway-wedge-diagnostics module rather than growing the high-churn process-recovery.ts lifecycle monolith (the deterministic growth gate's direction). The sandbox exec is passed explicitly, keeping the import graph acyclic.
  • Trust boundaries. Matched gateway-log lines are sandbox-writable bytes: control characters are neutralized and credential shapes redacted before anything reaches an operator terminal.
  • Workaround contract. The module header documents the invalid OpenClaw state, the upstream source boundary, and the removal condition — this PR is detection/confirmation only and is deliberately scoped as the host-side complement to the sandbox-side fix (fix(sandbox): pin gateway.reload=hot and add gateway serving watchdog (#4710) #5181); it does not claim the original [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 HEALTHCHECK/marker acceptance.
  • Seams and parity. Probe/sleep/exec are injectable in the established execImpl style, and tests run against the compiled dist/ like the rest of the CLI suite, with the CLI-level settle test isolated in a focused file under the size budgets.

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • npx prek run --all-files passes — except the test-cli hook's two pre-existing, machine-bound failures that reproduce identically on a clean main checkout on this host (live e2e scenario ubuntu-repo-cloud-openclaw, timing-sensitive auto-pair keepalive retry); all other hooks pass.
  • npm test passes for the touched suites (src/lib/actions/sandbox/process-recovery.test.ts, test/process-recovery.test.ts).
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes — env var follows the existing undocumented NEMOCLAW_GATEWAY_RECOVERY_WAIT_SECONDS/_POLL_INTERVAL_SECONDS convention in the same function; happy to add a reference entry if preferred.

Signed-off-by: Aaron Erickson aerickson@nvidia.com

Summary by CodeRabbit

  • New Features

    • Sandbox recovery now prints sanitized gateway "wedge" diagnostics on failure and supports an optional post-recovery "settle" confirmation (configurable; can be disabled).
  • Tests

    • Expanded coverage for settle-confirm behavior, polling/timeouts, intermittent listener-drop scenarios, diagnostic extraction/sanitization, and a CLI test path that disables the settle delay for faster runs.

A wedged in-sandbox OpenClaw gateway serves for ~20 seconds after logging
ready and then drops its HTTP listener while the process stays alive (a
failed in-process restart triggered by a post-launch config write). The
recovery wait declared success on a single health probe inside that
window, so 'nemoclaw <name> recover' reported a healthy gateway that was
already on its way back to the wedge.

After the first successful probe, wait out a settle window
(NEMOCLAW_GATEWAY_RECOVERY_SETTLE_SECONDS, default 25, 0 disables) and
require a confirm probe to still succeed before declaring recovery. On
confirm failure, surface the #4710 wedge signature from /tmp/gateway.log
(config-reload restart, gateway startup failed, process-will-stay-alive)
so the operator sees why the gateway is unreachable despite a live PID.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Exports waitForRecoveredSandboxGateway (injectable probe/sleep/quiet), adds an optional settle-window re-check after initial probe success, implements collect/print gateway wedge diagnostics with sanitization, integrates diagnostics into recovery/connect failure paths, and adds tests and CLI coverage.

Changes

Sandbox Recovery Enhancements

Layer / File(s) Summary
Export and make waitForRecoveredSandboxGateway testable
src/lib/actions/sandbox/process-recovery.test.ts, src/lib/actions/sandbox/process-recovery.ts
Function is exported and accepts optional probeImpl, sleepImpl, and quiet parameters; tests updated to import the new export.
Settle-window confirmation logic and tests
src/lib/actions/sandbox/process-recovery.ts, src/lib/actions/sandbox/process-recovery.test.ts
After an initial successful probe, reads NEMOCLAW_GATEWAY_RECOVERY_SETTLE_SECONDS, optionally logs (unless quiet), sleeps the settle duration, and re-probes to confirm stability. Tests cover settle timing, listener drop after recovery, skipping via 0, polling through initial failures, and timeout on exhausted wait budget.
Wedge diagnostics implementation and tests
src/lib/actions/sandbox/gateway-wedge-diagnostics.ts, src/lib/actions/sandbox/gateway-wedge-diagnostics.test.ts
Adds sanitizeWedgeLogLine, collectGatewayWedgeDiagnostics, and printGatewayWedgeDiagnostics that grep /tmp/gateway.log, return/sanitize up to five matching lines, and print a header plus sanitized lines; tests validate extraction, empty results, null exec handling, and redaction/sanitization.
Integrate diagnostics into recovery failure and connect probe
src/lib/actions/sandbox/process-recovery.ts, src/lib/actions/sandbox/connect.ts
On gateway recovery timeout, call waitForRecoveredSandboxGateway(sandboxName, { quiet }); if unresponsive, invoke printGatewayWedgeDiagnostics(sandboxName, executeSandboxExecCommand) before existing guidance. connect also prints wedge diagnostics when probe-only/quiet recovery fails.
CLI tests and helpers for settle behavior
test/cli/connect-recovery-settle.test.ts, test/cli/helpers.ts
Adds a CLI test that simulates a wedged gateway (one-time listener drop), asserts exit code and expected messages including the #4710 wedge signature, and sets NEMOCLAW_GATEWAY_RECOVERY_SETTLE_SECONDS="0" in test helper for faster CLI test runs.

Sequence Diagram

sequenceDiagram
  participant CLI as connect --probe-only / caller
  participant Wait as waitForRecoveredSandboxGateway
  participant Probe as isSandboxGatewayRunning (probeImpl)
  participant Timer as settle window (sleepImpl)
  participant Diag as printGatewayWedgeDiagnostics
  CLI->>Wait: trigger recovery probe
  Wait->>Probe: probeImpl(sandboxName)
  Probe-->>Wait: true/false
  alt initial success
    Wait->>Timer: sleep(settleSeconds)
    Timer-->>Wait: (wake)
    Wait->>Probe: probeImpl(sandboxName) (re-probe)
    Probe-->>Wait: false
    Wait->>Diag: printGatewayWedgeDiagnostics(sandboxName, exec)
  else never responsive
    Wait->>Diag: printGatewayWedgeDiagnostics(sandboxName, exec)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

bug-fix, v0.0.63

Suggested reviewers

  • cv
  • prekshivyas

Poem

🐰 I hop through logs where sleepy gateways dream,

I time the settles, watch the probes' soft gleam.
When listeners vanish in the midnight fog,
I sniff the wedge lines, and bring them to the log.
Cheers for fixes found beneath the shell's faint beam.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: requiring sustained gateway serving after recovery by implementing a settle-window confirmation before declaring recovery success.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/4710-recover-settle-window

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: None
Optional E2E: None

Workflow run

Full advisor summary

E2E Recommendation Advisor

Failed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-advisor-raw-output.txt

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: None
Optional Vitest E2E scenarios: None

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Failed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-scenario-advisor-raw-output.txt

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Top item: PR review advisor unavailable

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • PR review advisor unavailable: The automated advisor could not complete: Could not parse JSON from PR review advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/pr-review-advisor/pr-review-advisor-raw-output.txt
    • Recommendation: Re-run the PR Review Advisor or perform a manual review.
    • Evidence: Could not parse JSON from PR review advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/pr-review-advisor/pr-review-advisor-raw-output.txt

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Add or identify targeted runtime/integration validation for the changed behavior; do not report external E2E job pass/fail here.. Runtime/sandbox/infrastructure paths need behavioral runtime validation: src/lib/actions/sandbox/connect.ts, src/lib/actions/sandbox/gateway-wedge-diagnostics.ts, src/lib/actions/sandbox/process-recovery.ts.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27311911574
Target ref: f4b7c0837db93f426a7920da08c1506c4105df9d
Workflow ref: main
Requested jobs: sandbox-operations-e2e,sandbox-survival-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success

The probe-only connect path runs recovery with quiet=true and prints its
own failure summary, so the #4710 wedge signature added to the non-quiet
recovery path never reached the operator there. Extract a shared
printGatewayWedgeDiagnostics helper and call it from the probe-only
failure path too.

CLI test fallout from the settle-confirm: runWithEnv now disables the
25s settle by default (NEMOCLAW_GATEWAY_RECOVERY_SETTLE_SECONDS=0) so the
existing connect-recovery tests stay fast — settle behavior keeps its
dedicated unit coverage — and a new focused CLI suite drives the full
wedge shape (serve once, then refuse) through 'connect --probe-only'
with a short settle window, asserting the failure exit, the
wedge-signature output, and that the confirm probe ran.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/cli/helpers.ts`:
- Around line 174-178: Update the inline comment above
NEMOCLAW_GATEWAY_RECOVERY_SETTLE_SECONDS to reference the correct test filename:
replace "connect-recovery.test.ts" with "connect-recovery-settle.test.ts" so the
comment correctly points to the targeted CLI test that overrides the settle
window.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4a211809-b5cb-4117-a80f-44c60536addb

📥 Commits

Reviewing files that changed from the base of the PR and between f4b7c08 and 2950317.

📒 Files selected for processing (4)
  • src/lib/actions/sandbox/connect.ts
  • src/lib/actions/sandbox/process-recovery.ts
  • test/cli/connect-recovery-settle.test.ts
  • test/cli/helpers.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/actions/sandbox/process-recovery.ts

Comment thread test/cli/helpers.ts
…lines (#4710)

PR Review Advisor follow-ups on #5182:

- Move the wedge-diagnostics helpers out of the high-churn
  process-recovery.ts monolith into a focused gateway-wedge-diagnostics
  module with the sandbox exec passed explicitly (keeps the import graph
  acyclic) and document the source-of-truth contract there: the invalid
  OpenClaw park-alive state, the upstream source boundary, and the removal
  condition for this detection.
- The printed lines come from a sandbox-writable log, so treat them as
  untrusted: strip terminal control characters and redact common
  credential shapes (bearer headers, key/token/secret/password
  assignments, nvapi- keys) before they reach the operator's terminal.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa

Copy link
Copy Markdown
Contributor Author

Re: PR Review Advisor findings —

Scope (#4710 acceptance): this PR is explicitly the host-side complement, not the closure of the original HEALTHCHECK/marker acceptance. The marker/HEALTHCHECK and sandbox-side prevention clauses are owned by #5116 (marker tied to the launch site) and #5181 (gateway.reload=hot pin + serving watchdog + HEALTHCHECK pattern). This PR only hardens recover/connect --probe-only so they cannot declare success inside the pre-wedge window, and it does not close #4710 on its own (the body links it as "Related", not "Fixes").

Monolith growth: addressed in c9c71cf — the wedge-diagnostics helpers moved out of process-recovery.ts into a focused gateway-wedge-diagnostics.ts module, bringing the net process-recovery.ts delta down to the settle-confirm itself.

Source-of-truth / removal contract: documented in the new module header — invalid state (OpenClaw gateway parks alive-but-deaf after a failed in-process restart), source boundary (OpenClaw run loop; upstream report in progress), and removal condition (an OpenClaw release whose failed restart exits non-zero so PID-wait supervisors respawn it).

Log-line trust: also in c9c71cf — matched gateway.log lines are sanitized before printing (terminal control characters stripped; bearer headers, key/token/secret/password assignments, and nvapi- keys redacted), with unit coverage.

Runtime validation: dispatched the advisor-required Vitest scenario ubuntu-repo-docker-post-reboot-recovery against this branch; the auto-dispatched sandbox-operations-e2e and sandbox-survival-e2e runs already passed.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27327118403
Target ref: 29503179ae5fcc88f8e2565a0bdb3b28c90818b0
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e,sandbox-operations-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ✅ success
sandbox-operations-e2e ✅ success

…est.ts

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27327936491
Target ref: c9c71cf4cb44552efc9243e98dcaf0ad27146abd
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e,sandbox-operations-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled

@ericksoa ericksoa self-assigned this Jun 11, 2026
@ericksoa ericksoa added area: cli Command line interface, flags, terminal UX, or output area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery v0.0.64 Release target labels Jun 11, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27328391360
Target ref: 14a88e652f8fc0b251fb99f66c34a13556a6e18b
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e,sandbox-survival-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ✅ success
sandbox-survival-e2e ✅ success

@ericksoa ericksoa requested review from cv and removed request for cv June 11, 2026 12:28
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27347713550
Target ref: fix/4710-recover-settle-window
Requested jobs: issue-2478-crash-loop-recovery-e2e,sandbox-survival-e2e,sandbox-operations-e2e
Summary: 1 passed, 2 failed, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ✅ success
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ❌ failure

Failed jobs: sandbox-operations-e2e, sandbox-survival-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27350076197
Target ref: fix/4710-recover-settle-window
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
sandbox-operations-e2e ❌ failure

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27350329797
Target ref: fix/4710-recover-settle-window
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
sandbox-operations-e2e ❌ failure

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27350806077
Target ref: fix/4710-recover-settle-window
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
sandbox-operations-e2e ❌ failure

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27352865672
Target ref: fix/4710-recover-settle-window
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
sandbox-operations-e2e ✅ success

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27353767791
Target ref: fix/4710-recover-settle-window
Requested jobs: sandbox-survival-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
sandbox-survival-e2e ❌ failure

Failed jobs: sandbox-survival-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27350076197
Target ref: fix/4710-recover-settle-window
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
sandbox-operations-e2e ❌ failure

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27350076197
Target ref: fix/4710-recover-settle-window
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
sandbox-operations-e2e ❌ failure

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27377252394
Target ref: fix/4710-recover-settle-window
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
sandbox-operations-e2e ⚠️ cancelled

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27353767791
Target ref: fix/4710-recover-settle-window
Requested jobs: sandbox-survival-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
sandbox-survival-e2e ❌ failure

Failed jobs: sandbox-survival-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27377738233
Target ref: fix/4710-recover-settle-window
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
sandbox-operations-e2e ❌ failure

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27378334061
Target ref: fix/4710-recover-settle-window
Requested jobs: all (no filter)
Summary: 27 passed, 37 failed, 3 skipped

Job Result
agent-turn-latency-e2e ❌ failure
bedrock-runtime-compatible-anthropic-e2e ✅ success
brave-search-e2e ✅ success
channels-add-remove-e2e ❌ failure
channels-stop-start-e2e ❌ failure
cloud-e2e ❌ failure
cloud-inference-e2e ✅ success
cloud-onboard-e2e ❌ failure
common-egress-agent-e2e ❌ failure
concurrent-gateway-ports-e2e ❌ failure
credential-migration-e2e ❌ failure
credential-sanitization-e2e ✅ success
cron-preflight-inference-local-e2e ✅ success
device-auth-health-e2e ❌ failure
diagnostics-e2e ❌ failure
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
gpu-jetson-nvmap-e2e ⏭️ skipped
hermes-anthropic-inference-switch-e2e ✅ success
hermes-dashboard-e2e ✅ success
hermes-discord-e2e ❌ failure
hermes-e2e ✅ success
hermes-inference-switch-e2e ✅ success
hermes-onboard-security-posture-e2e ❌ failure
hermes-root-entrypoint-smoke-e2e ✅ success
hermes-secret-boundary-e2e ✅ success
hermes-slack-e2e ❌ failure
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ❌ failure
issue-3600-gpu-proof-optional-e2e ✅ success
issue-4434-tui-unreachable-inference-e2e ❌ failure
issue-4462-gateway-pinned-approval-characterization-e2e ❌ failure
issue-4462-scope-upgrade-approval-e2e ❌ failure
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ❌ failure
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ❌ failure
network-policy-e2e ❌ failure
onboard-negative-paths-e2e ✅ success
onboard-repair-e2e ❌ failure
onboard-resume-e2e ✅ success
openclaw-anthropic-inference-switch-e2e ❌ failure
openclaw-discord-pairing-e2e ❌ failure
openclaw-inference-switch-e2e ✅ success
openclaw-onboard-security-posture-e2e ❌ failure
openclaw-skill-cli-e2e ❌ failure
openclaw-slack-pairing-e2e ❌ failure
openclaw-tui-chat-correlation-e2e ❌ failure
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ❌ failure
rebuild-hermes-e2e ❌ failure
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ❌ failure
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
sessions-agents-cli-e2e ❌ failure
shields-config-e2e ❌ failure
skill-agent-e2e ✅ success
snapshot-commands-e2e ❌ failure
state-backup-restore-e2e ✅ success
telegram-injection-e2e ❌ failure
token-rotation-e2e ❌ failure
tunnel-lifecycle-e2e ❌ failure
upgrade-stale-sandbox-e2e ❌ failure

Failed jobs: agent-turn-latency-e2e, channels-add-remove-e2e, channels-stop-start-e2e, cloud-e2e, cloud-onboard-e2e, common-egress-agent-e2e, concurrent-gateway-ports-e2e, credential-migration-e2e, device-auth-health-e2e, diagnostics-e2e, hermes-discord-e2e, hermes-onboard-security-posture-e2e, hermes-slack-e2e, issue-2478-crash-loop-recovery-e2e, issue-4434-tui-unreachable-inference-e2e, issue-4462-gateway-pinned-approval-characterization-e2e, issue-4462-scope-upgrade-approval-e2e, launchable-smoke-e2e, messaging-providers-e2e, network-policy-e2e, onboard-repair-e2e, openclaw-anthropic-inference-switch-e2e, openclaw-discord-pairing-e2e, openclaw-onboard-security-posture-e2e, openclaw-skill-cli-e2e, openclaw-slack-pairing-e2e, openclaw-tui-chat-correlation-e2e, overlayfs-autofix-e2e, rebuild-hermes-e2e, rebuild-openclaw-e2e, sessions-agents-cli-e2e, shields-config-e2e, snapshot-commands-e2e, telegram-injection-e2e, token-rotation-e2e, tunnel-lifecycle-e2e, upgrade-stale-sandbox-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27393874604
Target ref: fix/4710-recover-settle-window
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
sandbox-operations-e2e ✅ success

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27394431256
Target ref: fix/4710-recover-settle-window
Requested jobs: all (no filter)
Summary: 51 passed, 13 failed, 3 skipped

Job Result
agent-turn-latency-e2e ✅ success
bedrock-runtime-compatible-anthropic-e2e ✅ success
brave-search-e2e ✅ success
channels-add-remove-e2e ✅ success
channels-stop-start-e2e ❌ failure
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
common-egress-agent-e2e ❌ failure
concurrent-gateway-ports-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
cron-preflight-inference-local-e2e ❌ failure
device-auth-health-e2e ❌ failure
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
gpu-jetson-nvmap-e2e ⏭️ skipped
hermes-anthropic-inference-switch-e2e ❌ failure
hermes-dashboard-e2e ✅ success
hermes-discord-e2e ✅ success
hermes-e2e ❌ failure
hermes-inference-switch-e2e ✅ success
hermes-onboard-security-posture-e2e ✅ success
hermes-root-entrypoint-smoke-e2e ✅ success
hermes-secret-boundary-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
issue-3600-gpu-proof-optional-e2e ✅ success
issue-4434-tui-unreachable-inference-e2e ✅ success
issue-4462-gateway-pinned-approval-characterization-e2e ❌ failure
issue-4462-scope-upgrade-approval-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ❌ failure
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ❌ failure
network-policy-e2e ❌ failure
onboard-negative-paths-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openclaw-anthropic-inference-switch-e2e ✅ success
openclaw-discord-pairing-e2e ✅ success
openclaw-inference-switch-e2e ✅ success
openclaw-onboard-security-posture-e2e ✅ success
openclaw-skill-cli-e2e ✅ success
openclaw-slack-pairing-e2e ✅ success
openclaw-tui-chat-correlation-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ❌ failure
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
sessions-agents-cli-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
state-backup-restore-e2e ❌ failure
telegram-injection-e2e ✅ success
token-rotation-e2e ❌ failure
tunnel-lifecycle-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: channels-stop-start-e2e, common-egress-agent-e2e, cron-preflight-inference-local-e2e, device-auth-health-e2e, hermes-anthropic-inference-switch-e2e, hermes-e2e, issue-4462-gateway-pinned-approval-characterization-e2e, launchable-smoke-e2e, messaging-providers-e2e, network-policy-e2e, overlayfs-autofix-e2e, state-backup-restore-e2e, token-rotation-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27394431256
Target ref: fix/4710-recover-settle-window
Requested jobs: all (no filter)
Summary: 63 passed, 1 failed, 3 skipped

Job Result
agent-turn-latency-e2e ✅ success
bedrock-runtime-compatible-anthropic-e2e ✅ success
brave-search-e2e ✅ success
channels-add-remove-e2e ✅ success
channels-stop-start-e2e ✅ success
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
common-egress-agent-e2e ✅ success
concurrent-gateway-ports-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
cron-preflight-inference-local-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
gpu-jetson-nvmap-e2e ⏭️ skipped
hermes-anthropic-inference-switch-e2e ✅ success
hermes-dashboard-e2e ✅ success
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-inference-switch-e2e ✅ success
hermes-onboard-security-posture-e2e ✅ success
hermes-root-entrypoint-smoke-e2e ✅ success
hermes-secret-boundary-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
issue-3600-gpu-proof-optional-e2e ✅ success
issue-4434-tui-unreachable-inference-e2e ✅ success
issue-4462-gateway-pinned-approval-characterization-e2e ❌ failure
issue-4462-scope-upgrade-approval-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-negative-paths-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openclaw-anthropic-inference-switch-e2e ✅ success
openclaw-discord-pairing-e2e ✅ success
openclaw-inference-switch-e2e ✅ success
openclaw-onboard-security-posture-e2e ✅ success
openclaw-skill-cli-e2e ✅ success
openclaw-slack-pairing-e2e ✅ success
openclaw-tui-chat-correlation-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
sessions-agents-cli-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
state-backup-restore-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
tunnel-lifecycle-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

Failed jobs: issue-4462-gateway-pinned-approval-characterization-e2e. Check run artifacts for logs.

@cv cv added v0.0.65 Release target and removed v0.0.64 Release target labels Jun 12, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27394431256
Target ref: fix/4710-recover-settle-window
Requested jobs: all (no filter)
Summary: 64 passed, 0 failed, 3 skipped

Job Result
agent-turn-latency-e2e ✅ success
bedrock-runtime-compatible-anthropic-e2e ✅ success
brave-search-e2e ✅ success
channels-add-remove-e2e ✅ success
channels-stop-start-e2e ✅ success
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
common-egress-agent-e2e ✅ success
concurrent-gateway-ports-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
cron-preflight-inference-local-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
gpu-jetson-nvmap-e2e ⏭️ skipped
hermes-anthropic-inference-switch-e2e ✅ success
hermes-dashboard-e2e ✅ success
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-inference-switch-e2e ✅ success
hermes-onboard-security-posture-e2e ✅ success
hermes-root-entrypoint-smoke-e2e ✅ success
hermes-secret-boundary-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
issue-3600-gpu-proof-optional-e2e ✅ success
issue-4434-tui-unreachable-inference-e2e ✅ success
issue-4462-gateway-pinned-approval-characterization-e2e ✅ success
issue-4462-scope-upgrade-approval-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-negative-paths-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openclaw-anthropic-inference-switch-e2e ✅ success
openclaw-discord-pairing-e2e ✅ success
openclaw-inference-switch-e2e ✅ success
openclaw-onboard-security-posture-e2e ✅ success
openclaw-skill-cli-e2e ✅ success
openclaw-slack-pairing-e2e ✅ success
openclaw-tui-chat-correlation-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
sessions-agents-cli-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
state-backup-restore-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
tunnel-lifecycle-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

@ericksoa ericksoa requested review from cv and removed request for cjagwani June 12, 2026 17:12
@cv cv merged commit 5fb47a5 into main Jun 13, 2026
39 checks passed
@cv cv deleted the fix/4710-recover-settle-window branch June 13, 2026 06:19
@miyoungc miyoungc mentioned this pull request Jun 16, 2026
13 tasks
cv pushed a commit that referenced this pull request Jun 17, 2026
## Summary
Refreshes release-prep documentation for NemoClaw v0.0.65.
Adds the v0.0.65 release-notes section and refreshes generated
`nemoclaw-user-*` skills from the Fern MDX source docs.

## Changes
- Added the v0.0.65 release notes to `docs/about/release-notes.mdx` with
links to the deeper docs pages for lifecycle, troubleshooting,
inference, CLI commands, messaging, credentials, network policy, Hermes,
and sub-agents.
- Regenerated the `nemoclaw-user-*` skills with
`scripts/docs-to-skills.py` so release-prep skill output matches the
merged source docs.
- Used the v0.0.65 announcement discussion as release context:
#5472.

## Source Summary
- #2492 -> `docs/about/release-notes.mdx`: Documents deadline-based
gateway wait reliability in the v0.0.65 recovery summary.
- #4958 -> `docs/about/release-notes.mdx`: Documents re-execed OpenClaw
gateway health check recovery in the sandbox recovery summary.
- #5163 -> `docs/about/release-notes.mdx`: Documents safer uninstall TTY
confirmation behavior in the day-two CLI summary.
- #5178 -> `docs/about/release-notes.mdx`: Documents fail-closed config
restore merge behavior in the rebuild and restore summary.
- #5179 -> `docs/about/release-notes.mdx`: Documents WeChat QR token
redaction in the messaging summary.
- #5182 -> `docs/about/release-notes.mdx`: Documents sustained gateway
serving checks in the recovery summary.
- #5194 -> `docs/about/release-notes.mdx`: Documents model-router
teardown during uninstall in the day-two CLI summary.
- #5195 -> `docs/about/release-notes.mdx`: Documents Shields
auto-restore lock reconfirmation in the rebuild and restore summary.
- #5198 -> `docs/about/release-notes.mdx`: Documents Docker Desktop WSL
CDI injection failure handling in the onboarding diagnostics summary.
- #5201 -> `docs/about/release-notes.mdx`: Documents sandbox
download/upload wrappers and sessions export in the day-two CLI summary.
- #5205 -> `docs/about/release-notes.mdx`: Documents reporter-owned
model metadata preservation in the rebuild and restore summary.
- #5214 -> `docs/about/release-notes.mdx`: Documents managed vLLM model
preflight before side effects in the inference setup summary.
- #5215 -> `docs/about/release-notes.mdx`: Documents managed vLLM extra
serve arguments in the inference setup summary.
- #5216 -> `docs/about/release-notes.mdx`: Documents silent OpenClaw
runtime fallback surfacing in the onboarding diagnostics summary.
- #5225 -> `docs/about/release-notes.mdx`: Documents persisted sandbox
gateway lookup in the gateway recovery summary.
- #5238 -> `docs/about/release-notes.mdx`: Documents sub-agent gateway
dial-back through the sandbox interface in the Hermes and sub-agent
summary.
- #5248 -> `docs/about/release-notes.mdx`: Documents Discord per-account
proxy resolution in the messaging summary.
- #5264 -> `docs/about/release-notes.mdx`: Documents reserved Hermes
port `8642` handling in the Hermes compatibility summary.
- #5267 -> `docs/about/release-notes.mdx`: Documents the narrower Hermes
baseline policy in the Hermes compatibility summary.
- #5321 -> `docs/about/release-notes.mdx`: Documents restored gateway
guard chains in the gateway recovery summary.
- #5328 -> `docs/about/release-notes.mdx`: Documents compact persisted
messaging plans in the messaging summary.
- #5338 -> `docs/about/release-notes.mdx`: Documents manifest channel
migration in the messaging summary.
- #5352 -> `docs/about/release-notes.mdx`: Documents persisted agent
preservation through registry recovery in the rebuild and restore
summary.
- #5371 ->
`.agents/skills/nemoclaw-user-reference/references/commands.md`:
Refreshes generated skill output for custom build cache and
layer-ordering source docs.
- #5379 -> `docs/about/release-notes.mdx`: Documents dashboard port
allocation across multiple NemoClaw gateways in the recovery summary.
- #5382 -> `docs/about/release-notes.mdx`: Documents recovery when an
active gateway has no sandbox spec in the recovery summary.
- #5389 ->
`.agents/skills/nemoclaw-user-reference/references/troubleshooting.md`:
Refreshes generated skill output for declared agent `forward_ports`
recovery source docs.
- #5400 -> `docs/about/release-notes.mdx`: Documents bounded compatible
endpoint probes in the inference setup summary.
- #5410 -> `docs/about/release-notes.mdx`: Documents provider credential
hash removal from sandbox registry entries in the messaging summary.
- #5418 -> `docs/about/release-notes.mdx`: Documents summarized
inference validation failures in the onboarding diagnostics summary.
- #5457 -> `docs/about/release-notes.mdx`: Documents context-window
recomputation after runtime model switches in the inference setup
summary.
- #5463 -> `docs/about/release-notes.mdx`: Documents cleanup of
hard-coded messaging channel stragglers in the messaging summary.

## Skipped
- #5366 matched `docs/.docs-skip` entries through skipped experimental
paths, so this PR does not add new release-note text for that commit.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [x] Git hooks passed during commit and push, or `npx prek run
--from-ref main --to-ref HEAD` passes
- [ ] Targeted tests pass for changed behavior
- [ ] Full `npm test` passes (broad runtime changes only)
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `npm run docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

Verification notes:
- `npm run docs` passed after rerunning outside the sandbox. Fern
reported 0 errors and 1 hidden warning.
- The first sandboxed `npm run docs` attempt failed before validation
because `tsx` could not create its local IPC pipe under sandbox
restrictions.
- `npm run build:cli` passed before push to refresh the local `dist/`
artifacts used by the CLI typecheck hook.
- `npm test` was not run because this is a docs-only release refresh.

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Released NemoClaw v0.0.65 with improved gateway/sandbox recovery,
safer day-two workflows, and enhanced Hermes compatibility.
* Added managed vLLM extra-arguments configuration via
`NEMOCLAW_VLLM_EXTRA_ARGS_JSON`.
* Added Hermes troubleshooting guidance for port forwarding and health
checks.

* **Documentation**
* Updated NVIDIA Endpoints/NIM setup and examples to use
`NVIDIA_INFERENCE_API_KEY`.
* Refined NVIDIA network policy and Model Router API base configuration.
* Expanded CLI/environment variable documentation (including sub-agent
gateway connectivity) and plugin build performance tips.

* **Tests**
  * Expanded Vitest-backed E2E release validation coverage.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery v0.0.65 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants