fix(recover): require sustained gateway serving after recovery (#4710)#5182
Conversation
A wedged in-sandbox OpenClaw gateway serves for ~20 seconds after logging ready and then drops its HTTP listener while the process stays alive (a failed in-process restart triggered by a post-launch config write). The recovery wait declared success on a single health probe inside that window, so 'nemoclaw <name> recover' reported a healthy gateway that was already on its way back to the wedge. After the first successful probe, wait out a settle window (NEMOCLAW_GATEWAY_RECOVERY_SETTLE_SECONDS, default 25, 0 disables) and require a confirm probe to still succeed before declaring recovery. On confirm failure, surface the #4710 wedge signature from /tmp/gateway.log (config-reload restart, gateway startup failed, process-will-stay-alive) so the operator sees why the gateway is unreachable despite a live PID. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughExports waitForRecoveredSandboxGateway (injectable probe/sleep/quiet), adds an optional settle-window re-check after initial probe success, implements collect/print gateway wedge diagnostics with sanitization, integrates diagnostics into recovery/connect failure paths, and adds tests and CLI coverage. ChangesSandbox Recovery Enhancements
Sequence DiagramsequenceDiagram
participant CLI as connect --probe-only / caller
participant Wait as waitForRecoveredSandboxGateway
participant Probe as isSandboxGatewayRunning (probeImpl)
participant Timer as settle window (sleepImpl)
participant Diag as printGatewayWedgeDiagnostics
CLI->>Wait: trigger recovery probe
Wait->>Probe: probeImpl(sandboxName)
Probe-->>Wait: true/false
alt initial success
Wait->>Timer: sleep(settleSeconds)
Timer-->>Wait: (wake)
Wait->>Probe: probeImpl(sandboxName) (re-probe)
Probe-->>Wait: false
Wait->>Diag: printGatewayWedgeDiagnostics(sandboxName, exec)
else never responsive
Wait->>Diag: printGatewayWedgeDiagnostics(sandboxName, exec)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: None Full advisor summaryE2E Recommendation AdvisorFailed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-advisor-raw-output.txt |
Vitest E2E Scenario RecommendationRequired Vitest E2E scenarios: None Full Vitest E2E advisor summaryVitest E2E Scenario AdvisorFailed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-scenario-advisor-raw-output.txt |
PR Review AdvisorFindings: 0 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
This is an automated advisory review. A human maintainer must make the final merge decision. |
Selective E2E Results — ✅ All requested jobs passedRun: 27311911574
|
The probe-only connect path runs recovery with quiet=true and prints its own failure summary, so the #4710 wedge signature added to the non-quiet recovery path never reached the operator there. Extract a shared printGatewayWedgeDiagnostics helper and call it from the probe-only failure path too. CLI test fallout from the settle-confirm: runWithEnv now disables the 25s settle by default (NEMOCLAW_GATEWAY_RECOVERY_SETTLE_SECONDS=0) so the existing connect-recovery tests stay fast — settle behavior keeps its dedicated unit coverage — and a new focused CLI suite drives the full wedge shape (serve once, then refuse) through 'connect --probe-only' with a short settle window, asserting the failure exit, the wedge-signature output, and that the confirm probe ran. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/cli/helpers.ts`:
- Around line 174-178: Update the inline comment above
NEMOCLAW_GATEWAY_RECOVERY_SETTLE_SECONDS to reference the correct test filename:
replace "connect-recovery.test.ts" with "connect-recovery-settle.test.ts" so the
comment correctly points to the targeted CLI test that overrides the settle
window.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4a211809-b5cb-4117-a80f-44c60536addb
📒 Files selected for processing (4)
src/lib/actions/sandbox/connect.tssrc/lib/actions/sandbox/process-recovery.tstest/cli/connect-recovery-settle.test.tstest/cli/helpers.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/actions/sandbox/process-recovery.ts
…lines (#4710) PR Review Advisor follow-ups on #5182: - Move the wedge-diagnostics helpers out of the high-churn process-recovery.ts monolith into a focused gateway-wedge-diagnostics module with the sandbox exec passed explicitly (keeps the import graph acyclic) and document the source-of-truth contract there: the invalid OpenClaw park-alive state, the upstream source boundary, and the removal condition for this detection. - The printed lines come from a sandbox-writable log, so treat them as untrusted: strip terminal control characters and redact common credential shapes (bearer headers, key/token/secret/password assignments, nvapi- keys) before they reach the operator's terminal. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
Re: PR Review Advisor findings — Scope (#4710 acceptance): this PR is explicitly the host-side complement, not the closure of the original HEALTHCHECK/marker acceptance. The marker/HEALTHCHECK and sandbox-side prevention clauses are owned by #5116 (marker tied to the launch site) and #5181 (gateway.reload=hot pin + serving watchdog + HEALTHCHECK pattern). This PR only hardens Monolith growth: addressed in c9c71cf — the wedge-diagnostics helpers moved out of Source-of-truth / removal contract: documented in the new module header — invalid state (OpenClaw gateway parks alive-but-deaf after a failed in-process restart), source boundary (OpenClaw run loop; upstream report in progress), and removal condition (an OpenClaw release whose failed restart exits non-zero so PID-wait supervisors respawn it). Log-line trust: also in c9c71cf — matched gateway.log lines are sanitized before printing (terminal control characters stripped; bearer headers, key/token/secret/password assignments, and nvapi- keys redacted), with unit coverage. Runtime validation: dispatched the advisor-required Vitest scenario |
Selective E2E Results — ✅ All requested jobs passedRun: 27327118403
|
…est.ts Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 27327936491
|
Selective E2E Results — ✅ All requested jobs passedRun: 27328391360
|
Selective E2E Results — ❌ Some jobs failedRun: 27347713550
|
Selective E2E Results — ❌ Some jobs failedRun: 27350076197
|
Selective E2E Results — ❌ Some jobs failedRun: 27350329797
|
Selective E2E Results — ❌ Some jobs failedRun: 27350806077
|
Selective E2E Results — ✅ All requested jobs passedRun: 27352865672
|
Selective E2E Results — ❌ Some jobs failedRun: 27353767791
|
Selective E2E Results — ❌ Some jobs failedRun: 27350076197
|
1 similar comment
Selective E2E Results — ❌ Some jobs failedRun: 27350076197
|
Selective E2E Results — ✅ All requested jobs passedRun: 27377252394
|
Selective E2E Results — ❌ Some jobs failedRun: 27353767791
|
Selective E2E Results — ❌ Some jobs failedRun: 27377738233
|
Selective E2E Results — ❌ Some jobs failedRun: 27378334061
|
Selective E2E Results — ✅ All requested jobs passedRun: 27393874604
|
Selective E2E Results — ❌ Some jobs failedRun: 27394431256
|
Selective E2E Results — ❌ Some jobs failedRun: 27394431256
|
Selective E2E Results — ✅ All requested jobs passedRun: 27394431256
|
## Summary Refreshes release-prep documentation for NemoClaw v0.0.65. Adds the v0.0.65 release-notes section and refreshes generated `nemoclaw-user-*` skills from the Fern MDX source docs. ## Changes - Added the v0.0.65 release notes to `docs/about/release-notes.mdx` with links to the deeper docs pages for lifecycle, troubleshooting, inference, CLI commands, messaging, credentials, network policy, Hermes, and sub-agents. - Regenerated the `nemoclaw-user-*` skills with `scripts/docs-to-skills.py` so release-prep skill output matches the merged source docs. - Used the v0.0.65 announcement discussion as release context: #5472. ## Source Summary - #2492 -> `docs/about/release-notes.mdx`: Documents deadline-based gateway wait reliability in the v0.0.65 recovery summary. - #4958 -> `docs/about/release-notes.mdx`: Documents re-execed OpenClaw gateway health check recovery in the sandbox recovery summary. - #5163 -> `docs/about/release-notes.mdx`: Documents safer uninstall TTY confirmation behavior in the day-two CLI summary. - #5178 -> `docs/about/release-notes.mdx`: Documents fail-closed config restore merge behavior in the rebuild and restore summary. - #5179 -> `docs/about/release-notes.mdx`: Documents WeChat QR token redaction in the messaging summary. - #5182 -> `docs/about/release-notes.mdx`: Documents sustained gateway serving checks in the recovery summary. - #5194 -> `docs/about/release-notes.mdx`: Documents model-router teardown during uninstall in the day-two CLI summary. - #5195 -> `docs/about/release-notes.mdx`: Documents Shields auto-restore lock reconfirmation in the rebuild and restore summary. - #5198 -> `docs/about/release-notes.mdx`: Documents Docker Desktop WSL CDI injection failure handling in the onboarding diagnostics summary. - #5201 -> `docs/about/release-notes.mdx`: Documents sandbox download/upload wrappers and sessions export in the day-two CLI summary. - #5205 -> `docs/about/release-notes.mdx`: Documents reporter-owned model metadata preservation in the rebuild and restore summary. - #5214 -> `docs/about/release-notes.mdx`: Documents managed vLLM model preflight before side effects in the inference setup summary. - #5215 -> `docs/about/release-notes.mdx`: Documents managed vLLM extra serve arguments in the inference setup summary. - #5216 -> `docs/about/release-notes.mdx`: Documents silent OpenClaw runtime fallback surfacing in the onboarding diagnostics summary. - #5225 -> `docs/about/release-notes.mdx`: Documents persisted sandbox gateway lookup in the gateway recovery summary. - #5238 -> `docs/about/release-notes.mdx`: Documents sub-agent gateway dial-back through the sandbox interface in the Hermes and sub-agent summary. - #5248 -> `docs/about/release-notes.mdx`: Documents Discord per-account proxy resolution in the messaging summary. - #5264 -> `docs/about/release-notes.mdx`: Documents reserved Hermes port `8642` handling in the Hermes compatibility summary. - #5267 -> `docs/about/release-notes.mdx`: Documents the narrower Hermes baseline policy in the Hermes compatibility summary. - #5321 -> `docs/about/release-notes.mdx`: Documents restored gateway guard chains in the gateway recovery summary. - #5328 -> `docs/about/release-notes.mdx`: Documents compact persisted messaging plans in the messaging summary. - #5338 -> `docs/about/release-notes.mdx`: Documents manifest channel migration in the messaging summary. - #5352 -> `docs/about/release-notes.mdx`: Documents persisted agent preservation through registry recovery in the rebuild and restore summary. - #5371 -> `.agents/skills/nemoclaw-user-reference/references/commands.md`: Refreshes generated skill output for custom build cache and layer-ordering source docs. - #5379 -> `docs/about/release-notes.mdx`: Documents dashboard port allocation across multiple NemoClaw gateways in the recovery summary. - #5382 -> `docs/about/release-notes.mdx`: Documents recovery when an active gateway has no sandbox spec in the recovery summary. - #5389 -> `.agents/skills/nemoclaw-user-reference/references/troubleshooting.md`: Refreshes generated skill output for declared agent `forward_ports` recovery source docs. - #5400 -> `docs/about/release-notes.mdx`: Documents bounded compatible endpoint probes in the inference setup summary. - #5410 -> `docs/about/release-notes.mdx`: Documents provider credential hash removal from sandbox registry entries in the messaging summary. - #5418 -> `docs/about/release-notes.mdx`: Documents summarized inference validation failures in the onboarding diagnostics summary. - #5457 -> `docs/about/release-notes.mdx`: Documents context-window recomputation after runtime model switches in the inference setup summary. - #5463 -> `docs/about/release-notes.mdx`: Documents cleanup of hard-coded messaging channel stragglers in the messaging summary. ## Skipped - #5366 matched `docs/.docs-skip` entries through skipped experimental paths, so this PR does not add new release-note text for that commit. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] Git hooks passed during commit and push, or `npx prek run --from-ref main --to-ref HEAD` passes - [ ] Targeted tests pass for changed behavior - [ ] Full `npm test` passes (broad runtime changes only) - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `npm run docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Verification notes: - `npm run docs` passed after rerunning outside the sandbox. Fern reported 0 errors and 1 hidden warning. - The first sandboxed `npm run docs` attempt failed before validation because `tsx` could not create its local IPC pipe under sandbox restrictions. - `npm run build:cli` passed before push to refresh the local `dist/` artifacts used by the CLI typecheck hook. - `npm test` was not run because this is a docs-only release refresh. --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Released NemoClaw v0.0.65 with improved gateway/sandbox recovery, safer day-two workflows, and enhanced Hermes compatibility. * Added managed vLLM extra-arguments configuration via `NEMOCLAW_VLLM_EXTRA_ARGS_JSON`. * Added Hermes troubleshooting guidance for port forwarding and health checks. * **Documentation** * Updated NVIDIA Endpoints/NIM setup and examples to use `NVIDIA_INFERENCE_API_KEY`. * Refined NVIDIA network policy and Model Router API base configuration. * Expanded CLI/environment variable documentation (including sub-agent gateway connectivity) and plugin build performance tips. * **Tests** * Expanded Vitest-backed E2E release validation coverage. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
nemoclaw <name> recoverdeclared success on a single gateway health probe. A wedged gateway (#4710) serves for ~20 seconds after logging ready and then drops its HTTP listener while the process stays alive — so recover reported a healthy gateway that was already on its way back to the wedge. This PR makes recovery require sustained serving and surfaces the wedge signature when it recurs.Related Issue
Related: #4710 (host-side hardening; helps existing sandboxes without an image rebuild — complements the sandbox-side fix).
Changes
waitForRecoveredSandboxGateway(src/lib/actions/sandbox/process-recovery.ts): after the first successful probe, wait a settle window (NEMOCLAW_GATEWAY_RECOVERY_SETTLE_SECONDS, default 25s,0disables) and require a confirm probe to still succeed before declaring recovery. Prints a one-line progress note so the wait doesn't read as a hang.collectGatewayWedgeDiagnostics: on confirm failure, greps/tmp/gateway.logfor the wedge signature (config change requires gateway restart/gateway startup failed/Process will stay alive) and prints the matching lines so the operator sees why the gateway is unreachable despite a live PID.0-disable path, initial polling interplay, and diagnostics extraction/edge cases.Architecture
gateway-wedge-diagnosticsmodule rather than growing the high-churnprocess-recovery.tslifecycle monolith (the deterministic growth gate's direction). The sandbox exec is passed explicitly, keeping the import graph acyclic.execImplstyle, and tests run against the compileddist/like the rest of the CLI suite, with the CLI-level settle test isolated in a focused file under the size budgets.Type of Change
Verification
npx prek run --all-filespasses — except thetest-clihook's two pre-existing, machine-bound failures that reproduce identically on a cleanmaincheckout on this host (live e2e scenarioubuntu-repo-cloud-openclaw, timing-sensitive auto-pair keepalive retry); all other hooks pass.npm testpasses for the touched suites (src/lib/actions/sandbox/process-recovery.test.ts,test/process-recovery.test.ts).NEMOCLAW_GATEWAY_RECOVERY_WAIT_SECONDS/_POLL_INTERVAL_SECONDSconvention in the same function; happy to add a reference entry if preferred.Signed-off-by: Aaron Erickson aerickson@nvidia.com
Summary by CodeRabbit
New Features
Tests