fix(onboard): allocate dashboard ports across NemoClaw gateways#5379
Conversation
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughReads persisted sandbox registry to build cross-gateway dashboard-port occupancy, merges that with the active gateway forward-list when selecting ports, wires the registry map through create/resolve flows and tests, and emits specific guidance when a sandbox returns "sandbox has no spec". ChangesMulti-instance Sandbox Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: None Full advisor summaryE2E Recommendation AdvisorFailed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-advisor-raw-output.txt |
Vitest E2E Scenario RecommendationRequired Vitest E2E scenarios: None Full Vitest E2E advisor summaryVitest E2E Scenario AdvisorFailed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-scenario-advisor-raw-output.txt |
PR Review AdvisorFindings: 0 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
This is an automated advisory review. A human maintainer must make the final merge decision. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
2541-2550: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winMove this lookup behind
dashboard-port.tssosrc/lib/onboard.tsstays within the growth budget.This extra wiring is what pushed
src/lib/onboard.tsover the entrypoint guardrail in CI. IfresolveCreateSandboxDashboardPort()computesgetRegistryOccupiedDashboardPorts(input.sandboxName)whenregistryOccupiedPortsis omitted, this call site can keep its old shape without losing the new behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 2541 - 2550, The call site currently computes getRegistryOccupiedDashboardPorts(sandboxName) and passes it into resolveCreateSandboxDashboardPort, which bloats src/lib/onboard.ts; instead remove the registryOccupiedPorts argument from this call site (stop invoking getRegistryOccupiedDashboardPorts here) and make resolveCreateSandboxDashboardPort responsible for computing registry-occupied ports when its registryOccupiedPorts parameter is omitted/undefined. Update resolveCreateSandboxDashboardPort's implementation to import/use getRegistryOccupiedDashboardPorts(sandboxName) internally as the fallback, keep the function signature backward-compatible (optional param), and ensure existing behavior and tests remain unchanged.Source: Pipeline failures
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/dashboard-port.ts`:
- Around line 209-213: The blanket catch around entries = list().sandboxes is
hiding real IO/permission errors; instead either call the existing
listSandboxes() helper (which already degrades for missing/unparseable registry
files) or narrow the catch to only handle the safe fallback cases (e.g.,
error.code === 'ENOENT' or JSON parsing errors) and rethrow any other errors
(permission/unreadable file) so they abort; reference list(), listSandboxes(),
and readConfigFile to locate the logic and implement the safer error handling
(i.e., remove the unconditional catch or replace it with conditional checks and
rethrow).
---
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 2541-2550: The call site currently computes
getRegistryOccupiedDashboardPorts(sandboxName) and passes it into
resolveCreateSandboxDashboardPort, which bloats src/lib/onboard.ts; instead
remove the registryOccupiedPorts argument from this call site (stop invoking
getRegistryOccupiedDashboardPorts here) and make
resolveCreateSandboxDashboardPort responsible for computing registry-occupied
ports when its registryOccupiedPorts parameter is omitted/undefined. Update
resolveCreateSandboxDashboardPort's implementation to import/use
getRegistryOccupiedDashboardPorts(sandboxName) internally as the fallback, keep
the function signature backward-compatible (optional param), and ensure existing
behavior and tests remain unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 61beb437-b077-4a3c-8299-ad174f026e5d
📒 Files selected for processing (6)
src/lib/actions/sandbox/gateway-state-hints.test.tssrc/lib/actions/sandbox/gateway-state.tssrc/lib/onboard.tssrc/lib/onboard/dashboard-port.test.tssrc/lib/onboard/dashboard-port.tssrc/lib/onboard/dashboard.ts
…rt module Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…st registry state Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
## Summary Refreshes release-prep documentation for NemoClaw v0.0.65. Adds the v0.0.65 release-notes section and refreshes generated `nemoclaw-user-*` skills from the Fern MDX source docs. ## Changes - Added the v0.0.65 release notes to `docs/about/release-notes.mdx` with links to the deeper docs pages for lifecycle, troubleshooting, inference, CLI commands, messaging, credentials, network policy, Hermes, and sub-agents. - Regenerated the `nemoclaw-user-*` skills with `scripts/docs-to-skills.py` so release-prep skill output matches the merged source docs. - Used the v0.0.65 announcement discussion as release context: #5472. ## Source Summary - #2492 -> `docs/about/release-notes.mdx`: Documents deadline-based gateway wait reliability in the v0.0.65 recovery summary. - #4958 -> `docs/about/release-notes.mdx`: Documents re-execed OpenClaw gateway health check recovery in the sandbox recovery summary. - #5163 -> `docs/about/release-notes.mdx`: Documents safer uninstall TTY confirmation behavior in the day-two CLI summary. - #5178 -> `docs/about/release-notes.mdx`: Documents fail-closed config restore merge behavior in the rebuild and restore summary. - #5179 -> `docs/about/release-notes.mdx`: Documents WeChat QR token redaction in the messaging summary. - #5182 -> `docs/about/release-notes.mdx`: Documents sustained gateway serving checks in the recovery summary. - #5194 -> `docs/about/release-notes.mdx`: Documents model-router teardown during uninstall in the day-two CLI summary. - #5195 -> `docs/about/release-notes.mdx`: Documents Shields auto-restore lock reconfirmation in the rebuild and restore summary. - #5198 -> `docs/about/release-notes.mdx`: Documents Docker Desktop WSL CDI injection failure handling in the onboarding diagnostics summary. - #5201 -> `docs/about/release-notes.mdx`: Documents sandbox download/upload wrappers and sessions export in the day-two CLI summary. - #5205 -> `docs/about/release-notes.mdx`: Documents reporter-owned model metadata preservation in the rebuild and restore summary. - #5214 -> `docs/about/release-notes.mdx`: Documents managed vLLM model preflight before side effects in the inference setup summary. - #5215 -> `docs/about/release-notes.mdx`: Documents managed vLLM extra serve arguments in the inference setup summary. - #5216 -> `docs/about/release-notes.mdx`: Documents silent OpenClaw runtime fallback surfacing in the onboarding diagnostics summary. - #5225 -> `docs/about/release-notes.mdx`: Documents persisted sandbox gateway lookup in the gateway recovery summary. - #5238 -> `docs/about/release-notes.mdx`: Documents sub-agent gateway dial-back through the sandbox interface in the Hermes and sub-agent summary. - #5248 -> `docs/about/release-notes.mdx`: Documents Discord per-account proxy resolution in the messaging summary. - #5264 -> `docs/about/release-notes.mdx`: Documents reserved Hermes port `8642` handling in the Hermes compatibility summary. - #5267 -> `docs/about/release-notes.mdx`: Documents the narrower Hermes baseline policy in the Hermes compatibility summary. - #5321 -> `docs/about/release-notes.mdx`: Documents restored gateway guard chains in the gateway recovery summary. - #5328 -> `docs/about/release-notes.mdx`: Documents compact persisted messaging plans in the messaging summary. - #5338 -> `docs/about/release-notes.mdx`: Documents manifest channel migration in the messaging summary. - #5352 -> `docs/about/release-notes.mdx`: Documents persisted agent preservation through registry recovery in the rebuild and restore summary. - #5371 -> `.agents/skills/nemoclaw-user-reference/references/commands.md`: Refreshes generated skill output for custom build cache and layer-ordering source docs. - #5379 -> `docs/about/release-notes.mdx`: Documents dashboard port allocation across multiple NemoClaw gateways in the recovery summary. - #5382 -> `docs/about/release-notes.mdx`: Documents recovery when an active gateway has no sandbox spec in the recovery summary. - #5389 -> `.agents/skills/nemoclaw-user-reference/references/troubleshooting.md`: Refreshes generated skill output for declared agent `forward_ports` recovery source docs. - #5400 -> `docs/about/release-notes.mdx`: Documents bounded compatible endpoint probes in the inference setup summary. - #5410 -> `docs/about/release-notes.mdx`: Documents provider credential hash removal from sandbox registry entries in the messaging summary. - #5418 -> `docs/about/release-notes.mdx`: Documents summarized inference validation failures in the onboarding diagnostics summary. - #5457 -> `docs/about/release-notes.mdx`: Documents context-window recomputation after runtime model switches in the inference setup summary. - #5463 -> `docs/about/release-notes.mdx`: Documents cleanup of hard-coded messaging channel stragglers in the messaging summary. ## Skipped - #5366 matched `docs/.docs-skip` entries through skipped experimental paths, so this PR does not add new release-note text for that commit. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] Git hooks passed during commit and push, or `npx prek run --from-ref main --to-ref HEAD` passes - [ ] Targeted tests pass for changed behavior - [ ] Full `npm test` passes (broad runtime changes only) - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `npm run docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Verification notes: - `npm run docs` passed after rerunning outside the sandbox. Fern reported 0 errors and 1 hidden warning. - The first sandboxed `npm run docs` attempt failed before validation because `tsx` could not create its local IPC pipe under sandbox restrictions. - `npm run build:cli` passed before push to refresh the local `dist/` artifacts used by the CLI typecheck hook. - `npm test` was not run because this is a docs-only release refresh. --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Released NemoClaw v0.0.65 with improved gateway/sandbox recovery, safer day-two workflows, and enhanced Hermes compatibility. * Added managed vLLM extra-arguments configuration via `NEMOCLAW_VLLM_EXTRA_ARGS_JSON`. * Added Hermes troubleshooting guidance for port forwarding and health checks. * **Documentation** * Updated NVIDIA Endpoints/NIM setup and examples to use `NVIDIA_INFERENCE_API_KEY`. * Refined NVIDIA network policy and Model Router API base configuration. * Expanded CLI/environment variable documentation (including sub-agent gateway connectivity) and plugin build performance tips. * **Tests** * Expanded Vitest-backed E2E release validation coverage. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
NemoClaw allocated dashboard ports per gateway by parsing
openshell forward list, which only shows forwards owned by the currently selected gateway. A second sandbox onboarded against a non-defaultNEMOCLAW_GATEWAY_PORTcould not see the first gateway's allocations and re-handed-out the same dashboard port — both sandboxes ended up reporting the samehttp://127.0.0.1:18789/URL, and the first sandbox became unreachable with a raw gRPCsandbox has no specerror. The persisted sandbox registry already records each sandbox'sdashboardPortat host scope; the allocator now consults that view as a supplementary signal so a fresh onboard on a sibling gateway cannot collide with an existing sandbox's port.Related Issue
Fixes #4865
Fixes #5359
Changes
src/lib/onboard/dashboard-port.ts—findAvailableDashboardPortaccepts an additionalregistryOccupiedPortsview; privatemergeOccupiedPortslets the active gateway's forward-list entry win when both views see the same port. The allocator defaultsregistryOccupiedPortsto an empty map so its unit tests stay independent of whatever sandboxes happen to live in the test runner's~/.nemoclaw/sandboxes.json. New exported helpergetRegistryOccupiedDashboardPorts(currentSandboxName, listSandboxesFn?)reads~/.nemoclaw/sandboxes.jsonand returns a port → sandbox map excluding the sandbox currently being allocated for; it letslistSandboxes()handle missing or unparseable registry files and propagates any other error (e.g. permission-denied) instead of swallowing it.resolveCreateSandboxDashboardPortdefaultsinput.registryOccupiedPortsto the registry-derived map internally, so the create-time call site keeps its existing shape without growingsrc/lib/onboard.ts.src/lib/onboard/dashboard.ts—ensureDashboardForwardpassesgetRegistryOccupiedDashboardPorts(sandboxName)through tofindAvailableDashboardPortso the post-build forward-setup path applies the same cross-gateway view.src/lib/actions/sandbox/gateway-state.ts—printGatewayLifecycleHintadds a clause that recognises the gateway-sidesandbox has no specgRPC reply and surfaces a concreteopenshell gateway select <owning>hint with the sandbox's recorded per-port gateway name, instead of letting the raw gRPC string be the last word.src/lib/onboard/dashboard-port.test.ts— 8 new tests: 5 cover the allocator's cross-gateway behaviour (registry-occupied ports block reuse, the current sandbox can still reclaim its own port, registry entries with null / non-numeric ports are ignored, exhaustion errors include registry-owned ports, the active gateway's forward-list entry wins). 3 covergetRegistryOccupiedDashboardPorts.src/lib/actions/sandbox/gateway-state-hints.test.ts— new file, 3 tests covering the newsandbox has no spechint clause across default and per-port gateway names, plus a non-match check on unrelated lifecycle output.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
Bug Fixes
New Features
Tests