Skip to content

fix(onboard): allocate dashboard ports across NemoClaw gateways#5379

Merged
cv merged 4 commits into
mainfrom
feat/multi-gateway-dashboard-binding
Jun 13, 2026
Merged

fix(onboard): allocate dashboard ports across NemoClaw gateways#5379
cv merged 4 commits into
mainfrom
feat/multi-gateway-dashboard-binding

Conversation

@laitingsheng

@laitingsheng laitingsheng commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

NemoClaw allocated dashboard ports per gateway by parsing openshell forward list, which only shows forwards owned by the currently selected gateway. A second sandbox onboarded against a non-default NEMOCLAW_GATEWAY_PORT could not see the first gateway's allocations and re-handed-out the same dashboard port — both sandboxes ended up reporting the same http://127.0.0.1:18789/ URL, and the first sandbox became unreachable with a raw gRPC sandbox has no spec error. The persisted sandbox registry already records each sandbox's dashboardPort at host scope; the allocator now consults that view as a supplementary signal so a fresh onboard on a sibling gateway cannot collide with an existing sandbox's port.

Related Issue

Fixes #4865
Fixes #5359

Changes

  • src/lib/onboard/dashboard-port.tsfindAvailableDashboardPort accepts an additional registryOccupiedPorts view; private mergeOccupiedPorts lets the active gateway's forward-list entry win when both views see the same port. The allocator defaults registryOccupiedPorts to an empty map so its unit tests stay independent of whatever sandboxes happen to live in the test runner's ~/.nemoclaw/sandboxes.json. New exported helper getRegistryOccupiedDashboardPorts(currentSandboxName, listSandboxesFn?) reads ~/.nemoclaw/sandboxes.json and returns a port → sandbox map excluding the sandbox currently being allocated for; it lets listSandboxes() handle missing or unparseable registry files and propagates any other error (e.g. permission-denied) instead of swallowing it. resolveCreateSandboxDashboardPort defaults input.registryOccupiedPorts to the registry-derived map internally, so the create-time call site keeps its existing shape without growing src/lib/onboard.ts.
  • src/lib/onboard/dashboard.tsensureDashboardForward passes getRegistryOccupiedDashboardPorts(sandboxName) through to findAvailableDashboardPort so the post-build forward-setup path applies the same cross-gateway view.
  • src/lib/actions/sandbox/gateway-state.tsprintGatewayLifecycleHint adds a clause that recognises the gateway-side sandbox has no spec gRPC reply and surfaces a concrete openshell gateway select <owning> hint with the sandbox's recorded per-port gateway name, instead of letting the raw gRPC string be the last word.
  • src/lib/onboard/dashboard-port.test.ts — 8 new tests: 5 cover the allocator's cross-gateway behaviour (registry-occupied ports block reuse, the current sandbox can still reclaim its own port, registry entries with null / non-numeric ports are ignored, exhaustion errors include registry-owned ports, the active gateway's forward-list entry wins). 3 cover getRegistryOccupiedDashboardPorts.
  • src/lib/actions/sandbox/gateway-state-hints.test.ts — new file, 3 tests covering the new sandbox has no spec hint clause across default and per-port gateway names, plus a non-match check on unrelated lifecycle output.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Clearer, user-facing guidance when a sandbox lacks a spec — prompts selecting the owning gateway and retrying instead of surfacing raw gRPC output.
  • New Features

    • Dashboard port allocation now accounts for ports persisted by sibling sandboxes across gateways to reduce conflicts and prefer appropriate owners.
  • Tests

    • Added unit coverage for gateway hint behavior and multi-gateway dashboard-port allocation scenarios.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 48234cd7-0d4c-4624-bc27-4b5096678968

📥 Commits

Reviewing files that changed from the base of the PR and between c90427d and 390c7cb.

📒 Files selected for processing (2)
  • src/lib/onboard/dashboard-port.ts
  • src/lib/onboard/dashboard.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/dashboard-port.ts

📝 Walkthrough

Walkthrough

Reads persisted sandbox registry to build cross-gateway dashboard-port occupancy, merges that with the active gateway forward-list when selecting ports, wires the registry map through create/resolve flows and tests, and emits specific guidance when a sandbox returns "sandbox has no spec".

Changes

Multi-instance Sandbox Support

Layer / File(s) Summary
Registry occupancy types and retrieval
src/lib/onboard/dashboard-port.ts
Adds SandboxRegistryEntry/ListSandboxesFn types and getRegistryOccupiedDashboardPorts; builds a port→sandbox map excluding the current sandbox and invalid ports.
Dashboard port allocation wiring
src/lib/onboard/dashboard-port.ts
Adds optional registryOccupiedPorts to CreateSandboxDashboardPortInput; resolveCreateSandboxDashboardPort now fetches registry occupancy when not supplied and forwards it into allocation.
Allocation logic: merge & availability
src/lib/onboard/dashboard-port.ts
Implements mergeOccupiedPorts and extends findAvailableDashboardPort to consult merged occupancy (forward-list precedence) so ports owned by other sandboxes are skipped while allowing self-reuse.
Multi-gateway dashboard port allocation tests
src/lib/onboard/dashboard-port.test.ts
Imports and tests getRegistryOccupiedDashboardPorts; extends findAvailableDashboardPort tests for sibling registry occupancy, self-reuse, invalid entry filtering, exhaustion reporting, and forward-list precedence.
Dashboard ensure wiring
src/lib/onboard/dashboard.ts
Imports getRegistryOccupiedDashboardPorts and passes the registry map into findAvailableDashboardPort when ensuring dashboard forwards.
Error guidance for orphaned sandbox
src/lib/actions/sandbox/gateway-state.ts, src/lib/actions/sandbox/gateway-state-hints.test.ts
printGatewayLifecycleHint now detects "sandbox has no spec" and emits instructions to select the owning nemoclaw-<port> gateway and retry; tests validate per-port gateway naming and negative cases.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#5130: Overlaps at src/lib/onboard/dashboard-port.ts where resolve/create port logic was refactored; related changes to allocation flow.
  • NVIDIA/NemoClaw#5128: Earlier changes to dashboard-port creation wiring and resolveCreateSandboxDashboardPort that this PR extends.
  • NVIDIA/NemoClaw#5225: Related gateway-binding and lifecycle hint adjustments that intersect with printGatewayLifecycleHint behavior.

Suggested labels

area: onboarding, v0.0.64

Suggested reviewers

  • cv

🐰 I poked the registry, checked each gate,
Assigned ports so sandboxes wait no more;
When a sandbox cries "no spec" in fright,
Select its gateway — and try again tonight.
Hoppy deploys and happy ports!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(onboard): allocate dashboard ports across NemoClaw gateways' clearly and concisely describes the main change: fixing dashboard port allocation to work across multiple NemoClaw gateways.
Linked Issues check ✅ Passed The PR comprehensively addresses all primary coding objectives from issues #4865 and #5359: prevents cross-gateway dashboard port reuse via registry consultation, adds 'sandbox has no spec' error hint in gateway-state.ts, and includes comprehensive test coverage for multi-gateway scenarios.
Out of Scope Changes check ✅ Passed All changes in dashboard-port.ts, dashboard.ts, gateway-state.ts, and their tests directly support the core objectives of fixing cross-gateway port allocation and improving error diagnostics; no out-of-scope refactoring or unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/multi-gateway-dashboard-binding

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: None
Optional E2E: None

Workflow run

Full advisor summary

E2E Recommendation Advisor

Failed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-advisor-raw-output.txt

@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: None
Optional Vitest E2E scenarios: None

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Failed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-scenario-advisor-raw-output.txt

@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Top item: PR review advisor unavailable

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • PR review advisor unavailable: The automated advisor could not complete: Could not parse JSON from PR review advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/pr-review-advisor/pr-review-advisor-raw-output.txt
    • Recommendation: Re-run the PR Review Advisor or perform a manual review.
    • Evidence: Could not parse JSON from PR review advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/pr-review-advisor/pr-review-advisor-raw-output.txt

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Add or identify targeted runtime/integration validation for the changed behavior; do not report external E2E job pass/fail here.. Runtime/sandbox/infrastructure paths need behavioral runtime validation: src/lib/actions/sandbox/gateway-state.ts, src/lib/onboard/dashboard-port.ts, src/lib/onboard/dashboard.ts.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@laitingsheng laitingsheng added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression labels Jun 13, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)

2541-2550: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Move this lookup behind dashboard-port.ts so src/lib/onboard.ts stays within the growth budget.

This extra wiring is what pushed src/lib/onboard.ts over the entrypoint guardrail in CI. If resolveCreateSandboxDashboardPort() computes getRegistryOccupiedDashboardPorts(input.sandboxName) when registryOccupiedPorts is omitted, this call site can keep its old shape without losing the new behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 2541 - 2550, The call site currently
computes getRegistryOccupiedDashboardPorts(sandboxName) and passes it into
resolveCreateSandboxDashboardPort, which bloats src/lib/onboard.ts; instead
remove the registryOccupiedPorts argument from this call site (stop invoking
getRegistryOccupiedDashboardPorts here) and make
resolveCreateSandboxDashboardPort responsible for computing registry-occupied
ports when its registryOccupiedPorts parameter is omitted/undefined. Update
resolveCreateSandboxDashboardPort's implementation to import/use
getRegistryOccupiedDashboardPorts(sandboxName) internally as the fallback, keep
the function signature backward-compatible (optional param), and ensure existing
behavior and tests remain unchanged.

Source: Pipeline failures

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/dashboard-port.ts`:
- Around line 209-213: The blanket catch around entries = list().sandboxes is
hiding real IO/permission errors; instead either call the existing
listSandboxes() helper (which already degrades for missing/unparseable registry
files) or narrow the catch to only handle the safe fallback cases (e.g.,
error.code === 'ENOENT' or JSON parsing errors) and rethrow any other errors
(permission/unreadable file) so they abort; reference list(), listSandboxes(),
and readConfigFile to locate the logic and implement the safer error handling
(i.e., remove the unconditional catch or replace it with conditional checks and
rethrow).

---

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 2541-2550: The call site currently computes
getRegistryOccupiedDashboardPorts(sandboxName) and passes it into
resolveCreateSandboxDashboardPort, which bloats src/lib/onboard.ts; instead
remove the registryOccupiedPorts argument from this call site (stop invoking
getRegistryOccupiedDashboardPorts here) and make
resolveCreateSandboxDashboardPort responsible for computing registry-occupied
ports when its registryOccupiedPorts parameter is omitted/undefined. Update
resolveCreateSandboxDashboardPort's implementation to import/use
getRegistryOccupiedDashboardPorts(sandboxName) internally as the fallback, keep
the function signature backward-compatible (optional param), and ensure existing
behavior and tests remain unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 61beb437-b077-4a3c-8299-ad174f026e5d

📥 Commits

Reviewing files that changed from the base of the PR and between 1467e84 and 3c70f45.

📒 Files selected for processing (6)
  • src/lib/actions/sandbox/gateway-state-hints.test.ts
  • src/lib/actions/sandbox/gateway-state.ts
  • src/lib/onboard.ts
  • src/lib/onboard/dashboard-port.test.ts
  • src/lib/onboard/dashboard-port.ts
  • src/lib/onboard/dashboard.ts

Comment thread src/lib/onboard/dashboard-port.ts Outdated
…rt module

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…st registry state

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@cv cv merged commit f34ac4c into main Jun 13, 2026
46 checks passed
@cv cv deleted the feat/multi-gateway-dashboard-binding branch June 13, 2026 08:21
@cv cv added the v0.0.65 Release target label Jun 13, 2026
@miyoungc miyoungc mentioned this pull request Jun 16, 2026
13 tasks
cv pushed a commit that referenced this pull request Jun 17, 2026
## Summary
Refreshes release-prep documentation for NemoClaw v0.0.65.
Adds the v0.0.65 release-notes section and refreshes generated
`nemoclaw-user-*` skills from the Fern MDX source docs.

## Changes
- Added the v0.0.65 release notes to `docs/about/release-notes.mdx` with
links to the deeper docs pages for lifecycle, troubleshooting,
inference, CLI commands, messaging, credentials, network policy, Hermes,
and sub-agents.
- Regenerated the `nemoclaw-user-*` skills with
`scripts/docs-to-skills.py` so release-prep skill output matches the
merged source docs.
- Used the v0.0.65 announcement discussion as release context:
#5472.

## Source Summary
- #2492 -> `docs/about/release-notes.mdx`: Documents deadline-based
gateway wait reliability in the v0.0.65 recovery summary.
- #4958 -> `docs/about/release-notes.mdx`: Documents re-execed OpenClaw
gateway health check recovery in the sandbox recovery summary.
- #5163 -> `docs/about/release-notes.mdx`: Documents safer uninstall TTY
confirmation behavior in the day-two CLI summary.
- #5178 -> `docs/about/release-notes.mdx`: Documents fail-closed config
restore merge behavior in the rebuild and restore summary.
- #5179 -> `docs/about/release-notes.mdx`: Documents WeChat QR token
redaction in the messaging summary.
- #5182 -> `docs/about/release-notes.mdx`: Documents sustained gateway
serving checks in the recovery summary.
- #5194 -> `docs/about/release-notes.mdx`: Documents model-router
teardown during uninstall in the day-two CLI summary.
- #5195 -> `docs/about/release-notes.mdx`: Documents Shields
auto-restore lock reconfirmation in the rebuild and restore summary.
- #5198 -> `docs/about/release-notes.mdx`: Documents Docker Desktop WSL
CDI injection failure handling in the onboarding diagnostics summary.
- #5201 -> `docs/about/release-notes.mdx`: Documents sandbox
download/upload wrappers and sessions export in the day-two CLI summary.
- #5205 -> `docs/about/release-notes.mdx`: Documents reporter-owned
model metadata preservation in the rebuild and restore summary.
- #5214 -> `docs/about/release-notes.mdx`: Documents managed vLLM model
preflight before side effects in the inference setup summary.
- #5215 -> `docs/about/release-notes.mdx`: Documents managed vLLM extra
serve arguments in the inference setup summary.
- #5216 -> `docs/about/release-notes.mdx`: Documents silent OpenClaw
runtime fallback surfacing in the onboarding diagnostics summary.
- #5225 -> `docs/about/release-notes.mdx`: Documents persisted sandbox
gateway lookup in the gateway recovery summary.
- #5238 -> `docs/about/release-notes.mdx`: Documents sub-agent gateway
dial-back through the sandbox interface in the Hermes and sub-agent
summary.
- #5248 -> `docs/about/release-notes.mdx`: Documents Discord per-account
proxy resolution in the messaging summary.
- #5264 -> `docs/about/release-notes.mdx`: Documents reserved Hermes
port `8642` handling in the Hermes compatibility summary.
- #5267 -> `docs/about/release-notes.mdx`: Documents the narrower Hermes
baseline policy in the Hermes compatibility summary.
- #5321 -> `docs/about/release-notes.mdx`: Documents restored gateway
guard chains in the gateway recovery summary.
- #5328 -> `docs/about/release-notes.mdx`: Documents compact persisted
messaging plans in the messaging summary.
- #5338 -> `docs/about/release-notes.mdx`: Documents manifest channel
migration in the messaging summary.
- #5352 -> `docs/about/release-notes.mdx`: Documents persisted agent
preservation through registry recovery in the rebuild and restore
summary.
- #5371 ->
`.agents/skills/nemoclaw-user-reference/references/commands.md`:
Refreshes generated skill output for custom build cache and
layer-ordering source docs.
- #5379 -> `docs/about/release-notes.mdx`: Documents dashboard port
allocation across multiple NemoClaw gateways in the recovery summary.
- #5382 -> `docs/about/release-notes.mdx`: Documents recovery when an
active gateway has no sandbox spec in the recovery summary.
- #5389 ->
`.agents/skills/nemoclaw-user-reference/references/troubleshooting.md`:
Refreshes generated skill output for declared agent `forward_ports`
recovery source docs.
- #5400 -> `docs/about/release-notes.mdx`: Documents bounded compatible
endpoint probes in the inference setup summary.
- #5410 -> `docs/about/release-notes.mdx`: Documents provider credential
hash removal from sandbox registry entries in the messaging summary.
- #5418 -> `docs/about/release-notes.mdx`: Documents summarized
inference validation failures in the onboarding diagnostics summary.
- #5457 -> `docs/about/release-notes.mdx`: Documents context-window
recomputation after runtime model switches in the inference setup
summary.
- #5463 -> `docs/about/release-notes.mdx`: Documents cleanup of
hard-coded messaging channel stragglers in the messaging summary.

## Skipped
- #5366 matched `docs/.docs-skip` entries through skipped experimental
paths, so this PR does not add new release-note text for that commit.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [x] Git hooks passed during commit and push, or `npx prek run
--from-ref main --to-ref HEAD` passes
- [ ] Targeted tests pass for changed behavior
- [ ] Full `npm test` passes (broad runtime changes only)
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `npm run docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

Verification notes:
- `npm run docs` passed after rerunning outside the sandbox. Fern
reported 0 errors and 1 hidden warning.
- The first sandboxed `npm run docs` attempt failed before validation
because `tsx` could not create its local IPC pipe under sandbox
restrictions.
- `npm run build:cli` passed before push to refresh the local `dist/`
artifacts used by the CLI typecheck hook.
- `npm test` was not run because this is a docs-only release refresh.

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Released NemoClaw v0.0.65 with improved gateway/sandbox recovery,
safer day-two workflows, and enhanced Hermes compatibility.
* Added managed vLLM extra-arguments configuration via
`NEMOCLAW_VLLM_EXTRA_ARGS_JSON`.
* Added Hermes troubleshooting guidance for port forwarding and health
checks.

* **Documentation**
* Updated NVIDIA Endpoints/NIM setup and examples to use
`NVIDIA_INFERENCE_API_KEY`.
* Refined NVIDIA network policy and Model Router API base configuration.
* Expanded CLI/environment variable documentation (including sub-agent
gateway connectivity) and plugin build performance tips.

* **Tests**
  * Expanded Vitest-backed E2E release validation coverage.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression v0.0.65 Release target

Projects

None yet

2 participants