perf(onboard): add deadline-based gateway wait#2492
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughGateway health polling during onboard transitions from a manual for/sleep loop to a ChangesPolling refactor and test expansion
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR introduces a reusable, synchronous, deadline-based polling helper and applies it to the gateway startup health wait loop as part of the onboard latency work (refs #2001), while aiming to preserve existing health poll env knob behavior.
Changes:
- Added
waitUntil(...)tosrc/lib/wait.tswith deadline, backoff, injectable clock/sleeper hooks, and optional attempt cap. - Replaced the gateway startup health polling loop in
src/lib/onboard.tswith the newwaitUntil(...)helper. - Expanded
test/wait.test.tswith focused unit tests forwaitUntil(...)behavior (immediate success, retries, deadline, backoff, and attempt caps).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| test/wait.test.ts | Adds unit coverage for the new waitUntil(...) helper. |
| src/lib/wait.ts | Introduces the waitUntil(...) synchronous polling primitive and options. |
| src/lib/onboard.ts | Switches gateway health polling to use waitUntil(...) with existing poll count/interval inputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/wait.test.ts (1)
42-172: Add a contract test for invaliddeadlineMs.Please add a test that asserts
waitUntilthrowsTypeErrorfor non-finitedeadlineMs(e.g.,NaN), so the input validation path is locked in.✅ Suggested test
+ it("waitUntil throws when deadlineMs is non-finite", () => { + expect(() => + waitUntil(() => false, { + deadlineMs: Number.NaN, + }), + ).toThrow(TypeError); + });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/wait.test.ts` around lines 42 - 172, Add a new test in wait.test.ts using the existing waitUntil harness that verifies passing a non-finite deadlineMs (e.g., NaN) causes waitUntil to throw a TypeError: call waitUntil with a simple predicate and options including deadlineMs: NaN plus the same now and sleep mocks used in other tests, and assert with expect(() => waitUntil(...)).toThrow(TypeError) so the input validation path for deadlineMs is locked in.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3512-3541: The deadline calculation (healthDeadlineMs) only
accounts for sleep budget and can expire while each waitUntil attempt runs
multiple shell-outs (repairGatewayBootstrapSecrets, runCaptureOpenshell,
isGatewayHealthy), shortening the effective retry window; fix by extending or
removing the deadline: either compute healthDeadlineMs to also include an
allowance for per-attempt probe runtime (e.g., add healthPollCount *
probeTimeoutMs or add a conservative maxProbeRuntimeMs) so the deadline >= total
sleep budget + total probe runtime, or omit deadlineMs from the waitUntil call
to rely solely on maxAttempts/healthPollCount; update the healthDeadlineMs
computation or the waitUntil invocation accordingly (references:
healthPollIntervalMs, healthDeadlineMs, healthPollCount, waitUntil,
repairGatewayBootstrapSecrets, runCaptureOpenshell, isGatewayHealthy).
In `@src/lib/wait.ts`:
- Around line 98-99: The loop can hot-spin when intervalMs (or maxIntervalMs) is
0 and maxAttempts is unbounded; change the sleeper call and interval update to
enforce a minimum non-zero sleep (e.g., MIN_SLEEP_MS = 1) so you never call
sleeper(0). Specifically, in the block using sleeper(Math.min(intervalMs,
deadlineMs - currentMs)) and intervalMs = Math.min(maxIntervalMs, intervalMs *
backoffFactor), clamp the sleep duration with Math.max(MIN_SLEEP_MS, ...) and
ensure intervalMs is also floored to at least MIN_SLEEP_MS after applying
backoff so the loop yields CPU even when initialIntervalMs/maxIntervalMs are
zero (references: sleeper, intervalMs, maxIntervalMs, backoffFactor, deadlineMs,
currentMs).
---
Nitpick comments:
In `@test/wait.test.ts`:
- Around line 42-172: Add a new test in wait.test.ts using the existing
waitUntil harness that verifies passing a non-finite deadlineMs (e.g., NaN)
causes waitUntil to throw a TypeError: call waitUntil with a simple predicate
and options including deadlineMs: NaN plus the same now and sleep mocks used in
other tests, and assert with expect(() => waitUntil(...)).toThrow(TypeError) so
the input validation path for deadlineMs is locked in.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 88a34e7e-6332-4895-82d8-e2cfc9f03551
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/wait.tstest/wait.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/wait.ts`:
- Around line 94-108: The loop currently calls condition() even after the
deadline guard because the deadline check (using now() and deadlineMs) comes
after condition(); move the deadline/time check so it runs before invoking
condition() (and keep the existing attempts checks), ensuring you check
Number.isFinite(currentMs) and currentMs >= deadlineMs prior to calling
condition() to avoid the extra probe; update the loop in src/lib/wait.ts
(references: attempts, maxAttempts, deadlineMs, now(), condition()) accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d13eeb32-c6e5-4ffb-bbea-dae420740d03
📒 Files selected for processing (2)
src/lib/wait.tstest/wait.test.ts
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/lib/wait.ts (1)
93-109:⚠️ Potential issue | 🟠 MajorHonor an already-expired deadline before the first probe.
Line 95 only checks the deadline after at least one attempt, so
condition()on Line 107 still runs once whendeadlineMs <= now()at entry. For callers using this to drive real health probes, that is still one probe past the requested budget. Please move the deadline guard out of theattempts > 0branch and add a regression test for an already-expired deadline.Suggested fix
let attempts = 0; for (;;) { - if (attempts > 0) { - const currentMs = now(); - if (!Number.isFinite(currentMs) || currentMs >= deadlineMs) { - return false; - } + const currentMs = now(); + if (!Number.isFinite(currentMs) || currentMs >= deadlineMs) { + return false; } if (attempts >= maxAttempts) { return false; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/wait.ts` around lines 93 - 109, The loop currently only checks the deadline after the first probe (gated by attempts > 0), so move the deadline guard to run on every iteration before invoking condition(): call const currentMs = now(); if (!Number.isFinite(currentMs) || currentMs >= deadlineMs) return false; (remove the attempts > 0 gate) so an already-expired deadline causes an immediate false; also add a regression test that sets deadlineMs <= now() and asserts the wait function returns false (and that condition() is not invoked) to prevent regressions; refer to symbols attempts, now(), deadlineMs, condition(), and maxAttempts when making changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/lib/wait.ts`:
- Around line 93-109: The loop currently only checks the deadline after the
first probe (gated by attempts > 0), so move the deadline guard to run on every
iteration before invoking condition(): call const currentMs = now(); if
(!Number.isFinite(currentMs) || currentMs >= deadlineMs) return false; (remove
the attempts > 0 gate) so an already-expired deadline causes an immediate false;
also add a regression test that sets deadlineMs <= now() and asserts the wait
function returns false (and that condition() is not invoked) to prevent
regressions; refer to symbols attempts, now(), deadlineMs, condition(), and
maxAttempts when making changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ab7479bd-599b-456e-8318-8c5f61cdf8fa
📒 Files selected for processing (2)
src/lib/wait.tstest/wait.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/wait.ts`:
- Around line 95-116: The code uses currentMs captured before calling
condition(), which can be stale if condition() is slow; update the timestamp
after the condition() call and before computing remainingMs/sleepDurationMs so
the sleep budget respects deadlineMs and avoids sleeping past the deadline; in
the wait function recompute currentMs = now() (or equivalent) after condition()
returns and use that value to derive remainingMs, requestedSleepMs and to clamp
sleeper(...) (referencing currentMs, now(), deadlineMs, condition(),
remainingMs, requestedSleepMs, MIN_UNCAPPED_SLEEP_MS, hasAttemptCap, and
sleeper).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3b8e8326-56d6-4585-bb6e-fa32ebce51cf
📒 Files selected for processing (2)
src/lib/wait.tstest/wait.test.ts
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
bb47782 to
5a709c4
Compare
|
✨ Thanks for submitting this pull request that proposes a way to improve the performance of NemoClaw's onboard process by introducing a deadline-based wait primitive. Related open issues: |
|
Sprint 5 planning update: we’re organizing this PR as the foundation PR for #3768, with #2001 remaining the umbrella tracker. Relationship:
This PR should be tracked for Sprint 5 review, but it should not auto-close #3768 or #2001 when merged. If it lands, #3768 should remain open for the remaining readiness paths called out there. |
|
@coderabbitai review |
✅ Action performedReview finished.
|
## Summary Refreshes release-prep documentation for NemoClaw v0.0.65. Adds the v0.0.65 release-notes section and refreshes generated `nemoclaw-user-*` skills from the Fern MDX source docs. ## Changes - Added the v0.0.65 release notes to `docs/about/release-notes.mdx` with links to the deeper docs pages for lifecycle, troubleshooting, inference, CLI commands, messaging, credentials, network policy, Hermes, and sub-agents. - Regenerated the `nemoclaw-user-*` skills with `scripts/docs-to-skills.py` so release-prep skill output matches the merged source docs. - Used the v0.0.65 announcement discussion as release context: #5472. ## Source Summary - #2492 -> `docs/about/release-notes.mdx`: Documents deadline-based gateway wait reliability in the v0.0.65 recovery summary. - #4958 -> `docs/about/release-notes.mdx`: Documents re-execed OpenClaw gateway health check recovery in the sandbox recovery summary. - #5163 -> `docs/about/release-notes.mdx`: Documents safer uninstall TTY confirmation behavior in the day-two CLI summary. - #5178 -> `docs/about/release-notes.mdx`: Documents fail-closed config restore merge behavior in the rebuild and restore summary. - #5179 -> `docs/about/release-notes.mdx`: Documents WeChat QR token redaction in the messaging summary. - #5182 -> `docs/about/release-notes.mdx`: Documents sustained gateway serving checks in the recovery summary. - #5194 -> `docs/about/release-notes.mdx`: Documents model-router teardown during uninstall in the day-two CLI summary. - #5195 -> `docs/about/release-notes.mdx`: Documents Shields auto-restore lock reconfirmation in the rebuild and restore summary. - #5198 -> `docs/about/release-notes.mdx`: Documents Docker Desktop WSL CDI injection failure handling in the onboarding diagnostics summary. - #5201 -> `docs/about/release-notes.mdx`: Documents sandbox download/upload wrappers and sessions export in the day-two CLI summary. - #5205 -> `docs/about/release-notes.mdx`: Documents reporter-owned model metadata preservation in the rebuild and restore summary. - #5214 -> `docs/about/release-notes.mdx`: Documents managed vLLM model preflight before side effects in the inference setup summary. - #5215 -> `docs/about/release-notes.mdx`: Documents managed vLLM extra serve arguments in the inference setup summary. - #5216 -> `docs/about/release-notes.mdx`: Documents silent OpenClaw runtime fallback surfacing in the onboarding diagnostics summary. - #5225 -> `docs/about/release-notes.mdx`: Documents persisted sandbox gateway lookup in the gateway recovery summary. - #5238 -> `docs/about/release-notes.mdx`: Documents sub-agent gateway dial-back through the sandbox interface in the Hermes and sub-agent summary. - #5248 -> `docs/about/release-notes.mdx`: Documents Discord per-account proxy resolution in the messaging summary. - #5264 -> `docs/about/release-notes.mdx`: Documents reserved Hermes port `8642` handling in the Hermes compatibility summary. - #5267 -> `docs/about/release-notes.mdx`: Documents the narrower Hermes baseline policy in the Hermes compatibility summary. - #5321 -> `docs/about/release-notes.mdx`: Documents restored gateway guard chains in the gateway recovery summary. - #5328 -> `docs/about/release-notes.mdx`: Documents compact persisted messaging plans in the messaging summary. - #5338 -> `docs/about/release-notes.mdx`: Documents manifest channel migration in the messaging summary. - #5352 -> `docs/about/release-notes.mdx`: Documents persisted agent preservation through registry recovery in the rebuild and restore summary. - #5371 -> `.agents/skills/nemoclaw-user-reference/references/commands.md`: Refreshes generated skill output for custom build cache and layer-ordering source docs. - #5379 -> `docs/about/release-notes.mdx`: Documents dashboard port allocation across multiple NemoClaw gateways in the recovery summary. - #5382 -> `docs/about/release-notes.mdx`: Documents recovery when an active gateway has no sandbox spec in the recovery summary. - #5389 -> `.agents/skills/nemoclaw-user-reference/references/troubleshooting.md`: Refreshes generated skill output for declared agent `forward_ports` recovery source docs. - #5400 -> `docs/about/release-notes.mdx`: Documents bounded compatible endpoint probes in the inference setup summary. - #5410 -> `docs/about/release-notes.mdx`: Documents provider credential hash removal from sandbox registry entries in the messaging summary. - #5418 -> `docs/about/release-notes.mdx`: Documents summarized inference validation failures in the onboarding diagnostics summary. - #5457 -> `docs/about/release-notes.mdx`: Documents context-window recomputation after runtime model switches in the inference setup summary. - #5463 -> `docs/about/release-notes.mdx`: Documents cleanup of hard-coded messaging channel stragglers in the messaging summary. ## Skipped - #5366 matched `docs/.docs-skip` entries through skipped experimental paths, so this PR does not add new release-note text for that commit. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] Git hooks passed during commit and push, or `npx prek run --from-ref main --to-ref HEAD` passes - [ ] Targeted tests pass for changed behavior - [ ] Full `npm test` passes (broad runtime changes only) - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `npm run docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Verification notes: - `npm run docs` passed after rerunning outside the sandbox. Fern reported 0 errors and 1 hidden warning. - The first sandboxed `npm run docs` attempt failed before validation because `tsx` could not create its local IPC pipe under sandbox restrictions. - `npm run build:cli` passed before push to refresh the local `dist/` artifacts used by the CLI typecheck hook. - `npm test` was not run because this is a docs-only release refresh. --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Released NemoClaw v0.0.65 with improved gateway/sandbox recovery, safer day-two workflows, and enhanced Hermes compatibility. * Added managed vLLM extra-arguments configuration via `NEMOCLAW_VLLM_EXTRA_ARGS_JSON`. * Added Hermes troubleshooting guidance for port forwarding and health checks. * **Documentation** * Updated NVIDIA Endpoints/NIM setup and examples to use `NVIDIA_INFERENCE_API_KEY`. * Refined NVIDIA network policy and Model Router API base configuration. * Expanded CLI/environment variable documentation (including sub-agent gateway connectivity) and plugin build performance tips. * **Tests** * Expanded Vitest-backed E2E release validation coverage. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Adds a reusable deadline-based wait primitive and applies it to gateway startup health polling as a scoped first step for #2001.
The rebased implementation keeps the existing health-poll environment knobs compatible and preserves the dual readiness gate: OpenShell metadata must be healthy and the host HTTP endpoint must be serving.
Related Issue
Refs #2001.
Changes
src/lib/core/wait.tswithwaitUntil(...),waitUntilAsync(...), absolute deadlines, configurable backoff, injectable clock/sleeper hooks, and optional attempt caps.src/lib/onboard/gateway-health-wait.ts, keepingsrc/lib/onboard.tsnet smaller for the growth guardrail.NEMOCLAW_HEALTH_POLL_COUNTandNEMOCLAW_HEALTH_POLL_INTERVALbehavior by using fixed intervals andmaxAttempts; per-probe shell/HTTP runtime is not consumed by a startup deadline.waitForGatewayHealth(...)call ordering, metadata repair refresh, dual readiness, attempt limits, final-attempt sleep behavior, and zero-attempt behavior.Type of Change
Verification
npx prek run --all-filespasses viamake checknpm testpassesmake docsbuilds without warnings (doc changes only)Additional validation run locally:
npm run build:clipassesnpm run typecheck:clipassesnpx vitest run src/lib/onboard/gateway-health-wait.test.ts test/wait.test.tspassesmake checkpassesNotes:
AI Disclosure
Signed-off-by: Ho Lim subhoya@gmail.com
Summary by CodeRabbit
Refactor
Tests
Chores