fix(onboard): auto-detect resume from an in_progress session#5487
fix(onboard): auto-detect resume from an in_progress session#5487Dongni-Yang wants to merge 5 commits into
Conversation
Resume mode was keyed only off the --resume flag (entry-options.ts), so re-running `nemoclaw onboard` after an interrupted attempt — where ~/.nemoclaw/onboard-session.json has status:"in_progress" — started over instead of resuming: no "(resume mode)" banner and no resume preflight. Auto-detect resume from the persisted in_progress status when --fresh is not set. --fresh always wins, and the explicit --resume/--fresh conflict guard is unchanged (it applies only to the explicit flags), so no existing flow regresses. Scope: this resolves the resume-detection symptom of #5470. The stale gateway.json reuse symptom is rooted in OpenShell's package-managed gateway lifecycle — gateway.json is OpenShell-owned and NemoClaw reuses on live `openshell status` health, so forcing a teardown would regress the legitimate host-process reuse path. Left out deliberately; hence Refs (not Closes). Local pre-commit Test (CLI) hook skipped via --no-verify (operator-approved): it fails only on environment-flaky tests (ollama proxy 5s timeout, /tmp symlink) unrelated to this change; entry-options 9/9 and the onboard suite 1876/1876 pass, typecheck and biome clean. CI re-runs all hooks on the PR. Refs #5470 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
📝 WalkthroughWalkthrough
ChangesPersisted Session Auto-Resume
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Code Coverage OverviewLanguages: TypeScript TypeScript / code-coverage/pluginThe overall coverage in the branch is 96%. Coverage data for the branch is not yet available. Show a code coverage summary of the most covered files.
TypeScript / code-coverage/cliThe overall coverage in the branch is 46%. Coverage data for the branch is not yet available. Show a code coverage summary of the most covered files.
Updated |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Vitest E2E Scenario RecommendationRequired Vitest E2E scenarios: Dispatch required Vitest E2E scenarios:
Full Vitest E2E advisor summaryVitest E2E Scenario AdvisorBase: Required Vitest E2E scenarios
Optional Vitest E2E scenarios
Relevant changed files
|
PR Review AdvisorFindings: 1 needs attention, 3 worth checking, 1 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/lib/onboard/entry-options.ts (1)
21-21: ⚡ Quick winTighten
persistedSessionStatusto a literal union instead ofstring.Using
string | nullweakens the cross-file contract and allows typos to compile. Narrow this to known session status literals (or a shared exported status type) so auto-resume conditions stay type-safe.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/entry-options.ts` at line 21, The persistedSessionStatus field is typed as string | null, which is too permissive and allows any string value, including typos. Replace this broad string type with a literal union of specific session status values (for example, 'active' | 'paused' | 'resumed') or reference a shared exported status type that captures the valid session states. This ensures type safety at compile time and prevents invalid status values from being assigned to persistedSessionStatus.src/lib/onboard.ts (1)
4750-4752: Run the onboarding E2E suite for this core-flow change before merge.Given this touches onboarding entry behavior, run the recommended targeted nightly E2E jobs to validate resume/fresh interactions across lifecycle paths.
gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=cloud-e2e,sandbox-operations-e2e,rebuild-openclaw-e2e,channels-stop-start-e2e,messaging-compatible-endpoint-e2e,bedrock-runtime-compatible-anthropic-e2e,hermes-discord-e2e,hermes-slack-e2e,openshell-gateway-upgrade-e2e,issue-3600-gpu-proof-optional-e2eAs per coding guidelines,
src/lib/onboard.tschanges should run the recommended onboarding E2E coverage set.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 4750 - 4752, Before merging the changes to src/lib/onboard.ts that modify the onboarding entry behavior with persistedSessionStatus initialization for resume detection, you must run the recommended onboarding E2E test coverage to validate that resume and fresh interaction paths work correctly across different lifecycle scenarios. Execute the provided gh workflow command to run the nightly E2E jobs (cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, channels-stop-start-e2e, messaging-compatible-endpoint-e2e, bedrock-runtime-compatible-anthropic-e2e, hermes-discord-e2e, hermes-slack-e2e, openshell-gateway-upgrade-e2e, and issue-3600-gpu-proof-optional-e2e) on your branch to ensure the onboarding resume/fresh interaction changes are validated before merge.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 4750-4751: The file `src/lib/onboard.ts` has exceeded the growth
guard threshold with a net addition of 2 lines, which blocks CI from passing. To
fix this, you need to offset the 2 newly added lines (the comment and the
persistedSessionStatus assignment) by removing or trimming 2 equivalent lines of
non-functional content elsewhere in the same file, such as unnecessary comments,
blank lines, or redundant whitespace. Ensure the deletions do not affect any
functional code logic.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 4750-4752: Before merging the changes to src/lib/onboard.ts that
modify the onboarding entry behavior with persistedSessionStatus initialization
for resume detection, you must run the recommended onboarding E2E test coverage
to validate that resume and fresh interaction paths work correctly across
different lifecycle scenarios. Execute the provided gh workflow command to run
the nightly E2E jobs (cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e,
channels-stop-start-e2e, messaging-compatible-endpoint-e2e,
bedrock-runtime-compatible-anthropic-e2e, hermes-discord-e2e, hermes-slack-e2e,
openshell-gateway-upgrade-e2e, and issue-3600-gpu-proof-optional-e2e) on your
branch to ensure the onboarding resume/fresh interaction changes are validated
before merge.
In `@src/lib/onboard/entry-options.ts`:
- Line 21: The persistedSessionStatus field is typed as string | null, which is
too permissive and allows any string value, including typos. Replace this broad
string type with a literal union of specific session status values (for example,
'active' | 'paused' | 'resumed') or reference a shared exported status type that
captures the valid session states. This ensures type safety at compile time and
prevents invalid status values from being assigned to persistedSessionStatus.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 93057bc6-750d-48d8-9238-4ef0d841519d
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/entry-options.test.tssrc/lib/onboard/entry-options.ts
Selective E2E Results — ✅ All requested jobs passedRun: 27589221225
|
…ests - Offset the codebase-growth-guardrails check: src/lib/onboard.ts must be net-neutral. Drop the inline comment (the rationale already lives on entry-options.ts) and a decorative blank line so the file is +1/-1 (net 0) while keeping the persistedSessionStatus wiring. - Widen the non-resume boundary test to cover "pending" and "" in addition to complete/failed/null/undefined, so only an exact "in_progress" auto-resumes. Refs #5470 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
test-onboard-resume.sh only exercised explicit `nemoclaw onboard --resume`. Add a phase that re-marks the session in_progress and runs a plain `onboard` (no --resume), asserting the "(resume mode)" banner and skipped cached steps, then verifies `--fresh` suppresses the auto-resume (fail-fast at preflight to stay cheap). Closes the implicit-resume gap flagged by the e2e advisor on #5470. Refs #5470 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/test-onboard-resume.sh`:
- Around line 375-390: The test in the --fresh branch injects a preflight
failure using NEMOCLAW_E2E_FORCE_FAIL_AT_STEP=preflight but never verifies the
injection actually occurred. The current code only checks that "(resume mode)"
is absent, which is insufficient. Modify the test to assert both that the
command exits with a non-zero status (indicating the preflight failure) and that
the expected failure marker appears in the output. Remove the || true from the
nemoclaw.js invocation to capture the exit code, then add assertions to verify
the non-zero exit code and check for the expected preflight failure marker in
fresh_output to ensure the injection regressed detection works correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a093d1cc-2baf-4383-b3a0-75555d7a0f63
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/entry-options.test.tstest/e2e/test-onboard-resume.sh
💤 Files with no reviewable changes (1)
- src/lib/onboard.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard/entry-options.test.ts
…heck CodeRabbit: the --fresh branch only checked that "(resume mode)" was absent, which could pass vacuously if the run failed for an unrelated reason or never reached preflight. Drop the `|| true`, capture the exit code, and assert the run exited non-zero with the injected "Forced onboarding failure at step 'preflight'" marker, so the banner-absence assertion is meaningful. Refs #5470 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 27591288160
|
The Phase 3.5 implicit-resume check flipped the post-Phase-3 session's status to in_progress but left resumable=false (set when onboarding completes), so the resume machine correctly rejected it as "no resumable session" and exited before the banner — failing the assertions. A genuinely interrupted session is resumable; reset resumable=true so the synthesized state matches and plain onboard auto-resumes it. (The fix itself was sound — auto-detect fired, which is why the resume guard ran at all.) Refs #5470 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 27591945815
|
E2E status
Scope reminderClosing keyword stays Signed-off-by: Dongni Yang dongniy@nvidia.com |
Summary
Resume mode was keyed only off the
--resumeflag, so re-runningnemoclaw onboardafter an interrupted attempt — where~/.nemoclaw/onboard-session.jsonhasstatus:"in_progress"— started over instead of resuming (no(resume mode)banner, no resume preflight). This auto-detects resume from the persisted in_progress status.Related Issue
Refs #5470
Changes
src/lib/onboard/entry-options.ts:resolveOnboardEntryOptionsnow auto-detects resume from a persistedin_progresssession (new optionalpersistedSessionStatusinput) when--freshis not set.--freshalways wins; the explicit--resume/--freshconflict guard is unchanged (applies only to the explicit flags).src/lib/onboard.ts: pass the persisted session status into the resolver.src/lib/onboard/entry-options.test.ts: red→green coverage for auto-detect, plus no-regression guards (--freshwins; non-in_progressstatuses).test/e2e/test-onboard-resume.sh: add a Phase 3.5 implicit-resume scenario (plainonboardon anin_progresssession) and a--freshsuppression check.Type of Change
Verification
entry-options.test.ts)trueforin_progress;--freshand non-in_progress(incl.pending/"") → no resume. Core case is a true red→green. ✅gateway --helplacksstart/destroy⇒supportsLifecycleCommands=false(host-process model — the bug-capable branch). ✅classifyGatewayPortReuse(healthy, lifecycle=false, container=missing)⇒reuse; container never probed, no cleanup; reuse isopenshell status-driven, nocontainerIdread. ✅onboard(Ubuntu 20.04, containerized-compat gateway)onboard(no--resume) on anin_progresssession prints(resume mode)(the fix), then[resume] Skipping gateway (running)against a healthy:8080gateway with no stale-cleanup sequence — reproducing both the fix and the unfixed half end-to-end. ✅onboard-resume-e2e(Ubuntu 24.04 CI)--resumeplus the new Phase 3.5 implicit-resume +--fresh-suppression assertions all pass. ✅Scope / notes: NemoClaw writes/reads no
gateway.jsonorcontainerIdin the onboard reuse path (grep-confirmed) — that artifact is OpenShell-owned, so the stale-gateway cleanup symptom is upstream. It also can't be tightened naively without regressing #4520 (the inverse "false-stale →:8080collision" bug, whose fix deliberately made reuse more trusting). HenceRefs, notCloses. Thecloud-onboard-e2efailure is the inference-provider credential/env issue the report itself calls "environmental and separate from this bug" (it fails on a fresh onboard at[3/8], where this change's resume path never runs).Signed-off-by: Dongni Yang dongniy@nvidia.com
Summary by CodeRabbit
in_progress, reducing the need for manual--resume.--freshcontinues to override and prevents auto-resume even when a persistedin_progresssession exists.--freshsuppression.--freshfails early during preflight without entering resume mode.