Skip to content

test(e2e): migrate cloud inference and routing scenarios#5057

Closed
cv wants to merge 24 commits into
mainfrom
codex/e2e-fanout-06-cloud-inference-routing-scenarios
Closed

test(e2e): migrate cloud inference and routing scenarios#5057
cv wants to merge 24 commits into
mainfrom
codex/e2e-fanout-06-cloud-inference-routing-scenarios

Conversation

@cv

@cv cv commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Migrates the core cloud OpenClaw inference and inference-routing runtime suites into the registry-driven Vitest E2E path.

Related Issue

Refs #4941
Refs #4990
Refs #4349
Depends on #5046, #5052, and #5056.
Stacked on branch codex/e2e-fanout-05-inference-runtime-helpers.

Changes

  • Added RuntimePhaseFixture.runSuite(...) for inference, cloud-inference, and inference-routing.
  • Runs the wired runtime suites from test/e2e-scenario/live/registry-scenarios.test.ts after onboarding and state validation.
  • Promoted inference-routing into the canonical ubuntu-repo-cloud-openclaw suite list so live Vitest dispatch executes it explicitly.
  • Tracks executed runtimeSuites and remaining pendingRuntimeSuites in live scenario artifacts and matrix metadata.
  • Repointed the shell-era inference-routing assertion group at the routing-specific validation scripts instead of aliasing it to generic cloud inference.
  • Marks test/e2e/test-cloud-inference-e2e.sh and test/e2e/test-inference-routing.sh as bridge-probe in the migration inventory. The new Vitest runtime suites cover the active happy-path probes, but deletion stays blocked until the remaining legacy security, negative, provider-route, compatible-endpoint, and skill-filesystem assertions migrate.

Out of scope for this PR: provider switch, compatible endpoint, Kimi, Bedrock, and model-router provider-routed migrations. Those remain for the following inference stack PRs.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Verified locally:

  • npx vitest run --project e2e-scenario-framework test/e2e-scenario/framework-tests/e2e-phase-runtime.test.ts test/e2e-scenario/framework-tests/e2e-live-registry-discovery.test.ts test/e2e-scenario/framework-tests/e2e-scenario-matrix.test.ts test/e2e-scenario/framework-tests/e2e-scenario-registry.test.ts test/e2e-scenario/framework-tests/e2e-migration-inventory.test.ts test/e2e-scenario/framework-tests/e2e-plan-compiler.test.ts --silent=false --reporter=default
  • npx vitest run --project e2e-scenario-framework test/e2e-scenario/framework-tests/e2e-migration-inventory.test.ts --silent=false --reporter=default
  • npx vitest run --project e2e-scenario-framework --silent=false --reporter=default
  • npm run typecheck:cli
  • npx prek run --files test/e2e-scenario/framework/phases/runtime.ts test/e2e-scenario/framework/phases/index.ts test/e2e-scenario/live/registry-scenarios.test.ts test/e2e-scenario/scenarios/runtime-support.ts test/e2e-scenario/scenarios/run.ts test/e2e-scenario/scenarios/scenarios/baseline.ts test/e2e-scenario/scenarios/assertions/registry.ts test/e2e-scenario/migration/legacy-inventory.json test/e2e-scenario/framework-tests/e2e-phase-runtime.test.ts test/e2e-scenario/framework-tests/e2e-live-registry-discovery.test.ts test/e2e-scenario/framework-tests/e2e-scenario-matrix.test.ts test/e2e-scenario/framework-tests/e2e-scenario-registry.test.ts --skip test-cli
  • npx prek run --files test/e2e-scenario/migration/legacy-inventory.json --skip test-cli
  • git diff --check

Note: the full pre-commit hook's test-cli step still fails locally in test/release-latest-tag.test.ts because this machine's global Git config enables SSH commit signing but the private signing key is unavailable. The focused E2E framework suite and CLI typecheck pass.


Signed-off-by: Carlos Villela cvillela@nvidia.com

Summary by CodeRabbit

  • Tests

    • Added new E2E runtime suite tests for inference and inference-routing functionality verification
    • Updated existing test scenarios to validate runtime suite execution and fixture behavior
  • Chores

    • Extended test framework with runtime suite support and assertion tracking
    • Updated test infrastructure to record and persist runtime suite execution results

jyaunches and others added 8 commits June 9, 2026 12:24
`liveScenarioSupport` previously rejected any scenario that declared an
`environment.lifecycle`, so post-onboard host mutations (reboot, rebuild,
upgrade, drift) could not surface in the live Vitest matrix at all.

Replace the unconditional reject with a `SUPPORTED_LIFECYCLES` whitelist
that starts with the single profile the upcoming post-reboot-recovery
fixture dispatches: `post-reboot-recovery`. Future profiles must land the
dispatcher branch and an expected-state in the same change set, so the
whitelist stays in lockstep with what the runner can actually execute.

Prepares the runner for #4423's failing-test-first guard, which needs a
post-reboot lifecycle scenario to demonstrate registry preservation +
Docker-backed sandbox recovery on Linux/Spark Docker-driver hosts.

Refs #4423
Adds two host-side state-validation probes the live runner needs to
express the regression target tracked by #4423:

  * `local-registry-entry-present` reads `~/.nemoclaw/sandboxes.json`
    and asserts the scenario's sandbox name is still recorded. This is
    deliberately orthogonal to `sandbox.expected`: post-reboot bugs
    can wipe the local registry while the live OpenShell gateway is
    healthy, and only a host-side probe catches the data-loss
    regression.

  * `docker-sandbox-container-present` runs
    `docker ps -a --filter label=openshell.ai/sandbox-name=<name>` and
    accepts running, stopped, or `*-nemoclaw-gpu-backup-*` sibling
    containers. The label filter mirrors `OPENSHELL_SANDBOX_NAME_LABEL`
    used by `findOpenShellDockerSandboxContainerIds` in
    `src/lib/onboard/docker-gpu-patch.ts`, so the probe stays in lock-
    step with how OpenShell labels containers today.

Probe wiring:

  * `StateProbeId` extended with the two new probe ids.
  * `ExpectedState` gains `localRegistry` and `dockerSandboxContainer`
    optional dimensions; `probesForState` emits the new probes only
    for `expected: "present"`. Negative-direction probes are
    intentionally omitted today and pinned by a probesForState test.
  * `StateValidationPhaseFixture.from()` now accepts either an
    expected-state ID or an inline `ExpectedState`, so unit tests can
    drive new probes without registering synthetic states in the
    typed registry. The live runner still calls `from(id, instance)`.
  * Fixture takes an optional `ProbeIO` injection so tests can stub
    the registry reader without touching `~/.nemoclaw`.

No callers of the existing typed registry are affected: every shipped
expected-state leaves `localRegistry` and `dockerSandboxContainer`
unset, so `probesForState` returns the same probe lists as before.

Refs #4423
Adds a Vitest phase fixture that mutates host state between onboarding
and state-validation, so live scenarios can express post-onboard
invariants the legacy bash runner has no equivalent for.

`LifecyclePhaseFixture.simulate("post-reboot-recovery", instance, opts)`
reproduces the host-side conditions of a DGX Spark / Linux Docker-driver
reboot in two modes:

  * `stop-original` (default)   — `openshell gateway stop` + `docker
                                   stop` of the labeled sandbox
                                   container. Models the common reboot
                                   outcome where OpenShell forgets the
                                   sandbox while Docker keeps the
                                   container exited but labeled.

  * `rename-to-gpu-backup`      — additionally `docker rename`s the
                                   container to a `*-nemoclaw-gpu-
                                   backup-<ts>` sibling, mirroring the
                                   GPU-patch reboot path in
                                   `src/lib/onboard/docker-gpu-patch.ts`.

Both modes register cleanups (in reverse order) to restore the
container so test teardown leaves Docker in a usable state.

Wiring:

  * `framework/phases/index.ts` re-exports the fixture and types.
  * `framework/e2e-test.ts` registers a `lifecycle` Vitest fixture on
    `E2EScenarioFixtures`, wired with the shared `host`, `sandbox`,
    and `cleanup` registries.
  * `live/registry-scenarios.test.ts` invokes
    `lifecycle.simulate(profile, instance)` between `onboard.from(...)`
    and `stateValidation.from(...)` whenever the scenario declares a
    whitelisted `environment.lifecycle`. Scenarios that omit lifecycle
    are unaffected. A scenario whose lifecycle is whitelisted by
    `runtime-support.ts` but NOT dispatched by the fixture fails fast
    with a clear error so the whitelist and dispatcher stay in lock-
    step.

Coverage in `e2e-phase-lifecycle.test.ts` exercises both modes,
gateway-stop tolerance, the no-labeled-container failure case, the
docker-discover failure case, the unsupported-profile rejection,
the cleanup queue order, and `buildBackupContainerName` truncation.

The fixture is intentionally narrow on profiles: only
`post-reboot-recovery` is dispatched today. Adding rebuild, upgrade,
or drift profiles is a separate, equally narrow change set that must
land the dispatcher branch and `SUPPORTED_LIFECYCLES` whitelist
together.

Refs #4423
Registers the failing-test-first guard for #4423 in the typed scenario
registry so the live Vitest matrix from #5006 fans it out as a
dedicated CI job. Builds on the framework primitives added earlier in
this PR (lifecycle phase fixture, host-side probes, lifecycle whitelist).

Additions:

  * `post-reboot-recovery-ready` expected-state in
    `scenarios/expected-states.ts` declaring the user-visible
    invariants that must hold after a `nemoclaw <name> status` call
    on a freshly-rebooted DGX Spark / Linux Docker-driver host:
      - cli installed,
      - gateway healthy (the user-systemd unit from #4580 brings it
        back up before status runs),
      - sandbox running (recovery completed in time),
      - localRegistry entry preserved (the user-visible regression
        target — destroyed on unfixed `main`),
      - dockerSandboxContainer present (recovery didn't delete the
        labeled container or its `*-nemoclaw-gpu-backup-*` sibling).

  * `ubuntu-repo-docker-post-reboot-recovery` scenario in
    `scenarios/scenarios/baseline.ts` wiring
    `ubuntuRepoDockerLifecycle("cloud-openclaw", "post-reboot-recovery")`
    against the new expected-state and a smoke suite. Carries a
    description that explains the RED/GREEN contract and points to the
    PR-A fix landing in `src/lib/`.

  * `manifests/openclaw-nvidia-post-reboot-recovery.yaml` declares
    `lifecycle: post-reboot-recovery` and the same NVIDIA_API_KEY
    credential ref the cloud-openclaw scenarios use.

  * `.github/workflows/e2e-scenarios.yaml` ROUTES table gains the new
    scenario so the workflow-boundary test
    (`e2e-scenarios-workflow.test.ts`) routes every typed id.

Test pinning:

  * `e2e-scenario-matrix.test.ts` updated from a 1-entry to a 2-entry
    live matrix expectation. The new entry asserts on
    `expectedStateId: "post-reboot-recovery-ready"` so a future
    accidental dropped-lifecycle change to the scenario regresses
    loudly.

  * `e2e-live-registry-discovery.test.ts` swaps the synthetic
    whitelist-coverage test for an assertion against the real
    `ubuntu-repo-docker-post-reboot-recovery` registry entry.

Behavior:

  * On unfixed `main`, the live runner's lifecycle phase stops the
    OpenShell gateway runtime and `docker stop`s the labeled sandbox
    container. State-validation then runs `nemoclaw <name> status`
    (which restarts the gateway via systemd) and the destructive
    `missing` branch in `src/lib/actions/sandbox/status.ts` wipes the
    local registry entry. The `local-registry-entry-present` probe
    fails. Scenario goes RED.

  * On the PR-A fix branch, the new Docker-driver sandbox recovery
    helper restarts the labeled container before stale-removal can
    fire, registry survives, all five probes pass. Scenario flips
    GREEN.

The bash-side legacy compiler emits a
`lifecycle.profile.post-reboot-recovery` PhaseAction pointing at
`nemoclaw_scenarios/lifecycle/dispatch.sh`, but the legacy bash worker
is intentionally not provided: this scenario is Vitest-only. The
typed runner's `LifecyclePhaseFixture` handles dispatch directly. If
the legacy runner is invoked against this scenario it errors out at
the dispatcher; that's the right failure mode while the bash side
stays on its own retirement clock.

Refs #4423
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@cv cv self-assigned this Jun 9, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cv cv added area: e2e End-to-end tests, nightly failures, or validation infrastructure area: inference Inference routing, serving, model selection, or outputs chore Build, CI, dependency, or tooling maintenance labels Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2e0bdae7-e314-45f6-99f0-c2f4e7d6ba5a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/e2e-fanout-06-cloud-inference-routing-scenarios

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: e2e-vitest-scenarios:ubuntu-repo-cloud-openclaw, e2e-scenarios:ubuntu-repo-cloud-openclaw
Optional E2E: cloud-inference-e2e, inference-routing-e2e

Dispatch hint: scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/codex/e2e-fanout-05-inference-runtime-helpers
Head: HEAD
Confidence: high

Required E2E

  • e2e-vitest-scenarios:ubuntu-repo-cloud-openclaw (medium; live Docker sandbox and NVIDIA_API_KEY required): Exercises the touched live Vitest path end-to-end: onboarding the canonical OpenClaw scenario, running RuntimePhaseFixture.runSuite for inference and inference-routing, and verifying the new runtimeSuites artifact metadata.
  • e2e-scenarios:ubuntu-repo-cloud-openclaw (medium-high; live sandbox with smoke, inference, inference-routing, and credentials suites): Validates the canonical typed scenario runner after adding inference-routing to the baseline scenario and remapping the inference-routing validation suite to routing-specific shell assertions.

Optional E2E

  • cloud-inference-e2e (medium; legacy live E2E with NVIDIA_API_KEY): Useful parity check because the migration inventory marks test/e2e/test-cloud-inference-e2e.sh only partially represented by the new Vitest bridge and notes remaining unmigrated repo-skill and sandbox filesystem validation.
  • inference-routing-e2e (medium; legacy live E2E with NVIDIA_API_KEY): Useful parity check for the legacy routing script because the new bridge intentionally covers only part of credential isolation, negative classification, cleanup, provider-route, and compatible-endpoint behavior.

New E2E recommendations

  • inference-routing (high): The Vitest bridge still lacks migrated coverage for credential isolation, invalid-key/transport negative classification, cleanup, and compatible endpoint behavior called out in the legacy inventory.
    • Suggested test: Add live Vitest runtime-suite coverage for inference-routing negative and credential-isolation cases before retiring test/e2e/test-inference-routing.sh.
  • cloud-inference (medium): The cloud inference legacy script remains only partially bridged; repo-skill and sandbox skill filesystem validation are not covered by the new RuntimePhaseFixture inference suite.
    • Suggested test: Add Vitest assertions under the ubuntu-repo-cloud-openclaw live scenario for repo-skill availability and sandbox skill filesystem validation.

Dispatch hint

  • Workflow: .github/workflows/e2e-vitest-scenarios.yaml
  • jobs input: scenarios=ubuntu-repo-cloud-openclaw

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: e2e-scenarios-all
Optional scenario E2E: None

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios-all.yaml --ref <pr-head-ref>

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/codex/e2e-fanout-05-inference-runtime-helpers
Head: HEAD
Confidence: high

Required scenario E2E

  • e2e-scenarios-all: Changes modify shared scenario framework/runtime fixtures, live registry execution, scenario matrix/runtime-support logic, assertion suite catalog wiring, and canonical scenario metadata. These affect scenario discovery/execution semantics across the suite, so the all-scenarios fan-out is required.
    • Dispatch: gh workflow run e2e-scenarios-all.yaml --ref <pr-head-ref>

Optional scenario E2E

  • None.

Relevant changed files

  • test/e2e-scenario/framework-tests/e2e-live-registry-discovery.test.ts
  • test/e2e-scenario/framework-tests/e2e-phase-runtime.test.ts
  • test/e2e-scenario/framework-tests/e2e-scenario-matrix.test.ts
  • test/e2e-scenario/framework-tests/e2e-scenario-registry.test.ts
  • test/e2e-scenario/framework/phases/index.ts
  • test/e2e-scenario/framework/phases/runtime.ts
  • test/e2e-scenario/live/registry-scenarios.test.ts
  • test/e2e-scenario/migration/legacy-inventory.json
  • test/e2e-scenario/scenarios/assertions/registry.ts
  • test/e2e-scenario/scenarios/run.ts
  • test/e2e-scenario/scenarios/runtime-support.ts
  • test/e2e-scenario/scenarios/scenarios/baseline.ts

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 0 worth checking, 1 nice ideas
Since last review: 2 prior items resolved, 0 still apply, 1 new item found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • None.

🌱 Nice ideas

  • PR body still says the inference legacy scripts are marked covered: The PR description says `test/e2e/test-cloud-inference-e2e.sh` and `test/e2e/test-inference-routing.sh` are marked `covered`, but the current diff correctly marks both as `bridge-probe` with deletionReady false and explicit partial-coverage notes. The code is safer than the description, but the stale wording can confuse acceptance/source-of-truth review.
    • Recommendation: Update the PR body bullet to say these legacy scripts are marked `bridge-probe` or partially represented, not `covered`.
    • Evidence: PR body: “Marked `test/e2e/test-cloud-inference-e2e.sh` and `test/e2e/test-inference-routing.sh` as covered...”. Current `legacy-inventory.json` rows set both scripts to `status: "bridge-probe"`, include bridge probes, keep `deletionReady: false`, and list the remaining migration gaps.
Consider writing more tests for
Since last review details

Current findings:

  • PR body still says the inference legacy scripts are marked covered: The PR description says `test/e2e/test-cloud-inference-e2e.sh` and `test/e2e/test-inference-routing.sh` are marked `covered`, but the current diff correctly marks both as `bridge-probe` with deletionReady false and explicit partial-coverage notes. The code is safer than the description, but the stale wording can confuse acceptance/source-of-truth review.
    • Recommendation: Update the PR body bullet to say these legacy scripts are marked `bridge-probe` or partially represented, not `covered`.
    • Evidence: PR body: “Marked `test/e2e/test-cloud-inference-e2e.sh` and `test/e2e/test-inference-routing.sh` as covered...”. Current `legacy-inventory.json` rows set both scripts to `status: "bridge-probe"`, include bridge probes, keep `deletionReady: false`, and list the remaining migration gaps.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

jyaunches and others added 13 commits June 9, 2026 14:22
Prek hook auto-fixed formatting in 6 files added/touched by this PR.
No behavior change.
The biome-format commit accidentally added a node_modules symlink
alongside the formatting fixes. Remove it; the directory is already
in .gitignore.
…nventory-internals

# Conflicts:
#	test/e2e-scenario/framework-tests/e2e-phase-lifecycle.test.ts
#	test/e2e-scenario/framework/phases/lifecycle.ts
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
cv added 2 commits June 9, 2026 16:39
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cv cv marked this pull request as ready for review June 9, 2026 23:51
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Base automatically changed from codex/e2e-fanout-05-inference-runtime-helpers to main June 10, 2026 06:00
@cv

cv commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Cross-linking the post-#5106 migration path from #5098 before this stack is rebased/salvaged: #5098 (comment)

Please treat any work salvaged from this branch as targeting the single Vitest E2E system. No second runner, no revived scenario framework, and no long-lived bash/Vitest split. Shell/system boundaries are fine, but they should be invoked from Vitest tests/fixtures when migrated.

Also: avoid using legacy-inventory.json as the migration source of truth. Replacement evidence should live in the PR and linked issue, with at most a lightweight deletion guard in the repo.

@cv

cv commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Closing as superseded by #5106 and the post-#5098 one-E2E migration plan.

The useful inference/runtime fixture work is still worth salvaging, but it should come back as a fresh, smaller PR from current main: Vitest as the only harness, GitHub Actions as the matrix, no revived runner path, and replacement evidence carried in the PR/linked issue rather than expanded legacy-inventory.json metadata.

@cv cv closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: e2e End-to-end tests, nightly failures, or validation infrastructure area: inference Inference routing, serving, model selection, or outputs chore Build, CI, dependency, or tooling maintenance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants