Skip to content

fix(onboard): resolve OpenShell binaries from Apple Silicon Homebrew prefix#5461

Open
yimoj wants to merge 1 commit into
NVIDIA:mainfrom
yimoj:fix/5334-openshell-homebrew-gateway
Open

fix(onboard): resolve OpenShell binaries from Apple Silicon Homebrew prefix#5461
yimoj wants to merge 1 commit into
NVIDIA:mainfrom
yimoj:fix/5334-openshell-homebrew-gateway

Conversation

@yimoj

@yimoj yimoj commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

On Apple Silicon macOS, Homebrew installs OpenShell under /opt/homebrew/bin, which is frequently missing from the non-interactive onboarding shell's PATH. NemoClaw's binary resolvers did not probe that prefix, so onboarding reported openshell CLI not found (and could fail the Docker-driver gateway step) even though OpenShell was installed. This adds /opt/homebrew/bin to the CLI, gateway, and sandbox resolvers and clarifies the gateway start-failure diagnostic.

Related Issue

Fixes #5334

Changes

  • src/lib/adapters/openshell/resolve.ts: add /opt/homebrew/bin/openshell to the CLI resolver fallbacks (Apple Silicon Homebrew prefix).
  • src/lib/onboard/docker-driver-gateway-runtime.ts: add /opt/homebrew/bin/openshell-gateway and /opt/homebrew/bin/openshell-sandbox to the standalone gateway/sandbox fallbacks, keeping all three resolvers coherent.
  • src/lib/onboard/docker-driver-gateway-failure.ts: distinguish a crashed gateway from one that did not become healthy in time (state-agnostic wording), and add openshell status / openshell gateway info to the troubleshooting footer so the real cause is discoverable instead of only "failed to start".
  • Tests for the /opt/homebrew/bin fallback (CLI + gateway + sandbox) and the gateway start-failure diagnostic.

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • Git hooks passed during commit and push
  • Targeted tests pass for changed behavior (resolve.test.ts, docker-driver-gateway-runtime.test.ts, docker-driver-gateway-failure.test.ts — 31 tests)
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed

src/lib/onboard.ts is untouched (net-neutral, no growth-guardrail impact).

End-to-end on the reporter's platform (real worktree CLI)

Run on Apple Silicon macOS 26.5.1 (arm64), Docker via colima, against the built worktree CLI (node ./bin/nemoclaw.js). The reporter's condition was recreated exactly: OpenShell 0.0.44 installed only at /opt/homebrew/bin, with that prefix absent from the onboarding shell's PATH (so command -v openshell fails, as in a non-interactive Homebrew shell).

Before the fix — onboarding cannot find the installed OpenShell:

$ export PATH="$HOME/repro-bin:/usr/bin:/bin:/usr/sbin:/sbin"   # /opt/homebrew/bin intentionally NOT on PATH
$ command -v openshell
# (not found — openshell lives only at /opt/homebrew/bin)
$ node ./bin/nemoclaw.js onboard --non-interactive --name repro-5334
  [1/8] Preflight checks
  ✓ Container runtime: colima
  openshell CLI not found. Installing...
  [install] Detected macOS (aarch64)
  [install] Installed openshell to /Users/yimoj/.local/bin/openshell (user-local path)

After the fix — identical setup; the resolver finds the Homebrew binary and the gateway step (the one that failed for the reporter) passes:

$ command -v openshell
# (still not found on PATH — unchanged)
$ node ./bin/nemoclaw.js onboard --non-interactive --name repro-5334
  [1/8] Preflight checks
  ✓ Container runtime: colima
  ✓ openshell CLI: openshell 0.0.44
  [2/8] Starting OpenShell gateway
  Starting OpenShell Docker-driver gateway...
  ✓ Docker-driver gateway is healthy
  [3/8] Configuring inference provider
  NVIDIA_INFERENCE_API_KEY (or NEMOCLAW_PROVIDER_KEY) is required for NVIDIA Endpoints in non-interactive mode.

After the fix, onboarding resolves OpenShell directly at /opt/homebrew/bin (no "not found", no redundant reinstall) and reaches ✓ Docker-driver gateway is healthy, advancing past the failing step. It then stops only on a missing NVIDIA_INFERENCE_API_KEY, which is unrelated to this bug.

Note: GitHub-hosted macOS runners (.github/workflows/macos-e2e.yaml) have no Docker, so this Docker-driver flow can only be validated on a Docker-capable Apple Silicon Mac, as done above.


Signed-off-by: Yimo Jiang yimoj@nvidia.com

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Three openshell binary resolvers (resolveOpenshell, resolveOpenShellGatewayBinary, resolveOpenShellSandboxBinary) each gain /opt/homebrew/bin as an explicit fallback candidate for Apple Silicon. reportDockerDriverGatewayStartFailure adds a branch that emits a dedicated stderr message when the gateway process is still running but never became healthy, and the troubleshooting footer adds openshell status and openshell gateway info commands. Tests cover all new paths.

Changes

Apple Silicon Homebrew Fallbacks and Gateway Failure Detection

Layer / File(s) Summary
/opt/homebrew/bin fallbacks for all three openshell resolvers
src/lib/adapters/openshell/resolve.ts, src/lib/adapters/openshell/resolve.test.ts, src/lib/onboard/docker-driver-gateway-runtime.ts, src/lib/onboard/docker-driver-gateway-runtime.test.ts
resolveOpenshell, resolveOpenShellGatewayBinary, and resolveOpenShellSandboxBinary each add /opt/homebrew/bin/openshell* as an explicit fallback candidate with inline comments explaining the Apple Silicon prefix issue; tests verify Homebrew paths are selected when only they exist on disk.
Gateway still-running branch and expanded troubleshooting output
src/lib/onboard/docker-driver-gateway-failure.ts, src/lib/onboard/docker-driver-gateway-failure.test.ts
reportDockerDriverGatewayStartFailure adds an else branch that emits a "still running" stderr message when childExit.exited is false; the troubleshooting footer gains openshell status and openshell gateway info entries; two new tests assert the distinct output for still-running vs already-exited gateway processes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

bug-fix, area: onboarding, v0.0.65

Suggested reviewers

  • cv

Poem

🐰 Hop hop, the Silicon M3 sighs,
/opt/homebrew was hiding from our eyes.
Now gateway checks if it's still alive,
Three resolvers updated, let the bunny thrive!
With status and info printed clear,
Apple Silicon onboarding has nothing to fear. 🍎

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding Apple Silicon Homebrew prefix support to OpenShell binary resolution during onboarding.
Linked Issues check ✅ Passed The PR successfully addresses issue #5334 by fixing OpenShell binary resolution for Apple Silicon Homebrew prefix and improving gateway failure diagnostics.
Out of Scope Changes check ✅ Passed All changes directly support the stated objectives: adding /opt/homebrew/bin fallback paths to resolvers and improving gateway failure messaging.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/docker-driver-gateway-failure.ts`:
- Around line 64-75: The error message in the docker-driver-gateway-failure.ts
function incorrectly assumes the process is still running based on
!childExit.exited being false, but the caller in src/lib/onboard.ts may have
already determined via isPidAlive() that the process is actually not alive. To
fix this, either pass the actual liveness check result from the call site in
src/lib/onboard.ts as a parameter to the function handling this branch, or
re-check the process liveness immediately before reporting the error message.
Use the actual liveness state to report an accurate message that doesn't claim
the gateway is running when it's actually dead.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 86989a3e-1378-48de-a1d9-e54a2cb89532

📥 Commits

Reviewing files that changed from the base of the PR and between f4f3c58 and 8115ed5.

📒 Files selected for processing (6)
  • src/lib/adapters/openshell/resolve.test.ts
  • src/lib/adapters/openshell/resolve.ts
  • src/lib/onboard/docker-driver-gateway-failure.test.ts
  • src/lib/onboard/docker-driver-gateway-failure.ts
  • src/lib/onboard/docker-driver-gateway-runtime.test.ts
  • src/lib/onboard/docker-driver-gateway-runtime.ts

Comment thread src/lib/onboard/docker-driver-gateway-failure.ts Outdated
@yimoj yimoj force-pushed the fix/5334-openshell-homebrew-gateway branch from 8115ed5 to b78dcb7 Compare June 15, 2026 10:50
…prefix

On Apple Silicon macOS, Homebrew installs OpenShell under /opt/homebrew/bin,
which is frequently absent from the non-interactive onboarding shell's PATH
(Homebrew only adds it via `brew shellenv`). The CLI resolver
(resolveOpenshell) and the standalone gateway/sandbox resolvers fell back to
~/.local/bin, /usr/local/bin and /usr/bin only, so onboarding reported
'openshell CLI not found' — and could fail the Docker-driver gateway step —
even though OpenShell was installed (NVIDIA#5334).

- Add /opt/homebrew/bin to the CLI, gateway, and sandbox binary fallbacks so
  all three resolvers stay coherent.
- In the Docker-driver gateway start-failure report, distinguish a crashed
  gateway from one that is still running but never became healthy, and add
  `openshell status` / `openshell gateway info` to the troubleshooting
  footer so the actual cause is discoverable.

Validated end-to-end on Apple Silicon macOS 26.5.1 (arm64, Docker via colima)
with OpenShell installed only at /opt/homebrew/bin and off PATH: pre-fix
`nemoclaw onboard` printed 'openshell CLI not found' and reinstalled; post-fix
it resolved the binary and reached '✓ Docker-driver gateway is healthy'.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj force-pushed the fix/5334-openshell-homebrew-gateway branch from b78dcb7 to 8b0a10d Compare June 15, 2026 10:56
@yimoj

yimoj commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai Addressed in 8b0a10da9. You're right that !childExit.exited can't be equated with "still running" — the start loop also breaks on !isPidAlive(childPid) before the exit event fires, and reaches the reporter when the poll budget is exhausted.

Rather than thread a processStillAlive boolean from the call site (the call site is in src/lib/onboard.ts, which a CI guardrail requires to stay net-neutral — adding a parameter there would fail codebase-growth-guardrails), I dropped the liveness assertion entirely. The else-branch now states only the observable fact:

The gateway process did not become healthy within the timeout.

which is accurate whether the process is alive-but-unhealthy or already dead-but-exit-not-yet-observed. The openshell status / openshell gateway info troubleshooting lines let the user see the real state. Tests updated to assert the message no longer claims "still running".

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

@yimoj Thanks for the detailed explanation. The approach is sound — "The gateway process did not become healthy within the timeout" is a precise, state-agnostic statement that holds true whether the process is alive-but-unhealthy or already dead with the exit event not yet fired. Avoiding the processStillAlive threading given the CI guardrail constraint makes sense, and offloading actual state discovery to openshell status / openshell gateway info is a better user experience anyway. The fix looks good.

@yimoj yimoj added the v0.0.66 Release target label Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.66 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenShell Docker-driver gateway fails on Apple Silicon M3 - onboard wizard times out

1 participant