Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions bin/lib/onboard.js
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,17 @@ async function startGateway(gpu) {
// Give DNS a moment to propagate
require("child_process").spawnSync("sleep", ["5"]);

// WSL2 GPU fix — CDI mode + libdxcore.so + node label
if (gpu && gpu.nimCapable && fs.existsSync("/dev/dxg")) {
console.log(" WSL2 detected — applying GPU CDI fixes...");
const fixScript = path.join(ROOT, "wsl2-gpu-fix.sh");
if (fs.existsSync(fixScript)) {
run(`bash "${fixScript}" nemoclaw`, { ignoreError: true });
Comment on lines +153 to +156
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Surface failures from the WSL2 fix helper.

The legacy setup path fails fast on this helper, but onboarding ignores a non-zero exit here and keeps going. That makes later WSL2 GPU failures look unrelated to the actual root cause.

Suggested fix
     if (fs.existsSync(fixScript)) {
-      run(`bash "${fixScript}" nemoclaw`, { ignoreError: true });
+      try {
+        run(`bash "${fixScript}" nemoclaw`, { ignoreError: false });
+      } catch {
+        console.log("  Warning: WSL2 GPU fix failed; GPU sandbox creation may fail on WSL2.");
+      }
     } else {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
console.log(" WSL2 detected — applying GPU CDI fixes...");
const fixScript = path.join(ROOT, "wsl2-gpu-fix.sh");
if (fs.existsSync(fixScript)) {
run(`bash "${fixScript}" nemoclaw`, { ignoreError: true });
console.log(" WSL2 detected — applying GPU CDI fixes...");
const fixScript = path.join(ROOT, "wsl2-gpu-fix.sh");
if (fs.existsSync(fixScript)) {
try {
run(`bash "${fixScript}" nemoclaw`, { ignoreError: false });
} catch {
console.log(" Warning: WSL2 GPU fix failed; GPU sandbox creation may fail on WSL2.");
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 153 - 156, The WSL2 GPU fix helper is being
run with ignoreError: true so failures are swallowed; change the call that
invokes run with fixScript (the run(`bash "${fixScript}" nemoclaw`, {
ignoreError: true }) usage) to propagate errors instead of ignoring them—remove
or set ignoreError to false and ensure the caller surfaces a non-zero exit
(throw or return error) so onboarding fails fast when the wsl2-gpu-fix.sh
(fixScript) step fails.

} else {
console.log(" Warning: wsl2-gpu-fix.sh not found at " + fixScript);
console.log(" GPU sandbox creation may fail on WSL2. See: https://github.com/NVIDIA/OpenShell/issues/404");
}
}
}

// ── Step 3: Sandbox ──────────────────────────────────────────────
Expand Down
38 changes: 27 additions & 11 deletions scripts/setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,17 @@ for i in 1 2 3 4 5; do
done
info "Gateway is healthy"

# 1b. WSL2 GPU fix — CDI mode + libdxcore.so + node label
if [ -c /dev/dxg ] && command -v nvidia-smi > /dev/null 2>&1; then
info "WSL2 detected — applying GPU CDI fixes..."
WSL2_FIX="${REPO_DIR}/wsl2-gpu-fix.sh"
if [ -x "$WSL2_FIX" ]; then
bash "$WSL2_FIX" nemoclaw
else
warn "wsl2-gpu-fix.sh not found at $WSL2_FIX — GPU sandbox may fail on WSL2"
fi
Comment on lines +94 to +99
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't require the helper to be executable here.

This block already runs the file with bash, so -x is stricter than needed. On WSL2 checkouts from /mnt/c, execute bits are often not preserved, which would skip the workaround on the exact platform this PR targets.

Suggested fix
-  if [ -x "$WSL2_FIX" ]; then
+  if [ -f "$WSL2_FIX" ]; then
     bash "$WSL2_FIX" nemoclaw
   else
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
WSL2_FIX="${REPO_DIR}/wsl2-gpu-fix.sh"
if [ -x "$WSL2_FIX" ]; then
bash "$WSL2_FIX" nemoclaw
else
warn "wsl2-gpu-fix.sh not found at $WSL2_FIX — GPU sandbox may fail on WSL2"
fi
WSL2_FIX="${REPO_DIR}/wsl2-gpu-fix.sh"
if [ -f "$WSL2_FIX" ]; then
bash "$WSL2_FIX" nemoclaw
else
warn "wsl2-gpu-fix.sh not found at $WSL2_FIX — GPU sandbox may fail on WSL2"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup.sh` around lines 94 - 99, The script currently checks
executability with the WSL2_FIX variable using [ -x "$WSL2_FIX" ] which prevents
running the helper via bash on files without execute bits; change the guard to
check existence (e.g., [ -f "$WSL2_FIX" ] or [ -e "$WSL2_FIX" ]) so the block
will call bash "$WSL2_FIX" nemoclaw when the file is present, otherwise emit the
same warn message; update the conditional that references WSL2_FIX accordingly.

fi

# 2. CoreDNS fix (Colima only)
if [ -S "$HOME/.colima/default/docker.sock" ]; then
info "Patching CoreDNS for Colima..."
Expand All @@ -113,19 +124,24 @@ if curl -s http://localhost:8000/v1/models > /dev/null 2>&1 || python3 -c "impor
"OPENAI_BASE_URL=http://host.openshell.internal:8000/v1"
fi

# 4a. Ollama (macOS local inference)
if [ "$(uname -s)" = "Darwin" ]; then
if ! command -v ollama > /dev/null 2>&1; then
info "Installing Ollama..."
brew install ollama 2>/dev/null || warn "Ollama install failed (brew required). Install manually: https://ollama.com"
# 4a. Ollama (local inference — macOS or Linux)
if command -v ollama > /dev/null 2>&1 || curl -s http://localhost:11434/api/tags > /dev/null 2>&1; then
if ! curl -s http://localhost:11434/api/tags > /dev/null 2>&1; then
info "Starting Ollama service..."
OLLAMA_HOST=0.0.0.0:11434 ollama serve > /dev/null 2>&1 &
sleep 2
fi
upsert_provider \
"ollama-local" \
"openai" \
"OPENAI_API_KEY=ollama" \
"OPENAI_BASE_URL=http://host.openshell.internal:11434/v1"
elif [ "$(uname -s)" = "Darwin" ] && ! command -v ollama > /dev/null 2>&1; then
info "Installing Ollama..."
brew install ollama 2>/dev/null || warn "Ollama install failed. Install manually: https://ollama.com"
if command -v ollama > /dev/null 2>&1; then
# Start Ollama service if not running
if ! curl -s http://localhost:11434/api/tags > /dev/null 2>&1; then
info "Starting Ollama service..."
OLLAMA_HOST=0.0.0.0:11434 ollama serve > /dev/null 2>&1 &
sleep 2
fi
OLLAMA_HOST=0.0.0.0:11434 ollama serve > /dev/null 2>&1 &
sleep 2
upsert_provider \
"ollama-local" \
"openai" \
Expand Down
145 changes: 145 additions & 0 deletions wsl2-gpu-fix.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
#!/bin/bash
# wsl2-gpu-fix.sh — Apply WSL2 GPU fixes to an OpenShell gateway
# Run after: openshell gateway start --gpu
# Usage: ./wsl2-gpu-fix.sh [gateway-name]
#
# This script applies the same fixes as the PR (NVIDIA/OpenShell#411)
# at runtime, until the upstream image ships with WSL2 support.

set -euo pipefail

GATEWAY="${1:-nemoclaw}"
echo "Applying WSL2 GPU fixes to gateway '$GATEWAY'..."

# Check gateway is up
if ! openshell status 2>&1 | grep -q "Connected"; then
echo "Error: gateway not connected. Start it first: openshell gateway start --gpu --name $GATEWAY"
exit 1
fi

# Check we're on WSL2
if [ ! -c /dev/dxg ] 2>/dev/null; then
echo "Not WSL2 (/dev/dxg absent) — no fixes needed"
exit 0
fi

echo "[1/4] Generating CDI spec with GPU UUIDs and libdxcore.so..."
openshell doctor exec -- sh -c '
mkdir -p /var/run/cdi

# Gather info
GPU_UUID=$(nvidia-smi --query-gpu=gpu_uuid --format=csv,noheader 2>/dev/null | tr -d " " | head -1)
DXCORE_PATH=$(find /usr/lib -name "libdxcore.so" 2>/dev/null | head -1)
DXCORE_DIR=$(dirname "$DXCORE_PATH" 2>/dev/null || echo "/usr/lib/x86_64-linux-gnu")
DRIVER_DIR=$(ls -d /usr/lib/wsl/drivers/nv*.inf_amd64_* 2>/dev/null | head -1)
Comment on lines +31 to +34
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fail fast when libdxcore.so is not discovered.

DXCORE_PATH is written into the CDI mounts without any validation. If discovery returns empty, the generated spec contains blank mount paths and the script still flips the runtime to cdi, which can leave the gateway in a worse state than before.

Suggested fix
 GPU_UUID=$(nvidia-smi --query-gpu=gpu_uuid --format=csv,noheader 2>/dev/null | tr -d " " | head -1)
 DXCORE_PATH=$(find /usr/lib -name "libdxcore.so" 2>/dev/null | head -1)
-DXCORE_DIR=$(dirname "$DXCORE_PATH" 2>/dev/null || echo "/usr/lib/x86_64-linux-gnu")
+if [ -z "$DXCORE_PATH" ]; then
+    echo "Error: libdxcore.so not found inside gateway"
+    exit 1
+fi
+DXCORE_DIR=$(dirname "$DXCORE_PATH")
 DRIVER_DIR=$(ls -d /usr/lib/wsl/drivers/nv*.inf_amd64_* 2>/dev/null | head -1)

Also applies to: 83-86

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wsl2-gpu-fix.sh` around lines 31 - 34, The script currently uses DXCORE_PATH
(and derived DXCORE_DIR) without validation which can produce blank CDI mounts
and still switch the runtime; update the logic to check that DXCORE_PATH is
non-empty (and readable) right after discovery (the block setting DXCORE_PATH
and DXCORE_DIR) and, if not found, print a clear error including what was
searched for and exit non-zero before any CDI mount generation or runtime change
(the code that later references DXCORE_DIR/DRIVER_DIR and flips to cdi must not
run); apply the same validation/early-exit pattern to the later discovery block
around lines 83-86 where DXCORE_PATH/DXCORE_DIR are used so the script fails
fast instead of producing invalid mounts.


if [ -z "$DRIVER_DIR" ]; then
echo "Error: no NVIDIA WSL driver store found"
exit 1
fi

# Write complete CDI spec from scratch (avoids fragile sed patching)
cat > /var/run/cdi/nvidia.yaml << CDIEOF
---
cdiVersion: "0.5.0"
kind: nvidia.com/gpu
devices:
- name: all
containerEdits:
deviceNodes:
- path: /dev/dxg
- name: "${GPU_UUID}"
containerEdits:
deviceNodes:
- path: /dev/dxg
- name: "0"
containerEdits:
deviceNodes:
- path: /dev/dxg
containerEdits:
env:
- NVIDIA_VISIBLE_DEVICES=void
hooks:
- hookName: createContainer
path: /usr/bin/nvidia-cdi-hook
args:
- nvidia-cdi-hook
- create-symlinks
- --link
- ${DRIVER_DIR}/nvidia-smi::/usr/bin/nvidia-smi
env:
- NVIDIA_CTK_DEBUG=false
- hookName: createContainer
path: /usr/bin/nvidia-cdi-hook
args:
- nvidia-cdi-hook
- update-ldcache
- --folder
- ${DRIVER_DIR}
- --folder
- ${DXCORE_DIR}
env:
- NVIDIA_CTK_DEBUG=false
mounts:
- hostPath: ${DXCORE_PATH}
containerPath: ${DXCORE_PATH}
options: [ro, nosuid, nodev, rbind, rprivate]
- hostPath: ${DRIVER_DIR}/libcuda.so.1.1
containerPath: ${DRIVER_DIR}/libcuda.so.1.1
options: [ro, nosuid, nodev, rbind, rprivate]
- hostPath: ${DRIVER_DIR}/libcuda_loader.so
containerPath: ${DRIVER_DIR}/libcuda_loader.so
options: [ro, nosuid, nodev, rbind, rprivate]
- hostPath: ${DRIVER_DIR}/libnvdxgdmal.so.1
containerPath: ${DRIVER_DIR}/libnvdxgdmal.so.1
options: [ro, nosuid, nodev, rbind, rprivate]
- hostPath: ${DRIVER_DIR}/libnvidia-ml.so.1
containerPath: ${DRIVER_DIR}/libnvidia-ml.so.1
options: [ro, nosuid, nodev, rbind, rprivate]
- hostPath: ${DRIVER_DIR}/libnvidia-ml_loader.so
containerPath: ${DRIVER_DIR}/libnvidia-ml_loader.so
options: [ro, nosuid, nodev, rbind, rprivate]
- hostPath: ${DRIVER_DIR}/libnvidia-ptxjitcompiler.so.1
containerPath: ${DRIVER_DIR}/libnvidia-ptxjitcompiler.so.1
options: [ro, nosuid, nodev, rbind, rprivate]
- hostPath: ${DRIVER_DIR}/nvcubins.bin
containerPath: ${DRIVER_DIR}/nvcubins.bin
options: [ro, nosuid, nodev, rbind, rprivate]
- hostPath: ${DRIVER_DIR}/nvidia-smi
containerPath: ${DRIVER_DIR}/nvidia-smi
options: [ro, nosuid, nodev, rbind, rprivate]
CDIEOF

nvidia-ctk cdi list 2>&1
'

echo "[2/4] Switching nvidia runtime to CDI mode..."
openshell doctor exec -- sed -i 's/mode = "auto"/mode = "cdi"/' /etc/nvidia-container-runtime/config.toml

echo "[3/4] Labeling node with NVIDIA PCI vendor..."
openshell doctor exec -- sh -c '
NODE=$(kubectl get nodes -o jsonpath="{.items[0].metadata.name}")
kubectl label node $NODE feature.node.kubernetes.io/pci-10de.present=true --overwrite
' 2>&1

echo "[4/4] Waiting for nvidia-device-plugin..."
for i in $(seq 1 60); do
GPU=$(openshell doctor exec -- kubectl get nodes -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null || true)
if [ "$GPU" = "1" ]; then
echo "GPU ready: nvidia.com/gpu=$GPU"
break
fi
[ "$((i % 10))" = "0" ] && echo " still waiting ($i/60)..."
sleep 2
done

if [ "$GPU" != "1" ]; then
echo "Warning: GPU resource not yet advertised after 120s"
echo "Checking device plugin pods..."
openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1
exit 1
fi
Comment on lines +125 to +141
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Treat any positive GPU count as ready.

The success check is hard-coded to "1". On WSL2 hosts that expose 2+ GPUs, this loop will hit the timeout and fail even though nvidia.com/gpu is already advertised.

Suggested fix
-    if [ "$GPU" = "1" ]; then
+    if [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then
         echo "GPU ready: nvidia.com/gpu=$GPU"
         break
     fi
@@
-if [ "$GPU" != "1" ]; then
+if ! [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then
     echo "Warning: GPU resource not yet advertised after 120s"
     echo "Checking device plugin pods..."
     openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1
     exit 1
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "[4/4] Waiting for nvidia-device-plugin..."
for i in $(seq 1 60); do
GPU=$(openshell doctor exec -- kubectl get nodes -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null || true)
if [ "$GPU" = "1" ]; then
echo "GPU ready: nvidia.com/gpu=$GPU"
break
fi
[ "$((i % 10))" = "0" ] && echo " still waiting ($i/60)..."
sleep 2
done
if [ "$GPU" != "1" ]; then
echo "Warning: GPU resource not yet advertised after 120s"
echo "Checking device plugin pods..."
openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1
exit 1
fi
echo "[4/4] Waiting for nvidia-device-plugin..."
for i in $(seq 1 60); do
GPU=$(openshell doctor exec -- kubectl get nodes -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}' 2>/dev/null || true)
if [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then
echo "GPU ready: nvidia.com/gpu=$GPU"
break
fi
[ "$((i % 10))" = "0" ] && echo " still waiting ($i/60)..."
sleep 2
done
if ! [[ "$GPU" =~ ^[1-9][0-9]*$ ]]; then
echo "Warning: GPU resource not yet advertised after 120s"
echo "Checking device plugin pods..."
openshell doctor exec -- kubectl -n nvidia-device-plugin get pods 2>&1
exit 1
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wsl2-gpu-fix.sh` around lines 125 - 141, The readiness check currently treats
only exactly "1" as ready; update the loop that assigns GPU (variable GPU from
the openshell doctor exec -- kubectl command) to consider any positive integer
as ready by testing numeric value > 0 (e.g., convert GPU to an integer and use a
numeric comparison) instead of string equality to "1", and keep the existing
success path that echoes "GPU ready: nvidia.com/gpu=$GPU" and breaks; ensure the
failure branch remains unchanged.


echo ""
echo "WSL2 GPU fixes applied successfully."
echo "Sandbox creation with --gpu should now work."