Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions bin/lib/onboard.js
Original file line number Diff line number Diff line change
Expand Up @@ -415,6 +415,66 @@ async function preflight() {
return gpu;
}


// ── TLS Fix: inject missing secrets on WSL2/Linux ────────────────
function injectTlsSecrets() {
const { execSync } = require("child_process");
console.log(" Injecting TLS secrets for OpenShell gateway...");
const tmpDir = "/tmp/openshell-tls-fix";
fs.mkdirSync(tmpDir, { recursive: true });

const ext = [
"[req]",
"req_extensions = v3_req",
"distinguished_name = req_distinguished_name",
"[req_distinguished_name]",
"[v3_req]",
"basicConstraints = CA:FALSE",
"keyUsage = nonRepudiation, digitalSignature, keyEncipherment",
"subjectAltName = @alt_names",
"[alt_names]",
"DNS.1 = openshell",
"DNS.2 = openshell.openshell.svc",
"DNS.3 = openshell.openshell.svc.cluster.local",
"DNS.4 = localhost",
"IP.1 = 127.0.0.1",
"[v3_ca]",
"subjectKeyIdentifier = hash",
"authorityKeyIdentifier = keyid:always,issuer",
"basicConstraints = CA:TRUE",
"keyUsage = cRLSign, keyCertSign",
].join("\n");
fs.writeFileSync(tmpDir + "/v3.ext", ext);
Comment on lines +423 to +447
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check the file structure and locate the file
git ls-files bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 77


🏁 Script executed:

# Read the relevant section with context
head -n 500 bin/lib/onboard.js | tail -n 100

Repository: NVIDIA/NemoClaw

Length of output: 4718


🏁 Script executed:

# Get a more precise view of the function around lines 420-480
sed -n '410,490p' bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 4349


🏁 Script executed:

# Check if tmpDir or /tmp/openshell-tls-fix is referenced elsewhere in the file
rg "openshell-tls-fix|tmpDir" bin/lib/onboard.js -n

Repository: NVIDIA/NemoClaw

Length of output: 1243


🏁 Script executed:

# Check imports at the top of the file to see what modules are available
head -n 50 bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 2131


Remove private keys from /tmp after use and prevent concurrent invocation collisions.

Line 423 creates a shared fixed temp directory that persists across invocations, leaving generated private keys (ca.key, server.key) in /tmp indefinitely. This enables stale key material to accumulate and allows concurrent runs to collide and overwrite each other's key material.

Suggested fix
 function injectTlsSecrets() {
   const { execSync } = require("child_process");
   console.log("  Injecting TLS secrets for OpenShell gateway...");
-  const tmpDir = "/tmp/openshell-tls-fix";
-  fs.mkdirSync(tmpDir, { recursive: true });
+  const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "openshell-tls-"));
+  fs.chmodSync(tmpDir, 0o700);
@@
-  fs.writeFileSync(tmpDir + "/v3.ext", ext);
+  fs.writeFileSync(path.join(tmpDir, "v3.ext"), ext, { mode: 0o600 });
@@
   } catch (e) {
     console.error("  !! TLS injection failed:", e.message);
+  } finally {
+    fs.rmSync(tmpDir, { recursive: true, force: true });
   }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 423 - 447, The tmpDir is fixed (tmpDir =
"/tmp/openshell-tls-fix") causing stale private keys and race conditions; change
to create a unique temp directory (use fs.mkdtemp or append a secure random
suffix to tmpDir) for each invocation, write v3.ext and generated private keys
(ca.key, server.key) into that per-run directory instead of a shared path, and
ensure you securely remove private keys and the temp directory after use
(unlink/remove) to prevent accumulation and collisions; update any code
referencing tmpDir (the tmpDir const and places writing v3.ext, ca.key,
server.key) to use the new per-run directory and add try/finally or equivalent
cleanup to guarantee deletion on success or error.


try {
execSync("openssl req -x509 -newkey rsa:4096 -keyout " + tmpDir + "/ca.key -out " + tmpDir + "/ca.crt -days 365 -nodes -subj '/CN=openshell-ca' -extensions v3_ca -config " + tmpDir + "/v3.ext 2>/dev/null");
execSync("openssl req -newkey rsa:4096 -keyout " + tmpDir + "/server.key -out " + tmpDir + "/server.csr -nodes -subj '/CN=openshell' 2>/dev/null");
execSync("openssl x509 -req -in " + tmpDir + "/server.csr -CA " + tmpDir + "/ca.crt -CAkey " + tmpDir + "/ca.key -CAcreateserial -out " + tmpDir + "/server.crt -days 365 -extensions v3_req -extfile " + tmpDir + "/v3.ext 2>/dev/null");

console.log(" Waiting for k3s namespace...");
const container = execSync("docker ps --filter name=openshell-cluster-nemoclaw --format '{{.Names}}'").toString().trim();
for (let i = 0; i < 30; i++) {
try {
execSync("docker exec " + container + " kubectl get namespace openshell 2>/dev/null");
console.log(" ✓ k3s namespace ready");
break;
} catch {
require("child_process").spawnSync("sleep", ["3"]);
}
}
Comment on lines +454 to +464
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 82


🏁 Script executed:

sed -n '440,480p' bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 2339


Fail fast if the openshell namespace never becomes ready.

The readiness loop (lines 454-464) waits up to 90 seconds but continues regardless of success. If readiness never occurs, subsequent docker cp and kubectl create secret operations will fail with unclear errors. Add an explicit gate that throws an error immediately if the namespace doesn't become ready within the timeout.

Suggested fix
     console.log("  Waiting for k3s namespace...");
     const container = execSync("docker ps --filter name=openshell-cluster-nemoclaw --format '{{.Names}}'").toString().trim();
+    let namespaceReady = false;
     for (let i = 0; i < 30; i++) {
       try {
         execSync("docker exec " + container + " kubectl get namespace openshell 2>/dev/null");
         console.log("  ✓ k3s namespace ready");
+        namespaceReady = true;
         break;
       } catch {
         require("child_process").spawnSync("sleep", ["3"]);
       }
     }
+    if (!namespaceReady) {
+      throw new Error("k3s namespace 'openshell' did not become ready within 90s");
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 454 - 464, The readiness loop that runs
"kubectl get namespace openshell" against the container variable may never
succeed but currently continues silently; update the loop in onboard.js (the
block using execSync, container and the for (let i = 0; i < 30; i++) { ... }
loop) to track success (e.g., set a ready boolean when the execSync succeeds)
and after the loop throw an Error or call process.exit(1) with a clear message
if ready is false (e.g., "openshell namespace did not become ready within
timeout") so downstream docker cp / kubectl commands fail fast with a clear
failure reason.


execSync("docker cp " + tmpDir + "/server.crt " + container + ":/tmp/server.crt");
execSync("docker cp " + tmpDir + "/server.key " + container + ":/tmp/server.key");
execSync("docker cp " + tmpDir + "/ca.crt " + container + ":/tmp/ca.crt");
execSync("docker exec " + container + " kubectl delete secret openshell-server-tls openshell-server-client-ca -n openshell 2>/dev/null || true");
execSync("docker exec " + container + " kubectl create secret tls openshell-server-tls -n openshell --cert=/tmp/server.crt --key=/tmp/server.key");
execSync("docker exec " + container + " kubectl create secret generic openshell-server-client-ca -n openshell --from-file=ca.crt=/tmp/ca.crt");
Comment on lines +469 to +471
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file exists and read the relevant section
wc -l bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 82


🏁 Script executed:

# Read lines around 469-471 with context
sed -n '450,490p' bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 2862


🏁 Script executed:

# Search for cert-manager references to understand if it's used in this project
rg -i "cert-manager" --type js --type yaml --type json

Repository: NVIDIA/NemoClaw

Length of output: 41


🏁 Script executed:

# Search for other TLS secret creation patterns
rg "kubectl.*secret" -A 2 -B 2

Repository: NVIDIA/NemoClaw

Length of output: 868


🏁 Script executed:

# Look at the broader context of the onboard.js file to understand the purpose
head -100 bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 4128


🏁 Script executed:

# Read more context around these lines to see the function name and purpose
sed -n '440,475p' bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 2483


🏁 Script executed:

# Get the function name and surrounding context to confirm this is onboarding
sed -n '400,445p' bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 1942


🏁 Script executed:

# Search for where injectTlsSecrets is called
rg "injectTlsSecrets" -B 3 -A 3

Repository: NVIDIA/NemoClaw

Length of output: 689


Avoid unconditional delete/recreate of TLS secrets.

Lines 469-471 delete then recreate secrets without checking if they exist first. If the delete succeeds but creation fails, the secrets are left missing, creating a transient failure window. Check for secret existence before attempting creation.

Suggested fix
-    execSync("docker exec " + container + " kubectl delete secret openshell-server-tls openshell-server-client-ca -n openshell 2>/dev/null || true");
-    execSync("docker exec " + container + " kubectl create secret tls openshell-server-tls -n openshell --cert=/tmp/server.crt --key=/tmp/server.key");
-    execSync("docker exec " + container + " kubectl create secret generic openshell-server-client-ca -n openshell --from-file=ca.crt=/tmp/ca.crt");
+    const tlsSecret = execSync(
+      "docker exec " + container + " kubectl get secret openshell-server-tls -n openshell --ignore-not-found -o name"
+    ).toString().trim();
+    const caSecret = execSync(
+      "docker exec " + container + " kubectl get secret openshell-server-client-ca -n openshell --ignore-not-found -o name"
+    ).toString().trim();
+
+    if (!tlsSecret) {
+      execSync("docker exec " + container + " kubectl create secret tls openshell-server-tls -n openshell --cert=/tmp/server.crt --key=/tmp/server.key");
+    }
+    if (!caSecret) {
+      execSync("docker exec " + container + " kubectl create secret generic openshell-server-client-ca -n openshell --from-file=ca.crt=/tmp/ca.crt");
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 469 - 471, The current sequence
unconditionally deletes then recreates TLS secrets using the three execSync
calls, leaving a race where creation failure removes secrets; change the logic
for the execSync invocations that reference the container variable so you first
check for secret existence and only create if missing (e.g., run "kubectl get
secret openshell-server-tls -n openshell" and if it fails run the create-tls
command, and similarly check for openshell-server-client-ca before creating it)
instead of deleting up-front; also remove the unconditional kubectl delete call
and ensure execSync errors from creation are propagated or logged so failures do
not silently leave secrets absent.

console.log(" ✓ TLS secrets injected");
} catch (e) {
console.error(" !! TLS injection failed:", e.message);
}
Comment on lines +473 to +475
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n bin/lib/onboard.js | sed -n '460,520p'

Repository: NVIDIA/NemoClaw

Length of output: 3045


🏁 Script executed:

cat -n bin/lib/onboard.js | sed -n '420,476p'

Repository: NVIDIA/NemoClaw

Length of output: 3325


🏁 Script executed:

rg "injectTlsSecrets" bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 110


🏁 Script executed:

cat -n bin/lib/onboard.js | sed -n '511,525p'

Repository: NVIDIA/NemoClaw

Length of output: 647


Propagate TLS injection failures to fail immediately instead of triggering delayed health check timeouts.

The injectTlsSecrets() function at lines 473-475 catches errors and only logs them, allowing execution to continue silently. When called unconditionally at line 508, a failed injection causes the subsequent health check loop (lines 511-522) to retry 5 times with 2-second delays, adding ~8-10 seconds of confusion before the final timeout failure. Since TLS secrets are critical for the gateway to function, injection failures should exit immediately with a clear error message rather than masking the root cause behind a retry loop.

Suggested fix
   } catch (e) {
-    console.error("  !! TLS injection failed:", e.message);
+    throw new Error(`TLS injection failed: ${e.message}`);
   }
 }
@@
-  injectTlsSecrets();
+  try {
+    injectTlsSecrets();
+  } catch (error) {
+    console.error(`  ${error.message}`);
+    console.error("  Required OpenShell TLS secrets were not created. Aborting onboarding.");
+    process.exit(1);
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 473 - 475, The catch block in
injectTlsSecrets currently swallows errors (console.error only), allowing
execution to continue and causing the health-check retry delay; update the catch
in injectTlsSecrets (the function named injectTlsSecrets and its call site where
it runs unconditionally) to propagate failure immediately by either rethrowing
the caught error or exiting with a non-zero status after logging (e.g.,
process.exit(1)), so that TLS injection failures abort onboarding immediately
with a clear error message rather than letting the health-check loop mask the
root cause.

}

// ── Step 2: Gateway ──────────────────────────────────────────────

async function startGateway(gpu) {
Expand Down Expand Up @@ -445,6 +505,8 @@ async function startGateway(gpu) {
env: gatewayEnv,
});

injectTlsSecrets();

// Verify health
for (let i = 0; i < 5; i++) {
const status = runCapture("openshell status 2>&1", { ignoreError: true });
Expand Down