fix: inject TLS secrets after gateway start on fresh install by vishalp-dev24 · Pull Request #217 · NVIDIA/NemoClaw

vishalp-dev24 · 2026-03-17T15:53:59Z

The Helm chart deploys openshell-0 but never creates the two TLS secrets it requires to initialize:

openshell-server-tls
openshell-server-client-ca

On systems without cert-manager pre-installed (all default setups), the pod waits indefinitely causing the onboard wizard to time out.

Added injectTlsSecrets() in bin/lib/onboard.js which:

Generates a self-signed X.509 v3 CA and server cert via openssl
Copies them into the k3s container via docker cp
Creates both secrets via kubectl create secret

Called immediately after openshell gateway start in startGateway().

Tested on: Ubuntu 22.04 WSL2 + Windows 11, Docker 27, RTX 4050.

Summary by CodeRabbit

New Features
- Added automatic TLS certificate generation and injection capability
- Self-signed CA and server certificates are now generated and configured during gateway initialization
- Kubernetes TLS and CA secrets are automatically created and managed as needed
- Certificate injection failures are caught and logged to prevent interruption of the onboarding workflow

cv · 2026-03-21T19:28:50Z

Thanks for looking into the TLS secret injection issue, @vishalp-dev24! Fresh installs tripping over gateway timing is exactly the kind of thing that's painful to debug. Since you opened this, we've been shipping features and adding CI checks at a pretty steady clip, so main has drifted. Would you be able to rebase onto the latest main? We'd really like to give this a proper review once it's up to date. Thanks!

coderabbitai · 2026-03-23T09:39:32Z

📝 Walkthrough

Walkthrough

A new injectTlsSecrets() helper function was added to generate self-signed certificates and manage Kubernetes TLS secrets. The function executes OpenSSL commands, waits for a namespace, and recreates secret objects via kubectl within the onboarding flow.

Changes

Cohort / File(s)	Summary
TLS Secret Injection `bin/lib/onboard.js`	Added `injectTlsSecrets()` helper that generates self-signed CA/server certificates via OpenSSL, waits for the openshell namespace, and recreates Kubernetes TLS/CA secrets. Invoked from `startGateway(gpu)` after gateway startup with error logging but no flow abortion.

Sequence Diagram(s)

sequenceDiagram
    participant Gateway as OpenShell Gateway
    participant TLS as injectTlsSecrets()
    participant OpenSSL as OpenSSL
    participant Docker as Docker Container
    participant K8s as Kubernetes API

    Gateway->>TLS: Invoke after startup
    TLS->>OpenSSL: Generate self-signed CA certificate
    OpenSSL-->>TLS: CA cert + key files
    TLS->>OpenSSL: Generate server certificate
    OpenSSL-->>TLS: Server cert + key files
    TLS->>Docker: Wait for openshell namespace
    Docker-->>TLS: Namespace ready
    TLS->>K8s: Delete openshell-server-tls secret
    K8s-->>TLS: Secret deleted
    TLS->>K8s: Delete openshell-server-client-ca secret
    K8s-->>TLS: Secret deleted
    TLS->>Docker: Copy cert/key files to container
    Docker-->>TLS: Files copied
    TLS->>K8s: Create openshell-server-tls secret
    K8s-->>TLS: Secret created
    TLS->>K8s: Create openshell-server-client-ca secret
    K8s-->>TLS: Secret created
    TLS-->>Gateway: TLS injection complete

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Secrets in certificates, hopping through the cloud,
OpenSSL whispers, Kubernetes bows proud,
TLS dancing with Docker so grand,
Certificates spinning with a flick of my paw's hand!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: inject TLS secrets after gateway start on fresh install' directly and accurately describes the main change: adding TLS secret injection logic after gateway startup to resolve fresh install issues.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 454-464: The readiness loop that runs "kubectl get namespace
openshell" against the container variable may never succeed but currently
continues silently; update the loop in onboard.js (the block using execSync,
container and the for (let i = 0; i < 30; i++) { ... } loop) to track success
(e.g., set a ready boolean when the execSync succeeds) and after the loop throw
an Error or call process.exit(1) with a clear message if ready is false (e.g.,
"openshell namespace did not become ready within timeout") so downstream docker
cp / kubectl commands fail fast with a clear failure reason.
- Around line 423-447: The tmpDir is fixed (tmpDir = "/tmp/openshell-tls-fix")
causing stale private keys and race conditions; change to create a unique temp
directory (use fs.mkdtemp or append a secure random suffix to tmpDir) for each
invocation, write v3.ext and generated private keys (ca.key, server.key) into
that per-run directory instead of a shared path, and ensure you securely remove
private keys and the temp directory after use (unlink/remove) to prevent
accumulation and collisions; update any code referencing tmpDir (the tmpDir
const and places writing v3.ext, ca.key, server.key) to use the new per-run
directory and add try/finally or equivalent cleanup to guarantee deletion on
success or error.
- Around line 473-475: The catch block in injectTlsSecrets currently swallows
errors (console.error only), allowing execution to continue and causing the
health-check retry delay; update the catch in injectTlsSecrets (the function
named injectTlsSecrets and its call site where it runs unconditionally) to
propagate failure immediately by either rethrowing the caught error or exiting
with a non-zero status after logging (e.g., process.exit(1)), so that TLS
injection failures abort onboarding immediately with a clear error message
rather than letting the health-check loop mask the root cause.
- Around line 469-471: The current sequence unconditionally deletes then
recreates TLS secrets using the three execSync calls, leaving a race where
creation failure removes secrets; change the logic for the execSync invocations
that reference the container variable so you first check for secret existence
and only create if missing (e.g., run "kubectl get secret openshell-server-tls
-n openshell" and if it fails run the create-tls command, and similarly check
for openshell-server-client-ca before creating it) instead of deleting up-front;
also remove the unconditional kubectl delete call and ensure execSync errors
from creation are propagated or logged so failures do not silently leave secrets
absent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 37199909-f80d-4209-9518-cb8cab4a3a65

📥 Commits

Reviewing files that changed from the base of the PR and between c55a309 and 3984308.

📒 Files selected for processing (1)

bin/lib/onboard.js

coderabbitai · 2026-03-23T09:43:01Z

bin/lib/onboard.js

+  const tmpDir = "/tmp/openshell-tls-fix";
+  fs.mkdirSync(tmpDir, { recursive: true });
+
+  const ext = [
+    "[req]",
+    "req_extensions = v3_req",
+    "distinguished_name = req_distinguished_name",
+    "[req_distinguished_name]",
+    "[v3_req]",
+    "basicConstraints = CA:FALSE",
+    "keyUsage = nonRepudiation, digitalSignature, keyEncipherment",
+    "subjectAltName = @alt_names",
+    "[alt_names]",
+    "DNS.1 = openshell",
+    "DNS.2 = openshell.openshell.svc",
+    "DNS.3 = openshell.openshell.svc.cluster.local",
+    "DNS.4 = localhost",
+    "IP.1 = 127.0.0.1",
+    "[v3_ca]",
+    "subjectKeyIdentifier = hash",
+    "authorityKeyIdentifier = keyid:always,issuer",
+    "basicConstraints = CA:TRUE",
+    "keyUsage = cRLSign, keyCertSign",
+  ].join("\n");
+  fs.writeFileSync(tmpDir + "/v3.ext", ext);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check the file structure and locate the file git ls-files bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 77

🏁 Script executed:

# Read the relevant section with context head -n 500 bin/lib/onboard.js | tail -n 100

Repository: NVIDIA/NemoClaw

Length of output: 4718

🏁 Script executed:

# Get a more precise view of the function around lines 420-480 sed -n '410,490p' bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 4349

🏁 Script executed:

# Check if tmpDir or /tmp/openshell-tls-fix is referenced elsewhere in the file rg "openshell-tls-fix|tmpDir" bin/lib/onboard.js -n

Repository: NVIDIA/NemoClaw

Length of output: 1243

🏁 Script executed:

# Check imports at the top of the file to see what modules are available head -n 50 bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 2131

Remove private keys from /tmp after use and prevent concurrent invocation collisions.

Line 423 creates a shared fixed temp directory that persists across invocations, leaving generated private keys (ca.key, server.key) in /tmp indefinitely. This enables stale key material to accumulate and allows concurrent runs to collide and overwrite each other's key material.

Suggested fix

function injectTlsSecrets() { const { execSync } = require("child_process"); console.log(" Injecting TLS secrets for OpenShell gateway..."); - const tmpDir = "/tmp/openshell-tls-fix"; - fs.mkdirSync(tmpDir, { recursive: true }); + const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "openshell-tls-")); + fs.chmodSync(tmpDir, 0o700); @@ - fs.writeFileSync(tmpDir + "/v3.ext", ext); + fs.writeFileSync(path.join(tmpDir, "v3.ext"), ext, { mode: 0o600 }); @@ } catch (e) { console.error(" !! TLS injection failed:", e.message); + } finally { + fs.rmSync(tmpDir, { recursive: true, force: true }); } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 423 - 447, The tmpDir is fixed (tmpDir = "/tmp/openshell-tls-fix") causing stale private keys and race conditions; change to create a unique temp directory (use fs.mkdtemp or append a secure random suffix to tmpDir) for each invocation, write v3.ext and generated private keys (ca.key, server.key) into that per-run directory instead of a shared path, and ensure you securely remove private keys and the temp directory after use (unlink/remove) to prevent accumulation and collisions; update any code referencing tmpDir (the tmpDir const and places writing v3.ext, ca.key, server.key) to use the new per-run directory and add try/finally or equivalent cleanup to guarantee deletion on success or error.

coderabbitai · 2026-03-23T09:43:01Z

bin/lib/onboard.js

+    console.log("  Waiting for k3s namespace...");
+    const container = execSync("docker ps --filter name=openshell-cluster-nemoclaw --format '{{.Names}}'").toString().trim();
+    for (let i = 0; i < 30; i++) {
+      try {
+        execSync("docker exec " + container + " kubectl get namespace openshell 2>/dev/null");
+        console.log("  ✓ k3s namespace ready");
+        break;
+      } catch {
+        require("child_process").spawnSync("sleep", ["3"]);
+      }
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 82

🏁 Script executed:

sed -n '440,480p' bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 2339

Fail fast if the openshell namespace never becomes ready.

The readiness loop (lines 454-464) waits up to 90 seconds but continues regardless of success. If readiness never occurs, subsequent docker cp and kubectl create secret operations will fail with unclear errors. Add an explicit gate that throws an error immediately if the namespace doesn't become ready within the timeout.

Suggested fix

console.log(" Waiting for k3s namespace..."); const container = execSync("docker ps --filter name=openshell-cluster-nemoclaw --format '{{.Names}}'").toString().trim(); + let namespaceReady = false; for (let i = 0; i < 30; i++) { try { execSync("docker exec " + container + " kubectl get namespace openshell 2>/dev/null"); console.log(" ✓ k3s namespace ready"); + namespaceReady = true; break; } catch { require("child_process").spawnSync("sleep", ["3"]); } } + if (!namespaceReady) { + throw new Error("k3s namespace 'openshell' did not become ready within 90s"); + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 454 - 464, The readiness loop that runs "kubectl get namespace openshell" against the container variable may never succeed but currently continues silently; update the loop in onboard.js (the block using execSync, container and the for (let i = 0; i < 30; i++) { ... } loop) to track success (e.g., set a ready boolean when the execSync succeeds) and after the loop throw an Error or call process.exit(1) with a clear message if ready is false (e.g., "openshell namespace did not become ready within timeout") so downstream docker cp / kubectl commands fail fast with a clear failure reason.

coderabbitai · 2026-03-23T09:43:01Z

bin/lib/onboard.js

+    execSync("docker exec " + container + " kubectl delete secret openshell-server-tls openshell-server-client-ca -n openshell 2>/dev/null || true");
+    execSync("docker exec " + container + " kubectl create secret tls openshell-server-tls -n openshell --cert=/tmp/server.crt --key=/tmp/server.key");
+    execSync("docker exec " + container + " kubectl create secret generic openshell-server-client-ca -n openshell --from-file=ca.crt=/tmp/ca.crt");


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file exists and read the relevant section wc -l bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 82

🏁 Script executed:

# Read lines around 469-471 with context sed -n '450,490p' bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 2862

🏁 Script executed:

# Search for cert-manager references to understand if it's used in this project rg -i "cert-manager" --type js --type yaml --type json

Repository: NVIDIA/NemoClaw

Length of output: 41

🏁 Script executed:

# Search for other TLS secret creation patterns rg "kubectl.*secret" -A 2 -B 2

Repository: NVIDIA/NemoClaw

Length of output: 868

🏁 Script executed:

# Look at the broader context of the onboard.js file to understand the purpose head -100 bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 4128

🏁 Script executed:

# Read more context around these lines to see the function name and purpose sed -n '440,475p' bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 2483

🏁 Script executed:

# Get the function name and surrounding context to confirm this is onboarding sed -n '400,445p' bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 1942

🏁 Script executed:

# Search for where injectTlsSecrets is called rg "injectTlsSecrets" -B 3 -A 3

Repository: NVIDIA/NemoClaw

Length of output: 689

Avoid unconditional delete/recreate of TLS secrets.

Lines 469-471 delete then recreate secrets without checking if they exist first. If the delete succeeds but creation fails, the secrets are left missing, creating a transient failure window. Check for secret existence before attempting creation.

Suggested fix

- execSync("docker exec " + container + " kubectl delete secret openshell-server-tls openshell-server-client-ca -n openshell 2>/dev/null || true"); - execSync("docker exec " + container + " kubectl create secret tls openshell-server-tls -n openshell --cert=/tmp/server.crt --key=/tmp/server.key"); - execSync("docker exec " + container + " kubectl create secret generic openshell-server-client-ca -n openshell --from-file=ca.crt=/tmp/ca.crt"); + const tlsSecret = execSync( + "docker exec " + container + " kubectl get secret openshell-server-tls -n openshell --ignore-not-found -o name" + ).toString().trim(); + const caSecret = execSync( + "docker exec " + container + " kubectl get secret openshell-server-client-ca -n openshell --ignore-not-found -o name" + ).toString().trim(); + + if (!tlsSecret) { + execSync("docker exec " + container + " kubectl create secret tls openshell-server-tls -n openshell --cert=/tmp/server.crt --key=/tmp/server.key"); + } + if (!caSecret) { + execSync("docker exec " + container + " kubectl create secret generic openshell-server-client-ca -n openshell --from-file=ca.crt=/tmp/ca.crt"); + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 469 - 471, The current sequence unconditionally deletes then recreates TLS secrets using the three execSync calls, leaving a race where creation failure removes secrets; change the logic for the execSync invocations that reference the container variable so you first check for secret existence and only create if missing (e.g., run "kubectl get secret openshell-server-tls -n openshell" and if it fails run the create-tls command, and similarly check for openshell-server-client-ca before creating it) instead of deleting up-front; also remove the unconditional kubectl delete call and ensure execSync errors from creation are propagated or logged so failures do not silently leave secrets absent.

coderabbitai · 2026-03-23T09:43:01Z

bin/lib/onboard.js

+  } catch (e) {
+    console.error("  !! TLS injection failed:", e.message);
+  }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n bin/lib/onboard.js | sed -n '460,520p'

Repository: NVIDIA/NemoClaw

Length of output: 3045

🏁 Script executed:

cat -n bin/lib/onboard.js | sed -n '420,476p'

Repository: NVIDIA/NemoClaw

Length of output: 3325

🏁 Script executed:

rg "injectTlsSecrets" bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 110

🏁 Script executed:

cat -n bin/lib/onboard.js | sed -n '511,525p'

Repository: NVIDIA/NemoClaw

Length of output: 647

Propagate TLS injection failures to fail immediately instead of triggering delayed health check timeouts.

The injectTlsSecrets() function at lines 473-475 catches errors and only logs them, allowing execution to continue silently. When called unconditionally at line 508, a failed injection causes the subsequent health check loop (lines 511-522) to retry 5 times with 2-second delays, adding ~8-10 seconds of confusion before the final timeout failure. Since TLS secrets are critical for the gateway to function, injection failures should exit immediately with a clear error message rather than masking the root cause behind a retry loop.

Suggested fix

} catch (e) { - console.error(" !! TLS injection failed:", e.message); + throw new Error(`TLS injection failed: ${e.message}`); } } @@ - injectTlsSecrets(); + try { + injectTlsSecrets(); + } catch (error) { + console.error(` ${error.message}`); + console.error(" Required OpenShell TLS secrets were not created. Aborting onboarding."); + process.exit(1); + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 473 - 475, The catch block in injectTlsSecrets currently swallows errors (console.error only), allowing execution to continue and causing the health-check retry delay; update the catch in injectTlsSecrets (the function named injectTlsSecrets and its call site where it runs unconditionally) to propagate failure immediately by either rethrowing the caught error or exiting with a non-zero status after logging (e.g., process.exit(1)), so that TLS injection failures abort onboarding immediately with a clear error message rather than letting the health-check loop mask the root cause.

wscurran added Integration: OpenClaw Support for OpenClaw Platform: Windows/WSL Support for Windows Subsystem for Linux labels Mar 18, 2026

fix: inject TLS secrets after gateway start on fresh install

3984308

vishalp-dev24 force-pushed the fix/tls-secrets-missing-on-fresh-install branch from 6eb0ba9 to 3984308 Compare March 23, 2026 09:39

coderabbitai bot reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: inject TLS secrets after gateway start on fresh install#217

fix: inject TLS secrets after gateway start on fresh install#217
vishalp-dev24 wants to merge 1 commit intoNVIDIA:mainfrom
vishalp-dev24:fix/tls-secrets-missing-on-fresh-install

vishalp-dev24 commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

cv commented Mar 21, 2026

Uh oh!

coderabbitai bot commented Mar 23, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 23, 2026

Uh oh!

coderabbitai bot Mar 23, 2026

Uh oh!

coderabbitai bot Mar 23, 2026

Uh oh!

coderabbitai bot Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vishalp-dev24 commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

cv commented Mar 21, 2026

Uh oh!

coderabbitai bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vishalp-dev24 commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 23, 2026 •

edited

Loading