Skip to content

fix: inject TLS secrets after gateway start on fresh install#217

Open
vishalp-dev24 wants to merge 1 commit intoNVIDIA:mainfrom
vishalp-dev24:fix/tls-secrets-missing-on-fresh-install
Open

fix: inject TLS secrets after gateway start on fresh install#217
vishalp-dev24 wants to merge 1 commit intoNVIDIA:mainfrom
vishalp-dev24:fix/tls-secrets-missing-on-fresh-install

Conversation

@vishalp-dev24
Copy link

@vishalp-dev24 vishalp-dev24 commented Mar 17, 2026

The Helm chart deploys openshell-0 but never creates the two TLS secrets it requires to initialize:

  • openshell-server-tls
  • openshell-server-client-ca

On systems without cert-manager pre-installed (all default setups), the pod waits indefinitely causing the onboard wizard to time out.

Added injectTlsSecrets() in bin/lib/onboard.js which:

  1. Generates a self-signed X.509 v3 CA and server cert via openssl
  2. Copies them into the k3s container via docker cp
  3. Creates both secrets via kubectl create secret

Called immediately after openshell gateway start in startGateway().

Tested on: Ubuntu 22.04 WSL2 + Windows 11, Docker 27, RTX 4050.

Summary by CodeRabbit

  • New Features
    • Added automatic TLS certificate generation and injection capability
    • Self-signed CA and server certificates are now generated and configured during gateway initialization
    • Kubernetes TLS and CA secrets are automatically created and managed as needed
    • Certificate injection failures are caught and logged to prevent interruption of the onboarding workflow

@wscurran wscurran added Integration: OpenClaw Support for OpenClaw Platform: Windows/WSL Support for Windows Subsystem for Linux labels Mar 18, 2026
@cv
Copy link
Contributor

cv commented Mar 21, 2026

Thanks for looking into the TLS secret injection issue, @vishalp-dev24! Fresh installs tripping over gateway timing is exactly the kind of thing that's painful to debug. Since you opened this, we've been shipping features and adding CI checks at a pretty steady clip, so main has drifted. Would you be able to rebase onto the latest main? We'd really like to give this a proper review once it's up to date. Thanks!

@vishalp-dev24 vishalp-dev24 force-pushed the fix/tls-secrets-missing-on-fresh-install branch from 6eb0ba9 to 3984308 Compare March 23, 2026 09:39
@coderabbitai
Copy link

coderabbitai bot commented Mar 23, 2026

📝 Walkthrough

Walkthrough

A new injectTlsSecrets() helper function was added to generate self-signed certificates and manage Kubernetes TLS secrets. The function executes OpenSSL commands, waits for a namespace, and recreates secret objects via kubectl within the onboarding flow.

Changes

Cohort / File(s) Summary
TLS Secret Injection
bin/lib/onboard.js
Added injectTlsSecrets() helper that generates self-signed CA/server certificates via OpenSSL, waits for the openshell namespace, and recreates Kubernetes TLS/CA secrets. Invoked from startGateway(gpu) after gateway startup with error logging but no flow abortion.

Sequence Diagram(s)

sequenceDiagram
    participant Gateway as OpenShell Gateway
    participant TLS as injectTlsSecrets()
    participant OpenSSL as OpenSSL
    participant Docker as Docker Container
    participant K8s as Kubernetes API

    Gateway->>TLS: Invoke after startup
    TLS->>OpenSSL: Generate self-signed CA certificate
    OpenSSL-->>TLS: CA cert + key files
    TLS->>OpenSSL: Generate server certificate
    OpenSSL-->>TLS: Server cert + key files
    TLS->>Docker: Wait for openshell namespace
    Docker-->>TLS: Namespace ready
    TLS->>K8s: Delete openshell-server-tls secret
    K8s-->>TLS: Secret deleted
    TLS->>K8s: Delete openshell-server-client-ca secret
    K8s-->>TLS: Secret deleted
    TLS->>Docker: Copy cert/key files to container
    Docker-->>TLS: Files copied
    TLS->>K8s: Create openshell-server-tls secret
    K8s-->>TLS: Secret created
    TLS->>K8s: Create openshell-server-client-ca secret
    K8s-->>TLS: Secret created
    TLS-->>Gateway: TLS injection complete
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Secrets in certificates, hopping through the cloud,
OpenSSL whispers, Kubernetes bows proud,
TLS dancing with Docker so grand,
Certificates spinning with a flick of my paw's hand!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: inject TLS secrets after gateway start on fresh install' directly and accurately describes the main change: adding TLS secret injection logic after gateway startup to resolve fresh install issues.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 454-464: The readiness loop that runs "kubectl get namespace
openshell" against the container variable may never succeed but currently
continues silently; update the loop in onboard.js (the block using execSync,
container and the for (let i = 0; i < 30; i++) { ... } loop) to track success
(e.g., set a ready boolean when the execSync succeeds) and after the loop throw
an Error or call process.exit(1) with a clear message if ready is false (e.g.,
"openshell namespace did not become ready within timeout") so downstream docker
cp / kubectl commands fail fast with a clear failure reason.
- Around line 423-447: The tmpDir is fixed (tmpDir = "/tmp/openshell-tls-fix")
causing stale private keys and race conditions; change to create a unique temp
directory (use fs.mkdtemp or append a secure random suffix to tmpDir) for each
invocation, write v3.ext and generated private keys (ca.key, server.key) into
that per-run directory instead of a shared path, and ensure you securely remove
private keys and the temp directory after use (unlink/remove) to prevent
accumulation and collisions; update any code referencing tmpDir (the tmpDir
const and places writing v3.ext, ca.key, server.key) to use the new per-run
directory and add try/finally or equivalent cleanup to guarantee deletion on
success or error.
- Around line 473-475: The catch block in injectTlsSecrets currently swallows
errors (console.error only), allowing execution to continue and causing the
health-check retry delay; update the catch in injectTlsSecrets (the function
named injectTlsSecrets and its call site where it runs unconditionally) to
propagate failure immediately by either rethrowing the caught error or exiting
with a non-zero status after logging (e.g., process.exit(1)), so that TLS
injection failures abort onboarding immediately with a clear error message
rather than letting the health-check loop mask the root cause.
- Around line 469-471: The current sequence unconditionally deletes then
recreates TLS secrets using the three execSync calls, leaving a race where
creation failure removes secrets; change the logic for the execSync invocations
that reference the container variable so you first check for secret existence
and only create if missing (e.g., run "kubectl get secret openshell-server-tls
-n openshell" and if it fails run the create-tls command, and similarly check
for openshell-server-client-ca before creating it) instead of deleting up-front;
also remove the unconditional kubectl delete call and ensure execSync errors
from creation are propagated or logged so failures do not silently leave secrets
absent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 37199909-f80d-4209-9518-cb8cab4a3a65

📥 Commits

Reviewing files that changed from the base of the PR and between c55a309 and 3984308.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

Comment on lines +423 to +447
const tmpDir = "/tmp/openshell-tls-fix";
fs.mkdirSync(tmpDir, { recursive: true });

const ext = [
"[req]",
"req_extensions = v3_req",
"distinguished_name = req_distinguished_name",
"[req_distinguished_name]",
"[v3_req]",
"basicConstraints = CA:FALSE",
"keyUsage = nonRepudiation, digitalSignature, keyEncipherment",
"subjectAltName = @alt_names",
"[alt_names]",
"DNS.1 = openshell",
"DNS.2 = openshell.openshell.svc",
"DNS.3 = openshell.openshell.svc.cluster.local",
"DNS.4 = localhost",
"IP.1 = 127.0.0.1",
"[v3_ca]",
"subjectKeyIdentifier = hash",
"authorityKeyIdentifier = keyid:always,issuer",
"basicConstraints = CA:TRUE",
"keyUsage = cRLSign, keyCertSign",
].join("\n");
fs.writeFileSync(tmpDir + "/v3.ext", ext);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check the file structure and locate the file
git ls-files bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 77


🏁 Script executed:

# Read the relevant section with context
head -n 500 bin/lib/onboard.js | tail -n 100

Repository: NVIDIA/NemoClaw

Length of output: 4718


🏁 Script executed:

# Get a more precise view of the function around lines 420-480
sed -n '410,490p' bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 4349


🏁 Script executed:

# Check if tmpDir or /tmp/openshell-tls-fix is referenced elsewhere in the file
rg "openshell-tls-fix|tmpDir" bin/lib/onboard.js -n

Repository: NVIDIA/NemoClaw

Length of output: 1243


🏁 Script executed:

# Check imports at the top of the file to see what modules are available
head -n 50 bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 2131


Remove private keys from /tmp after use and prevent concurrent invocation collisions.

Line 423 creates a shared fixed temp directory that persists across invocations, leaving generated private keys (ca.key, server.key) in /tmp indefinitely. This enables stale key material to accumulate and allows concurrent runs to collide and overwrite each other's key material.

Suggested fix
 function injectTlsSecrets() {
   const { execSync } = require("child_process");
   console.log("  Injecting TLS secrets for OpenShell gateway...");
-  const tmpDir = "/tmp/openshell-tls-fix";
-  fs.mkdirSync(tmpDir, { recursive: true });
+  const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "openshell-tls-"));
+  fs.chmodSync(tmpDir, 0o700);
@@
-  fs.writeFileSync(tmpDir + "/v3.ext", ext);
+  fs.writeFileSync(path.join(tmpDir, "v3.ext"), ext, { mode: 0o600 });
@@
   } catch (e) {
     console.error("  !! TLS injection failed:", e.message);
+  } finally {
+    fs.rmSync(tmpDir, { recursive: true, force: true });
   }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 423 - 447, The tmpDir is fixed (tmpDir =
"/tmp/openshell-tls-fix") causing stale private keys and race conditions; change
to create a unique temp directory (use fs.mkdtemp or append a secure random
suffix to tmpDir) for each invocation, write v3.ext and generated private keys
(ca.key, server.key) into that per-run directory instead of a shared path, and
ensure you securely remove private keys and the temp directory after use
(unlink/remove) to prevent accumulation and collisions; update any code
referencing tmpDir (the tmpDir const and places writing v3.ext, ca.key,
server.key) to use the new per-run directory and add try/finally or equivalent
cleanup to guarantee deletion on success or error.

Comment on lines +454 to +464
console.log(" Waiting for k3s namespace...");
const container = execSync("docker ps --filter name=openshell-cluster-nemoclaw --format '{{.Names}}'").toString().trim();
for (let i = 0; i < 30; i++) {
try {
execSync("docker exec " + container + " kubectl get namespace openshell 2>/dev/null");
console.log(" ✓ k3s namespace ready");
break;
} catch {
require("child_process").spawnSync("sleep", ["3"]);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 82


🏁 Script executed:

sed -n '440,480p' bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 2339


Fail fast if the openshell namespace never becomes ready.

The readiness loop (lines 454-464) waits up to 90 seconds but continues regardless of success. If readiness never occurs, subsequent docker cp and kubectl create secret operations will fail with unclear errors. Add an explicit gate that throws an error immediately if the namespace doesn't become ready within the timeout.

Suggested fix
     console.log("  Waiting for k3s namespace...");
     const container = execSync("docker ps --filter name=openshell-cluster-nemoclaw --format '{{.Names}}'").toString().trim();
+    let namespaceReady = false;
     for (let i = 0; i < 30; i++) {
       try {
         execSync("docker exec " + container + " kubectl get namespace openshell 2>/dev/null");
         console.log("  ✓ k3s namespace ready");
+        namespaceReady = true;
         break;
       } catch {
         require("child_process").spawnSync("sleep", ["3"]);
       }
     }
+    if (!namespaceReady) {
+      throw new Error("k3s namespace 'openshell' did not become ready within 90s");
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 454 - 464, The readiness loop that runs
"kubectl get namespace openshell" against the container variable may never
succeed but currently continues silently; update the loop in onboard.js (the
block using execSync, container and the for (let i = 0; i < 30; i++) { ... }
loop) to track success (e.g., set a ready boolean when the execSync succeeds)
and after the loop throw an Error or call process.exit(1) with a clear message
if ready is false (e.g., "openshell namespace did not become ready within
timeout") so downstream docker cp / kubectl commands fail fast with a clear
failure reason.

Comment on lines +469 to +471
execSync("docker exec " + container + " kubectl delete secret openshell-server-tls openshell-server-client-ca -n openshell 2>/dev/null || true");
execSync("docker exec " + container + " kubectl create secret tls openshell-server-tls -n openshell --cert=/tmp/server.crt --key=/tmp/server.key");
execSync("docker exec " + container + " kubectl create secret generic openshell-server-client-ca -n openshell --from-file=ca.crt=/tmp/ca.crt");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file exists and read the relevant section
wc -l bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 82


🏁 Script executed:

# Read lines around 469-471 with context
sed -n '450,490p' bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 2862


🏁 Script executed:

# Search for cert-manager references to understand if it's used in this project
rg -i "cert-manager" --type js --type yaml --type json

Repository: NVIDIA/NemoClaw

Length of output: 41


🏁 Script executed:

# Search for other TLS secret creation patterns
rg "kubectl.*secret" -A 2 -B 2

Repository: NVIDIA/NemoClaw

Length of output: 868


🏁 Script executed:

# Look at the broader context of the onboard.js file to understand the purpose
head -100 bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 4128


🏁 Script executed:

# Read more context around these lines to see the function name and purpose
sed -n '440,475p' bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 2483


🏁 Script executed:

# Get the function name and surrounding context to confirm this is onboarding
sed -n '400,445p' bin/lib/onboard.js | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 1942


🏁 Script executed:

# Search for where injectTlsSecrets is called
rg "injectTlsSecrets" -B 3 -A 3

Repository: NVIDIA/NemoClaw

Length of output: 689


Avoid unconditional delete/recreate of TLS secrets.

Lines 469-471 delete then recreate secrets without checking if they exist first. If the delete succeeds but creation fails, the secrets are left missing, creating a transient failure window. Check for secret existence before attempting creation.

Suggested fix
-    execSync("docker exec " + container + " kubectl delete secret openshell-server-tls openshell-server-client-ca -n openshell 2>/dev/null || true");
-    execSync("docker exec " + container + " kubectl create secret tls openshell-server-tls -n openshell --cert=/tmp/server.crt --key=/tmp/server.key");
-    execSync("docker exec " + container + " kubectl create secret generic openshell-server-client-ca -n openshell --from-file=ca.crt=/tmp/ca.crt");
+    const tlsSecret = execSync(
+      "docker exec " + container + " kubectl get secret openshell-server-tls -n openshell --ignore-not-found -o name"
+    ).toString().trim();
+    const caSecret = execSync(
+      "docker exec " + container + " kubectl get secret openshell-server-client-ca -n openshell --ignore-not-found -o name"
+    ).toString().trim();
+
+    if (!tlsSecret) {
+      execSync("docker exec " + container + " kubectl create secret tls openshell-server-tls -n openshell --cert=/tmp/server.crt --key=/tmp/server.key");
+    }
+    if (!caSecret) {
+      execSync("docker exec " + container + " kubectl create secret generic openshell-server-client-ca -n openshell --from-file=ca.crt=/tmp/ca.crt");
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 469 - 471, The current sequence
unconditionally deletes then recreates TLS secrets using the three execSync
calls, leaving a race where creation failure removes secrets; change the logic
for the execSync invocations that reference the container variable so you first
check for secret existence and only create if missing (e.g., run "kubectl get
secret openshell-server-tls -n openshell" and if it fails run the create-tls
command, and similarly check for openshell-server-client-ca before creating it)
instead of deleting up-front; also remove the unconditional kubectl delete call
and ensure execSync errors from creation are propagated or logged so failures do
not silently leave secrets absent.

Comment on lines +473 to +475
} catch (e) {
console.error(" !! TLS injection failed:", e.message);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n bin/lib/onboard.js | sed -n '460,520p'

Repository: NVIDIA/NemoClaw

Length of output: 3045


🏁 Script executed:

cat -n bin/lib/onboard.js | sed -n '420,476p'

Repository: NVIDIA/NemoClaw

Length of output: 3325


🏁 Script executed:

rg "injectTlsSecrets" bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 110


🏁 Script executed:

cat -n bin/lib/onboard.js | sed -n '511,525p'

Repository: NVIDIA/NemoClaw

Length of output: 647


Propagate TLS injection failures to fail immediately instead of triggering delayed health check timeouts.

The injectTlsSecrets() function at lines 473-475 catches errors and only logs them, allowing execution to continue silently. When called unconditionally at line 508, a failed injection causes the subsequent health check loop (lines 511-522) to retry 5 times with 2-second delays, adding ~8-10 seconds of confusion before the final timeout failure. Since TLS secrets are critical for the gateway to function, injection failures should exit immediately with a clear error message rather than masking the root cause behind a retry loop.

Suggested fix
   } catch (e) {
-    console.error("  !! TLS injection failed:", e.message);
+    throw new Error(`TLS injection failed: ${e.message}`);
   }
 }
@@
-  injectTlsSecrets();
+  try {
+    injectTlsSecrets();
+  } catch (error) {
+    console.error(`  ${error.message}`);
+    console.error("  Required OpenShell TLS secrets were not created. Aborting onboarding.");
+    process.exit(1);
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 473 - 475, The catch block in
injectTlsSecrets currently swallows errors (console.error only), allowing
execution to continue and causing the health-check retry delay; update the catch
in injectTlsSecrets (the function named injectTlsSecrets and its call site where
it runs unconditionally) to propagate failure immediately by either rethrowing
the caught error or exiting with a non-zero status after logging (e.g.,
process.exit(1)), so that TLS injection failures abort onboarding immediately
with a clear error message rather than letting the health-check loop mask the
root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Integration: OpenClaw Support for OpenClaw Platform: Windows/WSL Support for Windows Subsystem for Linux

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants