fix: inject TLS secrets after gateway start on fresh install#217
fix: inject TLS secrets after gateway start on fresh install#217vishalp-dev24 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
Thanks for looking into the TLS secret injection issue, @vishalp-dev24! Fresh installs tripping over gateway timing is exactly the kind of thing that's painful to debug. Since you opened this, we've been shipping features and adding CI checks at a pretty steady clip, so main has drifted. Would you be able to rebase onto the latest main? We'd really like to give this a proper review once it's up to date. Thanks! |
6eb0ba9 to
3984308
Compare
📝 WalkthroughWalkthroughA new Changes
Sequence Diagram(s)sequenceDiagram
participant Gateway as OpenShell Gateway
participant TLS as injectTlsSecrets()
participant OpenSSL as OpenSSL
participant Docker as Docker Container
participant K8s as Kubernetes API
Gateway->>TLS: Invoke after startup
TLS->>OpenSSL: Generate self-signed CA certificate
OpenSSL-->>TLS: CA cert + key files
TLS->>OpenSSL: Generate server certificate
OpenSSL-->>TLS: Server cert + key files
TLS->>Docker: Wait for openshell namespace
Docker-->>TLS: Namespace ready
TLS->>K8s: Delete openshell-server-tls secret
K8s-->>TLS: Secret deleted
TLS->>K8s: Delete openshell-server-client-ca secret
K8s-->>TLS: Secret deleted
TLS->>Docker: Copy cert/key files to container
Docker-->>TLS: Files copied
TLS->>K8s: Create openshell-server-tls secret
K8s-->>TLS: Secret created
TLS->>K8s: Create openshell-server-client-ca secret
K8s-->>TLS: Secret created
TLS-->>Gateway: TLS injection complete
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 454-464: The readiness loop that runs "kubectl get namespace
openshell" against the container variable may never succeed but currently
continues silently; update the loop in onboard.js (the block using execSync,
container and the for (let i = 0; i < 30; i++) { ... } loop) to track success
(e.g., set a ready boolean when the execSync succeeds) and after the loop throw
an Error or call process.exit(1) with a clear message if ready is false (e.g.,
"openshell namespace did not become ready within timeout") so downstream docker
cp / kubectl commands fail fast with a clear failure reason.
- Around line 423-447: The tmpDir is fixed (tmpDir = "/tmp/openshell-tls-fix")
causing stale private keys and race conditions; change to create a unique temp
directory (use fs.mkdtemp or append a secure random suffix to tmpDir) for each
invocation, write v3.ext and generated private keys (ca.key, server.key) into
that per-run directory instead of a shared path, and ensure you securely remove
private keys and the temp directory after use (unlink/remove) to prevent
accumulation and collisions; update any code referencing tmpDir (the tmpDir
const and places writing v3.ext, ca.key, server.key) to use the new per-run
directory and add try/finally or equivalent cleanup to guarantee deletion on
success or error.
- Around line 473-475: The catch block in injectTlsSecrets currently swallows
errors (console.error only), allowing execution to continue and causing the
health-check retry delay; update the catch in injectTlsSecrets (the function
named injectTlsSecrets and its call site where it runs unconditionally) to
propagate failure immediately by either rethrowing the caught error or exiting
with a non-zero status after logging (e.g., process.exit(1)), so that TLS
injection failures abort onboarding immediately with a clear error message
rather than letting the health-check loop mask the root cause.
- Around line 469-471: The current sequence unconditionally deletes then
recreates TLS secrets using the three execSync calls, leaving a race where
creation failure removes secrets; change the logic for the execSync invocations
that reference the container variable so you first check for secret existence
and only create if missing (e.g., run "kubectl get secret openshell-server-tls
-n openshell" and if it fails run the create-tls command, and similarly check
for openshell-server-client-ca before creating it) instead of deleting up-front;
also remove the unconditional kubectl delete call and ensure execSync errors
from creation are propagated or logged so failures do not silently leave secrets
absent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 37199909-f80d-4209-9518-cb8cab4a3a65
📒 Files selected for processing (1)
bin/lib/onboard.js
| const tmpDir = "/tmp/openshell-tls-fix"; | ||
| fs.mkdirSync(tmpDir, { recursive: true }); | ||
|
|
||
| const ext = [ | ||
| "[req]", | ||
| "req_extensions = v3_req", | ||
| "distinguished_name = req_distinguished_name", | ||
| "[req_distinguished_name]", | ||
| "[v3_req]", | ||
| "basicConstraints = CA:FALSE", | ||
| "keyUsage = nonRepudiation, digitalSignature, keyEncipherment", | ||
| "subjectAltName = @alt_names", | ||
| "[alt_names]", | ||
| "DNS.1 = openshell", | ||
| "DNS.2 = openshell.openshell.svc", | ||
| "DNS.3 = openshell.openshell.svc.cluster.local", | ||
| "DNS.4 = localhost", | ||
| "IP.1 = 127.0.0.1", | ||
| "[v3_ca]", | ||
| "subjectKeyIdentifier = hash", | ||
| "authorityKeyIdentifier = keyid:always,issuer", | ||
| "basicConstraints = CA:TRUE", | ||
| "keyUsage = cRLSign, keyCertSign", | ||
| ].join("\n"); | ||
| fs.writeFileSync(tmpDir + "/v3.ext", ext); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check the file structure and locate the file
git ls-files bin/lib/onboard.jsRepository: NVIDIA/NemoClaw
Length of output: 77
🏁 Script executed:
# Read the relevant section with context
head -n 500 bin/lib/onboard.js | tail -n 100Repository: NVIDIA/NemoClaw
Length of output: 4718
🏁 Script executed:
# Get a more precise view of the function around lines 420-480
sed -n '410,490p' bin/lib/onboard.js | cat -nRepository: NVIDIA/NemoClaw
Length of output: 4349
🏁 Script executed:
# Check if tmpDir or /tmp/openshell-tls-fix is referenced elsewhere in the file
rg "openshell-tls-fix|tmpDir" bin/lib/onboard.js -nRepository: NVIDIA/NemoClaw
Length of output: 1243
🏁 Script executed:
# Check imports at the top of the file to see what modules are available
head -n 50 bin/lib/onboard.js | cat -nRepository: NVIDIA/NemoClaw
Length of output: 2131
Remove private keys from /tmp after use and prevent concurrent invocation collisions.
Line 423 creates a shared fixed temp directory that persists across invocations, leaving generated private keys (ca.key, server.key) in /tmp indefinitely. This enables stale key material to accumulate and allows concurrent runs to collide and overwrite each other's key material.
Suggested fix
function injectTlsSecrets() {
const { execSync } = require("child_process");
console.log(" Injecting TLS secrets for OpenShell gateway...");
- const tmpDir = "/tmp/openshell-tls-fix";
- fs.mkdirSync(tmpDir, { recursive: true });
+ const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "openshell-tls-"));
+ fs.chmodSync(tmpDir, 0o700);
@@
- fs.writeFileSync(tmpDir + "/v3.ext", ext);
+ fs.writeFileSync(path.join(tmpDir, "v3.ext"), ext, { mode: 0o600 });
@@
} catch (e) {
console.error(" !! TLS injection failed:", e.message);
+ } finally {
+ fs.rmSync(tmpDir, { recursive: true, force: true });
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 423 - 447, The tmpDir is fixed (tmpDir =
"/tmp/openshell-tls-fix") causing stale private keys and race conditions; change
to create a unique temp directory (use fs.mkdtemp or append a secure random
suffix to tmpDir) for each invocation, write v3.ext and generated private keys
(ca.key, server.key) into that per-run directory instead of a shared path, and
ensure you securely remove private keys and the temp directory after use
(unlink/remove) to prevent accumulation and collisions; update any code
referencing tmpDir (the tmpDir const and places writing v3.ext, ca.key,
server.key) to use the new per-run directory and add try/finally or equivalent
cleanup to guarantee deletion on success or error.
| console.log(" Waiting for k3s namespace..."); | ||
| const container = execSync("docker ps --filter name=openshell-cluster-nemoclaw --format '{{.Names}}'").toString().trim(); | ||
| for (let i = 0; i < 30; i++) { | ||
| try { | ||
| execSync("docker exec " + container + " kubectl get namespace openshell 2>/dev/null"); | ||
| console.log(" ✓ k3s namespace ready"); | ||
| break; | ||
| } catch { | ||
| require("child_process").spawnSync("sleep", ["3"]); | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
wc -l bin/lib/onboard.jsRepository: NVIDIA/NemoClaw
Length of output: 82
🏁 Script executed:
sed -n '440,480p' bin/lib/onboard.jsRepository: NVIDIA/NemoClaw
Length of output: 2339
Fail fast if the openshell namespace never becomes ready.
The readiness loop (lines 454-464) waits up to 90 seconds but continues regardless of success. If readiness never occurs, subsequent docker cp and kubectl create secret operations will fail with unclear errors. Add an explicit gate that throws an error immediately if the namespace doesn't become ready within the timeout.
Suggested fix
console.log(" Waiting for k3s namespace...");
const container = execSync("docker ps --filter name=openshell-cluster-nemoclaw --format '{{.Names}}'").toString().trim();
+ let namespaceReady = false;
for (let i = 0; i < 30; i++) {
try {
execSync("docker exec " + container + " kubectl get namespace openshell 2>/dev/null");
console.log(" ✓ k3s namespace ready");
+ namespaceReady = true;
break;
} catch {
require("child_process").spawnSync("sleep", ["3"]);
}
}
+ if (!namespaceReady) {
+ throw new Error("k3s namespace 'openshell' did not become ready within 90s");
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 454 - 464, The readiness loop that runs
"kubectl get namespace openshell" against the container variable may never
succeed but currently continues silently; update the loop in onboard.js (the
block using execSync, container and the for (let i = 0; i < 30; i++) { ... }
loop) to track success (e.g., set a ready boolean when the execSync succeeds)
and after the loop throw an Error or call process.exit(1) with a clear message
if ready is false (e.g., "openshell namespace did not become ready within
timeout") so downstream docker cp / kubectl commands fail fast with a clear
failure reason.
| execSync("docker exec " + container + " kubectl delete secret openshell-server-tls openshell-server-client-ca -n openshell 2>/dev/null || true"); | ||
| execSync("docker exec " + container + " kubectl create secret tls openshell-server-tls -n openshell --cert=/tmp/server.crt --key=/tmp/server.key"); | ||
| execSync("docker exec " + container + " kubectl create secret generic openshell-server-client-ca -n openshell --from-file=ca.crt=/tmp/ca.crt"); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the file exists and read the relevant section
wc -l bin/lib/onboard.jsRepository: NVIDIA/NemoClaw
Length of output: 82
🏁 Script executed:
# Read lines around 469-471 with context
sed -n '450,490p' bin/lib/onboard.js | cat -nRepository: NVIDIA/NemoClaw
Length of output: 2862
🏁 Script executed:
# Search for cert-manager references to understand if it's used in this project
rg -i "cert-manager" --type js --type yaml --type jsonRepository: NVIDIA/NemoClaw
Length of output: 41
🏁 Script executed:
# Search for other TLS secret creation patterns
rg "kubectl.*secret" -A 2 -B 2Repository: NVIDIA/NemoClaw
Length of output: 868
🏁 Script executed:
# Look at the broader context of the onboard.js file to understand the purpose
head -100 bin/lib/onboard.js | cat -nRepository: NVIDIA/NemoClaw
Length of output: 4128
🏁 Script executed:
# Read more context around these lines to see the function name and purpose
sed -n '440,475p' bin/lib/onboard.js | cat -nRepository: NVIDIA/NemoClaw
Length of output: 2483
🏁 Script executed:
# Get the function name and surrounding context to confirm this is onboarding
sed -n '400,445p' bin/lib/onboard.js | cat -nRepository: NVIDIA/NemoClaw
Length of output: 1942
🏁 Script executed:
# Search for where injectTlsSecrets is called
rg "injectTlsSecrets" -B 3 -A 3Repository: NVIDIA/NemoClaw
Length of output: 689
Avoid unconditional delete/recreate of TLS secrets.
Lines 469-471 delete then recreate secrets without checking if they exist first. If the delete succeeds but creation fails, the secrets are left missing, creating a transient failure window. Check for secret existence before attempting creation.
Suggested fix
- execSync("docker exec " + container + " kubectl delete secret openshell-server-tls openshell-server-client-ca -n openshell 2>/dev/null || true");
- execSync("docker exec " + container + " kubectl create secret tls openshell-server-tls -n openshell --cert=/tmp/server.crt --key=/tmp/server.key");
- execSync("docker exec " + container + " kubectl create secret generic openshell-server-client-ca -n openshell --from-file=ca.crt=/tmp/ca.crt");
+ const tlsSecret = execSync(
+ "docker exec " + container + " kubectl get secret openshell-server-tls -n openshell --ignore-not-found -o name"
+ ).toString().trim();
+ const caSecret = execSync(
+ "docker exec " + container + " kubectl get secret openshell-server-client-ca -n openshell --ignore-not-found -o name"
+ ).toString().trim();
+
+ if (!tlsSecret) {
+ execSync("docker exec " + container + " kubectl create secret tls openshell-server-tls -n openshell --cert=/tmp/server.crt --key=/tmp/server.key");
+ }
+ if (!caSecret) {
+ execSync("docker exec " + container + " kubectl create secret generic openshell-server-client-ca -n openshell --from-file=ca.crt=/tmp/ca.crt");
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 469 - 471, The current sequence
unconditionally deletes then recreates TLS secrets using the three execSync
calls, leaving a race where creation failure removes secrets; change the logic
for the execSync invocations that reference the container variable so you first
check for secret existence and only create if missing (e.g., run "kubectl get
secret openshell-server-tls -n openshell" and if it fails run the create-tls
command, and similarly check for openshell-server-client-ca before creating it)
instead of deleting up-front; also remove the unconditional kubectl delete call
and ensure execSync errors from creation are propagated or logged so failures do
not silently leave secrets absent.
| } catch (e) { | ||
| console.error(" !! TLS injection failed:", e.message); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n bin/lib/onboard.js | sed -n '460,520p'Repository: NVIDIA/NemoClaw
Length of output: 3045
🏁 Script executed:
cat -n bin/lib/onboard.js | sed -n '420,476p'Repository: NVIDIA/NemoClaw
Length of output: 3325
🏁 Script executed:
rg "injectTlsSecrets" bin/lib/onboard.jsRepository: NVIDIA/NemoClaw
Length of output: 110
🏁 Script executed:
cat -n bin/lib/onboard.js | sed -n '511,525p'Repository: NVIDIA/NemoClaw
Length of output: 647
Propagate TLS injection failures to fail immediately instead of triggering delayed health check timeouts.
The injectTlsSecrets() function at lines 473-475 catches errors and only logs them, allowing execution to continue silently. When called unconditionally at line 508, a failed injection causes the subsequent health check loop (lines 511-522) to retry 5 times with 2-second delays, adding ~8-10 seconds of confusion before the final timeout failure. Since TLS secrets are critical for the gateway to function, injection failures should exit immediately with a clear error message rather than masking the root cause behind a retry loop.
Suggested fix
} catch (e) {
- console.error(" !! TLS injection failed:", e.message);
+ throw new Error(`TLS injection failed: ${e.message}`);
}
}
@@
- injectTlsSecrets();
+ try {
+ injectTlsSecrets();
+ } catch (error) {
+ console.error(` ${error.message}`);
+ console.error(" Required OpenShell TLS secrets were not created. Aborting onboarding.");
+ process.exit(1);
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 473 - 475, The catch block in
injectTlsSecrets currently swallows errors (console.error only), allowing
execution to continue and causing the health-check retry delay; update the catch
in injectTlsSecrets (the function named injectTlsSecrets and its call site where
it runs unconditionally) to propagate failure immediately by either rethrowing
the caught error or exiting with a non-zero status after logging (e.g.,
process.exit(1)), so that TLS injection failures abort onboarding immediately
with a clear error message rather than letting the health-check loop mask the
root cause.
The Helm chart deploys openshell-0 but never creates the two TLS secrets it requires to initialize:
On systems without cert-manager pre-installed (all default setups), the pod waits indefinitely causing the onboard wizard to time out.
Added injectTlsSecrets() in bin/lib/onboard.js which:
Called immediately after openshell gateway start in startGateway().
Tested on: Ubuntu 22.04 WSL2 + Windows 11, Docker 27, RTX 4050.
Summary by CodeRabbit