Skip to content

security: harden credential handling, add observability and CI workflows#390

Closed
100rabhkr wants to merge 8 commits intoNVIDIA:mainfrom
100rabhkr:security/harden-credentials-and-observability
Closed

security: harden credential handling, add observability and CI workflows#390
100rabhkr wants to merge 8 commits intoNVIDIA:mainfrom
100rabhkr:security/harden-credentials-and-observability

Conversation

@100rabhkr
Copy link

@100rabhkr 100rabhkr commented Mar 19, 2026

Summary

Fixes the critical API key exposure vulnerability (#325) and adds hardening across security, observability, reliability, testing, and CI/CD.

Security

  • Fix API key leakage via ps aux — Credentials passed through env var inheritance and stdin (curl -H @-) instead of CLI arguments in onboard.js, nemoclaw.js, and setup.sh. Runner error messages auto-redact known secret patterns.
  • Add API key validation during onboard — Keys validated against NVIDIA API before saving. Invalid/expired keys rejected immediately (fatal flag) instead of causing silent agent failure after 15-minute setup. Network failures warn but don't block.

Observability

  • Add --verbose / --debug flag — Global CLI flag and LOG_LEVEL env var. Logs every shell command (with secrets redacted) to stderr.

Reliability

  • Add onboard rollback on failure/interruptionOnboardPipeline state machine with synchronous SIGINT/SIGTERM handlers. Completed steps are rolled back in reverse order (gateway destroy, sandbox delete, NIM stop).
  • Rollback safety guards — Pre-existing sandboxes the user chose to keep are not destroyed on later failure (compares registry createdAt by returned name). Gateway only rolled back if this run started it (gatewayStarted flag).
  • Registry corruption recovery — Corrupted sandboxes.json backed up with timestamp, user warned with guidance.

Testing

  • 31 new tests (175 total, 0 failures): redactSecrets (9), OnboardPipeline rollback (7), validateApiKey (4), policies.js coverage expansion (11).

CI/CD

  • Release pipelinerelease.yaml: test → npm publish → changelog from conventional commits → GitHub Release.
  • CodeQL scanningcodeql.yaml: JS/TS + Python analysis on PRs and weekly.

Test plan

  • 175 tests pass (0 failures), up from 155
  • redactSecrets: 9 tests covering all secret patterns, empty input, consecutive calls
  • OnboardPipeline: 7 tests covering success, failure rollback order, null rollback, throwing rollback, cleanup
  • validateApiKey: 4 tests covering return shape, fatal invariant, empty key
  • --verbose flag shows in help output and activates debug logging
  • Manual: nemoclaw onboard with Ctrl+C verifies rollback cleanup
  • Manual: invalid API key rejected with clear error during onboard

Files changed (14)

File Change
bin/lib/credentials.js validateApiKey() with { ok, fatal, message } return; integrated into ensureApiKey()
bin/lib/logger.js New — debug-only logger (debug, setLevel, isVerbose)
bin/lib/onboard.js Credential fix, OnboardPipeline with sync rollback, ownership tracking (gatewayStarted, sandboxCreated)
bin/lib/registry.js Corruption backup + user warning on malformed sandboxes.json
bin/lib/runner.js redactSecrets() with per-call regex instances; verbose command logging
bin/nemoclaw.js --verbose/--debug flag parsing, help text, setupSpark credential fix
scripts/setup.sh Credential env var reference, remove key from sandbox create entrypoint
test/runner-redact.test.js New — 9 tests for secret redaction
test/onboard-pipeline.test.js New — 7 tests for pipeline rollback
test/validate-api-key.test.js New — 4 tests for API key validation
test/policies.test.js 11 new tests for extractPresetEntries, parseCurrentPolicy, schema
.github/workflows/release.yaml New — automated release pipeline
.github/workflows/codeql.yaml New — CodeQL security scanning

Summary by CodeRabbit

  • New Features

    • Synchronous API-key validation during setup; global --verbose/--debug flags and LOG_LEVEL support; tag-driven release with changelog and npm publish; scheduled security scans.
  • Bug Fixes

    • Registry now creates timestamped backups and warns on corruption.
  • Chores

    • Safer onboarding with rollback and signal handling; reduced exposure of credentials in subprocesses; improved command/output redaction and debug logging.
  • Tests

    • Added tests for policies, onboarding pipeline, secret redaction, and API-key validation.

@coderabbitai
Copy link

coderabbitai bot commented Mar 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds CodeQL and release GitHub workflows; introduces a leveled logger and secret redaction in runner logs; validates NVIDIA API keys before saving; refactors onboarding into an undoable, signal-safe OnboardPipeline and stops leaking credentials on CLI args; adds registry-corruption backups and multiple unit tests.

Changes

Cohort / File(s) Summary
CI/CD Workflows
\.github/workflows/codeql.yaml, \.github/workflows/release.yaml
Add CodeQL scanning (PR + weekly) and a release workflow triggered on semver tags that runs tests, publishes to npm, and creates GitHub Releases with an autogenerated changelog.
Credentials & Validation
bin/lib/credentials.js
Add exported validateApiKey(key) (sync curl via spawnSync) and call it from ensureApiKey() to classify fatal vs non‑fatal API-key failures before persisting; fatal failures exit process.
Logging Module
bin/lib/logger.js
New leveled logger exporting LOG_LEVELS, setLevel, and debug; initializes from LOG_LEVEL/NEMOCLAW_LOG_LEVEL.
Runner & Redaction
bin/lib/runner.js, test/runner-redact.test.js
Add redactSecrets(str) and use it in run/runInteractive/runCapture to redact secret-like patterns in debug/error snippets; export redactSecrets and add unit tests.
Onboarding Pipeline & CLI
bin/lib/onboard.js, bin/nemoclaw.js, test/onboard-pipeline.test.js
Refactor onboarding into OnboardPipeline with ordered execution and reverse rollback on failure or signals; add signal handlers and ownership flags to avoid destroying externally created resources; stop passing NVIDIA key as visible CLI arg, using inherited env or NEMOCLAW_CRED; CLI supports global --verbose/--debug.
Registry Recovery
bin/lib/registry.js
On JSON load/parsing failure, create a timestamped <file>.corrupt.<ts> backup (best‑effort), log warnings referencing the backup, and return empty registry fallback.
Scripts / Provider Env
scripts/setup.sh
Avoid injecting NVIDIA_API_KEY into sandbox CLI env; rely on inherited env and document transient exposure; provider credential still sourced from NVIDIA_API_KEY.
Tests — Policies & Validation
test/policies.test.js, test/validate-api-key.test.js
Add tests for policies.extractPresetEntries, policies.parseCurrentPolicy, policies.getAppliedPresets, and structural/robustness tests for validateApiKey.
Tests — Pipeline & Runner
test/onboard-pipeline.test.js, test/runner-redact.test.js
Add tests verifying OnboardPipeline rollback semantics, signal-handler cleanup, and runner secret redaction behavior.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant CLI as "CLI (bin/nemoclaw.js)"
    participant Pipeline as "OnboardPipeline"
    participant Step as "Onboarding Step"
    participant Runner as "Runner (bin/lib/runner)"
    participant Env as "Subprocess Env (NEMOCLAW_CRED)"
    participant Rollback as "Rollback Handler"

    User->>CLI: run "nemoclaw onboard"
    CLI->>CLI: set log level if --debug/--verbose
    CLI->>Pipeline: pipeline.run(steps[])
    alt step succeeds
        Pipeline->>Step: execute()
        Step->>Runner: run(command)
        Runner->>Env: provide/inherit NEMOCLAW_CRED
        Runner-->>Step: success
        Step-->>Pipeline: mark complete
        Pipeline->>Step: execute(next)
    else step fails
        Step-->>Pipeline: throw/error
        Pipeline->>Rollback: iterate completedSteps (reverse)
        loop each completed step
            Rollback->>Runner: run(rollbackCommand)
            Runner-->>Rollback: rollback done (errors tolerated)
        end
        Pipeline-->>CLI: reject/error (exit or 130 if signal)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • kjw3

Poem

🐰 I hopped through logs and keys tonight,
I clipped the secrets out of sight,
Pipelines march with rollbacks primed,
Corrupt files saved, no data blind,
Burrow bright — the deploy felt right. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main changes: hardening credential handling and adding observability plus CI workflows, which are the primary focus of this security-focused PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (11)
.github/workflows/release.yaml (3)

73-74: Consider waiting for publish-npm before creating the GitHub Release.

Currently, github-release only depends on test, so it runs in parallel with publish-npm. This can result in a GitHub Release being published before the npm package is available, potentially confusing users who try to install immediately after seeing the release notification.

💡 Suggested fix to sequence jobs
 github-release:
-    needs: test
+    needs: [test, publish-npm]
     runs-on: ubuntu-latest
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/release.yaml around lines 73 - 74, The github-release job
currently only depends on test and may run before publish-npm; update the
workflow so github-release depends on both test and publish-npm (add publish-npm
to the needs list for the github-release job) to ensure the npm package is
published before creating the GitHub Release.

85-107: Style: consider grouping redirects (SC2129) and verify --grep pattern.

  1. The static analysis tool flags repeated >> /tmp/changelog.md redirects. Grouping them with { ...; } >> file is cleaner but optional.

  2. The --grep="^${type}" pattern anchors to the start of any line in the commit message, not just the subject line. For conventional commits like feat: add feature, this works, but could also match lines in the body that happen to start with feat. Consider using --grep="^${type}:" or --grep="^${type}(" for stricter matching.

💡 Minor improvements
-            commits=$(git log "${PREV_TAG}..HEAD" --pretty=format:"- %s (%h)" --grep="^${type}" 2>/dev/null || true)
+            commits=$(git log "${PREV_TAG}..HEAD" --pretty=format:"- %s (%h)" --grep="^${type}[:(]" 2>/dev/null || true)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/release.yaml around lines 85 - 107, The changelog
generation loop should avoid repeated redirections and tighten the commit-type
grep: wrap the block that appends each section into a grouped redirect (e.g., {
echo "### ${label}"; echo "$commits"; echo ""; } >> /tmp/changelog.md) to
address the SC2129 warnings, and change the grep pattern in the git log call
inside the for-loop from --grep="^${type}" to a stricter match like
--grep="^${type}:" (or --grep="^${type}(") so it only matches commit subjects;
update the loop variables (type_label, type, label, commits) and the git log
invocation accordingly while keeping the PREV_TAG logic unchanged.

63-66: Add error handling for subdirectory navigation.

If the nemoclaw directory doesn't exist or the build fails, the cd .. will still execute and the step may not fail as expected due to how compound commands work in shell.

💡 Suggested improvement
       - name: Install and build
         run: |
           npm install
-          cd nemoclaw && npm install && npm run build && cd ..
+          pushd nemoclaw && npm install && npm run build && popd

Alternatively, use set -e at the start or split into separate steps with working-directory.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/release.yaml around lines 63 - 66, The "Install and build"
step runs a compound shell command that can mask failures when `nemoclaw` is
missing or build fails; fix it by either (A) prefixing the run block with `set
-e` so any command failure aborts the step (e.g., add `set -e` before `npm
install`), or (B) split into two GitHub Actions steps and use
`working-directory: nemoclaw` for the subproject (one step to run `npm
install`/`npm run build` in that directory) and keep the top-level `npm install`
in its own step—choose one approach and update the step named "Install and
build" accordingly.
.github/workflows/codeql.yaml (1)

8-12: Consider adding a push trigger for the default branch.

CodeQL analysis benefits from running on both PRs and pushes to the default branch. Pushes to main establish the baseline for comparison, ensuring new alerts are properly detected on PRs. Without it, the CodeQL dashboard may lack historical context.

💡 Suggested addition for baseline scanning
 on:
   pull_request:
     types: [opened, synchronize, reopened]
+  push:
+    branches: [main]
   schedule:
     - cron: "0 6 * * 1" # Every Monday at 06:00 UTC
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/codeql.yaml around lines 8 - 12, Add a push trigger to the
workflow's top-level on: block so CodeQL runs on pushes to the default branch
(in addition to the existing pull_request and schedule triggers); update the on:
configuration (near the existing pull_request and schedule keys) to include a
push: entry targeting the repository default branch (e.g., branches: [main]) so
baseline analysis is produced for the CodeQL dashboard.
test/policies.test.js (1)

200-207: Consider using path.basename for extension stripping.

replace(".yaml", "") only removes the first occurrence. Since path is already imported, using it is more robust and idiomatic.

♻️ Suggested improvement
     it("every preset name in YAML matches its filename", () => {
       for (const p of policies.listPresets()) {
-        const expectedName = p.file.replace(".yaml", "");
+        const expectedName = path.basename(p.file, ".yaml");
         assert.equal(p.name, expectedName, `${p.file}: YAML name '${p.name}' does not match filename`);
       }
     });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/policies.test.js` around lines 200 - 207, The test computes expectedName
by stripping ".yaml" via p.file.replace(".yaml", "") which can wrongly remove
the first occurrence; update the logic in the test (inside the "every preset
name in YAML matches its filename" case that iterates policies.listPresets()) to
use path.basename(p.file, ".yaml") (or path.parse/p.ext) to extract the filename
without extension so expectedName reliably matches p.name; ensure you keep the
existing assertion message unchanged.
bin/lib/onboard.js (1)

935-938: step.result is assigned but unused.

The result is stored on step.result but subsequent steps access shared state via the state object instead. Consider removing this dead assignment or documenting its intended use.

💡 Suggested cleanup
         const result = await step.execute();
         this.completedSteps.push(step);
-        if (result !== undefined) {
-          step.result = result;
-        }

If the intent is to preserve results for debugging, consider logging instead:

         const result = await step.execute();
         this.completedSteps.push(step);
+        // Results are tracked in shared `state` object passed to steps
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 935 - 938, The assignment to step.result
after this.completedSteps.push(step) is dead — subsequent code reads from the
shared state object, not from step.result — so either remove the conditional
assignment of step.result to avoid unused state, or if you intended to retain
per-step outputs for debugging/history, replace the assignment with an explicit
log or push the result into the existing state object used by other steps (e.g.,
update the state map) and document the choice; locate the logic around
completedSteps and the code paths that read from state to ensure consistency
(adjust references to step.result or state accordingly).
bin/lib/logger.js (1)

13-18: Consider warning on invalid log level.

setLevel() silently ignores invalid level names. A debug/warning message when an invalid level is passed would help users catch typos in configuration.

💡 Suggested improvement
 function setLevel(level) {
   const resolved = LOG_LEVELS[level];
   if (resolved !== undefined) {
     currentLevel = resolved;
+  } else if (level) {
+    console.error(`  [warn] Unknown log level: ${level}. Valid: ${Object.keys(LOG_LEVELS).join(", ")}`);
   }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/logger.js` around lines 13 - 18, setLevel currently ignores unknown
level keys (LOG_LEVELS[level]) with no feedback; update setLevel to detect when
resolved is undefined and emit a warning (e.g., console.warn or the module's
warning logger) that includes the invalid level value and a list or example of
valid levels, while keeping the existing behavior of updating currentLevel when
resolved is defined; reference symbols: setLevel, LOG_LEVELS, currentLevel.
bin/lib/runner.js (1)

29-33: Redaction prefix for nvapi- tokens may reveal too much.

Showing the first 8 characters (nvapi-xx***) still exposes part of the key. Consider a shorter prefix or fixed mask for direct token patterns.

💡 Suggested improvement
       const eqIdx = match.indexOf("=");
       if (eqIdx > 0) return match.slice(0, eqIdx + 1) + "***";
-      return match.slice(0, 8) + "***";
+      // For bare tokens like nvapi-xxx, show only the prefix type
+      if (match.startsWith("nvapi-")) return "nvapi-***";
+      return match.slice(0, 8) + "***";
     });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/runner.js` around lines 29 - 33, The current redaction callback in
result.replace exposes too many characters for keys like those starting with
"nvapi-" by returning match.slice(0, 8) + "***"; update the callback in the
result.replace(...) handler to treat nvapi tokens specially: check if match
startsWith("nvapi-") (or contains "nvapi-" before the value) and return a fixed
mask such as "nvapi-***" or just "***" instead of exposing 8 chars; keep the
existing eqIdx logic for other key=value patterns (use match.slice(0, eqIdx + 1)
+ "***" when eqIdx > 0) and only alter the fallback branch that currently uses
match.slice(0, 8) to use the fixed/shorter mask for nvapi patterns.
bin/lib/credentials.js (2)

104-104: Unnecessary inline require.

child_process is already imported at line 7. Use spawnSync from the existing import instead of require("child_process").spawnSync.

💡 Suggested fix

Update line 7 to include spawnSync:

-const { execSync } = require("child_process");
+const { execSync, spawnSync } = require("child_process");

Then on line 104:

-    const result = require("child_process").spawnSync(
+    const result = spawnSync(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/credentials.js` at line 104, Replace the inline require on line 104
with the already-imported child_process usage: either add spawnSync to the top
import (e.g., const { spawnSync } = require('child_process')) or use the
existing child_process.spawnSync call instead of
require("child_process").spawnSync; update the call site that currently uses
spawnSync to reference the imported symbol (spawnSync or
child_process.spawnSync) so no inline require remains.

84-87: Fragile string matching for error classification.

Checking for "invalid" or "expired" substrings in the message string is brittle—a future message change could break this logic. Consider using a structured field like { ok: false, fatal: true } instead.

💡 Suggested improvement
-    if (httpCode === "401" || httpCode === "403") {
-      return { ok: false, message: "API key is invalid or expired. Check https://build.nvidia.com/settings/api-keys" };
+    if (httpCode === "401" || httpCode === "403") {
+      return { ok: false, fatal: true, message: "API key is invalid or expired. Check https://build.nvidia.com/settings/api-keys" };
     }

Then in ensureApiKey:

-    if (validation.message.includes("invalid") || validation.message.includes("expired")) {
+    if (validation.fatal) {
       console.error(`  ✗ ${validation.message}`);
       process.exit(1);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/credentials.js` around lines 84 - 87, The current error
classification uses fragile substring checks on validation.message; update the
validation API to return a structured flag (e.g., validation.fatal or
validation.code) and change the check in credentials.js to test that structured
field instead of string matching: replace the condition that tests
validation.message.includes("invalid") || validation.message.includes("expired")
with a check like validation.fatal (or a specific validation.code ===
"INVALID_KEY"/"EXPIRED_KEY"), and ensure callers such as ensureApiKey use the
new structured field to decide when to log and call process.exit(1).
bin/lib/registry.js (1)

19-21: Silent catch may hide backup failures.

If fs.copyFileSync fails (e.g., disk full, permission denied), the user is told a backup was created when it wasn't. Consider logging when the backup itself fails.

💡 Suggested improvement
     try {
       fs.copyFileSync(REGISTRY_FILE, backupFile);
-    } catch {}
+    } catch (backupErr) {
+      console.error(`  [warn] Could not back up corrupted registry: ${backupErr.message}`);
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/registry.js` around lines 19 - 21, The empty catch around
fs.copyFileSync(REGISTRY_FILE, backupFile) hides backup failures; update the
catch to capture the thrown error and log a clear error message (e.g., using
console.error or the module's logger) that includes the error details and the
backupFile path so users know the backup attempt failed; keep the try/catch
structure around fs.copyFileSync but replace the empty block with a logged error
and, if appropriate, rethrow or handle the failure path consistently with the
rest of registry handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 653-667: The temp credential file (credFile) is still exposed via
command substitution in the run call (`$(cat '${credFile}')`), so change the
invocation of the provider creation to avoid expanding the secret into the
process args: write the credential to credFile as you do, then call run (or use
a child_process spawn) to execute the openshell command without inlining the
file contents—pass the credential via stdin or use a CLI flag that accepts a
credential file (instead of `--credential "$(cat ...)"`), keep the fs.unlinkSync
cleanup in the finally block, and adjust the run invocation (the call site that
builds the `openshell provider create --name nvidia-nim ...` command) to read
the credential from stdin or a protected file path rather than via command
substitution.
- Around line 893-904: The async signal handler in _registerSignalHandlers uses
await this._rollback() but then immediately calls process.exit(130), which can
run before the async rollback completes; change the handler so it does not call
process.exit synchronously — for example, make _signalHandler non-async and call
this._rollback().then(() => process.exit(130)).catch(err => {
processLogger/error handling; process.exit(130); }) or otherwise ensure
process.exit(130) is invoked only after the Promise returned from _rollback
resolves; reference _registerSignalHandlers, _signalHandler and _rollback when
making the change.

In `@scripts/setup.sh`:
- Around line 122-132: The credential is still exposed via command substitution
when calling upsert_provider; instead export the secret into the environment
from the temp file and call upsert_provider without embedding the credential
string (remove the "$(cat "$CRED_FILE")" argument), e.g. read the file into
NVIDIA_API_KEY or export the file contents into the shell environment using
CRED_FILE and then invoke upsert_provider so it inherits the env var; after the
call, explicitly unset the NVIDIA_API_KEY and remove CRED_FILE to clean up;
update the block referencing CRED_FILE and the upsert_provider invocation
accordingly.

---

Nitpick comments:
In @.github/workflows/codeql.yaml:
- Around line 8-12: Add a push trigger to the workflow's top-level on: block so
CodeQL runs on pushes to the default branch (in addition to the existing
pull_request and schedule triggers); update the on: configuration (near the
existing pull_request and schedule keys) to include a push: entry targeting the
repository default branch (e.g., branches: [main]) so baseline analysis is
produced for the CodeQL dashboard.

In @.github/workflows/release.yaml:
- Around line 73-74: The github-release job currently only depends on test and
may run before publish-npm; update the workflow so github-release depends on
both test and publish-npm (add publish-npm to the needs list for the
github-release job) to ensure the npm package is published before creating the
GitHub Release.
- Around line 85-107: The changelog generation loop should avoid repeated
redirections and tighten the commit-type grep: wrap the block that appends each
section into a grouped redirect (e.g., { echo "### ${label}"; echo "$commits";
echo ""; } >> /tmp/changelog.md) to address the SC2129 warnings, and change the
grep pattern in the git log call inside the for-loop from --grep="^${type}" to a
stricter match like --grep="^${type}:" (or --grep="^${type}(") so it only
matches commit subjects; update the loop variables (type_label, type, label,
commits) and the git log invocation accordingly while keeping the PREV_TAG logic
unchanged.
- Around line 63-66: The "Install and build" step runs a compound shell command
that can mask failures when `nemoclaw` is missing or build fails; fix it by
either (A) prefixing the run block with `set -e` so any command failure aborts
the step (e.g., add `set -e` before `npm install`), or (B) split into two GitHub
Actions steps and use `working-directory: nemoclaw` for the subproject (one step
to run `npm install`/`npm run build` in that directory) and keep the top-level
`npm install` in its own step—choose one approach and update the step named
"Install and build" accordingly.

In `@bin/lib/credentials.js`:
- Line 104: Replace the inline require on line 104 with the already-imported
child_process usage: either add spawnSync to the top import (e.g., const {
spawnSync } = require('child_process')) or use the existing
child_process.spawnSync call instead of require("child_process").spawnSync;
update the call site that currently uses spawnSync to reference the imported
symbol (spawnSync or child_process.spawnSync) so no inline require remains.
- Around line 84-87: The current error classification uses fragile substring
checks on validation.message; update the validation API to return a structured
flag (e.g., validation.fatal or validation.code) and change the check in
credentials.js to test that structured field instead of string matching: replace
the condition that tests validation.message.includes("invalid") ||
validation.message.includes("expired") with a check like validation.fatal (or a
specific validation.code === "INVALID_KEY"/"EXPIRED_KEY"), and ensure callers
such as ensureApiKey use the new structured field to decide when to log and call
process.exit(1).

In `@bin/lib/logger.js`:
- Around line 13-18: setLevel currently ignores unknown level keys
(LOG_LEVELS[level]) with no feedback; update setLevel to detect when resolved is
undefined and emit a warning (e.g., console.warn or the module's warning logger)
that includes the invalid level value and a list or example of valid levels,
while keeping the existing behavior of updating currentLevel when resolved is
defined; reference symbols: setLevel, LOG_LEVELS, currentLevel.

In `@bin/lib/onboard.js`:
- Around line 935-938: The assignment to step.result after
this.completedSteps.push(step) is dead — subsequent code reads from the shared
state object, not from step.result — so either remove the conditional assignment
of step.result to avoid unused state, or if you intended to retain per-step
outputs for debugging/history, replace the assignment with an explicit log or
push the result into the existing state object used by other steps (e.g., update
the state map) and document the choice; locate the logic around completedSteps
and the code paths that read from state to ensure consistency (adjust references
to step.result or state accordingly).

In `@bin/lib/registry.js`:
- Around line 19-21: The empty catch around fs.copyFileSync(REGISTRY_FILE,
backupFile) hides backup failures; update the catch to capture the thrown error
and log a clear error message (e.g., using console.error or the module's logger)
that includes the error details and the backupFile path so users know the backup
attempt failed; keep the try/catch structure around fs.copyFileSync but replace
the empty block with a logged error and, if appropriate, rethrow or handle the
failure path consistently with the rest of registry handling.

In `@bin/lib/runner.js`:
- Around line 29-33: The current redaction callback in result.replace exposes
too many characters for keys like those starting with "nvapi-" by returning
match.slice(0, 8) + "***"; update the callback in the result.replace(...)
handler to treat nvapi tokens specially: check if match startsWith("nvapi-") (or
contains "nvapi-" before the value) and return a fixed mask such as "nvapi-***"
or just "***" instead of exposing 8 chars; keep the existing eqIdx logic for
other key=value patterns (use match.slice(0, eqIdx + 1) + "***" when eqIdx > 0)
and only alter the fallback branch that currently uses match.slice(0, 8) to use
the fixed/shorter mask for nvapi patterns.

In `@test/policies.test.js`:
- Around line 200-207: The test computes expectedName by stripping ".yaml" via
p.file.replace(".yaml", "") which can wrongly remove the first occurrence;
update the logic in the test (inside the "every preset name in YAML matches its
filename" case that iterates policies.listPresets()) to use
path.basename(p.file, ".yaml") (or path.parse/p.ext) to extract the filename
without extension so expectedName reliably matches p.name; ensure you keep the
existing assertion message unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6a4fb94b-8937-444d-bbf9-7b6466a82cdd

📥 Commits

Reviewing files that changed from the base of the PR and between 3ba517d and e25c5ec.

📒 Files selected for processing (10)
  • .github/workflows/codeql.yaml
  • .github/workflows/release.yaml
  • bin/lib/credentials.js
  • bin/lib/logger.js
  • bin/lib/onboard.js
  • bin/lib/registry.js
  • bin/lib/runner.js
  • bin/nemoclaw.js
  • scripts/setup.sh
  • test/policies.test.js

@100rabhkr 100rabhkr marked this pull request as draft March 19, 2026 05:40
@100rabhkr 100rabhkr marked this pull request as ready for review March 19, 2026 05:46
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
scripts/setup.sh (1)

122-130: ⚠️ Potential issue | 🟠 Major

The key is still exposed on the openshell process argv.

"NVIDIA_API_KEY=$NVIDIA_API_KEY" removes the parent-shell leak, but upsert_provider() still passes the literal secret to openshell provider create/update --credential .... A concurrent ps can still recover it from the short-lived child process, so this hardening is incomplete until OpenShell supports file/stdin/env-based credential input.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup.sh` around lines 122 - 130, The upsert_provider call currently
embeds the secret in the child process argv causing a transient ps leak; update
the upsert_provider implementation (and its call site) so credentials are not
passed as literal args to openshell provider create/update—instead provide them
via the child process environment or via a temp credential file/stdin and invoke
openshell without the secret argument (i.e., stop constructing strings like
"NVIDIA_API_KEY=$NVIDIA_API_KEY" for openshell); modify upsert_provider and the
openshell invocation to export or load the credential into the child environment
or feed it to openshell over stdin/secure temp file and ensure the literal
secret never appears in the command arguments.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 933-955: The run() catch block won't execute if helper functions
call process.exit directly (e.g., setupNim, setupInference, setupPolicies), so
replace direct process.exit() calls inside those helpers with throwing a
descriptive Error (or returning a rejected Promise) so the error propagates up
to run() and triggers _rollback(); alternatively, if immediate termination is
required, ensure those helpers call await this._rollback() and rethrow before
exiting. Update setupNim, setupInference, and setupPolicies to avoid terminating
the process themselves and let run() handle cleanup via _rollback().
- Around line 988-995: createSandbox may return an existing sandbox when the
user opts to keep it, so the rollback currently deletes any non-null
state.sandboxName and can remove a preexisting sandbox; fix by tracking whether
this run actually created/recreated the sandbox (e.g., add state.sandboxCreated
boolean). In the execute callback (where createSandbox is called), set
state.sandboxCreated = true only when the sandbox was newly created (modify
createSandbox to return creation metadata or detect pre-existence), and change
the rollback to delete and call registry.removeSandbox only if
state.sandboxCreated is true (leave existing sandboxes untouched).

---

Duplicate comments:
In `@scripts/setup.sh`:
- Around line 122-130: The upsert_provider call currently embeds the secret in
the child process argv causing a transient ps leak; update the upsert_provider
implementation (and its call site) so credentials are not passed as literal args
to openshell provider create/update—instead provide them via the child process
environment or via a temp credential file/stdin and invoke openshell without the
secret argument (i.e., stop constructing strings like
"NVIDIA_API_KEY=$NVIDIA_API_KEY" for openshell); modify upsert_provider and the
openshell invocation to export or load the credential into the child environment
or feed it to openshell over stdin/secure temp file and ensure the literal
secret never appears in the command arguments.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 03f031fd-9ce0-40ea-9446-2036f907c8a7

📥 Commits

Reviewing files that changed from the base of the PR and between e25c5ec and 7cc24c0.

📒 Files selected for processing (3)
  • bin/lib/credentials.js
  • bin/lib/onboard.js
  • scripts/setup.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • bin/lib/credentials.js

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/validate-api-key.test.js (1)

9-23: Consider documenting test flakiness expectations.

These tests hit the real NVIDIA API and handle network variability via conditional assertions (Lines 20-22). While acceptable for integration-style tests, consider adding a // @flaky`` or similar annotation and ensuring CI handles transient failures gracefully (e.g., retry or skip on network errors).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/validate-api-key.test.js` around lines 9 - 23, The test using
validateApiKey calls the real NVIDIA API and contains conditional assertions
that can be flaky; annotate the test (e.g., add a comment like // `@flaky` or use
your test framework's retry/skip mechanism) on the it("returns { ok: false,
fatal: true } for an invalid key" ...) case and update CI/test runner config to
either retry transient network failures or skip the test when network is
unavailable so the validateApiKey integration test does not cause intermittent
CI failures.
bin/lib/credentials.js (1)

124-125: Minor: Missing message field on successful validation.

The docstring (Lines 96-102) states the return includes a message for "human-readable status," but the success case omits it. While callers currently only use message on failure, this inconsistency could cause issues if future code expects message to always exist.

🔧 Proposed fix
     if (httpCode === "200") {
-      return { ok: true, fatal: false };
+      return { ok: true, fatal: false, message: "API key validated successfully" };
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/credentials.js` around lines 124 - 125, The success branch that
returns { ok: true, fatal: false } when httpCode === "200" is missing the
human-readable message described in the function docstring; update that return
to include a message (e.g., "Validation successful" or similar) so the shape
matches failure results and callers can always rely on a message field while
keeping ok: true and fatal: false; modify the object returned in the httpCode
=== "200" branch accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/lib/credentials.js`:
- Around line 124-125: The success branch that returns { ok: true, fatal: false
} when httpCode === "200" is missing the human-readable message described in the
function docstring; update that return to include a message (e.g., "Validation
successful" or similar) so the shape matches failure results and callers can
always rely on a message field while keeping ok: true and fatal: false; modify
the object returned in the httpCode === "200" branch accordingly.

In `@test/validate-api-key.test.js`:
- Around line 9-23: The test using validateApiKey calls the real NVIDIA API and
contains conditional assertions that can be flaky; annotate the test (e.g., add
a comment like // `@flaky` or use your test framework's retry/skip mechanism) on
the it("returns { ok: false, fatal: true } for an invalid key" ...) case and
update CI/test runner config to either retry transient network failures or skip
the test when network is unavailable so the validateApiKey integration test does
not cause intermittent CI failures.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5e76f0d4-7138-4a57-969b-cfd53f5da3d5

📥 Commits

Reviewing files that changed from the base of the PR and between 7cc24c0 and e4b6621.

📒 Files selected for processing (7)
  • bin/lib/credentials.js
  • bin/lib/logger.js
  • bin/lib/onboard.js
  • bin/lib/runner.js
  • test/onboard-pipeline.test.js
  • test/runner-redact.test.js
  • test/validate-api-key.test.js
🚧 Files skipped from review as they are similar to previous changes (2)
  • bin/lib/runner.js
  • bin/lib/logger.js

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/credentials.js`:
- Around line 77-87: The validation for the NVIDIA API key must run regardless
of whether the key came from a prompt, environment, or credentials.json; move
the validateApiKey(key) call and its handling (the console messages and
process.exit(1) for fatal) into the shared path immediately after the code
resolves the 'key' variable (i.e., before any early return that returns a cached
NVIDIA_API_KEY), so that validateApiKey, the 'validation.ok' /
'validation.fatal' branches, and the associated console logs are executed for
all sources of the key.
- Around line 103-138: The validateApiKey function currently only checks
result.stdout and treats empty output as a network/NVIDIA issue; update it to
explicitly inspect the spawnSync result object for execution errors
(result.error) and non-zero process exit (result.status) before interpreting
stdout: if result.error exists (e.g., ENOENT/curl missing or timeout), return a
clear failure object that includes the error.message and marks the validation as
a local/fatal execution problem; if result.status is non-zero but no stdout,
surface the exit code and stderr in the returned message; only fall back to the
previous network/NVIDIA logic when no execution error and stdout is present.
Ensure the returned shapes (ok, fatal, message) remain consistent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f8e2fb41-97dc-492f-b687-57ed60f6ab74

📥 Commits

Reviewing files that changed from the base of the PR and between e4b6621 and 1859e09.

📒 Files selected for processing (2)
  • bin/lib/credentials.js
  • test/validate-api-key.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/validate-api-key.test.js

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/credentials.js`:
- Around line 124-127: The error handling in the block returning { ok: false,
fatal: false, message: `Could not validate key (${reason}). Proceeding with
saved key.` } should special-case ENOENT to give a clearer message; update the
logic that computes reason (in the same scope where result.error is checked) to
first check if result.error.code === "ENOENT" and set reason to "curl is not
installed", otherwise keep the existing timeout check (result.error.code ===
"ETIMEDOUT" ? "timed out" : result.error.message || "unknown error"); ensure
this change is applied in the function/method that contains the result.error
handling in credentials.js so the returned message explicitly mentions curl is
missing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cc29d3cb-0f7c-41cf-840b-a2af19d39abb

📥 Commits

Reviewing files that changed from the base of the PR and between 1859e09 and 5394242.

📒 Files selected for processing (1)
  • bin/lib/credentials.js

@wscurran wscurran added enhancement: feature Use this label to identify requests for new capabilities in NemoClaw. security Something isn't secure NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). observability Use this label to improve NemoClaw logging, metrics, and tracing. priority: high Important issue that should be resolved in the next release labels Mar 19, 2026
- Fix API key exposure in process arguments visible via ps aux (NVIDIA#325).
  Credentials are now passed via temp files (mode 0600) or environment
  inheritance instead of CLI args in onboard.js, nemoclaw.js, and setup.sh.
- Add secret redaction to runner error output to prevent accidental
  leakage of API keys in logs and terminal output.
- Validate NVIDIA API key against the API during onboard before saving,
  catching invalid or expired keys early instead of silent agent failure.
- Add --verbose / --debug CLI flag and LOG_LEVEL env var for diagnostic
  output. All shell commands are logged in debug mode with redacted secrets.
- Add OnboardPipeline state machine with SIGINT/SIGTERM signal handlers
  that automatically roll back completed steps (gateway, sandbox, NIM)
  on interruption or failure, preventing orphaned resources.
- Improve registry corruption recovery: corrupted sandboxes.json is
  backed up with a timestamp and the user is warned with actionable guidance.
- Expand policies.js test coverage from 14 to 25 tests, adding coverage
  for extractPresetEntries, parseCurrentPolicy, and YAML schema validation.
- Add GitHub Actions release pipeline (release.yaml) triggered on semver
  tags: runs tests, publishes to npm, generates changelog from conventional
  commits, and creates a GitHub Release.
- Add CodeQL security scanning workflow (codeql.yaml) for JavaScript/
  TypeScript and Python on PRs and a weekly schedule.
- credentials.js: pass curl auth header via stdin (-H @-) instead of
  CLI argument to prevent API key visibility in ps aux.
- onboard.js: replace temp-file command substitution with env var
  reference for openshell --credential flag. The literal value still
  appears in the transient openshell subprocess args (unavoidable
  without upstream --credential-file support), but no longer in the
  parent bash process args or shell history.
- onboard.js: make signal handler synchronous to guarantee rollback
  completes before process.exit(). All rollback actions use spawnSync
  so no async work is needed.
- setup.sh: remove unnecessary temp file indirection for credential
  passing — use direct env var reference instead.
- Remove redundant sandboxEnv; NVIDIA_API_KEY inherits via process.env.
- Fix sandbox rollback name mismatch: snapshot all registry entries by
  name before createSandbox, then compare the returned name's createdAt
  to detect "kept existing" vs "newly created" (interactive mode bug).
- Track state.gatewayStarted flag so rollback only destroys gateways
  this run created, not pre-existing ones serving other sandboxes.
- Strip logger.js to debug-only (remove unused info/warn/error exports
  whose prefix format conflicts with existing console output style).
- Move redactSecrets regex patterns inside the function to avoid shared
  /g lastIndex state across calls.
- Add fatal flag to validateApiKey return value; ensureApiKey checks
  result.fatal instead of fragile string matching on message text.
- Remove dead async _rollback wrapper — all rollback actions are
  synchronous (spawnSync), so the async indirection was misleading.
- Add 20 new tests: redactSecrets (9 tests), OnboardPipeline (7 tests),
  validateApiKey (4 tests). Total test count: 175, 0 failures.
- Include message field in validateApiKey success return for consistent
  { ok, fatal, message } shape across all code paths.
- Annotate real-API test as @flaky with explanation of network-dependent
  conditional assertions.
Check result.error before inspecting stdout so curl-missing and
timeout failures are reported accurately instead of being mislabeled
as NVIDIA API network issues.
- Special-case ENOENT in validateApiKey to show "curl is not installed"
  instead of the raw "spawnSync curl ENOENT" message.
- Import spawnSync at module top alongside execSync instead of inline
  require("child_process").spawnSync inside the function.
- Add explanatory comment to empty catch block in registry backup to
  clarify the best-effort intent.
@100rabhkr 100rabhkr force-pushed the security/harden-credentials-and-observability branch from f3cded8 to 37bb222 Compare March 22, 2026 06:57
- Remove unused isVerbose export from logger.js (nothing imports it).
- Drop unused err binding in registry.js catch block.
- Restore inline comments stripped during merge conflict resolution:
  streamSandboxCreate rationale, sandbox ready-wait context, orphan
  cleanup explanation.
- Inline rawArgs variable in nemoclaw.js (used only once).
@100rabhkr
Copy link
Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Mar 22, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
bin/lib/onboard.js (1)

1082-1104: ⚠️ Potential issue | 🟠 Major

Rollback still bypassed by process.exit() calls in step helpers.

The pipeline's catch block and _rollback() mechanism won't execute when step helpers terminate via process.exit(). Several helpers still call process.exit(1) directly:

  • setupNim() (Lines 666, 704, 779)
  • setupInference() (Lines 819, 839, 859)
  • setupPolicies() (Lines 929, 939, 947, 951)
  • createSandbox() (Lines 489, 499, 561, 592)

These failures will skip rollback entirely, leaving partial onboarding state behind.

Consider throwing errors instead of calling process.exit() directly, allowing the pipeline to catch them and perform proper cleanup.

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1082 - 1104, Step helper functions call
process.exit(1) which bypasses the run() catch/_rollback flow; update each
helper (createSandbox, setupNim, setupInference, setupPolicies) to stop calling
process.exit and instead throw an Error (or return a rejected Promise) with a
descriptive message so the error propagates back to Onboard.run's try/catch;
ensure any callers that relied on process.exit now either rethrow or let the
error bubble so Onboard.run can call _rollback() and then handle termination (if
desired) after cleanup.
🧹 Nitpick comments (1)
bin/nemoclaw.js (1)

183-200: Verify remote .env file permissions.

Credentials are securely written to a local temp file (mode 0o600) and cleaned up after scp, which is good. However, the destination /home/ubuntu/nemoclaw/.env file permissions depend on the remote umask.

Consider adding an explicit chmod 600 on the remote after the scp to ensure the .env file has restrictive permissions regardless of the remote system's default umask.

♻️ Proposed fix
   try {
     run(`scp -q -o StrictHostKeyChecking=no -o LogLevel=ERROR ${shellQuote(envTmp)} ${qname}:/home/ubuntu/nemoclaw/.env`);
+    run(`ssh -o StrictHostKeyChecking=no -o LogLevel=ERROR ${qname} 'chmod 600 /home/ubuntu/nemoclaw/.env'`);
   } finally {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 183 - 200, After copying the temp env file with
run(`scp ... ${qname}:/home/ubuntu/nemoclaw/.env`), ensure the remote file has
restrictive permissions by invoking a remote chmod; update the try block around
the scp to run an additional command like run(`ssh -o StrictHostKeyChecking=no
-o LogLevel=ERROR ${qname} 'chmod 600 /home/ubuntu/nemoclaw/.env'`) right after
the scp call so the remote .env permissions are explicitly set regardless of
remote umask, while keeping the existing cleanup of envTmp and envDir intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/release.yaml:
- Around line 73-75: The github-release job currently lists only needs: test,
which allows releases even if npm publishing fails; update the github-release
job definition to depend on publish-npm as well (e.g., change needs: test to
needs: [test, publish-npm] or add publish-npm alongside test) so the
github-release job waits for the publish-npm job to succeed before running; edit
the job block named github-release to reference the publish-npm job by its job
name.
- Line 14: The workflow's tag pattern string "v[0-9]+.[0-9]+.[0-9]+*" uses
regex-like syntax which glob matching in GitHub Actions doesn't support; replace
that pattern with a glob that matches semver tags such as "v*.*.*" (i.e., update
the tag pattern value used in the workflow trigger so tags like v1.2.3 will
correctly trigger the workflow).

In `@test/validate-api-key.test.js`:
- Around line 10-25: The test is flaky and misleading: instead of allowing any
outcome, change the test to be deterministic by stubbing the underlying system
call used by validateApiKey (e.g., stub/spyon the child_process.spawnSync or the
internal http/curl wrapper used by validateApiKey) to simulate a 401/403
response for the invalid key, then assert exactly that
validateApiKey("nvapi-INVALID_TEST_KEY_000000") returns { ok: false, fatal: true
} (and a message containing "invalid" or "expired"); alternatively, gate the
existing live-network test behind an environment flag (e.g., RUN_LIVE_TESTS) and
update the test name to reflect the mocked behavior so CI only runs the
deterministic assertion when not running live.

---

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1082-1104: Step helper functions call process.exit(1) which
bypasses the run() catch/_rollback flow; update each helper (createSandbox,
setupNim, setupInference, setupPolicies) to stop calling process.exit and
instead throw an Error (or return a rejected Promise) with a descriptive message
so the error propagates back to Onboard.run's try/catch; ensure any callers that
relied on process.exit now either rethrow or let the error bubble so Onboard.run
can call _rollback() and then handle termination (if desired) after cleanup.

---

Nitpick comments:
In `@bin/nemoclaw.js`:
- Around line 183-200: After copying the temp env file with run(`scp ...
${qname}:/home/ubuntu/nemoclaw/.env`), ensure the remote file has restrictive
permissions by invoking a remote chmod; update the try block around the scp to
run an additional command like run(`ssh -o StrictHostKeyChecking=no -o
LogLevel=ERROR ${qname} 'chmod 600 /home/ubuntu/nemoclaw/.env'`) right after the
scp call so the remote .env permissions are explicitly set regardless of remote
umask, while keeping the existing cleanup of envTmp and envDir intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bed89b33-26ad-46ca-b2d7-e5ff698f16a7

📥 Commits

Reviewing files that changed from the base of the PR and between 1859e09 and 44970dc.

📒 Files selected for processing (13)
  • .github/workflows/codeql.yaml
  • .github/workflows/release.yaml
  • bin/lib/credentials.js
  • bin/lib/logger.js
  • bin/lib/onboard.js
  • bin/lib/registry.js
  • bin/lib/runner.js
  • bin/nemoclaw.js
  • scripts/setup.sh
  • test/onboard-pipeline.test.js
  • test/policies.test.js
  • test/runner-redact.test.js
  • test/validate-api-key.test.js
✅ Files skipped from review due to trivial changes (7)
  • test/runner-redact.test.js
  • bin/lib/registry.js
  • .github/workflows/codeql.yaml
  • bin/lib/logger.js
  • test/onboard-pipeline.test.js
  • test/policies.test.js
  • bin/lib/runner.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • bin/lib/credentials.js

- release.yaml: fix tag pattern from regex-style v[0-9]+.[0-9]+.[0-9]+*
  to glob-style v*.*.* — GitHub Actions uses globs, not regex.
- release.yaml: chain github-release after publish-npm so releases are
  not created when npm publish fails.
- validate-api-key.test.js: test return shape and invariants only, not
  specific ok/fatal values, since the /v1/models endpoint may return
  200 for any key (public listing). Removes false-pass risk.
@100rabhkr
Copy link
Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Mar 22, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/validate-api-key.test.js (1)

15-27: Consolidate duplicate tests.

These two tests verify essentially the same thing—that the return value has the { ok, fatal, message } shape. Test 1 (lines 15-20) already checks that ok and fatal are booleans and message exists; Test 2 (lines 22-27) only adds weaker "in" checks that are implied by Test 1 passing.

Consider removing Test 2 or merging them into a single, more comprehensive test.

♻️ Suggested consolidation
-  it("returns a well-formed result for any key", () => {
+  it("returns a well-formed { ok, fatal, message } result", () => {
     const result = validateApiKey("nvapi-INVALID_TEST_KEY_000000");
     assert.equal(typeof result.ok, "boolean");
     assert.equal(typeof result.fatal, "boolean");
-    assert.ok("message" in result, "must have message field");
-  });
-
-  it("always returns the { ok, fatal, message } shape", () => {
-    const result = validateApiKey("nvapi-test");
-    assert.ok("ok" in result, "must have ok field");
-    assert.ok("fatal" in result, "must have fatal field");
-    assert.ok("message" in result, "must have message field");
+    assert.equal(typeof result.message, "string", "message must be a string");
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/validate-api-key.test.js` around lines 15 - 27, The two tests both
assert the same return shape from validateApiKey; remove the redundant second
test or merge its weaker checks into the first to consolidate. Specifically,
keep the more comprehensive test that calls validateApiKey (function name
validateApiKey) and asserts typeof result.ok === "boolean", typeof result.fatal
=== "boolean", and that "message" in result; delete the second it block that
only checks existence of keys, or replace it by adding any additional input
cases into the single test so all shape assertions are covered in one it block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/validate-api-key.test.js`:
- Around line 29-34: The test currently passes vacuously when
validateApiKey("nvapi-anything") returns ok: false; change the assertion to be
unconditional and assert the invariant directly by checking that not (result.ok
&& result.fatal) so the test fails if validateApiKey ever returns ok: true with
fatal: true; locate the test case using validateApiKey in the "it" block and
replace the conditional assertion with a single assert that !(result.ok &&
result.fatal).

---

Nitpick comments:
In `@test/validate-api-key.test.js`:
- Around line 15-27: The two tests both assert the same return shape from
validateApiKey; remove the redundant second test or merge its weaker checks into
the first to consolidate. Specifically, keep the more comprehensive test that
calls validateApiKey (function name validateApiKey) and asserts typeof result.ok
=== "boolean", typeof result.fatal === "boolean", and that "message" in result;
delete the second it block that only checks existence of keys, or replace it by
adding any additional input cases into the single test so all shape assertions
are covered in one it block.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 61e3056a-da01-48ce-94e8-9ff07153a933

📥 Commits

Reviewing files that changed from the base of the PR and between 44970dc and bb38862.

📒 Files selected for processing (2)
  • .github/workflows/release.yaml
  • test/validate-api-key.test.js
✅ Files skipped from review due to trivial changes (1)
  • .github/workflows/release.yaml

Comment on lines +29 to +34
it("never returns fatal: true when ok: true", () => {
const result = validateApiKey("nvapi-anything");
if (result.ok) {
assert.equal(result.fatal, false, "ok: true must have fatal: false");
}
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test passes vacuously when ok: false.

If validateApiKey("nvapi-anything") returns { ok: false, ... } (which is likely for an invalid-looking key), the assertion inside the if block never runs and the test passes trivially. This doesn't actually verify the invariant.

To make this test meaningful, either:

  1. Mock validateApiKey to return { ok: true, fatal: false, message: "..." } and verify the invariant holds, or
  2. Invert the logic to assert !(result.ok && result.fatal) unconditionally.
🔧 Suggested fix using unconditional assertion
   it("never returns fatal: true when ok: true", () => {
     const result = validateApiKey("nvapi-anything");
-    if (result.ok) {
-      assert.equal(result.fatal, false, "ok: true must have fatal: false");
-    }
+    // Invariant: ok and fatal cannot both be true
+    assert.ok(!(result.ok && result.fatal), "ok: true must have fatal: false");
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/validate-api-key.test.js` around lines 29 - 34, The test currently
passes vacuously when validateApiKey("nvapi-anything") returns ok: false; change
the assertion to be unconditional and assert the invariant directly by checking
that not (result.ok && result.fatal) so the test fails if validateApiKey ever
returns ok: true with fatal: true; locate the test case using validateApiKey in
the "it" block and replace the conditional assertion with a single assert that
!(result.ok && result.fatal).

@ericksoa
Copy link
Contributor

Thanks for the comprehensive work here @100rabhkr. The API key exposure fix is covered by #330 (landing now). We've split the remaining security and feature pieces into separate issues so they can land independently:

@ericksoa ericksoa closed this Mar 22, 2026
@ericksoa
Copy link
Contributor

If you're interested in continuing, the CodeQL workflow and the onboard rollback pipeline would both be great as standalone PRs — easier to review and merge when the concerns are separate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement: feature Use this label to identify requests for new capabilities in NemoClaw. NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). observability Use this label to improve NemoClaw logging, metrics, and tracing. priority: high Important issue that should be resolved in the next release security Something isn't secure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants