security: harden credential handling, add observability and CI workflows#390
security: harden credential handling, add observability and CI workflows#390100rabhkr wants to merge 8 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds CodeQL and release GitHub workflows; introduces a leveled logger and secret redaction in runner logs; validates NVIDIA API keys before saving; refactors onboarding into an undoable, signal-safe OnboardPipeline and stops leaking credentials on CLI args; adds registry-corruption backups and multiple unit tests. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant CLI as "CLI (bin/nemoclaw.js)"
participant Pipeline as "OnboardPipeline"
participant Step as "Onboarding Step"
participant Runner as "Runner (bin/lib/runner)"
participant Env as "Subprocess Env (NEMOCLAW_CRED)"
participant Rollback as "Rollback Handler"
User->>CLI: run "nemoclaw onboard"
CLI->>CLI: set log level if --debug/--verbose
CLI->>Pipeline: pipeline.run(steps[])
alt step succeeds
Pipeline->>Step: execute()
Step->>Runner: run(command)
Runner->>Env: provide/inherit NEMOCLAW_CRED
Runner-->>Step: success
Step-->>Pipeline: mark complete
Pipeline->>Step: execute(next)
else step fails
Step-->>Pipeline: throw/error
Pipeline->>Rollback: iterate completedSteps (reverse)
loop each completed step
Rollback->>Runner: run(rollbackCommand)
Runner-->>Rollback: rollback done (errors tolerated)
end
Pipeline-->>CLI: reject/error (exit or 130 if signal)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (11)
.github/workflows/release.yaml (3)
73-74: Consider waiting forpublish-npmbefore creating the GitHub Release.Currently,
github-releaseonly depends ontest, so it runs in parallel withpublish-npm. This can result in a GitHub Release being published before the npm package is available, potentially confusing users who try to install immediately after seeing the release notification.💡 Suggested fix to sequence jobs
github-release: - needs: test + needs: [test, publish-npm] runs-on: ubuntu-latest🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/release.yaml around lines 73 - 74, The github-release job currently only depends on test and may run before publish-npm; update the workflow so github-release depends on both test and publish-npm (add publish-npm to the needs list for the github-release job) to ensure the npm package is published before creating the GitHub Release.
85-107: Style: consider grouping redirects (SC2129) and verify--greppattern.
The static analysis tool flags repeated
>> /tmp/changelog.mdredirects. Grouping them with{ ...; } >> fileis cleaner but optional.The
--grep="^${type}"pattern anchors to the start of any line in the commit message, not just the subject line. For conventional commits likefeat: add feature, this works, but could also match lines in the body that happen to start withfeat. Consider using--grep="^${type}:"or--grep="^${type}("for stricter matching.💡 Minor improvements
- commits=$(git log "${PREV_TAG}..HEAD" --pretty=format:"- %s (%h)" --grep="^${type}" 2>/dev/null || true) + commits=$(git log "${PREV_TAG}..HEAD" --pretty=format:"- %s (%h)" --grep="^${type}[:(]" 2>/dev/null || true)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/release.yaml around lines 85 - 107, The changelog generation loop should avoid repeated redirections and tighten the commit-type grep: wrap the block that appends each section into a grouped redirect (e.g., { echo "### ${label}"; echo "$commits"; echo ""; } >> /tmp/changelog.md) to address the SC2129 warnings, and change the grep pattern in the git log call inside the for-loop from --grep="^${type}" to a stricter match like --grep="^${type}:" (or --grep="^${type}(") so it only matches commit subjects; update the loop variables (type_label, type, label, commits) and the git log invocation accordingly while keeping the PREV_TAG logic unchanged.
63-66: Add error handling for subdirectory navigation.If the
nemoclawdirectory doesn't exist or the build fails, thecd ..will still execute and the step may not fail as expected due to how compound commands work in shell.💡 Suggested improvement
- name: Install and build run: | npm install - cd nemoclaw && npm install && npm run build && cd .. + pushd nemoclaw && npm install && npm run build && popdAlternatively, use
set -eat the start or split into separate steps withworking-directory.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/release.yaml around lines 63 - 66, The "Install and build" step runs a compound shell command that can mask failures when `nemoclaw` is missing or build fails; fix it by either (A) prefixing the run block with `set -e` so any command failure aborts the step (e.g., add `set -e` before `npm install`), or (B) split into two GitHub Actions steps and use `working-directory: nemoclaw` for the subproject (one step to run `npm install`/`npm run build` in that directory) and keep the top-level `npm install` in its own step—choose one approach and update the step named "Install and build" accordingly..github/workflows/codeql.yaml (1)
8-12: Consider adding apushtrigger for the default branch.CodeQL analysis benefits from running on both PRs and pushes to the default branch. Pushes to
mainestablish the baseline for comparison, ensuring new alerts are properly detected on PRs. Without it, the CodeQL dashboard may lack historical context.💡 Suggested addition for baseline scanning
on: pull_request: types: [opened, synchronize, reopened] + push: + branches: [main] schedule: - cron: "0 6 * * 1" # Every Monday at 06:00 UTC🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/codeql.yaml around lines 8 - 12, Add a push trigger to the workflow's top-level on: block so CodeQL runs on pushes to the default branch (in addition to the existing pull_request and schedule triggers); update the on: configuration (near the existing pull_request and schedule keys) to include a push: entry targeting the repository default branch (e.g., branches: [main]) so baseline analysis is produced for the CodeQL dashboard.test/policies.test.js (1)
200-207: Consider usingpath.basenamefor extension stripping.
replace(".yaml", "")only removes the first occurrence. Sincepathis already imported, using it is more robust and idiomatic.♻️ Suggested improvement
it("every preset name in YAML matches its filename", () => { for (const p of policies.listPresets()) { - const expectedName = p.file.replace(".yaml", ""); + const expectedName = path.basename(p.file, ".yaml"); assert.equal(p.name, expectedName, `${p.file}: YAML name '${p.name}' does not match filename`); } });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/policies.test.js` around lines 200 - 207, The test computes expectedName by stripping ".yaml" via p.file.replace(".yaml", "") which can wrongly remove the first occurrence; update the logic in the test (inside the "every preset name in YAML matches its filename" case that iterates policies.listPresets()) to use path.basename(p.file, ".yaml") (or path.parse/p.ext) to extract the filename without extension so expectedName reliably matches p.name; ensure you keep the existing assertion message unchanged.bin/lib/onboard.js (1)
935-938:step.resultis assigned but unused.The result is stored on
step.resultbut subsequent steps access shared state via thestateobject instead. Consider removing this dead assignment or documenting its intended use.💡 Suggested cleanup
const result = await step.execute(); this.completedSteps.push(step); - if (result !== undefined) { - step.result = result; - }If the intent is to preserve results for debugging, consider logging instead:
const result = await step.execute(); this.completedSteps.push(step); + // Results are tracked in shared `state` object passed to steps🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 935 - 938, The assignment to step.result after this.completedSteps.push(step) is dead — subsequent code reads from the shared state object, not from step.result — so either remove the conditional assignment of step.result to avoid unused state, or if you intended to retain per-step outputs for debugging/history, replace the assignment with an explicit log or push the result into the existing state object used by other steps (e.g., update the state map) and document the choice; locate the logic around completedSteps and the code paths that read from state to ensure consistency (adjust references to step.result or state accordingly).bin/lib/logger.js (1)
13-18: Consider warning on invalid log level.
setLevel()silently ignores invalid level names. A debug/warning message when an invalid level is passed would help users catch typos in configuration.💡 Suggested improvement
function setLevel(level) { const resolved = LOG_LEVELS[level]; if (resolved !== undefined) { currentLevel = resolved; + } else if (level) { + console.error(` [warn] Unknown log level: ${level}. Valid: ${Object.keys(LOG_LEVELS).join(", ")}`); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/logger.js` around lines 13 - 18, setLevel currently ignores unknown level keys (LOG_LEVELS[level]) with no feedback; update setLevel to detect when resolved is undefined and emit a warning (e.g., console.warn or the module's warning logger) that includes the invalid level value and a list or example of valid levels, while keeping the existing behavior of updating currentLevel when resolved is defined; reference symbols: setLevel, LOG_LEVELS, currentLevel.bin/lib/runner.js (1)
29-33: Redaction prefix fornvapi-tokens may reveal too much.Showing the first 8 characters (
nvapi-xx***) still exposes part of the key. Consider a shorter prefix or fixed mask for direct token patterns.💡 Suggested improvement
const eqIdx = match.indexOf("="); if (eqIdx > 0) return match.slice(0, eqIdx + 1) + "***"; - return match.slice(0, 8) + "***"; + // For bare tokens like nvapi-xxx, show only the prefix type + if (match.startsWith("nvapi-")) return "nvapi-***"; + return match.slice(0, 8) + "***"; });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/runner.js` around lines 29 - 33, The current redaction callback in result.replace exposes too many characters for keys like those starting with "nvapi-" by returning match.slice(0, 8) + "***"; update the callback in the result.replace(...) handler to treat nvapi tokens specially: check if match startsWith("nvapi-") (or contains "nvapi-" before the value) and return a fixed mask such as "nvapi-***" or just "***" instead of exposing 8 chars; keep the existing eqIdx logic for other key=value patterns (use match.slice(0, eqIdx + 1) + "***" when eqIdx > 0) and only alter the fallback branch that currently uses match.slice(0, 8) to use the fixed/shorter mask for nvapi patterns.bin/lib/credentials.js (2)
104-104: Unnecessary inline require.
child_processis already imported at line 7. UsespawnSyncfrom the existing import instead ofrequire("child_process").spawnSync.💡 Suggested fix
Update line 7 to include
spawnSync:-const { execSync } = require("child_process"); +const { execSync, spawnSync } = require("child_process");Then on line 104:
- const result = require("child_process").spawnSync( + const result = spawnSync(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/credentials.js` at line 104, Replace the inline require on line 104 with the already-imported child_process usage: either add spawnSync to the top import (e.g., const { spawnSync } = require('child_process')) or use the existing child_process.spawnSync call instead of require("child_process").spawnSync; update the call site that currently uses spawnSync to reference the imported symbol (spawnSync or child_process.spawnSync) so no inline require remains.
84-87: Fragile string matching for error classification.Checking for
"invalid"or"expired"substrings in the message string is brittle—a future message change could break this logic. Consider using a structured field like{ ok: false, fatal: true }instead.💡 Suggested improvement
- if (httpCode === "401" || httpCode === "403") { - return { ok: false, message: "API key is invalid or expired. Check https://build.nvidia.com/settings/api-keys" }; + if (httpCode === "401" || httpCode === "403") { + return { ok: false, fatal: true, message: "API key is invalid or expired. Check https://build.nvidia.com/settings/api-keys" }; }Then in
ensureApiKey:- if (validation.message.includes("invalid") || validation.message.includes("expired")) { + if (validation.fatal) { console.error(` ✗ ${validation.message}`); process.exit(1); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/credentials.js` around lines 84 - 87, The current error classification uses fragile substring checks on validation.message; update the validation API to return a structured flag (e.g., validation.fatal or validation.code) and change the check in credentials.js to test that structured field instead of string matching: replace the condition that tests validation.message.includes("invalid") || validation.message.includes("expired") with a check like validation.fatal (or a specific validation.code === "INVALID_KEY"/"EXPIRED_KEY"), and ensure callers such as ensureApiKey use the new structured field to decide when to log and call process.exit(1).bin/lib/registry.js (1)
19-21: Silent catch may hide backup failures.If
fs.copyFileSyncfails (e.g., disk full, permission denied), the user is told a backup was created when it wasn't. Consider logging when the backup itself fails.💡 Suggested improvement
try { fs.copyFileSync(REGISTRY_FILE, backupFile); - } catch {} + } catch (backupErr) { + console.error(` [warn] Could not back up corrupted registry: ${backupErr.message}`); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/registry.js` around lines 19 - 21, The empty catch around fs.copyFileSync(REGISTRY_FILE, backupFile) hides backup failures; update the catch to capture the thrown error and log a clear error message (e.g., using console.error or the module's logger) that includes the error details and the backupFile path so users know the backup attempt failed; keep the try/catch structure around fs.copyFileSync but replace the empty block with a logged error and, if appropriate, rethrow or handle the failure path consistently with the rest of registry handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 653-667: The temp credential file (credFile) is still exposed via
command substitution in the run call (`$(cat '${credFile}')`), so change the
invocation of the provider creation to avoid expanding the secret into the
process args: write the credential to credFile as you do, then call run (or use
a child_process spawn) to execute the openshell command without inlining the
file contents—pass the credential via stdin or use a CLI flag that accepts a
credential file (instead of `--credential "$(cat ...)"`), keep the fs.unlinkSync
cleanup in the finally block, and adjust the run invocation (the call site that
builds the `openshell provider create --name nvidia-nim ...` command) to read
the credential from stdin or a protected file path rather than via command
substitution.
- Around line 893-904: The async signal handler in _registerSignalHandlers uses
await this._rollback() but then immediately calls process.exit(130), which can
run before the async rollback completes; change the handler so it does not call
process.exit synchronously — for example, make _signalHandler non-async and call
this._rollback().then(() => process.exit(130)).catch(err => {
processLogger/error handling; process.exit(130); }) or otherwise ensure
process.exit(130) is invoked only after the Promise returned from _rollback
resolves; reference _registerSignalHandlers, _signalHandler and _rollback when
making the change.
In `@scripts/setup.sh`:
- Around line 122-132: The credential is still exposed via command substitution
when calling upsert_provider; instead export the secret into the environment
from the temp file and call upsert_provider without embedding the credential
string (remove the "$(cat "$CRED_FILE")" argument), e.g. read the file into
NVIDIA_API_KEY or export the file contents into the shell environment using
CRED_FILE and then invoke upsert_provider so it inherits the env var; after the
call, explicitly unset the NVIDIA_API_KEY and remove CRED_FILE to clean up;
update the block referencing CRED_FILE and the upsert_provider invocation
accordingly.
---
Nitpick comments:
In @.github/workflows/codeql.yaml:
- Around line 8-12: Add a push trigger to the workflow's top-level on: block so
CodeQL runs on pushes to the default branch (in addition to the existing
pull_request and schedule triggers); update the on: configuration (near the
existing pull_request and schedule keys) to include a push: entry targeting the
repository default branch (e.g., branches: [main]) so baseline analysis is
produced for the CodeQL dashboard.
In @.github/workflows/release.yaml:
- Around line 73-74: The github-release job currently only depends on test and
may run before publish-npm; update the workflow so github-release depends on
both test and publish-npm (add publish-npm to the needs list for the
github-release job) to ensure the npm package is published before creating the
GitHub Release.
- Around line 85-107: The changelog generation loop should avoid repeated
redirections and tighten the commit-type grep: wrap the block that appends each
section into a grouped redirect (e.g., { echo "### ${label}"; echo "$commits";
echo ""; } >> /tmp/changelog.md) to address the SC2129 warnings, and change the
grep pattern in the git log call inside the for-loop from --grep="^${type}" to a
stricter match like --grep="^${type}:" (or --grep="^${type}(") so it only
matches commit subjects; update the loop variables (type_label, type, label,
commits) and the git log invocation accordingly while keeping the PREV_TAG logic
unchanged.
- Around line 63-66: The "Install and build" step runs a compound shell command
that can mask failures when `nemoclaw` is missing or build fails; fix it by
either (A) prefixing the run block with `set -e` so any command failure aborts
the step (e.g., add `set -e` before `npm install`), or (B) split into two GitHub
Actions steps and use `working-directory: nemoclaw` for the subproject (one step
to run `npm install`/`npm run build` in that directory) and keep the top-level
`npm install` in its own step—choose one approach and update the step named
"Install and build" accordingly.
In `@bin/lib/credentials.js`:
- Line 104: Replace the inline require on line 104 with the already-imported
child_process usage: either add spawnSync to the top import (e.g., const {
spawnSync } = require('child_process')) or use the existing
child_process.spawnSync call instead of require("child_process").spawnSync;
update the call site that currently uses spawnSync to reference the imported
symbol (spawnSync or child_process.spawnSync) so no inline require remains.
- Around line 84-87: The current error classification uses fragile substring
checks on validation.message; update the validation API to return a structured
flag (e.g., validation.fatal or validation.code) and change the check in
credentials.js to test that structured field instead of string matching: replace
the condition that tests validation.message.includes("invalid") ||
validation.message.includes("expired") with a check like validation.fatal (or a
specific validation.code === "INVALID_KEY"/"EXPIRED_KEY"), and ensure callers
such as ensureApiKey use the new structured field to decide when to log and call
process.exit(1).
In `@bin/lib/logger.js`:
- Around line 13-18: setLevel currently ignores unknown level keys
(LOG_LEVELS[level]) with no feedback; update setLevel to detect when resolved is
undefined and emit a warning (e.g., console.warn or the module's warning logger)
that includes the invalid level value and a list or example of valid levels,
while keeping the existing behavior of updating currentLevel when resolved is
defined; reference symbols: setLevel, LOG_LEVELS, currentLevel.
In `@bin/lib/onboard.js`:
- Around line 935-938: The assignment to step.result after
this.completedSteps.push(step) is dead — subsequent code reads from the shared
state object, not from step.result — so either remove the conditional assignment
of step.result to avoid unused state, or if you intended to retain per-step
outputs for debugging/history, replace the assignment with an explicit log or
push the result into the existing state object used by other steps (e.g., update
the state map) and document the choice; locate the logic around completedSteps
and the code paths that read from state to ensure consistency (adjust references
to step.result or state accordingly).
In `@bin/lib/registry.js`:
- Around line 19-21: The empty catch around fs.copyFileSync(REGISTRY_FILE,
backupFile) hides backup failures; update the catch to capture the thrown error
and log a clear error message (e.g., using console.error or the module's logger)
that includes the error details and the backupFile path so users know the backup
attempt failed; keep the try/catch structure around fs.copyFileSync but replace
the empty block with a logged error and, if appropriate, rethrow or handle the
failure path consistently with the rest of registry handling.
In `@bin/lib/runner.js`:
- Around line 29-33: The current redaction callback in result.replace exposes
too many characters for keys like those starting with "nvapi-" by returning
match.slice(0, 8) + "***"; update the callback in the result.replace(...)
handler to treat nvapi tokens specially: check if match startsWith("nvapi-") (or
contains "nvapi-" before the value) and return a fixed mask such as "nvapi-***"
or just "***" instead of exposing 8 chars; keep the existing eqIdx logic for
other key=value patterns (use match.slice(0, eqIdx + 1) + "***" when eqIdx > 0)
and only alter the fallback branch that currently uses match.slice(0, 8) to use
the fixed/shorter mask for nvapi patterns.
In `@test/policies.test.js`:
- Around line 200-207: The test computes expectedName by stripping ".yaml" via
p.file.replace(".yaml", "") which can wrongly remove the first occurrence;
update the logic in the test (inside the "every preset name in YAML matches its
filename" case that iterates policies.listPresets()) to use
path.basename(p.file, ".yaml") (or path.parse/p.ext) to extract the filename
without extension so expectedName reliably matches p.name; ensure you keep the
existing assertion message unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6a4fb94b-8937-444d-bbf9-7b6466a82cdd
📒 Files selected for processing (10)
.github/workflows/codeql.yaml.github/workflows/release.yamlbin/lib/credentials.jsbin/lib/logger.jsbin/lib/onboard.jsbin/lib/registry.jsbin/lib/runner.jsbin/nemoclaw.jsscripts/setup.shtest/policies.test.js
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
scripts/setup.sh (1)
122-130:⚠️ Potential issue | 🟠 MajorThe key is still exposed on the
openshellprocess argv.
"NVIDIA_API_KEY=$NVIDIA_API_KEY"removes the parent-shell leak, butupsert_provider()still passes the literal secret toopenshell provider create/update --credential .... A concurrentpscan still recover it from the short-lived child process, so this hardening is incomplete until OpenShell supports file/stdin/env-based credential input.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/setup.sh` around lines 122 - 130, The upsert_provider call currently embeds the secret in the child process argv causing a transient ps leak; update the upsert_provider implementation (and its call site) so credentials are not passed as literal args to openshell provider create/update—instead provide them via the child process environment or via a temp credential file/stdin and invoke openshell without the secret argument (i.e., stop constructing strings like "NVIDIA_API_KEY=$NVIDIA_API_KEY" for openshell); modify upsert_provider and the openshell invocation to export or load the credential into the child environment or feed it to openshell over stdin/secure temp file and ensure the literal secret never appears in the command arguments.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 933-955: The run() catch block won't execute if helper functions
call process.exit directly (e.g., setupNim, setupInference, setupPolicies), so
replace direct process.exit() calls inside those helpers with throwing a
descriptive Error (or returning a rejected Promise) so the error propagates up
to run() and triggers _rollback(); alternatively, if immediate termination is
required, ensure those helpers call await this._rollback() and rethrow before
exiting. Update setupNim, setupInference, and setupPolicies to avoid terminating
the process themselves and let run() handle cleanup via _rollback().
- Around line 988-995: createSandbox may return an existing sandbox when the
user opts to keep it, so the rollback currently deletes any non-null
state.sandboxName and can remove a preexisting sandbox; fix by tracking whether
this run actually created/recreated the sandbox (e.g., add state.sandboxCreated
boolean). In the execute callback (where createSandbox is called), set
state.sandboxCreated = true only when the sandbox was newly created (modify
createSandbox to return creation metadata or detect pre-existence), and change
the rollback to delete and call registry.removeSandbox only if
state.sandboxCreated is true (leave existing sandboxes untouched).
---
Duplicate comments:
In `@scripts/setup.sh`:
- Around line 122-130: The upsert_provider call currently embeds the secret in
the child process argv causing a transient ps leak; update the upsert_provider
implementation (and its call site) so credentials are not passed as literal args
to openshell provider create/update—instead provide them via the child process
environment or via a temp credential file/stdin and invoke openshell without the
secret argument (i.e., stop constructing strings like
"NVIDIA_API_KEY=$NVIDIA_API_KEY" for openshell); modify upsert_provider and the
openshell invocation to export or load the credential into the child environment
or feed it to openshell over stdin/secure temp file and ensure the literal
secret never appears in the command arguments.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 03f031fd-9ce0-40ea-9446-2036f907c8a7
📒 Files selected for processing (3)
bin/lib/credentials.jsbin/lib/onboard.jsscripts/setup.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- bin/lib/credentials.js
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/validate-api-key.test.js (1)
9-23: Consider documenting test flakiness expectations.These tests hit the real NVIDIA API and handle network variability via conditional assertions (Lines 20-22). While acceptable for integration-style tests, consider adding a
//@flaky`` or similar annotation and ensuring CI handles transient failures gracefully (e.g., retry or skip on network errors).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/validate-api-key.test.js` around lines 9 - 23, The test using validateApiKey calls the real NVIDIA API and contains conditional assertions that can be flaky; annotate the test (e.g., add a comment like // `@flaky` or use your test framework's retry/skip mechanism) on the it("returns { ok: false, fatal: true } for an invalid key" ...) case and update CI/test runner config to either retry transient network failures or skip the test when network is unavailable so the validateApiKey integration test does not cause intermittent CI failures.bin/lib/credentials.js (1)
124-125: Minor: Missingmessagefield on successful validation.The docstring (Lines 96-102) states the return includes a
messagefor "human-readable status," but the success case omits it. While callers currently only usemessageon failure, this inconsistency could cause issues if future code expectsmessageto always exist.🔧 Proposed fix
if (httpCode === "200") { - return { ok: true, fatal: false }; + return { ok: true, fatal: false, message: "API key validated successfully" }; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/credentials.js` around lines 124 - 125, The success branch that returns { ok: true, fatal: false } when httpCode === "200" is missing the human-readable message described in the function docstring; update that return to include a message (e.g., "Validation successful" or similar) so the shape matches failure results and callers can always rely on a message field while keeping ok: true and fatal: false; modify the object returned in the httpCode === "200" branch accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@bin/lib/credentials.js`:
- Around line 124-125: The success branch that returns { ok: true, fatal: false
} when httpCode === "200" is missing the human-readable message described in the
function docstring; update that return to include a message (e.g., "Validation
successful" or similar) so the shape matches failure results and callers can
always rely on a message field while keeping ok: true and fatal: false; modify
the object returned in the httpCode === "200" branch accordingly.
In `@test/validate-api-key.test.js`:
- Around line 9-23: The test using validateApiKey calls the real NVIDIA API and
contains conditional assertions that can be flaky; annotate the test (e.g., add
a comment like // `@flaky` or use your test framework's retry/skip mechanism) on
the it("returns { ok: false, fatal: true } for an invalid key" ...) case and
update CI/test runner config to either retry transient network failures or skip
the test when network is unavailable so the validateApiKey integration test does
not cause intermittent CI failures.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5e76f0d4-7138-4a57-969b-cfd53f5da3d5
📒 Files selected for processing (7)
bin/lib/credentials.jsbin/lib/logger.jsbin/lib/onboard.jsbin/lib/runner.jstest/onboard-pipeline.test.jstest/runner-redact.test.jstest/validate-api-key.test.js
🚧 Files skipped from review as they are similar to previous changes (2)
- bin/lib/runner.js
- bin/lib/logger.js
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/credentials.js`:
- Around line 77-87: The validation for the NVIDIA API key must run regardless
of whether the key came from a prompt, environment, or credentials.json; move
the validateApiKey(key) call and its handling (the console messages and
process.exit(1) for fatal) into the shared path immediately after the code
resolves the 'key' variable (i.e., before any early return that returns a cached
NVIDIA_API_KEY), so that validateApiKey, the 'validation.ok' /
'validation.fatal' branches, and the associated console logs are executed for
all sources of the key.
- Around line 103-138: The validateApiKey function currently only checks
result.stdout and treats empty output as a network/NVIDIA issue; update it to
explicitly inspect the spawnSync result object for execution errors
(result.error) and non-zero process exit (result.status) before interpreting
stdout: if result.error exists (e.g., ENOENT/curl missing or timeout), return a
clear failure object that includes the error.message and marks the validation as
a local/fatal execution problem; if result.status is non-zero but no stdout,
surface the exit code and stderr in the returned message; only fall back to the
previous network/NVIDIA logic when no execution error and stdout is present.
Ensure the returned shapes (ok, fatal, message) remain consistent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f8e2fb41-97dc-492f-b687-57ed60f6ab74
📒 Files selected for processing (2)
bin/lib/credentials.jstest/validate-api-key.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
- test/validate-api-key.test.js
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/credentials.js`:
- Around line 124-127: The error handling in the block returning { ok: false,
fatal: false, message: `Could not validate key (${reason}). Proceeding with
saved key.` } should special-case ENOENT to give a clearer message; update the
logic that computes reason (in the same scope where result.error is checked) to
first check if result.error.code === "ENOENT" and set reason to "curl is not
installed", otherwise keep the existing timeout check (result.error.code ===
"ETIMEDOUT" ? "timed out" : result.error.message || "unknown error"); ensure
this change is applied in the function/method that contains the result.error
handling in credentials.js so the returned message explicitly mentions curl is
missing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cc29d3cb-0f7c-41cf-840b-a2af19d39abb
📒 Files selected for processing (1)
bin/lib/credentials.js
- Fix API key exposure in process arguments visible via ps aux (NVIDIA#325). Credentials are now passed via temp files (mode 0600) or environment inheritance instead of CLI args in onboard.js, nemoclaw.js, and setup.sh. - Add secret redaction to runner error output to prevent accidental leakage of API keys in logs and terminal output. - Validate NVIDIA API key against the API during onboard before saving, catching invalid or expired keys early instead of silent agent failure. - Add --verbose / --debug CLI flag and LOG_LEVEL env var for diagnostic output. All shell commands are logged in debug mode with redacted secrets. - Add OnboardPipeline state machine with SIGINT/SIGTERM signal handlers that automatically roll back completed steps (gateway, sandbox, NIM) on interruption or failure, preventing orphaned resources. - Improve registry corruption recovery: corrupted sandboxes.json is backed up with a timestamp and the user is warned with actionable guidance. - Expand policies.js test coverage from 14 to 25 tests, adding coverage for extractPresetEntries, parseCurrentPolicy, and YAML schema validation. - Add GitHub Actions release pipeline (release.yaml) triggered on semver tags: runs tests, publishes to npm, generates changelog from conventional commits, and creates a GitHub Release. - Add CodeQL security scanning workflow (codeql.yaml) for JavaScript/ TypeScript and Python on PRs and a weekly schedule.
- credentials.js: pass curl auth header via stdin (-H @-) instead of CLI argument to prevent API key visibility in ps aux. - onboard.js: replace temp-file command substitution with env var reference for openshell --credential flag. The literal value still appears in the transient openshell subprocess args (unavoidable without upstream --credential-file support), but no longer in the parent bash process args or shell history. - onboard.js: make signal handler synchronous to guarantee rollback completes before process.exit(). All rollback actions use spawnSync so no async work is needed. - setup.sh: remove unnecessary temp file indirection for credential passing — use direct env var reference instead.
- Remove redundant sandboxEnv; NVIDIA_API_KEY inherits via process.env. - Fix sandbox rollback name mismatch: snapshot all registry entries by name before createSandbox, then compare the returned name's createdAt to detect "kept existing" vs "newly created" (interactive mode bug). - Track state.gatewayStarted flag so rollback only destroys gateways this run created, not pre-existing ones serving other sandboxes. - Strip logger.js to debug-only (remove unused info/warn/error exports whose prefix format conflicts with existing console output style). - Move redactSecrets regex patterns inside the function to avoid shared /g lastIndex state across calls. - Add fatal flag to validateApiKey return value; ensureApiKey checks result.fatal instead of fragile string matching on message text. - Remove dead async _rollback wrapper — all rollback actions are synchronous (spawnSync), so the async indirection was misleading. - Add 20 new tests: redactSecrets (9 tests), OnboardPipeline (7 tests), validateApiKey (4 tests). Total test count: 175, 0 failures.
- Include message field in validateApiKey success return for consistent
{ ok, fatal, message } shape across all code paths.
- Annotate real-API test as @flaky with explanation of network-dependent
conditional assertions.
Check result.error before inspecting stdout so curl-missing and timeout failures are reported accurately instead of being mislabeled as NVIDIA API network issues.
- Special-case ENOENT in validateApiKey to show "curl is not installed"
instead of the raw "spawnSync curl ENOENT" message.
- Import spawnSync at module top alongside execSync instead of inline
require("child_process").spawnSync inside the function.
- Add explanatory comment to empty catch block in registry backup to
clarify the best-effort intent.
f3cded8 to
37bb222
Compare
- Remove unused isVerbose export from logger.js (nothing imports it). - Drop unused err binding in registry.js catch block. - Restore inline comments stripped during merge conflict resolution: streamSandboxCreate rationale, sandbox ready-wait context, orphan cleanup explanation. - Inline rawArgs variable in nemoclaw.js (used only once).
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
bin/lib/onboard.js (1)
1082-1104:⚠️ Potential issue | 🟠 MajorRollback still bypassed by
process.exit()calls in step helpers.The pipeline's catch block and
_rollback()mechanism won't execute when step helpers terminate viaprocess.exit(). Several helpers still callprocess.exit(1)directly:
setupNim()(Lines 666, 704, 779)setupInference()(Lines 819, 839, 859)setupPolicies()(Lines 929, 939, 947, 951)createSandbox()(Lines 489, 499, 561, 592)These failures will skip rollback entirely, leaving partial onboarding state behind.
Consider throwing errors instead of calling
process.exit()directly, allowing the pipeline to catch them and perform proper cleanup.,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1082 - 1104, Step helper functions call process.exit(1) which bypasses the run() catch/_rollback flow; update each helper (createSandbox, setupNim, setupInference, setupPolicies) to stop calling process.exit and instead throw an Error (or return a rejected Promise) with a descriptive message so the error propagates back to Onboard.run's try/catch; ensure any callers that relied on process.exit now either rethrow or let the error bubble so Onboard.run can call _rollback() and then handle termination (if desired) after cleanup.
🧹 Nitpick comments (1)
bin/nemoclaw.js (1)
183-200: Verify remote.envfile permissions.Credentials are securely written to a local temp file (mode
0o600) and cleaned up after scp, which is good. However, the destination/home/ubuntu/nemoclaw/.envfile permissions depend on the remote umask.Consider adding an explicit
chmod 600on the remote after the scp to ensure the.envfile has restrictive permissions regardless of the remote system's default umask.♻️ Proposed fix
try { run(`scp -q -o StrictHostKeyChecking=no -o LogLevel=ERROR ${shellQuote(envTmp)} ${qname}:/home/ubuntu/nemoclaw/.env`); + run(`ssh -o StrictHostKeyChecking=no -o LogLevel=ERROR ${qname} 'chmod 600 /home/ubuntu/nemoclaw/.env'`); } finally {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/nemoclaw.js` around lines 183 - 200, After copying the temp env file with run(`scp ... ${qname}:/home/ubuntu/nemoclaw/.env`), ensure the remote file has restrictive permissions by invoking a remote chmod; update the try block around the scp to run an additional command like run(`ssh -o StrictHostKeyChecking=no -o LogLevel=ERROR ${qname} 'chmod 600 /home/ubuntu/nemoclaw/.env'`) right after the scp call so the remote .env permissions are explicitly set regardless of remote umask, while keeping the existing cleanup of envTmp and envDir intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/release.yaml:
- Around line 73-75: The github-release job currently lists only needs: test,
which allows releases even if npm publishing fails; update the github-release
job definition to depend on publish-npm as well (e.g., change needs: test to
needs: [test, publish-npm] or add publish-npm alongside test) so the
github-release job waits for the publish-npm job to succeed before running; edit
the job block named github-release to reference the publish-npm job by its job
name.
- Line 14: The workflow's tag pattern string "v[0-9]+.[0-9]+.[0-9]+*" uses
regex-like syntax which glob matching in GitHub Actions doesn't support; replace
that pattern with a glob that matches semver tags such as "v*.*.*" (i.e., update
the tag pattern value used in the workflow trigger so tags like v1.2.3 will
correctly trigger the workflow).
In `@test/validate-api-key.test.js`:
- Around line 10-25: The test is flaky and misleading: instead of allowing any
outcome, change the test to be deterministic by stubbing the underlying system
call used by validateApiKey (e.g., stub/spyon the child_process.spawnSync or the
internal http/curl wrapper used by validateApiKey) to simulate a 401/403
response for the invalid key, then assert exactly that
validateApiKey("nvapi-INVALID_TEST_KEY_000000") returns { ok: false, fatal: true
} (and a message containing "invalid" or "expired"); alternatively, gate the
existing live-network test behind an environment flag (e.g., RUN_LIVE_TESTS) and
update the test name to reflect the mocked behavior so CI only runs the
deterministic assertion when not running live.
---
Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1082-1104: Step helper functions call process.exit(1) which
bypasses the run() catch/_rollback flow; update each helper (createSandbox,
setupNim, setupInference, setupPolicies) to stop calling process.exit and
instead throw an Error (or return a rejected Promise) with a descriptive message
so the error propagates back to Onboard.run's try/catch; ensure any callers that
relied on process.exit now either rethrow or let the error bubble so Onboard.run
can call _rollback() and then handle termination (if desired) after cleanup.
---
Nitpick comments:
In `@bin/nemoclaw.js`:
- Around line 183-200: After copying the temp env file with run(`scp ...
${qname}:/home/ubuntu/nemoclaw/.env`), ensure the remote file has restrictive
permissions by invoking a remote chmod; update the try block around the scp to
run an additional command like run(`ssh -o StrictHostKeyChecking=no -o
LogLevel=ERROR ${qname} 'chmod 600 /home/ubuntu/nemoclaw/.env'`) right after the
scp call so the remote .env permissions are explicitly set regardless of remote
umask, while keeping the existing cleanup of envTmp and envDir intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bed89b33-26ad-46ca-b2d7-e5ff698f16a7
📒 Files selected for processing (13)
.github/workflows/codeql.yaml.github/workflows/release.yamlbin/lib/credentials.jsbin/lib/logger.jsbin/lib/onboard.jsbin/lib/registry.jsbin/lib/runner.jsbin/nemoclaw.jsscripts/setup.shtest/onboard-pipeline.test.jstest/policies.test.jstest/runner-redact.test.jstest/validate-api-key.test.js
✅ Files skipped from review due to trivial changes (7)
- test/runner-redact.test.js
- bin/lib/registry.js
- .github/workflows/codeql.yaml
- bin/lib/logger.js
- test/onboard-pipeline.test.js
- test/policies.test.js
- bin/lib/runner.js
🚧 Files skipped from review as they are similar to previous changes (1)
- bin/lib/credentials.js
- release.yaml: fix tag pattern from regex-style v[0-9]+.[0-9]+.[0-9]+* to glob-style v*.*.* — GitHub Actions uses globs, not regex. - release.yaml: chain github-release after publish-npm so releases are not created when npm publish fails. - validate-api-key.test.js: test return shape and invariants only, not specific ok/fatal values, since the /v1/models endpoint may return 200 for any key (public listing). Removes false-pass risk.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/validate-api-key.test.js (1)
15-27: Consolidate duplicate tests.These two tests verify essentially the same thing—that the return value has the
{ ok, fatal, message }shape. Test 1 (lines 15-20) already checks thatokandfatalare booleans andmessageexists; Test 2 (lines 22-27) only adds weaker"in"checks that are implied by Test 1 passing.Consider removing Test 2 or merging them into a single, more comprehensive test.
♻️ Suggested consolidation
- it("returns a well-formed result for any key", () => { + it("returns a well-formed { ok, fatal, message } result", () => { const result = validateApiKey("nvapi-INVALID_TEST_KEY_000000"); assert.equal(typeof result.ok, "boolean"); assert.equal(typeof result.fatal, "boolean"); - assert.ok("message" in result, "must have message field"); - }); - - it("always returns the { ok, fatal, message } shape", () => { - const result = validateApiKey("nvapi-test"); - assert.ok("ok" in result, "must have ok field"); - assert.ok("fatal" in result, "must have fatal field"); - assert.ok("message" in result, "must have message field"); + assert.equal(typeof result.message, "string", "message must be a string"); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/validate-api-key.test.js` around lines 15 - 27, The two tests both assert the same return shape from validateApiKey; remove the redundant second test or merge its weaker checks into the first to consolidate. Specifically, keep the more comprehensive test that calls validateApiKey (function name validateApiKey) and asserts typeof result.ok === "boolean", typeof result.fatal === "boolean", and that "message" in result; delete the second it block that only checks existence of keys, or replace it by adding any additional input cases into the single test so all shape assertions are covered in one it block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/validate-api-key.test.js`:
- Around line 29-34: The test currently passes vacuously when
validateApiKey("nvapi-anything") returns ok: false; change the assertion to be
unconditional and assert the invariant directly by checking that not (result.ok
&& result.fatal) so the test fails if validateApiKey ever returns ok: true with
fatal: true; locate the test case using validateApiKey in the "it" block and
replace the conditional assertion with a single assert that !(result.ok &&
result.fatal).
---
Nitpick comments:
In `@test/validate-api-key.test.js`:
- Around line 15-27: The two tests both assert the same return shape from
validateApiKey; remove the redundant second test or merge its weaker checks into
the first to consolidate. Specifically, keep the more comprehensive test that
calls validateApiKey (function name validateApiKey) and asserts typeof result.ok
=== "boolean", typeof result.fatal === "boolean", and that "message" in result;
delete the second it block that only checks existence of keys, or replace it by
adding any additional input cases into the single test so all shape assertions
are covered in one it block.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 61e3056a-da01-48ce-94e8-9ff07153a933
📒 Files selected for processing (2)
.github/workflows/release.yamltest/validate-api-key.test.js
✅ Files skipped from review due to trivial changes (1)
- .github/workflows/release.yaml
| it("never returns fatal: true when ok: true", () => { | ||
| const result = validateApiKey("nvapi-anything"); | ||
| if (result.ok) { | ||
| assert.equal(result.fatal, false, "ok: true must have fatal: false"); | ||
| } | ||
| }); |
There was a problem hiding this comment.
Test passes vacuously when ok: false.
If validateApiKey("nvapi-anything") returns { ok: false, ... } (which is likely for an invalid-looking key), the assertion inside the if block never runs and the test passes trivially. This doesn't actually verify the invariant.
To make this test meaningful, either:
- Mock
validateApiKeyto return{ ok: true, fatal: false, message: "..." }and verify the invariant holds, or - Invert the logic to assert
!(result.ok && result.fatal)unconditionally.
🔧 Suggested fix using unconditional assertion
it("never returns fatal: true when ok: true", () => {
const result = validateApiKey("nvapi-anything");
- if (result.ok) {
- assert.equal(result.fatal, false, "ok: true must have fatal: false");
- }
+ // Invariant: ok and fatal cannot both be true
+ assert.ok(!(result.ok && result.fatal), "ok: true must have fatal: false");
});🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/validate-api-key.test.js` around lines 29 - 34, The test currently
passes vacuously when validateApiKey("nvapi-anything") returns ok: false; change
the assertion to be unconditional and assert the invariant directly by checking
that not (result.ok && result.fatal) so the test fails if validateApiKey ever
returns ok: true with fatal: true; locate the test case using validateApiKey in
the "it" block and replace the conditional assertion with a single assert that
!(result.ok && result.fatal).
|
Thanks for the comprehensive work here @100rabhkr. The API key exposure fix is covered by #330 (landing now). We've split the remaining security and feature pieces into separate issues so they can land independently:
|
|
If you're interested in continuing, the CodeQL workflow and the onboard rollback pipeline would both be great as standalone PRs — easier to review and merge when the concerns are separate. |
Summary
Fixes the critical API key exposure vulnerability (#325) and adds hardening across security, observability, reliability, testing, and CI/CD.
Security
ps aux— Credentials passed through env var inheritance and stdin (curl -H @-) instead of CLI arguments inonboard.js,nemoclaw.js, andsetup.sh. Runner error messages auto-redact known secret patterns.fatalflag) instead of causing silent agent failure after 15-minute setup. Network failures warn but don't block.Observability
--verbose/--debugflag — Global CLI flag andLOG_LEVELenv var. Logs every shell command (with secrets redacted) to stderr.Reliability
OnboardPipelinestate machine with synchronous SIGINT/SIGTERM handlers. Completed steps are rolled back in reverse order (gateway destroy, sandbox delete, NIM stop).createdAtby returned name). Gateway only rolled back if this run started it (gatewayStartedflag).sandboxes.jsonbacked up with timestamp, user warned with guidance.Testing
redactSecrets(9),OnboardPipelinerollback (7),validateApiKey(4),policies.jscoverage expansion (11).CI/CD
release.yaml: test → npm publish → changelog from conventional commits → GitHub Release.codeql.yaml: JS/TS + Python analysis on PRs and weekly.Test plan
redactSecrets: 9 tests covering all secret patterns, empty input, consecutive callsOnboardPipeline: 7 tests covering success, failure rollback order, null rollback, throwing rollback, cleanupvalidateApiKey: 4 tests covering return shape, fatal invariant, empty key--verboseflag shows in help output and activates debug loggingnemoclaw onboardwith Ctrl+C verifies rollback cleanupFiles changed (14)
bin/lib/credentials.jsvalidateApiKey()with{ ok, fatal, message }return; integrated intoensureApiKey()bin/lib/logger.jsdebug,setLevel,isVerbose)bin/lib/onboard.jsOnboardPipelinewith sync rollback, ownership tracking (gatewayStarted,sandboxCreated)bin/lib/registry.jssandboxes.jsonbin/lib/runner.jsredactSecrets()with per-call regex instances; verbose command loggingbin/nemoclaw.js--verbose/--debugflag parsing, help text,setupSparkcredential fixscripts/setup.shtest/runner-redact.test.jstest/onboard-pipeline.test.jstest/validate-api-key.test.jstest/policies.test.jsextractPresetEntries,parseCurrentPolicy, schema.github/workflows/release.yaml.github/workflows/codeql.yamlSummary by CodeRabbit
New Features
Bug Fixes
Chores
Tests