Skip to content

Feat/compute warning skill#160

Merged
bbsngg merged 2 commits intomainfrom
feat/compute-warning-skill
Apr 23, 2026
Merged

Feat/compute warning skill#160
bbsngg merged 2 commits intomainfrom
feat/compute-warning-skill

Conversation

@zli12321
Copy link
Copy Markdown
Collaborator

  • Add a new aris-compute-guard skill that performs a mandatory pre-flight check of GPU/compute resources (CUDA, MPS, remote SSH, Vast.ai, Modal) before any experiment execution
  • Inject a compute guard block into the system prompt of all AI providers (Claude, OpenRouter, Gemini CLI/API, Local GPU) so the agent always checks resources before running experiments — even for natural-language requests
  • Add GET /api/compute/check-available REST endpoint and compute_check_available MCP tool for programmatic compute availability checks
  • Gate aris-run-experiment (Step 0) and aris-experiment-bridge (Phase 0) with the mandatory compute guard, and update all project templates and skill registries

@zli12321 zli12321 requested review from Zhang-Henry and bbsngg April 11, 2026 02:55
Copy link
Copy Markdown
Collaborator

@Zhang-Henry Zhang-Henry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Direction is good — pre-flight compute check + consistent system-prompt injection across all 5 providers + clean skill registry updates (170 → 171, correctly slotted into experiment: base). The 14-line COMPUTE_GUARD_BLOCK keeps token cost low. A few things worth addressing:

Should-fix

  1. DRY violationserver/routes/compute.js:498-622 and server/mcp-compute.js:203-315 duplicate ~200 lines of near-identical CUDA/MPS detection logic (including the hard-coded 500 MiB threshold, GPU parsing, remote/local branching). Any future change has to happen twice. Extract into server/utils/computeCheck.js and import from both.

  2. Error-message info leakage in server/routes/compute.js:616 — raw SSH err.message is returned to the client (Remote node "${node.name}" unreachable: ${err.message}). This can expose private key paths, host fingerprints, or network topology. Return a generic message in the JSON response and keep details in server logs only.

  3. Hard-coded 500 MiB "free GPU" threshold in both files. Make it configurable, e.g. COMPUTE_MIN_FREE_GPU_MB env var. 500 MB is too low on an H100 and may be too high on a small dev GPU.

  4. Template drift risk — the 19-line compute-guard block is copy-pasted verbatim into AGENTS.md, CLAUDE.md, GEMINI.md, and cursor-project.md. Any future tweak must touch all four files. Extract into a shared snippet (e.g. server/templates/_compute-guard-snippet.md) and include it, or generate templates from a single source.

  5. Vast.ai / Modal checks are documented in skills/aris-compute-guard/SKILL.md but not implemented in the route or MCP tool (only CUDA, MPS, and remote SSH are actually checked). Either implement them or remove from SKILL.md to avoid misleading users.

Security (low/medium)

  • Python MPS probe swallows stderr: python3 -c "..." 2>/dev/null — if torch isn't installed, users get a misleading "no GPU" instead of "torch not installed". Log stderr separately for diagnosis.
  • SSH host-key handling — verify ComputeNode.run() sets -o StrictHostKeyChecking=accept-new (or similar), otherwise first-time remote checks will hang on TOFU prompts.

Nits

  • The guard prompt is concise but pretty polite. Consider a stronger marker like **CRITICAL SAFETY RULE:** to make it harder for the model to skip under heavy context.
  • No tests. Worth adding a small unit suite that mocks nvidia-smi / python / SSH outputs and exercises error paths (parse failures, timeouts, mixed CUDA+MPS environments).
  • GPU names are interpolated into markdown-ish strings in the MCP tool response without escaping. Unlikely to bite, but H100\something` would break the output formatting.
  • All 5 providers import and append the block, but the concatenation style varies slightly (some chain via .trim(), others raw +). Cosmetic, but worth unifying.

What's good

  • Registry updates are internally consistent across skill-tag-mapping.json, skills-catalog-v2.json, and stage-skill-map.json.
  • Route is correctly mounted behind authenticateToken.
  • Injection is uniform across all 5 providers — no provider skipped.
  • The skill's Step 0 gating in aris-run-experiment and aris-experiment-bridge is the right place to enforce this.

Overall: approve in direction, but would recommend extracting the shared check utility and sanitizing error responses before merge.

@zli12321
Copy link
Copy Markdown
Collaborator Author

I have pushed changes with the following addressed:

  1. DRY violation fixed — Created server/utils/computeCheck.js as the single source of truth for CUDA/MPS detection, GPU parsing, and remote SSH checking. Both server/routes/compute.js (/check-available endpoint) and server/mcp-compute.js (compute_check_available tool) now delegate to it instead of duplicating ~200 lines.

  2. SSH error sanitization — checkRemoteCompute() in the shared utility logs the raw err.message to console.error but returns a generic "unreachable — check SSH connectivity and try again" to the client. The route's catch block also returns "Compute check failed — see server logs for details" instead of leaking internal error text.

  3. Configurable GPU threshold — The hard-coded 500 MiB is replaced with MIN_FREE_GPU_MB, read from process.env.COMPUTE_MIN_FREE_GPU_MB (defaults to 500). The "all GPUs occupied" message now dynamically includes the threshold value.

  4. Template drift eliminated — The 19-line compute-guard block is now a single file server/templates/_compute-guard-snippet.md. All 4 templates (CLAUDE.md, AGENTS.md, GEMINI.md, cursor-project.md) use a {{COMPUTE-GUARD}} placeholder. server/templates/index.js resolves snippets at write time.

  5. Vast.ai / Modal checks implemented — checkVastCompute() checks vast-instances.json and vastai show instances --raw. checkModalCompute() runs modal token verify. Both are accessible via checkComputeAvailability('vast') / checkComputeAvailability('modal'), aligning the code with what was documented in SKILL.md.

  6. Python MPS probe stderr — runLocal() in the shared utility captures err.stderr and logs it via console.warn instead of silently swallowing with 2>/dev/null. This surfaces "torch not installed" vs. "no MPS" properly.

  7. Stronger guard prompt marker — Changed from # Compute Resource Guard to CRITICAL SAFETY RULE — Compute Resource Guard in computeGuardPrompt.js, making it harder for models to skip under heavy context.

  8. Unified concatenation — All 5 providers now use the same pattern: [basePrompt, memoryBlock, COMPUTE_GUARD_BLOCK].filter(Boolean).join('\n\n').trim() instead of mixed raw + and .trim() approaches.

  9. GPU name escaping — escapeGpuName() in the shared utility strips backticks and backslashes from GPU names before interpolating into output strings, preventing formatting breaks.

…esults

When the AI agent is asked to run experiments but local/remote compute
resources (GPU) are unavailable or insufficient, the agent now stops
immediately and reports the issue instead of proceeding and fabricating
fake results.

Changes:
- New skill: aris-compute-guard — mandatory pre-flight compute check
- New API: GET /api/compute/check-available — structured availability check
- New MCP tool: compute_check_available — agents can call directly
- Modified aris-run-experiment and aris-experiment-bridge with Step/Phase 0
- Injected compute guard into system prompts for all providers
  (Claude, OpenRouter, Gemini CLI/API, Local GPU)
- Updated all project templates (CLAUDE.md, AGENTS.md, GEMINI.md, cursor)
- Registered skill in catalog, stage-skill-map, and tag mappings

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@bbsngg bbsngg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough work on the compute guard! I rebased this locally onto current main and wanted to flag a few risks worth considering before merge.

Key concerns

1. Enforcement is prompt-only, not hard-gated (highest concern)

The PR's core value prop is "prevent hallucinated results when no GPU is available," but the guard is implemented entirely as:

  • A system-prompt append block
  • A skill-level "Step 0" reminder in aris-run-experiment and aris-experiment-bridge

There is no tool-layer gate — Bash/execute tools don't actually consult compute_check_available before running python train.py or similar commands. An agent that misreads or deprioritizes the prompt can still hallucinate results. Consider either (a) softening the promise in the PR description, or (b) adding a real pre-execution hook in aris-experiment-bridge that refuses to proceed when /api/compute/check-available returns available: false.

2. Always-on token cost across all sessions

COMPUTE_GUARD_BLOCK (~200 tokens, ~924 chars) is unconditionally appended to every LLM call — including workspace_qa mode, /btw side questions, and the local-gpu provider (where "check if GPU exists" is semantically tautological). Claude SDK's prompt caching absorbs this, but OpenRouter and Gemini-API pay on every message. Suggest gating on session-mode or pipeline stage (only inject for Experiment stage or research sessions).

3. runLocal conflates errors with "no GPU"

server/utils/computeCheck.js:16-26 catches all exceptions and returns ''. Downstream logic treats empty output as "nvidia-smi not found," which means a transient driver error or a 15s timeout falsely reports "no GPU" and blocks legitimate experiments on machines that actually have GPUs. Recommend distinguishing not-found / error / timeout, and surfacing the error cause in the response.

4. MPS check depends on the server's python3, not the project env

python3 -c \"import torch; ...\" runs in the server process's Python, not the user's pyenv/conda. Users with torch installed in their project venv will be told "no MPS available" and prompted to pip install torch — into an ambiguous interpreter. Consider system_profiler SPDisplaysDataType as a hardware-side fallback on macOS, with torch as a soft-confirm.

Smaller issues

  • Template snippet only applies to new projects. writeProjectTemplates skips files that already exist (server/templates/index.js:56-57), so existing CLAUDE.md / AGENTS.md / GEMINI.md won't gain the {{COMPUTE-GUARD}} section. The prompt-layer still fires for existing projects, but template-layer discoverability is zero for current users.
  • Dead cwd plumbing. checkVastCompute(cwd) reads vast-instances.json from cwd, but neither /api/compute/check-available nor the MCP compute_check_available tool passes cwd. The file-lookup branch is unreachable from the current callers.
  • environment query param isn't validated. /api/compute/check-available?environment=garbage silently falls through to the auto branch. Should return 400 on unknown values, or at minimum log a warning.
  • HTTP status inconsistency. check-available returns res.json({ success: false }) with HTTP 200 on errors, while most sibling routes return res.status(500) — makes client-side error handling inconsistent.
  • No tests. parseNvidiaSmiGpuList is a pure function — trivial to unit-test and would catch future regressions in the CSV parsing path.
  • MIN_FREE_GPU_MB = 500 is an absolute threshold. On an A100 (80GB), 500MB used looks "occupied" even at idle (display compositor overhead); on an 8GB consumer card, 500MB is 6% — inconsistent semantics. Consider a ratio-based threshold (e.g., memory.used / memory.total < 0.1).

Rebase note

When I rebased onto current main, the compute warning skill commit (82a56265) became empty — its only change was a package-lock.json version bump from 1.1.21.1.3, and main is now at 1.1.4. I skipped it with git rebase --skip. If you re-push after rebasing, the commit count will shrink from 3 to 2. Conflicts in server/claude-sdk.js and server/gemini-api.js were purely import-line overlaps with the BTW work on main and merged cleanly.

Happy to help implement any of these if useful.

@bbsngg bbsngg force-pushed the feat/compute-warning-skill branch from 3344933 to de2d74e Compare April 23, 2026 15:57
@bbsngg bbsngg merged commit 1bec91a into main Apr 23, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants