Conversation
zli12321
commented
Apr 11, 2026
- Add a new aris-compute-guard skill that performs a mandatory pre-flight check of GPU/compute resources (CUDA, MPS, remote SSH, Vast.ai, Modal) before any experiment execution
- Inject a compute guard block into the system prompt of all AI providers (Claude, OpenRouter, Gemini CLI/API, Local GPU) so the agent always checks resources before running experiments — even for natural-language requests
- Add GET /api/compute/check-available REST endpoint and compute_check_available MCP tool for programmatic compute availability checks
- Gate aris-run-experiment (Step 0) and aris-experiment-bridge (Phase 0) with the mandatory compute guard, and update all project templates and skill registries
Zhang-Henry
left a comment
There was a problem hiding this comment.
Code Review
Direction is good — pre-flight compute check + consistent system-prompt injection across all 5 providers + clean skill registry updates (170 → 171, correctly slotted into experiment: base). The 14-line COMPUTE_GUARD_BLOCK keeps token cost low. A few things worth addressing:
Should-fix
-
DRY violation —
server/routes/compute.js:498-622andserver/mcp-compute.js:203-315duplicate ~200 lines of near-identical CUDA/MPS detection logic (including the hard-coded 500 MiB threshold, GPU parsing, remote/local branching). Any future change has to happen twice. Extract intoserver/utils/computeCheck.jsand import from both. -
Error-message info leakage in
server/routes/compute.js:616— raw SSHerr.messageis returned to the client (Remote node "${node.name}" unreachable: ${err.message}). This can expose private key paths, host fingerprints, or network topology. Return a generic message in the JSON response and keep details in server logs only. -
Hard-coded 500 MiB "free GPU" threshold in both files. Make it configurable, e.g.
COMPUTE_MIN_FREE_GPU_MBenv var. 500 MB is too low on an H100 and may be too high on a small dev GPU. -
Template drift risk — the 19-line compute-guard block is copy-pasted verbatim into
AGENTS.md,CLAUDE.md,GEMINI.md, andcursor-project.md. Any future tweak must touch all four files. Extract into a shared snippet (e.g.server/templates/_compute-guard-snippet.md) and include it, or generate templates from a single source. -
Vast.ai / Modal checks are documented in
skills/aris-compute-guard/SKILL.mdbut not implemented in the route or MCP tool (only CUDA, MPS, and remote SSH are actually checked). Either implement them or remove from SKILL.md to avoid misleading users.
Security (low/medium)
- Python MPS probe swallows stderr:
python3 -c "..." 2>/dev/null— if torch isn't installed, users get a misleading "no GPU" instead of "torch not installed". Log stderr separately for diagnosis. - SSH host-key handling — verify
ComputeNode.run()sets-o StrictHostKeyChecking=accept-new(or similar), otherwise first-time remote checks will hang on TOFU prompts.
Nits
- The guard prompt is concise but pretty polite. Consider a stronger marker like
**CRITICAL SAFETY RULE:**to make it harder for the model to skip under heavy context. - No tests. Worth adding a small unit suite that mocks
nvidia-smi/ python / SSH outputs and exercises error paths (parse failures, timeouts, mixed CUDA+MPS environments). - GPU names are interpolated into markdown-ish strings in the MCP tool response without escaping. Unlikely to bite, but
H100\something` would break the output formatting. - All 5 providers import and append the block, but the concatenation style varies slightly (some chain via
.trim(), others raw+). Cosmetic, but worth unifying.
What's good
- Registry updates are internally consistent across
skill-tag-mapping.json,skills-catalog-v2.json, andstage-skill-map.json. - Route is correctly mounted behind
authenticateToken. - Injection is uniform across all 5 providers — no provider skipped.
- The skill's Step 0 gating in
aris-run-experimentandaris-experiment-bridgeis the right place to enforce this.
Overall: approve in direction, but would recommend extracting the shared check utility and sanitizing error responses before merge.
|
I have pushed changes with the following addressed:
|
…esults When the AI agent is asked to run experiments but local/remote compute resources (GPU) are unavailable or insufficient, the agent now stops immediately and reports the issue instead of proceeding and fabricating fake results. Changes: - New skill: aris-compute-guard — mandatory pre-flight compute check - New API: GET /api/compute/check-available — structured availability check - New MCP tool: compute_check_available — agents can call directly - Modified aris-run-experiment and aris-experiment-bridge with Step/Phase 0 - Injected compute guard into system prompts for all providers (Claude, OpenRouter, Gemini CLI/API, Local GPU) - Updated all project templates (CLAUDE.md, AGENTS.md, GEMINI.md, cursor) - Registered skill in catalog, stage-skill-map, and tag mappings Made-with: Cursor
bbsngg
left a comment
There was a problem hiding this comment.
Thanks for the thorough work on the compute guard! I rebased this locally onto current main and wanted to flag a few risks worth considering before merge.
Key concerns
1. Enforcement is prompt-only, not hard-gated (highest concern)
The PR's core value prop is "prevent hallucinated results when no GPU is available," but the guard is implemented entirely as:
- A system-prompt
appendblock - A skill-level "Step 0" reminder in
aris-run-experimentandaris-experiment-bridge
There is no tool-layer gate — Bash/execute tools don't actually consult compute_check_available before running python train.py or similar commands. An agent that misreads or deprioritizes the prompt can still hallucinate results. Consider either (a) softening the promise in the PR description, or (b) adding a real pre-execution hook in aris-experiment-bridge that refuses to proceed when /api/compute/check-available returns available: false.
2. Always-on token cost across all sessions
COMPUTE_GUARD_BLOCK (~200 tokens, ~924 chars) is unconditionally appended to every LLM call — including workspace_qa mode, /btw side questions, and the local-gpu provider (where "check if GPU exists" is semantically tautological). Claude SDK's prompt caching absorbs this, but OpenRouter and Gemini-API pay on every message. Suggest gating on session-mode or pipeline stage (only inject for Experiment stage or research sessions).
3. runLocal conflates errors with "no GPU"
server/utils/computeCheck.js:16-26 catches all exceptions and returns ''. Downstream logic treats empty output as "nvidia-smi not found," which means a transient driver error or a 15s timeout falsely reports "no GPU" and blocks legitimate experiments on machines that actually have GPUs. Recommend distinguishing not-found / error / timeout, and surfacing the error cause in the response.
4. MPS check depends on the server's python3, not the project env
python3 -c \"import torch; ...\" runs in the server process's Python, not the user's pyenv/conda. Users with torch installed in their project venv will be told "no MPS available" and prompted to pip install torch — into an ambiguous interpreter. Consider system_profiler SPDisplaysDataType as a hardware-side fallback on macOS, with torch as a soft-confirm.
Smaller issues
- Template snippet only applies to new projects.
writeProjectTemplatesskips files that already exist (server/templates/index.js:56-57), so existingCLAUDE.md/AGENTS.md/GEMINI.mdwon't gain the{{COMPUTE-GUARD}}section. The prompt-layer still fires for existing projects, but template-layer discoverability is zero for current users. - Dead
cwdplumbing.checkVastCompute(cwd)readsvast-instances.jsonfromcwd, but neither/api/compute/check-availablenor the MCPcompute_check_availabletool passescwd. The file-lookup branch is unreachable from the current callers. environmentquery param isn't validated./api/compute/check-available?environment=garbagesilently falls through to theautobranch. Should return 400 on unknown values, or at minimum log a warning.- HTTP status inconsistency.
check-availablereturnsres.json({ success: false })with HTTP 200 on errors, while most sibling routes returnres.status(500)— makes client-side error handling inconsistent. - No tests.
parseNvidiaSmiGpuListis a pure function — trivial to unit-test and would catch future regressions in the CSV parsing path. MIN_FREE_GPU_MB = 500is an absolute threshold. On an A100 (80GB), 500MB used looks "occupied" even at idle (display compositor overhead); on an 8GB consumer card, 500MB is 6% — inconsistent semantics. Consider a ratio-based threshold (e.g.,memory.used / memory.total < 0.1).
Rebase note
When I rebased onto current main, the compute warning skill commit (82a56265) became empty — its only change was a package-lock.json version bump from 1.1.2 → 1.1.3, and main is now at 1.1.4. I skipped it with git rebase --skip. If you re-push after rebasing, the commit count will shrink from 3 to 2. Conflicts in server/claude-sdk.js and server/gemini-api.js were purely import-line overlaps with the BTW work on main and merged cleanly.
Happy to help implement any of these if useful.
3344933 to
de2d74e
Compare