feat(health-check): add Proactive Sentinel for continuous health monitoring#566
feat(health-check): add Proactive Sentinel for continuous health monitoring#566nikolasdehor wants to merge 2 commits intoSynkraAI:mainfrom
Conversation
|
@nikolasdehor is attempting to deploy a commit to the Pedro Valério Lopez's projects Team on Vercel. A member of the Team first needs to authorize it. |
WalkthroughAdds a ProactiveSentinel Node.js module implementing watchpoint registration, periodic evaluation, alerting with persistence, health-score computation, built‑in checks, pruning, and a comprehensive test suite plus manifest updates to include the new file. Changes
Sequence Diagram(s)sequenceDiagram
actor Caller
participant Sentinel as ProactiveSentinel
participant Watchpoint as Watchpoint.check()
participant Alerts as AlertStore
participant FS as FileSystem
Caller->>Sentinel: start()
activate Sentinel
Sentinel->>Sentinel: schedule periodic evaluateAll()
Sentinel-->>Caller: emit SENTINEL_STARTED
deactivate Sentinel
loop every interval
Sentinel->>Watchpoint: evaluate (async)
activate Watchpoint
Watchpoint-->>Sentinel: status/result
deactivate Watchpoint
alt non-HEALTHY or status change
Sentinel->>Alerts: _fireAlert(alert)
activate Alerts
Alerts->>FS: persist alerts.json
Alerts-->>Sentinel: persisted
deactivate Alerts
Sentinel-->>Caller: emit ALERT_FIRED
end
Sentinel->>Sentinel: _updateHealthScore()
Sentinel-->>Caller: emit WATCHPOINT_EVALUATED
Sentinel-->>Caller: emit HEALTH_SCORE_CHANGED (if changed)
end
Caller->>Sentinel: stop()
activate Sentinel
Sentinel->>Sentinel: clear timers
Sentinel-->>Caller: emit SENTINEL_STOPPED
deactivate Sentinel
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR introduces the ProactiveSentinel module for continuous proactive health monitoring of the AIOX system (Story HCS-3). It complements the existing HealerManager by providing scheduled watchpoint evaluations with a health score, alert history, and integration hooks for auto-remediation.
Changes:
- New
ProactiveSentinelclass (~650 lines) with customizable watchpoints, alert management, health scoring (0-100), persistent alert history, and 5 built-in watchpoints (config-integrity, memory-integrity, stale-locks, disk-space, workspace-structure). - Comprehensive test suite with 50 unit tests covering constructor, load/save, registration, evaluation, alerts, health score, start/stop, status/stats, pruning, and all built-in watchpoints.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
.aios-core/core/health-check/proactive-sentinel.js |
New ProactiveSentinel class with watchpoint registration, evaluation, alerting, health scoring, persistence, and 5 built-in checks |
tests/core/health-check/proactive-sentinel.test.js |
50 unit tests covering all public APIs and built-in watchpoints |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| description: 'Checks that core config files are valid JSON/YAML', | ||
| severity: AlertSeverity.CRITICAL, | ||
| check: async () => { | ||
| const configFiles = [ | ||
| path.join(projectRoot, 'core-config.yaml'), | ||
| path.join(projectRoot, 'package.json'), | ||
| ]; | ||
|
|
||
| for (const file of configFiles) { | ||
| if (!fs.existsSync(file)) continue; | ||
| try { | ||
| const content = fs.readFileSync(file, 'utf8'); | ||
| if (file.endsWith('.json')) { | ||
| JSON.parse(content); | ||
| } |
There was a problem hiding this comment.
The config-integrity description claims it "Checks that core config files are valid JSON/YAML", but the YAML file (core-config.yaml) is only read with fs.readFileSync — its content is never actually parsed or validated as YAML. A file with completely invalid YAML content will still pass this check as long as it's readable from disk. Either add actual YAML validation (e.g., using a YAML parser) or update the description to accurately reflect that only JSON files are validated and YAML files are only checked for readability.
| let totalSize = 0; | ||
| const files = fs.readdirSync(memDir); | ||
| for (const file of files) { | ||
| try { | ||
| const stat = fs.statSync(path.join(memDir, file)); | ||
| totalSize += stat.size; | ||
| } catch { | ||
| // ignore | ||
| } | ||
| } |
There was a problem hiding this comment.
The disk-space check only sums file sizes of direct children of .aiox/ using fs.readdirSync, but .aiox/ contains subdirectories (e.g., audit/, cache/, dashboard/, session-digests/) whose contents won't be counted. Also, calling fs.statSync on a directory entry returns the directory metadata size, not the size of its contents. This means the disk usage reported will significantly undercount the actual space used. Consider using a recursive directory walk to compute the total size accurately.
| // Track consecutive failures | ||
| if (wp.lastStatus === WatchpointStatus.FAILING) { | ||
| wp.consecutiveFailures++; | ||
| } else if (wp.lastStatus === WatchpointStatus.HEALTHY) { |
There was a problem hiding this comment.
consecutiveFailures is only incremented for FAILING status and reset for HEALTHY status, but DEGRADED status is not handled — it neither increments nor resets the counter. This means that if a watchpoint transitions from FAILING to DEGRADED, the consecutiveFailures count will retain its stale value from the previous FAILING runs, which could be misleading for consumers (e.g., auto-healing decisions or alert patterns). Consider either incrementing for DEGRADED as well or resetting it, depending on the intended semantics.
| } else if (wp.lastStatus === WatchpointStatus.HEALTHY) { | |
| } else if ( | |
| wp.lastStatus === WatchpointStatus.HEALTHY || | |
| wp.lastStatus === WatchpointStatus.DEGRADED | |
| ) { |
| const lockPatterns = [ | ||
| path.join(projectRoot, '.aiox', '*.lock'), | ||
| path.join(projectRoot, '.aiox', 'locks', '*'), | ||
| ]; | ||
|
|
||
| const staleLocks = []; | ||
| const maxAge = 30 * 60 * 1000; // 30 minutes | ||
|
|
||
| for (const pattern of lockPatterns) { | ||
| const dir = path.dirname(pattern); |
There was a problem hiding this comment.
The lockPatterns array contains glob-like strings ('*.lock', '*'), but these patterns are never actually used for glob matching. Instead, path.dirname(pattern) is called to extract the directory, and then fs.readdirSync with a .endsWith('.lock') filter is used. The glob strings are misleading to readers. Consider replacing lockPatterns with a simple array of directories to scan, which would make the intent clearer.
| const lockPatterns = [ | |
| path.join(projectRoot, '.aiox', '*.lock'), | |
| path.join(projectRoot, '.aiox', 'locks', '*'), | |
| ]; | |
| const staleLocks = []; | |
| const maxAge = 30 * 60 * 1000; // 30 minutes | |
| for (const pattern of lockPatterns) { | |
| const dir = path.dirname(pattern); | |
| const lockDirs = [ | |
| path.join(projectRoot, '.aiox'), | |
| path.join(projectRoot, '.aiox', 'locks'), | |
| ]; | |
| const staleLocks = []; | |
| const maxAge = 30 * 60 * 1000; // 30 minutes | |
| for (const dir of lockDirs) { |
| consecutiveFailures: 0, | ||
| }; | ||
|
|
||
| this.watchpoints.set(watchpoint.id, entry); |
There was a problem hiding this comment.
Watchpoints registered after start() has been called will not get periodic evaluation timers. The start() method only creates setInterval timers for watchpoints that exist at the time it's called. If a consumer calls registerWatchpoint() while the sentinel is already running, the new watchpoint will exist in the map but will never be automatically evaluated until stop()/start() is called again. Consider adding timer setup in registerWatchpoint when this.running is true, to ensure newly registered watchpoints are immediately included in the monitoring loop.
| this.watchpoints.set(watchpoint.id, entry); | |
| this.watchpoints.set(watchpoint.id, entry); | |
| // If the sentinel is already running, immediately start periodic evaluation | |
| if (this.running) { | |
| const timer = setInterval(() => { | |
| // Ensure we don't create unhandled promise rejections | |
| this.evaluateWatchpoint(entry.id).catch(() => {}); | |
| }, entry.intervalMs); | |
| this._timers.set(entry.id, timer); | |
| } |
| // Fire alert on status change to degraded/failing | ||
| if ( | ||
| wp.lastStatus !== WatchpointStatus.HEALTHY && | ||
| wp.lastStatus !== WatchpointStatus.UNKNOWN && | ||
| (prevStatus !== wp.lastStatus || wp.consecutiveFailures > 1) | ||
| ) { | ||
| this._fireAlert(wp, evaluation); | ||
| } |
There was a problem hiding this comment.
The comment says "Fire alert on status change to degraded/failing" but the condition also fires an alert on every subsequent evaluation of a FAILING watchpoint (via wp.consecutiveFailures > 1), not just on status changes. This means a continuously failing watchpoint generates an alert on every evaluation cycle, which can rapidly fill the alert history. The comment should be updated to accurately describe the behavior, and you may want to consider throttling repeated alerts for the same ongoing failure (e.g., only alerting on every Nth consecutive failure, or after a cooldown period).
| name: watchpoint.name, | ||
| description: watchpoint.description || '', | ||
| check: watchpoint.check, | ||
| intervalMs: watchpoint.intervalMs || this.config.defaultIntervalMs, |
There was a problem hiding this comment.
Using || for default values on line 172-174 will treat falsy values like 0 as missing. For example, intervalMs: 0 would silently fall back to the default interval (60000ms) instead of being treated as an explicit value. Consider using nullish coalescing (??) instead of || for intervalMs to correctly handle the 0 case, e.g., watchpoint.intervalMs ?? this.config.defaultIntervalMs.
| intervalMs: watchpoint.intervalMs || this.config.defaultIntervalMs, | |
| intervalMs: watchpoint.intervalMs ?? this.config.defaultIntervalMs, |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (4)
tests/core/health-check/proactive-sentinel.test.js (2)
707-722: Test validates current threshold behavior.This test uses a 31-minute-old lock to exceed the hardcoded 30-minute threshold. If the stale-lock detection is updated to respect LockManager's per-lock TTL, this test will need adjustment to create a lock file with YAML content including the TTL field.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/health-check/proactive-sentinel.test.js` around lines 707 - 722, The test "stale-locks detects old lock files" currently creates a plain lock file older than 30 minutes to trigger the stale-locks watchpoint; if stale-lock detection is changed to respect per-lock TTL from LockManager, update the test to write a YAML lock file that includes the TTL field (and other expected lock metadata) instead of plain 'locked', and ensure the created lock's TTL/mtime combination will cause s.evaluateWatchpoint('stale-locks') to return WatchpointStatus.DEGRADED; locate the test by the test name and the createSentinel() / evaluateWatchpoint('stale-locks') calls and change the lock file creation to emit the proper YAML structure including the TTL key.
673-692: Test aligns with implementation but note path concern.These tests create
core-config.yamlatTEST_DIRroot to match the current implementation. If the implementation is updated to check.aiox-core/core-config.yaml(as noted in the implementation review), these tests will need corresponding updates.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/health-check/proactive-sentinel.test.js` around lines 673 - 692, The tests for the "config-integrity" watchpoint write core-config.yaml to TEST_DIR root but the implementation may look for .aiox-core/core-config.yaml; update the tests so they create the expected file path (e.g., write core-config.yaml into path.join(TEST_DIR, '.aiox-core', 'core-config.yaml') or adjust the sentinel fixture created by createSentinel() to point to the current lookup location before calling registerBuiltInWatchpoints() and evaluateWatchpoint('config-integrity') so the tests match the implementation..aios-core/core/health-check/proactive-sentinel.js (2)
103-122: Consider using async file operations for consistency.The
load()method is declaredasyncbut uses synchronousfs.existsSyncandfs.readFileSync. For consistency with the method signature and to avoid blocking the event loop, consider usingfs.promisesor callback-based async alternatives.♻️ Suggested refactor using fs.promises
async load() { const filePath = this._getFilePath(); try { - if (fs.existsSync(filePath)) { - const raw = fs.readFileSync(filePath, 'utf8'); + const raw = await fs.promises.readFile(filePath, 'utf8'); - const data = JSON.parse(raw); + const data = JSON.parse(raw); - if (data.schemaVersion !== this.config.schemaVersion) { - this.alerts = []; - this._loaded = true; - return; - } + if (data.schemaVersion !== this.config.schemaVersion) { + this.alerts = []; + this._loaded = true; + return; + } - this.alerts = Array.isArray(data.alerts) ? data.alerts : []; - } - } catch { + this.alerts = Array.isArray(data.alerts) ? data.alerts : []; + } catch (err) { + // File doesn't exist or is invalid - start fresh this.alerts = []; } this._loaded = true; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.aios-core/core/health-check/proactive-sentinel.js around lines 103 - 122, The asynchronous load() method currently uses blocking fs.existsSync and fs.readFileSync which can block the event loop; update load() to use fs.promises (or await fs.promises.stat/readFile) with await this._getFilePath() usage as needed: call this._getFilePath() to get filePath, use fs.promises.stat or try/catch on fs.promises.readFile(filePath, 'utf8') to read and JSON.parse the contents, preserve the existing logic that compares data.schemaVersion to this.config.schemaVersion and sets this.alerts and this._loaded, and keep the same error handling branch (on any read/parse error set this.alerts = [] and ensure this._loaded = true). Ensure to reference the same identifiers (load, _getFilePath, this.config.schemaVersion, this.alerts, this._loaded) and remove the synchronous fs.* calls.
127-143: Use async operations and consider atomic write pattern.The
save()method uses synchronous file operations in an async function. Additionally, the codebase has anatomic-write.jsutility (at.aiox-core/core/synapse/utils/atomic-write.js) that prevents file corruption on unexpected exit by using a write-to-tmp-then-rename pattern. Since this file is monitored by thememory-integritywatchpoint, a partial write during crash could trigger a false corruption alert.♻️ Suggested refactor
+ const { atomicWriteSync } = require('../synapse/utils/atomic-write'); + async save() { const filePath = this._getFilePath(); const dir = path.dirname(filePath); - if (!fs.existsSync(dir)) { - fs.mkdirSync(dir, { recursive: true }); - } + await fs.promises.mkdir(dir, { recursive: true }); const data = { schemaVersion: this.config.schemaVersion, version: this.config.version, savedAt: new Date().toISOString(), alerts: this.alerts, }; - fs.writeFileSync(filePath, JSON.stringify(data, null, 2), 'utf8'); + atomicWriteSync(filePath, JSON.stringify(data, null, 2)); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.aios-core/core/health-check/proactive-sentinel.js around lines 127 - 143, The save() method currently uses blocking fs calls; replace fs.existsSync/fs.mkdirSync/fs.writeFileSync with async equivalents and perform an atomic write using the project's atomic-write utility: compute const filePath = this._getFilePath(), ensure the parent directory exists with await fs.promises.mkdir(path.dirname(filePath), { recursive: true }), build the data object (schemaVersion, version, savedAt, alerts) and then await the atomic-write helper (imported from the atomic-write module) to write JSON.stringify(data, null, 2) to filePath so the write is non-blocking and uses the write-to-temp-then-rename pattern to avoid partial writes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.aios-core/core/health-check/proactive-sentinel.js:
- Around line 515-551: The disk-size watchpoint's check() currently only sums
files directly under memDir and ignores subdirectories; update the check
implementation in the registerWatchpoint call (the async check function
referencing memDir, totalSize, maxSize, and WatchpointStatus) to compute
totalSize recursively: walk the directory tree under memDir, for each entry use
fs.statSync (or lstatSync) and if stat.isDirectory() recurse into it, otherwise
add stat.size; keep the same error handling (catch and ignore per-entry errors),
then apply the existing threshold logic and return the same
WatchpointStatus.DEGRADED/HEALTHY responses with data.totalSizeBytes populated.
- Around line 474-513: The stale-locks watchpoint currently uses a fixed
30-minute mtime check and autoHeals, which conflicts with LockManager semantics;
update the watchpoint (id: 'stale-locks') to either delegate to
LockManager.cleanupStaleLocks() or replicate LockManager._isLockStale() logic by
reading each lock file's YAML (ttl_seconds and pid) and only marking a lock
stale if ttl has expired AND the PID is not alive (using process.kill(pid,0)),
and disable autoHeal (or ensure it defers deletion to LockManager) so we don't
auto-delete locks that LockManager would consider valid.
- Around line 409-438: The config-integrity watchpoint (id 'config-integrity'
registered via this.registerWatchpoint, check async function) is using the wrong
path and only validates JSON; update the configFiles entry to point to
'.aiox-core/core-config.yaml' instead of 'core-config.yaml' and add YAML parsing
for that file inside the check: when encountering a .yaml/.yml file, parse its
content (e.g., via your project's YAML parser utility or js-yaml's
safeLoad/safeLoadAll) in the try block so syntax errors are caught; keep
existing JSON.parse for .json files and return the same WatchpointStatus.FAILING
payload on parse errors.
In `@tests/core/health-check/proactive-sentinel.test.js`:
- Around line 7-10: Replace the relative require of
'../../../.aios-core/core/health-check/proactive-sentinel' with your project's
configured absolute module path (e.g., the alias used in
moduleNameMapper/tsconfig paths) so the test imports ProactiveSentinel via an
absolute import; update the require that defines ProactiveSentinel (and the
destructured AlertSeverity, WatchpointStatus, Events, CONFIG) to use that
absolute module name so the test follows the coding guideline for absolute
imports.
---
Nitpick comments:
In @.aios-core/core/health-check/proactive-sentinel.js:
- Around line 103-122: The asynchronous load() method currently uses blocking
fs.existsSync and fs.readFileSync which can block the event loop; update load()
to use fs.promises (or await fs.promises.stat/readFile) with await
this._getFilePath() usage as needed: call this._getFilePath() to get filePath,
use fs.promises.stat or try/catch on fs.promises.readFile(filePath, 'utf8') to
read and JSON.parse the contents, preserve the existing logic that compares
data.schemaVersion to this.config.schemaVersion and sets this.alerts and
this._loaded, and keep the same error handling branch (on any read/parse error
set this.alerts = [] and ensure this._loaded = true). Ensure to reference the
same identifiers (load, _getFilePath, this.config.schemaVersion, this.alerts,
this._loaded) and remove the synchronous fs.* calls.
- Around line 127-143: The save() method currently uses blocking fs calls;
replace fs.existsSync/fs.mkdirSync/fs.writeFileSync with async equivalents and
perform an atomic write using the project's atomic-write utility: compute const
filePath = this._getFilePath(), ensure the parent directory exists with await
fs.promises.mkdir(path.dirname(filePath), { recursive: true }), build the data
object (schemaVersion, version, savedAt, alerts) and then await the atomic-write
helper (imported from the atomic-write module) to write JSON.stringify(data,
null, 2) to filePath so the write is non-blocking and uses the
write-to-temp-then-rename pattern to avoid partial writes.
In `@tests/core/health-check/proactive-sentinel.test.js`:
- Around line 707-722: The test "stale-locks detects old lock files" currently
creates a plain lock file older than 30 minutes to trigger the stale-locks
watchpoint; if stale-lock detection is changed to respect per-lock TTL from
LockManager, update the test to write a YAML lock file that includes the TTL
field (and other expected lock metadata) instead of plain 'locked', and ensure
the created lock's TTL/mtime combination will cause
s.evaluateWatchpoint('stale-locks') to return WatchpointStatus.DEGRADED; locate
the test by the test name and the createSentinel() /
evaluateWatchpoint('stale-locks') calls and change the lock file creation to
emit the proper YAML structure including the TTL key.
- Around line 673-692: The tests for the "config-integrity" watchpoint write
core-config.yaml to TEST_DIR root but the implementation may look for
.aiox-core/core-config.yaml; update the tests so they create the expected file
path (e.g., write core-config.yaml into path.join(TEST_DIR, '.aiox-core',
'core-config.yaml') or adjust the sentinel fixture created by createSentinel()
to point to the current lookup location before calling
registerBuiltInWatchpoints() and evaluateWatchpoint('config-integrity') so the
tests match the implementation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 13eba410-f3b4-4a25-b37a-31cdf18532af
📒 Files selected for processing (2)
.aios-core/core/health-check/proactive-sentinel.jstests/core/health-check/proactive-sentinel.test.js
| // 1. Config file integrity | ||
| this.registerWatchpoint({ | ||
| id: 'config-integrity', | ||
| name: 'Config File Integrity', | ||
| description: 'Checks that core config files are valid JSON/YAML', | ||
| severity: AlertSeverity.CRITICAL, | ||
| check: async () => { | ||
| const configFiles = [ | ||
| path.join(projectRoot, 'core-config.yaml'), | ||
| path.join(projectRoot, 'package.json'), | ||
| ]; | ||
|
|
||
| for (const file of configFiles) { | ||
| if (!fs.existsSync(file)) continue; | ||
| try { | ||
| const content = fs.readFileSync(file, 'utf8'); | ||
| if (file.endsWith('.json')) { | ||
| JSON.parse(content); | ||
| } | ||
| } catch { | ||
| return { | ||
| status: WatchpointStatus.FAILING, | ||
| message: `Corrupted config: ${path.basename(file)}`, | ||
| data: { file }, | ||
| }; | ||
| } | ||
| } | ||
| return { status: WatchpointStatus.HEALTHY, message: 'All configs valid' }; | ||
| }, | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the actual location of core-config.yaml in the codebase
fd -t f 'core-config.yaml' --max-depth 3Repository: SynkraAI/aiox-core
Length of output: 44
🏁 Script executed:
# Search for references to core-config in the codebase
rg "core-config" --max-count=20Repository: SynkraAI/aiox-core
Length of output: 50374
🏁 Script executed:
# Find the proactive-sentinel.js file and check its actual location
fd -t f "proactive-sentinel.js"Repository: SynkraAI/aiox-core
Length of output: 44
🏁 Script executed:
# Look for env-config.js file
fd -t f "env-config.js"Repository: SynkraAI/aiox-core
Length of output: 87
🏁 Script executed:
# Check project root structure
ls -la | head -20Repository: SynkraAI/aiox-core
Length of output: 1232
🏁 Script executed:
# Check if .aiox-core directory exists
fd -t d "\.aiox-core" --max-depth 2Repository: SynkraAI/aiox-core
Length of output: 44
🏁 Script executed:
# Check package.json for js-yaml dependency
cat package.json | grep -A5 -B5 "js-yaml" || echo "js-yaml not found in package.json"Repository: SynkraAI/aiox-core
Length of output: 352
🏁 Script executed:
# Find proactive-sentinel.js with correct directory
fd -t f "proactive-sentinel.js" --path "*aios-core*"Repository: SynkraAI/aiox-core
Length of output: 295
🏁 Script executed:
# Try finding it in .aios-core (with 's')
find .aios-core -name "proactive-sentinel.js" 2>/dev/nullRepository: SynkraAI/aiox-core
Length of output: 112
🏁 Script executed:
# List what's in .aios-core
ls -la .aios-core/Repository: SynkraAI/aiox-core
Length of output: 226
🏁 Script executed:
# Read the proactive-sentinel.js file, focusing on lines 409-438
sed -n '400,450p' .aios-core/core/health-check/proactive-sentinel.jsRepository: SynkraAI/aiox-core
Length of output: 1774
🏁 Script executed:
# Check for yaml parsing in the file
rg "yaml|YAML" .aios-core/core/health-check/proactive-sentinel.js -A 2 -B 2Repository: SynkraAI/aiox-core
Length of output: 412
🏁 Script executed:
# Check imports at top of file
head -50 .aios-core/core/health-check/proactive-sentinel.jsRepository: SynkraAI/aiox-core
Length of output: 1663
🏁 Script executed:
# Verify the test file references to confirm path expectation
rg "core-config.yaml" tests/core/health-check/ -B 1 -A 1Repository: SynkraAI/aiox-core
Length of output: 385
Incorrect path and missing YAML validation in config-integrity watchpoint.
The watchpoint has two functional issues:
-
Wrong file path: Line 417 looks for
core-config.yamlat the project root, but the actual file is located at.aiox-core/core-config.yamlthroughout the codebase. -
No YAML parsing: The check only validates JSON files. The YAML file is read but never parsed, so invalid YAML syntax won't be detected despite the description claiming to validate "JSON/YAML".
Suggested fix
+const yaml = require('js-yaml'); // Add with other requires
// 1. Config file integrity
this.registerWatchpoint({
id: 'config-integrity',
name: 'Config File Integrity',
description: 'Checks that core config files are valid JSON/YAML',
severity: AlertSeverity.CRITICAL,
check: async () => {
const configFiles = [
- path.join(projectRoot, 'core-config.yaml'),
+ path.join(projectRoot, '.aiox-core', 'core-config.yaml'),
path.join(projectRoot, 'package.json'),
];
for (const file of configFiles) {
if (!fs.existsSync(file)) continue;
try {
const content = fs.readFileSync(file, 'utf8');
if (file.endsWith('.json')) {
JSON.parse(content);
+ } else if (file.endsWith('.yaml') || file.endsWith('.yml')) {
+ yaml.load(content);
}
} catch {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.aios-core/core/health-check/proactive-sentinel.js around lines 409 - 438,
The config-integrity watchpoint (id 'config-integrity' registered via
this.registerWatchpoint, check async function) is using the wrong path and only
validates JSON; update the configFiles entry to point to
'.aiox-core/core-config.yaml' instead of 'core-config.yaml' and add YAML parsing
for that file inside the check: when encountering a .yaml/.yml file, parse its
content (e.g., via your project's YAML parser utility or js-yaml's
safeLoad/safeLoadAll) in the try block so syntax errors are caught; keep
existing JSON.parse for .json files and return the same WatchpointStatus.FAILING
payload on parse errors.
| // 3. Stale lock files | ||
| this.registerWatchpoint({ | ||
| id: 'stale-locks', | ||
| name: 'Stale Lock File Detection', | ||
| description: 'Detects orphaned .lock files that may block operations', | ||
| severity: AlertSeverity.WARNING, | ||
| autoHeal: true, | ||
| check: async () => { | ||
| const lockPatterns = [ | ||
| path.join(projectRoot, '.aiox', '*.lock'), | ||
| path.join(projectRoot, '.aiox', 'locks', '*'), | ||
| ]; | ||
|
|
||
| const staleLocks = []; | ||
| const maxAge = 30 * 60 * 1000; // 30 minutes | ||
|
|
||
| for (const pattern of lockPatterns) { | ||
| const dir = path.dirname(pattern); | ||
| if (!fs.existsSync(dir)) continue; | ||
|
|
||
| const files = fs.readdirSync(dir).filter((f) => f.endsWith('.lock')); | ||
| for (const file of files) { | ||
| const filePath = path.join(dir, file); | ||
| const stat = fs.statSync(filePath); | ||
| if (Date.now() - stat.mtimeMs > maxAge) { | ||
| staleLocks.push(filePath); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (staleLocks.length > 0) { | ||
| return { | ||
| status: WatchpointStatus.DEGRADED, | ||
| message: `Found ${staleLocks.length} stale lock file(s)`, | ||
| data: { locks: staleLocks }, | ||
| }; | ||
| } | ||
| return { status: WatchpointStatus.HEALTHY, message: 'No stale locks' }; | ||
| }, | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check LockManager's TTL handling and lock file format
rg -n -A5 'DEFAULT_TTL|ttl_seconds|_isLockStale' --type jsRepository: SynkraAI/aiox-core
Length of output: 7561
🏁 Script executed:
fd -t f "lock-manager.js" --type f | grep -v test | head -5Repository: SynkraAI/aiox-core
Length of output: 44
🏁 Script executed:
cat -n .aiox-core/core/orchestration/lock-manager.js | head -100Repository: SynkraAI/aiox-core
Length of output: 3734
🏁 Script executed:
cat -n .aiox-core/core/health-check/proactive-sentinel.js | sed -n '470,520p'Repository: SynkraAI/aiox-core
Length of output: 144
🏁 Script executed:
fd -t f "proactive-sentinel.js" 2>/dev/null | head -5Repository: SynkraAI/aiox-core
Length of output: 44
🏁 Script executed:
git ls-files | grep -i "sentinel\|health" | head -10Repository: SynkraAI/aiox-core
Length of output: 620
🏁 Script executed:
cat -n .aios-core/core/health-check/proactive-sentinel.js | sed -n '474,513p'Repository: SynkraAI/aiox-core
Length of output: 1687
🏁 Script executed:
cat -n .aios-core/core/health-check/proactive-sentinel.js | sed -n '1,50p'Repository: SynkraAI/aiox-core
Length of output: 2013
🏁 Script executed:
cat -n .aiox-core/core/orchestration/lock-manager.js | grep -i "stale\|islock" -A 10Repository: SynkraAI/aiox-core
Length of output: 3081
🏁 Script executed:
cat -n .aiox-core/core/orchestration/lock-manager.js | sed -n '100,200p'Repository: SynkraAI/aiox-core
Length of output: 3655
🏁 Script executed:
cat -n .aiox-core/core/orchestration/lock-manager.js | sed -n '246,280p'Repository: SynkraAI/aiox-core
Length of output: 1200
🏁 Script executed:
rg -n "process.kill\|isPidAlive\|_isLockStale" .aiox-core/core/orchestration/lock-manager.js -A 5Repository: SynkraAI/aiox-core
Length of output: 44
🏁 Script executed:
cat -n .aiox-core/core/orchestration/lock-manager.js | sed -n '280,295p'Repository: SynkraAI/aiox-core
Length of output: 433
Stale lock detection contradicts LockManager semantics and may auto-delete valid locks.
The stale-locks watchpoint (lines 474-513) uses a hardcoded 30-minute threshold based solely on file modification time, but this conflicts with how LockManager (.aiox-core/core/orchestration/lock-manager.js) determines lock validity:
-
TTL mismatch: LockManager reads
ttl_secondsfrom the lock file (defaults to 300 seconds, ~5 minutes, but configurable per-lock), whereas the watchpoint ignores this and uses a fixed 30-minute threshold. A lock withttl_seconds: 3600(1 hour) would be flagged as stale after 30 minutes. -
Missing PID check: LockManager's
_isLockStale()method (lines 251-269) verifies both TTL expiration and that the owning process is still alive viaprocess.kill(pid, 0). The watchpoint performs no PID validation. -
Auto-deletion risk: With
autoHeal: true(line 480), the watchpoint will automatically delete locks that are valid according to LockManager, causing race conditions in the orchestration layer.
To align with LockManager's logic, parse the lock file YAML to read ttl_seconds and pid, then validate staleness using the same criteria LockManager does—or call LockManager's cleanupStaleLocks() method if available for this use case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.aios-core/core/health-check/proactive-sentinel.js around lines 474 - 513,
The stale-locks watchpoint currently uses a fixed 30-minute mtime check and
autoHeals, which conflicts with LockManager semantics; update the watchpoint
(id: 'stale-locks') to either delegate to LockManager.cleanupStaleLocks() or
replicate LockManager._isLockStale() logic by reading each lock file's YAML
(ttl_seconds and pid) and only marking a lock stale if ttl has expired AND the
PID is not alive (using process.kill(pid,0)), and disable autoHeal (or ensure it
defers deletion to LockManager) so we don't auto-delete locks that LockManager
would consider valid.
| const fs = require('fs'); | ||
| const path = require('path'); | ||
| const ProactiveSentinel = require('../../../.aios-core/core/health-check/proactive-sentinel'); | ||
| const { AlertSeverity, WatchpointStatus, Events, CONFIG } = ProactiveSentinel; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use absolute imports per coding guidelines.
Line 9 uses a relative import path. As per coding guidelines, **/*.{js,jsx,ts,tsx} files should use absolute imports instead of relative imports.
♻️ Suggested fix
const fs = require('fs');
const path = require('path');
-const ProactiveSentinel = require('../../../.aios-core/core/health-check/proactive-sentinel');
+const ProactiveSentinel = require('.aios-core/core/health-check/proactive-sentinel');
const { AlertSeverity, WatchpointStatus, Events, CONFIG } = ProactiveSentinel;Note: The exact absolute path depends on your project's module resolution configuration (e.g., moduleNameMapper in Jest config, paths in jsconfig/tsconfig, or custom resolvers).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const fs = require('fs'); | |
| const path = require('path'); | |
| const ProactiveSentinel = require('../../../.aios-core/core/health-check/proactive-sentinel'); | |
| const { AlertSeverity, WatchpointStatus, Events, CONFIG } = ProactiveSentinel; | |
| const fs = require('fs'); | |
| const path = require('path'); | |
| const ProactiveSentinel = require('.aios-core/core/health-check/proactive-sentinel'); | |
| const { AlertSeverity, WatchpointStatus, Events, CONFIG } = ProactiveSentinel; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/core/health-check/proactive-sentinel.test.js` around lines 7 - 10,
Replace the relative require of
'../../../.aios-core/core/health-check/proactive-sentinel' with your project's
configured absolute module path (e.g., the alias used in
moduleNameMapper/tsconfig paths) so the test imports ProactiveSentinel via an
absolute import; update the require that defines ProactiveSentinel (and the
destructured AlertSeverity, WatchpointStatus, Events, CONFIG) to use that
absolute module name so the test follows the coding guideline for absolute
imports.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
.aiox-core/core/health-check/proactive-sentinel.js (2)
421-455:⚠️ Potential issue | 🟠 Major
config-integrityskips the real framework config and never parses YAML.In this repo layout the framework config lives at
.aiox-core/core-config.yaml, so the current path misses it. Even if found, any non-empty YAML with broken syntax is reported healthy because the check never parses it.As per coding guidelines, "Flag any hardcoded paths, credentials, or environment-specific values."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.aiox-core/core/health-check/proactive-sentinel.js around lines 421 - 455, The 'config-integrity' watchpoint's check function currently misses the framework config at .aiox-core/core-config.yaml and doesn't actually parse YAML, so update the configFiles array to include path.join(projectRoot, '.aiox-core', 'core-config.yaml') (or glob both locations) and replace the YAML readability check with real parsing using a YAML parser (e.g., js-yaml or yaml) inside the existing try block; ensure YAML.parse (or yaml.safeLoad) is called and that parse errors are caught and returned with WatchpointStatus.FAILING and the same message/data shape so malformed YAML is reported as corrupted.
495-525:⚠️ Potential issue | 🟠 MajorReuse lock-manager staleness rules here.
This check treats any
.lockolder than 30 minutes as stale, but the orchestration layer uses lock metadata like TTL and owner liveness. Valid long-lived locks can be marked degraded here, andautoHeal: truemakes the mismatch risky.As per coding guidelines, "Check for race conditions in orchestration modules (lock-manager, session-state)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.aiox-core/core/health-check/proactive-sentinel.js around lines 495 - 525, The current stale-locks check in the check() function (variables: lockDir, staleLocks, maxAge) naively treats any .lock older than 30 minutes as stale; replace the age-only logic by delegating to the lock-manager's staleness API (e.g., call lockManager.isLockStale or lockManager.getLockMeta for each lock) to read TTL/owner metadata and verify owner liveness before classifying a lock as stale, handle and log errors from lock-manager calls, and do not perform auto-heal on locks that the lock-manager deems valid (either set autoHeal: false for this watchpoint or only auto-heal locks confirmed stale by lock-manager) to avoid race conditions with orchestration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.aiox-core/core/health-check/proactive-sentinel.js:
- Around line 396-407: The prune() and _fireAlert() methods currently only
mutate this.alerts in memory, so ensure any mutation persists by calling and
awaiting this.save() (or this.save().catch(...)) immediately after modifying
this.alerts; update prune() to call await this.save() after emitting
Events.ALERTS_PRUNED and update _fireAlert() (and the other mutating block
around lines 624-647) to persist new/removed alerts the same way, handling/save
errors (log or emit) so state isn’t lost across restarts.
- Around line 158-189: registerWatchpoint currently overwrites
this.watchpoints[id] without clearing an existing timer and schedules raw
setInterval calls that can overlap; fix by validating duplicates and preventing
overlapping runs: if a watchpoint with the same id exists, clear its existing
timer from this._timers (clearInterval) and cancel any in-flight run before
replacing; when scheduling (in registerWatchpoint and the other timer-creation
paths), wrap evaluateWatchpoint with an in-flight guard (e.g., entry.inFlight or
a this._inFlight map) or use a self-rescheduling setTimeout pattern so a slow
check() cannot start a second concurrent run and so
lastStatus/consecutiveFailures are updated only by one runner; ensure you also
initialize/reset entry.lastStatus and entry.consecutiveFailures appropriately
when replacing an existing watchpoint.
- Around line 465-489: The check function currently parses every .json in memDir
and can falsely flag files that are mid-write; update check (the async check in
proactive-sentinel.js) to ignore very recently modified files or retry reads:
before attempting JSON.parse for each file (the loop that builds corrupted),
stat the file (fs.statSync or async) and skip parsing if (Date.now() - mtimeMs)
< 2000 (or configurable freshnessWindow), or alternatively wrap the read+parse
in a small retry/backoff (e.g., 3 attempts with 100–200ms delays) and only push
into corrupted after retries fail; ensure the returned message/data (status,
message, data: { corrupted }) remains unchanged.
---
Duplicate comments:
In @.aiox-core/core/health-check/proactive-sentinel.js:
- Around line 421-455: The 'config-integrity' watchpoint's check function
currently misses the framework config at .aiox-core/core-config.yaml and doesn't
actually parse YAML, so update the configFiles array to include
path.join(projectRoot, '.aiox-core', 'core-config.yaml') (or glob both
locations) and replace the YAML readability check with real parsing using a YAML
parser (e.g., js-yaml or yaml) inside the existing try block; ensure YAML.parse
(or yaml.safeLoad) is called and that parse errors are caught and returned with
WatchpointStatus.FAILING and the same message/data shape so malformed YAML is
reported as corrupted.
- Around line 495-525: The current stale-locks check in the check() function
(variables: lockDir, staleLocks, maxAge) naively treats any .lock older than 30
minutes as stale; replace the age-only logic by delegating to the lock-manager's
staleness API (e.g., call lockManager.isLockStale or lockManager.getLockMeta for
each lock) to read TTL/owner metadata and verify owner liveness before
classifying a lock as stale, handle and log errors from lock-manager calls, and
do not perform auto-heal on locks that the lock-manager deems valid (either set
autoHeal: false for this watchpoint or only auto-heal locks confirmed stale by
lock-manager) to avoid race conditions with orchestration.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 51d469a9-e175-47e0-83c5-99afd7cc35e7
📒 Files selected for processing (3)
.aios-core/core/health-check/proactive-sentinel.js.aiox-core/core/health-check/proactive-sentinel.js.aiox-core/install-manifest.yaml
| registerWatchpoint(watchpoint) { | ||
| if (!watchpoint.id || !watchpoint.name || !watchpoint.check) { | ||
| throw new Error('Required fields: id, name, check'); | ||
| } | ||
|
|
||
| if (typeof watchpoint.check !== 'function') { | ||
| throw new Error('check must be a function'); | ||
| } | ||
|
|
||
| const entry = { | ||
| id: watchpoint.id, | ||
| name: watchpoint.name, | ||
| description: watchpoint.description || '', | ||
| check: watchpoint.check, | ||
| intervalMs: watchpoint.intervalMs ?? this.config.defaultIntervalMs, | ||
| severity: watchpoint.severity ?? AlertSeverity.WARNING, | ||
| autoHeal: watchpoint.autoHeal ?? false, | ||
| healerId: watchpoint.healerId || null, | ||
| lastStatus: WatchpointStatus.UNKNOWN, | ||
| lastCheck: null, | ||
| consecutiveFailures: 0, | ||
| }; | ||
|
|
||
| this.watchpoints.set(watchpoint.id, entry); | ||
|
|
||
| // If sentinel is already running, create a timer for the new watchpoint immediately | ||
| if (this.running) { | ||
| const timer = setInterval(() => { | ||
| this.evaluateWatchpoint(entry.id).catch(() => {}); | ||
| }, entry.intervalMs); | ||
| if (timer.unref) timer.unref(); | ||
| this._timers.set(entry.id, timer); |
There was a problem hiding this comment.
Prevent duplicate and overlapping watchpoint runs.
registerWatchpoint() can replace an existing id without clearing its old interval, and both timer paths use raw setInterval with no in-flight guard. A duplicate registration or a slow check() will race on lastStatus / consecutiveFailures and can double-fire alerts.
As per coding guidelines, "Check for race conditions in orchestration modules (lock-manager, session-state)." and "Check for proper input validation on public API methods."
Also applies to: 232-240, 291-299
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.aiox-core/core/health-check/proactive-sentinel.js around lines 158 - 189,
registerWatchpoint currently overwrites this.watchpoints[id] without clearing an
existing timer and schedules raw setInterval calls that can overlap; fix by
validating duplicates and preventing overlapping runs: if a watchpoint with the
same id exists, clear its existing timer from this._timers (clearInterval) and
cancel any in-flight run before replacing; when scheduling (in
registerWatchpoint and the other timer-creation paths), wrap evaluateWatchpoint
with an in-flight guard (e.g., entry.inFlight or a this._inFlight map) or use a
self-rescheduling setTimeout pattern so a slow check() cannot start a second
concurrent run and so lastStatus/consecutiveFailures are updated only by one
runner; ensure you also initialize/reset entry.lastStatus and
entry.consecutiveFailures appropriately when replacing an existing watchpoint.
| prune() { | ||
| const cutoff = Date.now() - this.config.retentionDays * 24 * 60 * 60 * 1000; | ||
| const before = this.alerts.length; | ||
|
|
||
| this.alerts = this.alerts.filter((a) => new Date(a.timestamp).getTime() >= cutoff); | ||
|
|
||
| const pruned = before - this.alerts.length; | ||
| if (pruned > 0) { | ||
| this.emit(Events.ALERTS_PRUNED, { count: pruned }); | ||
| } | ||
| return pruned; | ||
| } |
There was a problem hiding this comment.
Persist alert history when it changes.
prune() and _fireAlert() only mutate this.alerts in memory. New alerts are dropped and pruned alerts come back after restart unless every caller remembers to invoke save() immediately afterward.
Also applies to: 624-647
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.aiox-core/core/health-check/proactive-sentinel.js around lines 396 - 407,
The prune() and _fireAlert() methods currently only mutate this.alerts in
memory, so ensure any mutation persists by calling and awaiting this.save() (or
this.save().catch(...)) immediately after modifying this.alerts; update prune()
to call await this.save() after emitting Events.ALERTS_PRUNED and update
_fireAlert() (and the other mutating block around lines 624-647) to persist
new/removed alerts the same way, handling/save errors (log or emit) so state
isn’t lost across restarts.
| check: async () => { | ||
| const memDir = path.join(projectRoot, '.aiox'); | ||
| if (!fs.existsSync(memDir)) { | ||
| return { status: WatchpointStatus.HEALTHY, message: 'No memory dir yet' }; | ||
| } | ||
|
|
||
| const files = fs.readdirSync(memDir).filter((f) => f.endsWith('.json')); | ||
| const corrupted = []; | ||
|
|
||
| for (const file of files) { | ||
| try { | ||
| JSON.parse(fs.readFileSync(path.join(memDir, file), 'utf8')); | ||
| } catch { | ||
| corrupted.push(file); | ||
| } | ||
| } | ||
|
|
||
| if (corrupted.length > 0) { | ||
| return { | ||
| status: WatchpointStatus.FAILING, | ||
| message: `Corrupted memory files: ${corrupted.join(', ')}`, | ||
| data: { corrupted }, | ||
| }; | ||
| } | ||
| return { status: WatchpointStatus.HEALTHY, message: 'All memory files valid' }; |
There was a problem hiding this comment.
memory-integrity can raise false corruption alerts on active writes.
.aiox-core/workflow-intelligence/learning/gotcha-registry.js:108-131 writes .aiox/gotchas.json directly to its final path. Parsing immediately with no retry or freshness window can catch the file mid-write and mark a healthy workspace as FAILING.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.aiox-core/core/health-check/proactive-sentinel.js around lines 465 - 489,
The check function currently parses every .json in memDir and can falsely flag
files that are mid-write; update check (the async check in
proactive-sentinel.js) to ignore very recently modified files or retry reads:
before attempting JSON.parse for each file (the loop that builds corrupted),
stat the file (fs.statSync or async) and skip parsing if (Date.now() - mtimeMs)
< 2000 (or configurable freshnessWindow), or alternatively wrap the read+parse
in a small retry/backoff (e.g., 3 attempts with 100–200ms delays) and only push
into corrupted after retries fail; ensure the returned message/data (status,
message, data: { corrupted }) remains unchanged.
…toring Story HCS-3 (Proactive Health Monitoring). The Proactive Sentinel monitors system health continuously and fires alerts before failures cascade: - Register custom watchpoints with configurable check intervals - 5 built-in watchpoints: config integrity, memory files, stale locks, disk space, workspace structure - Alert system with severity levels (info, warning, critical) - Aggregated health score (0-100) from all watchpoints - Consecutive failure tracking for escalation - Alert history persistence in .aiox/sentinel-alerts.json - Start/stop continuous monitoring with timer management - Integration point for HealerManager auto-remediation 50 unit tests covering all features and built-in watchpoints.
- Fix config-integrity to not falsely claim YAML validation - Make disk-space check recursive for subdirectories - Handle DEGRADED status in consecutiveFailures tracking - Simplify stale-locks to direct scan instead of glob patterns - Auto-register timers for watchpoints added after start() - Use ?? instead of || for optional numeric defaults - Copy module to .aiox-core/ for package shipping
c4e2be2 to
e34ed95
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (6)
.aiox-core/core/health-check/proactive-sentinel.js (3)
396-406:⚠️ Potential issue | 🟠 MajorPersist alert history inside the mutators.
prune()and_fireAlert()only changethis.alertsin memory. Fired alerts disappear after restart and pruned alerts can come back unless every caller remembers to invokesave()immediately afterward.Also applies to: 624-647
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.aiox-core/core/health-check/proactive-sentinel.js around lines 396 - 406, prune() and _fireAlert() only mutate this.alerts in memory so changes get lost on restart; update both mutators (prune and _fireAlert) to persist the updated alerts state by invoking the existing save() persistence method after modifying this.alerts (await if save is async), and handle/save failures by logging/emitting an error (use this.emit or process logger in the same methods) so callers don't have to remember to call save() externally. Ensure you reference and update the methods named prune and _fireAlert to include the save call and proper error handling.
181-189:⚠️ Potential issue | 🟠 MajorPrevent duplicate and overlapping watchpoint runs.
Re-registering an existing id while the sentinel is running overwrites the
Mapentry but leaves the old interval alive, and both timer paths use rawsetInterval. A slowcheck()can therefore overlap the next tick, racelastStatus/consecutiveFailures, and double-fire alerts.As per coding guidelines, "Check for race conditions in orchestration modules (lock-manager, session-state)." and "Check for proper input validation on public API methods."
Also applies to: 215-239, 291-299
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.aiox-core/core/health-check/proactive-sentinel.js around lines 181 - 189, The current registration overwrites this.watchpoints.set(entry.id, entry) but can leave an old timer running and uses raw setInterval which allows overlap; fix by first validating entry.intervalMs > 0 and then if this._timers has an existing timer for entry.id clear it (clearInterval/clearTimeout and remove from this._timers) before creating a new schedule; replace setInterval with a non-overlapping scheduling pattern that calls evaluateWatchpoint(entry.id) and on its completion schedules the next tick via setTimeout (or store a per-watchpoint running flag) so slow check() / evaluateWatchpoint calls cannot overlap and you still keep a reference in this._timers for cancellation. Ensure cancellation/cleanup logic used in other places (the same this._timers usage at evaluate/remove paths) honors the new stored timer/flag.
465-489:⚠️ Potential issue | 🟠 MajorTreat freshly written JSON as transient before flagging corruption.
This loop parses every
.jsonin.aioximmediately. If another process is mid-write, a temporary truncation becomes a FAILING corruption alert for an otherwise healthy workspace.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.aiox-core/core/health-check/proactive-sentinel.js around lines 465 - 489, The check function currently flags any JSON parse error as corruption even if a file is mid-write; update the check (the async check closure that reads memDir) to treat files modified recently as transient: for each file in memDir, stat it (using fs.statSync or async equivalent) and skip parsing if mtime is younger than a short threshold (e.g., 3–5 seconds); only attempt JSON.parse on older files and collect those into corrupted; if corrupted (from older files) is non-empty return WatchpointStatus.FAILING as before, but if parse errors only occurred on skipped/recent files return a non-failing status (e.g., WatchpointStatus.HEALTHY with a message like "Recent writes detected; recheck later" or a transient/degraded status if you have one) so transient mid-writes aren’t reported as permanent corruption..aios-core/core/health-check/proactive-sentinel.js (2)
425-455:⚠️ Potential issue | 🟠 MajorValidate the real framework config, and actually parse YAML.
This built-in check looks for
projectRoot/core-config.yaml, but the framework config lives under.aiox-core/core-config.yaml. The YAML branch also only checks for a non-empty string, so invalid YAML syntax still reports healthy.As per coding guidelines, "Ensure backwards compatibility — core modules are consumed by all agents."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.aios-core/core/health-check/proactive-sentinel.js around lines 425 - 455, The check() currently inspects projectRoot/core-config.yaml and only verifies YAML is non-empty; update the configFiles list in the check function to include the framework location '.aiox-core/core-config.yaml' (keep the existing projectRoot paths for backwards compatibility), and replace the YAML branch that only trims content with actual parsing using the project's YAML parser (e.g., js-yaml or existing yaml util) so parsing errors throw into the catch block; ensure you reference WatchpointStatus and the same return shape on parse failure so corrupted YAML produces a FAILING result.
499-525:⚠️ Potential issue | 🟠 MajorDon't auto-heal locks from an mtime-only guess.
A lock older than 30 minutes is not necessarily orphaned; long-running workers can still own it. With
autoHeal: true, this watchpoint can feed false stale-lock cleanup into the orchestration layer unless it uses the same ownership/TTL semantics asLockManager.As per coding guidelines, "Check for race conditions in orchestration modules (lock-manager, session-state)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.aios-core/core/health-check/proactive-sentinel.js around lines 499 - 525, The watchpoint currently sets autoHeal: true and treats mtime-only stale locks as safe to remove; change this to avoid unsafe auto-heal by setting autoHeal: false (in the same object where autoHeal is declared) and update the check() logic to only mark DEGRADED rather than auto-remove. If you need automatic cleanup, integrate with the project LockManager semantics instead of mtime-only: call the LockManager API (e.g., LockManager.verifyOwnership or equivalent helper) or validate owner/heartbeat/TTL before considering a lock orphan, and only perform repair actions when LockManager confirms the lock is truly orphaned; keep references to the same symbols used here (autoHeal, check, lockDir, staleLocks) when implementing the change.tests/core/health-check/proactive-sentinel.test.js (1)
7-10:⚠️ Potential issue | 🟠 MajorTest the shipped/public module instead of the
.aios-coresource twin.The packaged copy lives under
.aiox-core, but this suite imports../../../.aios-core/...directly. That lets the source and shipped artifacts drift without CI noticing, and it also bypasses the public entrypoints that consumers are supposed to use.As per coding guidelines, "Use absolute imports instead of relative imports in all code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/health-check/proactive-sentinel.test.js` around lines 7 - 10, The test imports the source twin (.aios-core) instead of the packaged public module; change the require to import the shipped package (the .aiox-core packaged copy) via the public/absolute entrypoint rather than the relative '../../../.aios-core/...' path so the test exercises the published module API; ensure you still pull the exported symbols (ProactiveSentinel and its members AlertSeverity, WatchpointStatus, Events, CONFIG) from the package's public export surface rather than internal source files.
🧹 Nitpick comments (1)
tests/core/health-check/proactive-sentinel.test.js (1)
156-202: Add regression coverage for the scheduler and persistence edge cases.The suite still doesn't cover duplicate watchpoint ids while running, slow
check()calls overlapping an interval, or whether alerts fired/pruned in-memory survive a reload. Those are the paths most likely to regress in this module.As per coding guidelines, "Verify test coverage exists for new/modified functions."
Also applies to: 336-432, 532-574, 620-654
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/health-check/proactive-sentinel.test.js` around lines 156 - 202, Add tests to cover scheduler and persistence edge cases: extend the ProactiveSentinel test suite to include scenarios for duplicate watchpoint ids during runtime (call registerWatchpoint with an existing id and assert proper error or no-op), overlapping slow check() executions (use a check that delays longer than intervalMs and assert scheduler does not start concurrent runs and respects intervalMs), and persistence across reloads (simulate saving/restoring sentinel state and verify in-memory alerts/alerts pruning survive reload). Target the registerWatchpoint, ProactiveSentinel scheduling logic, watchpoints map, Events emissions, and persistence/load routines so these regression paths are exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.aios-core/core/health-check/proactive-sentinel.js:
- Around line 567-579: The code currently hardcodes requiredDirs =
['.aiox-core','tests','bin'] causing false degradations; update the check to
import/require the canonical required directories from framework-config.js (use
the exported constant or accessor in
.aios-core/core/health-check/checks/project/framework-config.js) and use that
array when computing missing (keep the existing backward-compat logic that
treats '.aiox-core' specially), then return WatchpointStatus.DEGRADED with the
same message/data shape only when the canonical list yields missing entries;
reference the variables requiredDirs, missing, projectRoot,
WatchpointStatus.DEGRADED and framework-config.js export in your change.
- Around line 158-175: registerWatchpoint currently accepts any numeric-ish
intervalMs and any severity string which can produce hot timer loops or invalid
alert semantics; validate intervalMs and severity up front in
registerWatchpoint: ensure the resolved intervalMs (watchpoint.intervalMs ??
this.config.defaultIntervalMs) is a finite number > 0 (reject 0, negative, NaN,
±Infinity) and throw a clear Error if invalid, and ensure severity
(watchpoint.severity ?? AlertSeverity.WARNING) is one of the known AlertSeverity
values (e.g., check membership in the AlertSeverity enum/object) and throw a
clear Error for unknown values; reference the registerWatchpoint method, the
intervalMs and severity fields, AlertSeverity, and this.config.defaultIntervalMs
when adding these validations.
In @.aiox-core/core/health-check/proactive-sentinel.js:
- Around line 682-687: The new ProactiveSentinel API (ProactiveSentinel,
AlertSeverity, WatchpointStatus, Events, CONFIG) is only exported locally and
must be re-exported through the standard barrel modules; add re-exports for
these symbols from the health-check module into the health-check barrel (export
them from the health-check index) and then add corresponding re-exports in the
core barrels (core index.js and core index.esm.js) so consumers using the normal
entrypoints can import ProactiveSentinel and its enums/config; ensure both
CommonJS and ESM barrels include the exact symbol names (ProactiveSentinel,
AlertSeverity, WatchpointStatus, Events, CONFIG) in their export lists.
---
Duplicate comments:
In @.aios-core/core/health-check/proactive-sentinel.js:
- Around line 425-455: The check() currently inspects
projectRoot/core-config.yaml and only verifies YAML is non-empty; update the
configFiles list in the check function to include the framework location
'.aiox-core/core-config.yaml' (keep the existing projectRoot paths for backwards
compatibility), and replace the YAML branch that only trims content with actual
parsing using the project's YAML parser (e.g., js-yaml or existing yaml util) so
parsing errors throw into the catch block; ensure you reference WatchpointStatus
and the same return shape on parse failure so corrupted YAML produces a FAILING
result.
- Around line 499-525: The watchpoint currently sets autoHeal: true and treats
mtime-only stale locks as safe to remove; change this to avoid unsafe auto-heal
by setting autoHeal: false (in the same object where autoHeal is declared) and
update the check() logic to only mark DEGRADED rather than auto-remove. If you
need automatic cleanup, integrate with the project LockManager semantics instead
of mtime-only: call the LockManager API (e.g., LockManager.verifyOwnership or
equivalent helper) or validate owner/heartbeat/TTL before considering a lock
orphan, and only perform repair actions when LockManager confirms the lock is
truly orphaned; keep references to the same symbols used here (autoHeal, check,
lockDir, staleLocks) when implementing the change.
In @.aiox-core/core/health-check/proactive-sentinel.js:
- Around line 396-406: prune() and _fireAlert() only mutate this.alerts in
memory so changes get lost on restart; update both mutators (prune and
_fireAlert) to persist the updated alerts state by invoking the existing save()
persistence method after modifying this.alerts (await if save is async), and
handle/save failures by logging/emitting an error (use this.emit or process
logger in the same methods) so callers don't have to remember to call save()
externally. Ensure you reference and update the methods named prune and
_fireAlert to include the save call and proper error handling.
- Around line 181-189: The current registration overwrites
this.watchpoints.set(entry.id, entry) but can leave an old timer running and
uses raw setInterval which allows overlap; fix by first validating
entry.intervalMs > 0 and then if this._timers has an existing timer for entry.id
clear it (clearInterval/clearTimeout and remove from this._timers) before
creating a new schedule; replace setInterval with a non-overlapping scheduling
pattern that calls evaluateWatchpoint(entry.id) and on its completion schedules
the next tick via setTimeout (or store a per-watchpoint running flag) so slow
check() / evaluateWatchpoint calls cannot overlap and you still keep a reference
in this._timers for cancellation. Ensure cancellation/cleanup logic used in
other places (the same this._timers usage at evaluate/remove paths) honors the
new stored timer/flag.
- Around line 465-489: The check function currently flags any JSON parse error
as corruption even if a file is mid-write; update the check (the async check
closure that reads memDir) to treat files modified recently as transient: for
each file in memDir, stat it (using fs.statSync or async equivalent) and skip
parsing if mtime is younger than a short threshold (e.g., 3–5 seconds); only
attempt JSON.parse on older files and collect those into corrupted; if corrupted
(from older files) is non-empty return WatchpointStatus.FAILING as before, but
if parse errors only occurred on skipped/recent files return a non-failing
status (e.g., WatchpointStatus.HEALTHY with a message like "Recent writes
detected; recheck later" or a transient/degraded status if you have one) so
transient mid-writes aren’t reported as permanent corruption.
In `@tests/core/health-check/proactive-sentinel.test.js`:
- Around line 7-10: The test imports the source twin (.aios-core) instead of the
packaged public module; change the require to import the shipped package (the
.aiox-core packaged copy) via the public/absolute entrypoint rather than the
relative '../../../.aios-core/...' path so the test exercises the published
module API; ensure you still pull the exported symbols (ProactiveSentinel and
its members AlertSeverity, WatchpointStatus, Events, CONFIG) from the package's
public export surface rather than internal source files.
---
Nitpick comments:
In `@tests/core/health-check/proactive-sentinel.test.js`:
- Around line 156-202: Add tests to cover scheduler and persistence edge cases:
extend the ProactiveSentinel test suite to include scenarios for duplicate
watchpoint ids during runtime (call registerWatchpoint with an existing id and
assert proper error or no-op), overlapping slow check() executions (use a check
that delays longer than intervalMs and assert scheduler does not start
concurrent runs and respects intervalMs), and persistence across reloads
(simulate saving/restoring sentinel state and verify in-memory alerts/alerts
pruning survive reload). Target the registerWatchpoint, ProactiveSentinel
scheduling logic, watchpoints map, Events emissions, and persistence/load
routines so these regression paths are exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: df11f2f0-9c7d-4b27-b6a6-f698cd4eddc4
📒 Files selected for processing (4)
.aios-core/core/health-check/proactive-sentinel.js.aiox-core/core/health-check/proactive-sentinel.js.aiox-core/install-manifest.yamltests/core/health-check/proactive-sentinel.test.js
| registerWatchpoint(watchpoint) { | ||
| if (!watchpoint.id || !watchpoint.name || !watchpoint.check) { | ||
| throw new Error('Required fields: id, name, check'); | ||
| } | ||
|
|
||
| if (typeof watchpoint.check !== 'function') { | ||
| throw new Error('check must be a function'); | ||
| } | ||
|
|
||
| const entry = { | ||
| id: watchpoint.id, | ||
| name: watchpoint.name, | ||
| description: watchpoint.description || '', | ||
| check: watchpoint.check, | ||
| intervalMs: watchpoint.intervalMs ?? this.config.defaultIntervalMs, | ||
| severity: watchpoint.severity ?? AlertSeverity.WARNING, | ||
| autoHeal: watchpoint.autoHeal ?? false, | ||
| healerId: watchpoint.healerId || null, |
There was a problem hiding this comment.
Validate intervalMs and severity up front.
registerWatchpoint() accepts any numeric-ish interval and any severity string. 0, negative, or non-finite intervals can create a hot timer loop, and unknown severities break alert weighting/semantics.
🛡️ Proposed validation
if (typeof watchpoint.check !== 'function') {
throw new Error('check must be a function');
}
+
+ const intervalMs = watchpoint.intervalMs ?? this.config.defaultIntervalMs;
+ if (!Number.isFinite(intervalMs) || intervalMs <= 0) {
+ throw new Error('intervalMs must be a positive number');
+ }
+ if (
+ watchpoint.severity != null &&
+ !Object.values(AlertSeverity).includes(watchpoint.severity)
+ ) {
+ throw new Error(`Unknown severity: ${watchpoint.severity}`);
+ }
const entry = {
id: watchpoint.id,
name: watchpoint.name,
description: watchpoint.description || '',
check: watchpoint.check,
- intervalMs: watchpoint.intervalMs ?? this.config.defaultIntervalMs,
+ intervalMs,
severity: watchpoint.severity ?? AlertSeverity.WARNING,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.aios-core/core/health-check/proactive-sentinel.js around lines 158 - 175,
registerWatchpoint currently accepts any numeric-ish intervalMs and any severity
string which can produce hot timer loops or invalid alert semantics; validate
intervalMs and severity up front in registerWatchpoint: ensure the resolved
intervalMs (watchpoint.intervalMs ?? this.config.defaultIntervalMs) is a finite
number > 0 (reject 0, negative, NaN, ±Infinity) and throw a clear Error if
invalid, and ensure severity (watchpoint.severity ?? AlertSeverity.WARNING) is
one of the known AlertSeverity values (e.g., check membership in the
AlertSeverity enum/object) and throw a clear Error for unknown values; reference
the registerWatchpoint method, the intervalMs and severity fields,
AlertSeverity, and this.config.defaultIntervalMs when adding these validations.
| const requiredDirs = ['.aiox-core', 'tests', 'bin']; | ||
| const missing = requiredDirs.filter((d) => !fs.existsSync(path.join(projectRoot, d))); | ||
|
|
||
| // Also check .aios-core for backward compat | ||
| if (missing.includes('.aiox-core') && fs.existsSync(path.join(projectRoot, '.aios-core'))) { | ||
| missing.splice(missing.indexOf('.aiox-core'), 1); | ||
| } | ||
|
|
||
| if (missing.length > 0) { | ||
| return { | ||
| status: WatchpointStatus.DEGRADED, | ||
| message: `Missing directories: ${missing.join(', ')}`, | ||
| data: { missing }, |
There was a problem hiding this comment.
Use the framework's canonical required directories here.
.aiox-core/core/health-check/checks/project/framework-config.js treats .aiox-core and .claude as required. Hardcoding tests and bin here will mark valid minimal workspaces as degraded and penalize the health score for no framework issue.
As per coding guidelines, "Ensure backwards compatibility — core modules are consumed by all agents."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.aios-core/core/health-check/proactive-sentinel.js around lines 567 - 579,
The code currently hardcodes requiredDirs = ['.aiox-core','tests','bin'] causing
false degradations; update the check to import/require the canonical required
directories from framework-config.js (use the exported constant or accessor in
.aios-core/core/health-check/checks/project/framework-config.js) and use that
array when computing missing (keep the existing backward-compat logic that
treats '.aiox-core' specially), then return WatchpointStatus.DEGRADED with the
same message/data shape only when the canonical list yields missing entries;
reference the variables requiredDirs, missing, projectRoot,
WatchpointStatus.DEGRADED and framework-config.js export in your change.
| module.exports = ProactiveSentinel; | ||
| module.exports.ProactiveSentinel = ProactiveSentinel; | ||
| module.exports.AlertSeverity = AlertSeverity; | ||
| module.exports.WatchpointStatus = WatchpointStatus; | ||
| module.exports.Events = Events; | ||
| module.exports.CONFIG = CONFIG; |
There was a problem hiding this comment.
Expose the new sentinel through the standard barrels.
These local exports are not enough: .aiox-core/core/health-check/index.js, .aiox-core/core/index.js, and .aiox-core/core/index.esm.js still omit ProactiveSentinel and its enums/config. Consumers using the normal entrypoints cannot reach the new API.
As per coding guidelines, "Ensure backwards compatibility — core modules are consumed by all agents."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.aiox-core/core/health-check/proactive-sentinel.js around lines 682 - 687,
The new ProactiveSentinel API (ProactiveSentinel, AlertSeverity,
WatchpointStatus, Events, CONFIG) is only exported locally and must be
re-exported through the standard barrel modules; add re-exports for these
symbols from the health-check module into the health-check barrel (export them
from the health-check index) and then add corresponding re-exports in the core
barrels (core index.js and core index.esm.js) so consumers using the normal
entrypoints can import ProactiveSentinel and its enums/config; ensure both
CommonJS and ESM barrels include the exact symbol names (ProactiveSentinel,
AlertSeverity, WatchpointStatus, Events, CONFIG) in their export lists.
Resumo
Story HCS-3 (Proactive Health Monitoring). Complementa o HealerManager existente com monitoramento proativo.
O Proactive Sentinel detecta problemas antes que causem falhas em cascata, como um sistema imunológico para o AIOX.
Funcionalidades
config-integrity— Verifica se package.json e core-config.yaml são válidosmemory-integrity— Verifica se arquivos JSON em .aiox/ são parseáveisstale-locks— Detecta lock files órfãos (>30min)disk-space— Monitora tamanho do diretório .aiox/workspace-structure— Verifica se diretórios essenciais existemautoHeal+healerIdpara remediação automáticaArquitetura
Arquivos
.aios-core/core/health-check/proactive-sentinel.jstests/core/health-check/proactive-sentinel.test.jsTest plan
npx jest tests/core/health-check/proactive-sentinel.test.js— 50/50 passingIntegrar com(trabalho futuro, issue separada)aiox doctorpara usar watchpoints como checks adicionaisConectar com HealerManager para auto-remediação via(trabalho futuro, issue separada)autoHealflagSummary by CodeRabbit