fix(daemon): resolve FD exhaustion and event loop blocking with thousands of memory artifacts#513
Conversation
Track FD counts, event loop lag, and memory usage at daemon lifecycle points. Expose resource snapshot on /health under a new additive `resources` key. Gate periodic polling behind debug log level. Ref: Signet-AI#512
…est refs Cold-start cache seeding now queries DB source_path rows so files deleted while the daemon was down are detected and their DB rows cleaned up during the first reindex after restart. syncManifestRefs now tracks previous ledger refs via prevLedgerRefs so manifests that drop out of the clipped ledger have their memory_md_refs cleared even when their disk files did not change. resetProjectionPurgeState clears all module-level caches for proper test isolation. Includes regression tests for both fixes.
Move renderMemoryProjection() to a node:worker_threads worker to eliminate ~1.5s event loop blocking during MEMORY.md synthesis. - Add synthesis-render-worker with own SQLite + sqlite-vec connection - Add typed message protocol in synthesis-worker-protocol.ts - Integrate worker lifecycle in daemon startup/shutdown - Graceful sync fallback when worker is unavailable - Extract checkCliAvailable() helper, deduplicate 4 CLI preflight blocks - Remove dead runSynthesis function and unused imports - 12 new integration tests, POC test, stress test script
|
Hi @Ostico - I'm an automated code reviewer powered by pr-reviewer. I'm taking a look at the fixes in |
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Automated review by pr-reviewer | model: gpt-5.3-codex | commit:
d56e2cad
[Automated Review by pr-reviewer] I found blocking correctness issues in the new synthesis worker fallback path and incremental artifact reindex logic.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The issues are directly visible in changed code paths: worker timeout/error handling in hooks.ts now returns an empty prompt instead of rendering synchronously, and reindexMemoryArtifacts() in memory-lineage.ts skips tombstoned/invalid/out-of-scope files without deleting existing DB rows, which can leave stale artifacts.
| indexBlock: rendered.indexBlock, | ||
| }; | ||
| } | ||
|
|
There was a problem hiding this comment.
Blocking: on worker timeout/error this resolves to an empty synthesis payload (prompt: "") instead of falling back to synchronous renderMemoryProjection(). That diverges from the PR goal/description of sync fallback when worker is unavailable and can cause lost synthesis output under transient worker failures.
There was a problem hiding this comment.
Not a bug — intentional design. The sync fallback is for when the worker is unavailable (worker === null), not for when it fails mid-request.
Three failure modes, three deliberate responses:
| Condition | Behavior | Why |
|---|---|---|
worker === null |
Sync renderMemoryProjection() fallback |
Worker unavailable (startup failure, previous crash) — safe one-off sync call |
| Worker timeout (30s) | Empty prompt, auto-recovery on next request | If we called sync after a 30s hang, we’d block the event loop for another ~1.5s — the exact problem this PR fixes. Plus, whatever caused the 30s hang (e.g., SQLite lock contention) would likely also stall sync rendering |
| Worker error | Empty prompt, auto-recovery on next request | The error/exit handlers in daemon.ts:2478-2485 set setSynthesisRenderWorker(null) — so the next synthesis request sees worker === null and uses sync fallback automatically |
The auto-recovery path is: worker crashes → onError/onExit sets worker to null → next /api/hooks/synthesis call hits line 3608 (worker === null) → sync fallback fires.
Falling back to sync on timeout would create a dangerous cascade: worker hangs → 30s timeout → sync render blocks event loop for 1.5s → health checks stall → cascading degradation. Empty prompt for a single request is the safer degradation.
| cache.set(path, mtime); | ||
| continue; | ||
| } | ||
| const match = path.match(/--([a-z2-7]{16})--/); |
There was a problem hiding this comment.
Blocking data-integrity bug: incremental reindex now continues for tombstoned/invalid/out-of-scope files after updating cache, but does not delete pre-existing memory_artifacts rows for those paths. Stale rows can persist in DB and leak deleted/invalid sessions into projection/ledger results.
There was a problem hiding this comment.
Partially valid — fixed in incoming push.
The claim overstates the tombstone case but is correct for invalid/scope-changed artifacts.
Tombstone case (NOT a real issue):
removeCanonicalSession() at line 1444-1467 already runs DELETE FROM memory_artifacts WHERE agent_id = ? AND session_token = ? AND rmSync() on the files. By the time the next reindex runs, the DB rows are gone and the files are deleted from disk. No stale row is possible for tombstoned sessions through the normal path.
Invalid artifact / scope-change cases (real gap, now fixed):
If a previously-indexed file later fails isValidArtifact() (corrupted hash, missing frontmatter) or changes agent_id (scope mismatch), the continue skipped the upsert but left the old DB row. The file still exists on disk, so the cleanup loop at lines 639-646 never catches it.
Practical impact is near-zero (artifact files are written once by the pipeline and essentially never modified), but it is an architectural correctness gap. Fixed by adding DELETE FROM memory_artifacts WHERE source_path = ? before each continue in all three branches (tombstone, invalid, scope-mismatch). The tombstone DELETE is defensive/harmless since removeCanonicalSession already cleaned the row.
Fix coming in the next push.
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Automated review by pr-reviewer | model: gpt-5.3-codex | commit:
940d5096
[Automated Review by pr-reviewer] I found blocking correctness issues where the new synthesis worker path can silently return empty output and where global ledger-ref state can corrupt cross-agent manifest refs.
Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The failures are directly visible in changed control flow: worker timeout/error paths now resolve empty synthesis payloads in hooks.ts, and memory-lineage.ts introduces process-global prevLedgerRefs/lastChangedManifests used across agent-scoped renders. I did not run runtime repro here, but the logic-level regressions are concrete in the diff.
| @@ -1257,9 +1333,26 @@ function renderIndexSection(indexBlock: string, base: ReadonlyArray<string>): st | |||
| return ""; | |||
There was a problem hiding this comment.
Blocking: prevLedgerRefs/lastChangedManifests are module-global and not keyed by agentId. syncManifestRefs() now unions previous refs into the current update set, so renders for agent A can influence which manifests are rewritten during a render for agent B. This creates cross-agent state bleed and can clear or rewrite ledger refs in unrelated manifests (data integrity risk).
There was a problem hiding this comment.
Not a new issue — pre-existing design, this PR narrows the scope.
prevLedgerRefs and lastChangedManifests are module-global, yes. But the OLD syncManifestRefs(refs) was broader:
// Before this PR:
const files = listCanonicalFiles().filter(path => path.endsWith("--manifest.md"));Every render scanned ALL manifest files and set/cleared memory_md_refs based on the current agent’s ledger. Sequential renders for different agents already produced the same cross-agent ref thrashing — alice’s render sets her refs, bob’s render clears them because they’re not in bob’s set.
The new prevLedgerRefs optimization actually REDUCES this: instead of processing all manifests every time, it only touches changed manifests + previously-referenced ones that dropped out. The cross-agent interaction surface is narrower, not wider.
The root issue is that syncManifestRefs was never agent-scoped — that’s a pre-existing design choice, not a regression from this PR. Scoping it properly would be a separate improvement.
Worker creation had three unhandled failure modes that could crash the daemon process: constructor throw, unhandled error event during module evaluation, and worker exit before init response. Wrap in try/catch and attach error/exit listeners during the init phase so failures fall back to sync rendering instead of process crash.
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Automated review by pr-reviewer | model: gpt-5.3-codex | commit:
8dddb952
[Automated Review by pr-reviewer] I found correctness regressions that should be fixed before merge, including a synthesis fallback gap and stale-artifact indexing behavior.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The issues are directly visible in changed logic: worker timeout/error paths in hooks.ts now return empty output instead of rendering, and reindexMemoryArtifacts now short-circuits unchanged files before tombstone checks in memory-lineage.ts, which can leave deleted sessions indexed.
| }); | ||
|
|
||
| const fileSet = new Set(files); | ||
| for (const path of files) { |
There was a problem hiding this comment.
Incremental reindex checks cache.get(path) === mtime before evaluating tombstones/session validity. If a session is tombstoned while files are unchanged, those rows will be skipped and can remain in memory_artifacts indefinitely, causing stale projection/index state. Tombstone invalidation needs to be evaluated even when file mtime is unchanged.
There was a problem hiding this comment.
Not a bug — removeCanonicalSession() already handles cleanup.
The claim assumes tombstoned sessions can have stale memory_artifacts rows. Looking at removeCanonicalSession() (line 1453-1479), it performs three cleanup steps in the same transaction:
- Inserts tombstone record into
memory_artifact_tombstones DELETE FROM memory_artifacts WHERE agent_id = ? AND session_token = ?— rows gonermSync()on all artifact files — files gone from disk
After tombstoning, there are no DB rows to become stale. On next reindex, if any cache entries reference deleted files, the cleanup loop at lines 648-655 catches them:
for (const path of cache.keys()) {
if (fileSet.has(path)) continue; // still on disk
// Not on disk → delete any lingering DB row
db.prepare("DELETE FROM memory_artifacts WHERE source_path = ?").run(relativePath(path));
cache.delete(path);
}The mtime cache skip (line 613) only applies to files that exist on disk with unchanged content. Tombstoned files don’t exist on disk — they’re caught by the cache cleanup, not the mtime check. The scenario described cannot occur through removeCanonicalSession, which is the only tombstone creation path.
| @@ -1248,7 +1296,6 @@ function startFileWatcher() { | |||
| join(AGENTS_DIR, "IDENTITY.md"), | |||
There was a problem hiding this comment.
Removing the memory/ watch path disables runtime ingestion for newly added/modified non-artifact markdown files in memory/ (the watcher handlers still contain ingestion logic, and startup import still filters for non-artifact files). This creates inconsistent behavior: startup imports such files, but later changes are no longer observed.
There was a problem hiding this comment.
See detailed reply on the latest review round for this same comment (commit 4f1c2ce0).
initDbAccessorLite() unconditionally called loadExtension("") when
sqlite-vec was unavailable (common in Docker, possible on fresh
installs). This crashed the worker thread during module evaluation,
which in turn crashed the daemon via unhandled error event.
Add truthy guard and try/catch around loadExtension(), matching the
same defensive pattern used in the main daemon loadVecExtension().
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Automated review by pr-reviewer | model: gpt-5.3-codex | commit:
4f1c2ce0
[Automated Review by pr-reviewer] I found one blocking behavior regression where the implementation does not fully match the PR’s claim about memory/ contents and watcher coverage.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The diff removes join(AGENTS_DIR, "memory") from chokidar watch roots in daemon.ts, while the same file still contains active ingestion paths (importExistingMemoryFiles and ingestMemoryMarkdown) for non-artifact .md files under memory/. That creates a concrete runtime regression: post-start writes/adds of such files will no longer trigger ingestion events.
| @@ -1248,7 +1296,6 @@ function startFileWatcher() { | |||
| join(AGENTS_DIR, "IDENTITY.md"), | |||
There was a problem hiding this comment.
Blocking regression: removing memory/ from watcher roots disables runtime ingestion for non-artifact markdown files created/updated inside memory/ after startup. This conflicts with existing logic that still treats such files as ingestible (importExistingMemoryFiles filters for non-artifact .md and ingestMemoryMarkdown explicitly skips only artifact/backup patterns). The PR description asserts everything in memory/ is canonical artifacts/backups, but the codebase behavior still supports non-artifact inputs there.
There was a problem hiding this comment.
Technically correct, operationally irrelevant — the FD leak fix justifies this.
The watcher handler at lines 1330-1341 is indeed dead code for memory/ files after this change. But ingestMemoryMarkdown() (line 1019) already filters out ALL file types that the daemon creates in memory/:
- Canonical artifacts (
--summary.md,--transcript.md, etc.) → filtered byARTIFACT_FILENAME_RE - Backups (
MEMORY.backup-*.md) → filtered byMEMORY_BACKUP_FILENAME_RE MEMORY.md→ explicitly filtered at line 1016- DB/system files → not
.md
No daemon code path, API endpoint, CLI command, or documented workflow creates non-artifact markdown files in memory/ at runtime. importExistingMemoryFiles() at startup still handles any pre-existing files via readdirSync() (unaffected by this change).
The memory/ directory contains ~8,340 files in a typical workspace. Bun’s fs.watch() opens one O_RDONLY FD per file and never releases them on close() (confirmed via strace — issue #512 root cause). Removing this watch path eliminates the FD leak entirely.
The codebase assertion is correct: everything in memory/ is canonical artifacts or backups. The ingestion handler was never triggered for these files anyway — the ignored matcher in watcher-ignore.ts:52-58 already filtered artifacts, and ingestMemoryMarkdown filtered the rest. The watcher on memory/ was consuming 8K+ FDs to produce zero useful events.
|
I think the overall direction is right. The one thing I want fixed before merge is the worker failure path. If the render worker times out or errors, we should not return the same shape as a valid empty render. The manifest ref scoping concern is real, but I’m fine tracking that as a follow-up if this PR is not making the existing behavior worse. |
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Automated review by pr-reviewer | model: gpt-5.3-codex | commit:
d0e6112a
[Automated Review by pr-reviewer] I found one blocking data-integrity issue in the incremental reindex path; otherwise the PR generally aligns with the stated FD leak and event-loop unblocking goals.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The new scoped reindex logic in memory-lineage.ts performs DELETEs by source_path without agent_id constraints. This is directly visible in changed code and can remove rows belonging to other agents when reindexing a single agent.
| if (scope && nextAgent !== scope) { | ||
| getDbAccessor().withWriteTx((db) => { | ||
| db.prepare("DELETE FROM memory_artifacts WHERE source_path = ?").run(relativePath(path)); | ||
| }); |
There was a problem hiding this comment.
Blocking: scoped reindex can delete artifacts for other agents. In the scope && nextAgent !== scope branch (and other new delete branches), rows are deleted with DELETE FROM memory_artifacts WHERE source_path = ? and no agent_id predicate. If another agent has the same source_path row, this causes cross-agent data loss/corruption. Use scoped deletes when scope is set (e.g., ... WHERE source_path = ? AND agent_id = ?) and keep global deletes only for unscoped runs.
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Automated review by pr-reviewer | model: gpt-5.3-codex | commit:
5181b943
[Automated Review] I am an automated PR review tool (pr-reviewer). I found one blocking data-integrity issue in the new incremental manifest-ref sync logic.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The new state used by syncManifestRefs is module-global (prevLedgerRefs) and not keyed by agentId. In renderMemoryProjection(agentId), refs for one agent are reused in later calls for other agents, which can cause manifest ref updates to touch the wrong agent’s manifests and clear valid refs. This behavior is directly visible in the changed logic paths in memory-lineage.ts.
| @@ -32,6 +33,16 @@ const BASE32 = "abcdefghijklmnopqrstuvwxyz234567"; | |||
| let projTok: Tiktoken | null = null; | |||
There was a problem hiding this comment.
Blocking: prevLedgerRefs is global process state and not scoped by agentId. The new incremental syncManifestRefs(refs, changedManifests) uses prevLedgerRefs from the previous render call, which may belong to a different agent. That can cause cross-agent manifest rewrites and stale-ref clearing in the wrong files (data integrity risk). Please scope this state per-agent (e.g., Map<agentId, Set<...>>) and pass agentId through the sync path.
|
Thanks Nicholai — appreciate the review and the follow-up commits. The sync fallback on worker failure is a better design than returning the empty shape, agreed. Hoping this PR can help — I've been running it locally against a pretty large memory directory (~8,500 artifact files) and it's been working well. The FD exhaustion is gone and synthesis feels noticeably snappier with the worker thread offload. The prevLedgerRefs scoping concern makes sense as a follow-up. Happy to pick that up in a separate PR if useful. |
Convert two process-global singletons in memory-lineage.ts to per-agent Maps so concurrent projections for different agents no longer leak state into each other. Adds regression test verifying cross-agent isolation. Addresses review feedback from PR Signet-AI#513.
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Automated review by pr-reviewer | model: gpt-5.3-codex | commit:
04cae822
[Automated Review by pr-reviewer] I found one correctness issue in the new test changes; core implementation largely matches the PR description, but this should be fixed before relying on test results.
Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The issue is directly visible in the changed test line where an environment variable is set to undefined instead of being deleted, which can change process state unexpectedly across tests. I did not run the test suite here to confirm downstream failures.
|
Follow-up fix for the cross-agent state leakage in Changes:
All existing tests pass (15/15 memory-lineage, 13/13 synthesis-worker). TSC clean against baseline. |
|
merging this, thank you @Ostico! |
|
You're Welcome!! |
Summary
Fixes #512. The daemon became unresponsive after startup when thousands of memory artifact files existed. Three independent root causes were identified through instrumented profiling and strace analysis, and all three have been addressed:
fs.watch()FD leak — Bun opens oneO_RDONLYFD per file when watching a directory andclose()never releases them (~8,000 leaked FDs)renderMemoryProjection()blocks the event loop — synchronous full-scan of ~7,600 files on every synthesis request (~1.5s block)Changes
FD Leak Fix
memory/directory from chokidar watch paths — all files inside are canonical artifacts already excluded by theignoredmatcher, so no watcher events would fire anywaywatcher-ignore.tsas defense-in-depthSynthesis Worker Thread (event loop unblocking)
synthesis-render-worker.ts— Worker script with independent SQLite + sqlite-vec connectionsynthesis-worker-protocol.ts— Shared typed message protocol with type guards (zeroas anycasts)hooks.ts—handleSynthesisRequest()delegates to worker viapostMessage, sync fallback if worker unavailabledaemon.ts— Worker lifecycle (spawn on startup, terminate on shutdown), logged error handlingdb-accessor.ts—initDbAccessorLite()for lightweight DB access in worker isolateIncremental Reindex + Import Short-circuit
reindexMemoryArtifacts()now tracks mtime, skips unchanged filesimportExistingMemoryFiles()short-circuits when filtered file list is emptyCode Cleanup
checkCliAvailable()helper — deduplicated 4 identical CLI preflight spawn blocksrunSynthesis()function (not exported, zero callers)hooks-routes.tsPermanent Diagnostics
resource-monitor.ts— FD snapshot, event loop lag detection, periodic polling/healthendpoint extended with additiveresourceskeyType
feat— new user-facing feature (bumps minor)fix— bug fixrefactor— restructure without behavior changechore— build, deps, config, docsperf— performance improvementtest— test coveragePackages affected
@signet/core@signet/daemon@signet/cli/ dashboard@signet/sdk@signet/connector-*@signet/webpredictorScreenshots
N/A — no UI changes.
PR Readiness (MANDATORY)
INDEX.md+dependencies.yaml)Migration Notes (if applicable)
N/A — no migrations touched.
Testing
bun testpassesbun run typecheckpasses (442 errors, 2 fewer than baseline — dead code removal)bun run lintpassesStress Test Results
5 concurrent synthesis requests + 10
/healthprobes fired simultaneously:Test Coverage
tests/integration/synthesis-stress-test.sh4-Agent Verification
as any/@ts-ignore)AI disclosure
Assisted-bytags in commits)Assisted-by: OpenCode:claude-opus-4-6
Notes
Known Limitations (tracked as tech debt, non-blocking)
console.errorinstead of structured logger (V8 isolate has no access to daemon's logger instance)configurePragmasduplicated in POC test (test-only, not production code)Root Cause: Bun Runtime Bug
The FD leak is a confirmed Bun runtime bug:
fs.watch()on a directory opens oneO_RDONLYFD per file, andclose()does not release them. Node.js uses a single inotify FD for the same operation. Reproduction script available in issue #512. This fix works around the bug by not watching thememory/directory at all.