fix(compute-batches): emit knownCrossBatchNodeIds so doc-batches use correct prefixes (#303) by tirth8205 · Pull Request #399 · Egonex-AI/Understand-Anything

tirth8205 · 2026-06-05T16:11:37Z

Summary

Closes #303 — doc-batch file-analyzer subagents were emitting cross-batch edges with the wrong type prefix (e.g. file:CLAUDE.md instead of document:CLAUDE.md) because the dispatch prompt gave them batchImportData[] + neighborMap{} but no map of canonical node IDs allocated by other batches. ~17/19 of those bad edges were getting recovered post-hoc by the assemble-reviewer at extra LLM cost; 2 stayed dangling.

This change pre-computes the authoritative <prefix>:<path> for every file in compute-batches.mjs (Phase 1.5) and ships a per-batch knownCrossBatchNodeIds: string[] field in batches.json. The file-analyzer dispatch prompt now passes this list through with strict instructions: use the IDs verbatim, do not invent IDs, drop the edge if the target is not in the list.

compute-batches.mjs: adds nodePrefixForFile() (mirrors file-analyzer.md "Node type mapping by fileCategory" + merge-batch-graphs.py TYPE_TO_PREFIX) and buildKnownCrossBatchNodeIds() (sorted per-batch list of all-other-batches' canonical IDs). New field is wired into the second-pass batch enrichment alongside batchImportData and neighborMap.
agents/file-analyzer.md: adds a "Cross-batch node IDs (knownCrossBatchNodeIds) — STRICT prefix authority" section with rules for all edge types whose target lives in another batch.
skills/understand/SKILL.md Phase 2: passes knownCrossBatchNodeIds through the dispatch template.
tests/skill/understand/test_compute_batches.test.mjs: adds a new knownCrossBatchNodeIds (issue #303) suite (6 cases) backed by a new scan-result-known-cross-batch-ids.json fixture that exercises the exact doc-batch file-analyzers emit wrong-prefix dangling edges to cross-batch nodes (need known-prefix map at dispatch) #303 scenario (docs/config/infra batches with cross-references). Asserts the field exists on every batch, excludes own files, uses the correct prefix per fileCategory (the document:CLAUDE.md vs file:CLAUDE.md invariant), is sorted, and that the union own ∪ knownCrossBatchNodeIds reaches every project file (the invariant the agent uses for "drop the edge if not in the list").

`fileCategory` → prefix mapping (mirrors file-analyzer.md and merge-batch-graphs.py)

fileCategory	Prefix	Notes
`code` / `script` / `markup`	`file:`	per file-analyzer.md "Node type mapping by fileCategory"
`config`	`config:`
`docs`	`document:`
`infra`	`pipeline:`	`.github/workflows/`, `.gitlab-ci.yml`, `.circleci/`, `Jenkinsfile`
`infra`	`resource:`	`.tf`, `.tfvars`, `Vagrantfile`
`infra`	`service:`	default (Dockerfile, docker-compose, K8s, …)
`data`	`endpoint:`	`openapi.yaml`/`swagger.json`
`data`	`schema:`	`.graphql`/`.gql`/`.proto`/`.prisma`
`data`	`table:`	default (SQL)
unknown	`file:`	defensive fallback (matches merge-batch-graphs default)

Caveats

The infra and data sub-classifications are path-heuristic, mirroring buildNonCodeBatches() Group A–D and the file-analyzer's "Choosing between infra/data sub-types" guidance. Edge cases like a Makefile (currently → service:) might mis-classify, but a small mis-prefix is bounded — the merge step's dangling-edge dropper still catches it and the existing assemble-reviewer pass still has the recovery signal. Worst case it falls back to today's behavior.
I scoped this to JUST emitting knownCrossBatchNodeIds from compute-batches.mjs + the dispatch instructions, per the issue's "feel free to scope down" note. Verifying the full end-to-end gain (the 17/19 → 19/19 dangling-edge recovery rate) requires running /understand against a real multi-batch project, which is out of scope for this PR. The unit tests confirm the data shape and invariants.

Test plan

pnpm test tests/skill/understand/test_compute_batches.test.mjs — 25 passed (19 existing + 6 new)
pnpm test (full suite) — 224 passed across 17 test files (no regressions in extract-import-map, scan-project, merge-recover-imports, worktree-redirect, dashboard utils, etc.)
End-to-end: run /understand --full against a project that has multi-batch doc-cross-references (e.g. a repo with CLAUDE.md + multi-package src/) and confirm the merge-batch-graphs.py "Dropped dangling edges" count drops vs. baseline. (Recommended but not blocking.)

🤖 Generated with Claude Code

…correct prefixes (Egonex-AI#303) During Phase 2, doc/test batch file-analyzer subagents emit edges to nodes they don't own using the wrong type prefix (e.g. file:CLAUDE.md instead of document:CLAUDE.md). The assemble-reviewer recovers most via post-hoc LLM repair but at extra cost, and the residual stays dangling. Root cause: the dispatch prompt provided batchImportData[] + neighborMap{} but no map of canonical node IDs allocated by other batches, forcing the agent to guess the prefix from the path. This change: - Adds nodePrefixForFile() in compute-batches.mjs that mirrors the file category -> node type mapping documented in file-analyzer.md and the TYPE_TO_PREFIX dict in merge-batch-graphs.py (code/script/markup -> file, config -> config, docs -> document, infra -> service/pipeline/resource by path heuristic, data -> table/schema/endpoint by path heuristic). - Adds buildKnownCrossBatchNodeIds() that pre-computes the canonical <prefix>:<path> ID for every file and, per batch, returns the sorted array of IDs owned by all OTHER batches. - Emits the new knownCrossBatchNodeIds: string[] field on every batch entry written to batches.json. - Updates agents/file-analyzer.md with a STRICT prefix-authority section: use the list verbatim, do not invent IDs, drop the edge if the target is not in the list. - Updates skills/understand/SKILL.md Phase 2 dispatch template to pass knownCrossBatchNodeIds through to the file-analyzer subagent. - Adds 6 vitest cases covering: presence on every batch, exclusion of own files, correct per-fileCategory prefixes (the exact Egonex-AI#303 scenario with CLAUDE.md, Dockerfile, tsconfig.json, .github/workflows/), sorted determinism, and the reachability invariant the agent relies on ("not in the list -> not in the graph"). Closes Egonex-AI#303. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(compute-batches): emit knownCrossBatchNodeIds so doc-batches use correct prefixes (#303)#399

fix(compute-batches): emit knownCrossBatchNodeIds so doc-batches use correct prefixes (#303)#399
tirth8205 wants to merge 1 commit into
Egonex-AI:mainfrom
tirth8205:fix/compute-batches-known-cross-batch-ids

tirth8205 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tirth8205 commented Jun 5, 2026

Summary

fileCategory → prefix mapping (mirrors file-analyzer.md and merge-batch-graphs.py)

Caveats

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`fileCategory` → prefix mapping (mirrors file-analyzer.md and merge-batch-graphs.py)