fix(compute-batches): emit knownCrossBatchNodeIds so doc-batches use correct prefixes (#303)#399
Open
tirth8205 wants to merge 1 commit into
Open
Conversation
…correct prefixes (Egonex-AI#303) During Phase 2, doc/test batch file-analyzer subagents emit edges to nodes they don't own using the wrong type prefix (e.g. file:CLAUDE.md instead of document:CLAUDE.md). The assemble-reviewer recovers most via post-hoc LLM repair but at extra cost, and the residual stays dangling. Root cause: the dispatch prompt provided batchImportData[] + neighborMap{} but no map of canonical node IDs allocated by other batches, forcing the agent to guess the prefix from the path. This change: - Adds nodePrefixForFile() in compute-batches.mjs that mirrors the file category -> node type mapping documented in file-analyzer.md and the TYPE_TO_PREFIX dict in merge-batch-graphs.py (code/script/markup -> file, config -> config, docs -> document, infra -> service/pipeline/resource by path heuristic, data -> table/schema/endpoint by path heuristic). - Adds buildKnownCrossBatchNodeIds() that pre-computes the canonical <prefix>:<path> ID for every file and, per batch, returns the sorted array of IDs owned by all OTHER batches. - Emits the new knownCrossBatchNodeIds: string[] field on every batch entry written to batches.json. - Updates agents/file-analyzer.md with a STRICT prefix-authority section: use the list verbatim, do not invent IDs, drop the edge if the target is not in the list. - Updates skills/understand/SKILL.md Phase 2 dispatch template to pass knownCrossBatchNodeIds through to the file-analyzer subagent. - Adds 6 vitest cases covering: presence on every batch, exclusion of own files, correct per-fileCategory prefixes (the exact Egonex-AI#303 scenario with CLAUDE.md, Dockerfile, tsconfig.json, .github/workflows/), sorted determinism, and the reachability invariant the agent relies on ("not in the list -> not in the graph"). Closes Egonex-AI#303. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #303 — doc-batch file-analyzer subagents were emitting cross-batch edges with the wrong type prefix (e.g.
file:CLAUDE.mdinstead ofdocument:CLAUDE.md) because the dispatch prompt gave thembatchImportData[]+neighborMap{}but no map of canonical node IDs allocated by other batches. ~17/19 of those bad edges were getting recovered post-hoc by the assemble-reviewer at extra LLM cost; 2 stayed dangling.This change pre-computes the authoritative
<prefix>:<path>for every file incompute-batches.mjs(Phase 1.5) and ships a per-batchknownCrossBatchNodeIds: string[]field inbatches.json. The file-analyzer dispatch prompt now passes this list through with strict instructions: use the IDs verbatim, do not invent IDs, drop the edge if the target is not in the list.compute-batches.mjs: addsnodePrefixForFile()(mirrorsfile-analyzer.md"Node type mapping by fileCategory" +merge-batch-graphs.pyTYPE_TO_PREFIX) andbuildKnownCrossBatchNodeIds()(sorted per-batch list of all-other-batches' canonical IDs). New field is wired into the second-pass batch enrichment alongsidebatchImportDataandneighborMap.agents/file-analyzer.md: adds a "Cross-batch node IDs (knownCrossBatchNodeIds) — STRICT prefix authority" section with rules for all edge types whose target lives in another batch.skills/understand/SKILL.mdPhase 2: passesknownCrossBatchNodeIdsthrough the dispatch template.tests/skill/understand/test_compute_batches.test.mjs: adds a newknownCrossBatchNodeIds (issue #303)suite (6 cases) backed by a newscan-result-known-cross-batch-ids.jsonfixture that exercises the exact doc-batch file-analyzers emit wrong-prefix dangling edges to cross-batch nodes (need known-prefix map at dispatch) #303 scenario (docs/config/infra batches with cross-references). Asserts the field exists on every batch, excludes own files, uses the correct prefix perfileCategory(thedocument:CLAUDE.mdvsfile:CLAUDE.mdinvariant), is sorted, and that the unionown ∪ knownCrossBatchNodeIdsreaches every project file (the invariant the agent uses for "drop the edge if not in the list").fileCategory→ prefix mapping (mirrors file-analyzer.md and merge-batch-graphs.py)code/script/markupfile:configconfig:docsdocument:infrapipeline:.github/workflows/*,.gitlab-ci.yml,.circleci/*,Jenkinsfileinfraresource:*.tf,*.tfvars,Vagrantfileinfraservice:dataendpoint:openapi.yaml/swagger.jsondataschema:*.graphql/*.gql/*.proto/*.prismadatatable:file:Caveats
infraanddatasub-classifications are path-heuristic, mirroringbuildNonCodeBatches()Group A–D and the file-analyzer's "Choosing between infra/data sub-types" guidance. Edge cases like aMakefile(currently →service:) might mis-classify, but a small mis-prefix is bounded — the merge step's dangling-edge dropper still catches it and the existing assemble-reviewer pass still has the recovery signal. Worst case it falls back to today's behavior.knownCrossBatchNodeIdsfromcompute-batches.mjs+ the dispatch instructions, per the issue's "feel free to scope down" note. Verifying the full end-to-end gain (the 17/19 → 19/19 dangling-edge recovery rate) requires running/understandagainst a real multi-batch project, which is out of scope for this PR. The unit tests confirm the data shape and invariants.Test plan
pnpm test tests/skill/understand/test_compute_batches.test.mjs— 25 passed (19 existing + 6 new)pnpm test(full suite) — 224 passed across 17 test files (no regressions inextract-import-map,scan-project,merge-recover-imports,worktree-redirect, dashboard utils, etc.)/understand --fullagainst a project that has multi-batch doc-cross-references (e.g. a repo withCLAUDE.md+ multi-package src/) and confirm the merge-batch-graphs.py "Dropped dangling edges" count drops vs. baseline. (Recommended but not blocking.)🤖 Generated with Claude Code