Skip to content

fix(compute-batches): emit knownCrossBatchNodeIds so doc-batches use correct prefixes (#303)#399

Open
tirth8205 wants to merge 1 commit into
Egonex-AI:mainfrom
tirth8205:fix/compute-batches-known-cross-batch-ids
Open

fix(compute-batches): emit knownCrossBatchNodeIds so doc-batches use correct prefixes (#303)#399
tirth8205 wants to merge 1 commit into
Egonex-AI:mainfrom
tirth8205:fix/compute-batches-known-cross-batch-ids

Conversation

@tirth8205

Copy link
Copy Markdown
Contributor

Summary

Closes #303 — doc-batch file-analyzer subagents were emitting cross-batch edges with the wrong type prefix (e.g. file:CLAUDE.md instead of document:CLAUDE.md) because the dispatch prompt gave them batchImportData[] + neighborMap{} but no map of canonical node IDs allocated by other batches. ~17/19 of those bad edges were getting recovered post-hoc by the assemble-reviewer at extra LLM cost; 2 stayed dangling.

This change pre-computes the authoritative <prefix>:<path> for every file in compute-batches.mjs (Phase 1.5) and ships a per-batch knownCrossBatchNodeIds: string[] field in batches.json. The file-analyzer dispatch prompt now passes this list through with strict instructions: use the IDs verbatim, do not invent IDs, drop the edge if the target is not in the list.

  • compute-batches.mjs: adds nodePrefixForFile() (mirrors file-analyzer.md "Node type mapping by fileCategory" + merge-batch-graphs.py TYPE_TO_PREFIX) and buildKnownCrossBatchNodeIds() (sorted per-batch list of all-other-batches' canonical IDs). New field is wired into the second-pass batch enrichment alongside batchImportData and neighborMap.
  • agents/file-analyzer.md: adds a "Cross-batch node IDs (knownCrossBatchNodeIds) — STRICT prefix authority" section with rules for all edge types whose target lives in another batch.
  • skills/understand/SKILL.md Phase 2: passes knownCrossBatchNodeIds through the dispatch template.
  • tests/skill/understand/test_compute_batches.test.mjs: adds a new knownCrossBatchNodeIds (issue #303) suite (6 cases) backed by a new scan-result-known-cross-batch-ids.json fixture that exercises the exact doc-batch file-analyzers emit wrong-prefix dangling edges to cross-batch nodes (need known-prefix map at dispatch) #303 scenario (docs/config/infra batches with cross-references). Asserts the field exists on every batch, excludes own files, uses the correct prefix per fileCategory (the document:CLAUDE.md vs file:CLAUDE.md invariant), is sorted, and that the union own ∪ knownCrossBatchNodeIds reaches every project file (the invariant the agent uses for "drop the edge if not in the list").

fileCategory → prefix mapping (mirrors file-analyzer.md and merge-batch-graphs.py)

fileCategory Prefix Notes
code / script / markup file: per file-analyzer.md "Node type mapping by fileCategory"
config config:
docs document:
infra pipeline: .github/workflows/*, .gitlab-ci.yml, .circleci/*, Jenkinsfile
infra resource: *.tf, *.tfvars, Vagrantfile
infra service: default (Dockerfile, docker-compose, K8s, …)
data endpoint: openapi.yaml/swagger.json
data schema: *.graphql/*.gql/*.proto/*.prisma
data table: default (SQL)
unknown file: defensive fallback (matches merge-batch-graphs default)

Caveats

  • The infra and data sub-classifications are path-heuristic, mirroring buildNonCodeBatches() Group A–D and the file-analyzer's "Choosing between infra/data sub-types" guidance. Edge cases like a Makefile (currently → service:) might mis-classify, but a small mis-prefix is bounded — the merge step's dangling-edge dropper still catches it and the existing assemble-reviewer pass still has the recovery signal. Worst case it falls back to today's behavior.
  • I scoped this to JUST emitting knownCrossBatchNodeIds from compute-batches.mjs + the dispatch instructions, per the issue's "feel free to scope down" note. Verifying the full end-to-end gain (the 17/19 → 19/19 dangling-edge recovery rate) requires running /understand against a real multi-batch project, which is out of scope for this PR. The unit tests confirm the data shape and invariants.

Test plan

  • pnpm test tests/skill/understand/test_compute_batches.test.mjs — 25 passed (19 existing + 6 new)
  • pnpm test (full suite) — 224 passed across 17 test files (no regressions in extract-import-map, scan-project, merge-recover-imports, worktree-redirect, dashboard utils, etc.)
  • End-to-end: run /understand --full against a project that has multi-batch doc-cross-references (e.g. a repo with CLAUDE.md + multi-package src/) and confirm the merge-batch-graphs.py "Dropped dangling edges" count drops vs. baseline. (Recommended but not blocking.)

🤖 Generated with Claude Code

…correct prefixes (Egonex-AI#303)

During Phase 2, doc/test batch file-analyzer subagents emit edges to nodes
they don't own using the wrong type prefix (e.g. file:CLAUDE.md instead of
document:CLAUDE.md). The assemble-reviewer recovers most via post-hoc LLM
repair but at extra cost, and the residual stays dangling. Root cause: the
dispatch prompt provided batchImportData[] + neighborMap{} but no map of
canonical node IDs allocated by other batches, forcing the agent to guess
the prefix from the path.

This change:
- Adds nodePrefixForFile() in compute-batches.mjs that mirrors the file
  category -> node type mapping documented in file-analyzer.md and the
  TYPE_TO_PREFIX dict in merge-batch-graphs.py (code/script/markup -> file,
  config -> config, docs -> document, infra -> service/pipeline/resource
  by path heuristic, data -> table/schema/endpoint by path heuristic).
- Adds buildKnownCrossBatchNodeIds() that pre-computes the canonical
  <prefix>:<path> ID for every file and, per batch, returns the sorted
  array of IDs owned by all OTHER batches.
- Emits the new knownCrossBatchNodeIds: string[] field on every batch
  entry written to batches.json.
- Updates agents/file-analyzer.md with a STRICT prefix-authority section:
  use the list verbatim, do not invent IDs, drop the edge if the target
  is not in the list.
- Updates skills/understand/SKILL.md Phase 2 dispatch template to pass
  knownCrossBatchNodeIds through to the file-analyzer subagent.
- Adds 6 vitest cases covering: presence on every batch, exclusion of own
  files, correct per-fileCategory prefixes (the exact Egonex-AI#303 scenario with
  CLAUDE.md, Dockerfile, tsconfig.json, .github/workflows/), sorted
  determinism, and the reachability invariant the agent relies on
  ("not in the list -> not in the graph").

Closes Egonex-AI#303.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

doc-batch file-analyzers emit wrong-prefix dangling edges to cross-batch nodes (need known-prefix map at dispatch)

1 participant