feat: index files inside git submodules#43
Open
mschreib28 wants to merge 1 commit into
Open
Conversation
`git ls-files` (used for both the initial scan and incremental sync)
does not enter submodules — they appear as gitlink entries with their
contents invisible. As a result, source files inside submodules were
silently skipped during indexing.
Both file-discovery paths now recurse into active submodules:
- getGitVisibleFiles (full index) enumerates active submodules via
`git submodule foreach --recursive --quiet 'echo "$displaypath"'`
and runs `git ls-files -co --exclude-standard` inside each, prefixing
the submodule path so files are reported relative to the parent root.
- getGitChangedFiles (sync) was refactored to share its status-parsing
logic between the parent repo and each submodule. Submodule directory
entries that the parent's status emits when a submodule pointer moves
(e.g., " m vendor/sub") are filtered out so we don't try to read a
directory as a file.
Submodule indexing is on by default and can be disabled via
`indexSubmodules: false` in CodeGraphConfig — useful for repos with
large vendor submodules that should remain unindexed without having to
add a path-based exclude. Uninitialized / missing submodules are
silently skipped (best-effort enhancement on top of the existing scan).
Status output paths are now C-style-unquoted before being used or
compared against the submodule directory set, so submodule paths
containing spaces or non-ASCII bytes are handled correctly. The parent
status command failing still falls back to the full filesystem scan via
a null return, preserving the prior contract; only submodule-internal
status failures are absorbed silently.
Closes colbymchenry#86.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary\n\n
git ls-files(used for both the initial scan and incremental sync) doesn't enter submodules — they appear as gitlink entries with their contents invisible. Both file-discovery paths now recurse into active submodules. Closes colbymchenry#86.\n\n-getGitVisibleFiles(full index) enumerates active submodules viagit submodule foreach --recursive --quietand runsgit ls-files -co --exclude-standardinside each, prefixing paths so files are reported relative to the parent root.\n-getGitChangedFiles(sync) was refactored to share its status-parsing logic between the parent repo and each submodule. Submodule directory entries the parent's status emits (e.g.m vendor/subwhen the pointer moved) are filtered out so we don't try to read a directory as a file.\n- Status output paths are now C-style-unquoted before use, so submodule directories containing spaces or non-ASCII bytes are handled correctly (also fixes an existing latent bug for non-submodule paths with spaces).\n\n## Opt-out\n\nSubmodule indexing is on by default. For repos with very large vendor submodules, setindexSubmodules: falseinCodeGraphConfigto skip them. Path-based excludes (e.g.'**/vendor/**') also still work.\n\n## Behavior on failure\n\n- Parent-repogit statusfailure → falls back to the full filesystem scan (preserves the priornullcontract).\n- Submodule-internalgit status/ls-filesfailure → silently absorbed (e.g. uninitialized or partially fetched submodules).\n\n## Files changed\n\n| File | Change |\n|---|---|\n|src/extraction/index.ts| AddgetGitSubmodules/getSubmoduleFiles; recurse into submodules fromgetGitVisibleFiles; refactorgetGitChangedFilesto sharereadGitStatus; addunquoteGitPathfor C-style-quoted porcelain paths |\n|src/types.ts| AddindexSubmodules?: booleantoCodeGraphConfig, defaulttrueinDEFAULT_CONFIG|\n|__tests__/sync.test.ts| 5 new tests: indexAll picks up submodule contents, sync detects modifications inside a submodule, sync detects new untracked files inside a submodule, missing submodule directory doesn't break the scan,indexSubmodules: falseskips submodule contents |\n\n## Test plan\n\n- [x]npm test(serialized): 385/385 pass (was 379, added 6 — 5 submodule + 1 internal helper test indirectly)\n- [x]npx tsc --noEmitclean\n- [x]npm run buildclean\n- [ ] Reviewer to spot-check on a real repo with multiple submodules\n\n🤖 Generated with Claude Code\nCopied from colbymchenry/codegraph#93