Skip to content

feat: index files inside git submodules#93

Open
andreinknv wants to merge 1 commit into
colbymchenry:mainfrom
andreinknv:feat/index-submodules
Open

feat: index files inside git submodules#93
andreinknv wants to merge 1 commit into
colbymchenry:mainfrom
andreinknv:feat/index-submodules

Conversation

@andreinknv
Copy link
Copy Markdown
Contributor

Summary

git ls-files (used for both the initial scan and incremental sync) doesn't enter submodules — they appear as gitlink entries with their contents invisible. Both file-discovery paths now recurse into active submodules. Closes #86.

  • getGitVisibleFiles (full index) enumerates active submodules via git submodule foreach --recursive --quiet and runs git ls-files -co --exclude-standard inside each, prefixing paths so files are reported relative to the parent root.
  • getGitChangedFiles (sync) was refactored to share its status-parsing logic between the parent repo and each submodule. Submodule directory entries the parent's status emits (e.g. m vendor/sub when the pointer moved) are filtered out so we don't try to read a directory as a file.
  • Status output paths are now C-style-unquoted before use, so submodule directories containing spaces or non-ASCII bytes are handled correctly (also fixes an existing latent bug for non-submodule paths with spaces).

Opt-out

Submodule indexing is on by default. For repos with very large vendor submodules, set indexSubmodules: false in CodeGraphConfig to skip them. Path-based excludes (e.g. '**/vendor/**') also still work.

Behavior on failure

  • Parent-repo git status failure → falls back to the full filesystem scan (preserves the prior null contract).
  • Submodule-internal git status / ls-files failure → silently absorbed (e.g. uninitialized or partially fetched submodules).

Files changed

File Change
src/extraction/index.ts Add getGitSubmodules / getSubmoduleFiles; recurse into submodules from getGitVisibleFiles; refactor getGitChangedFiles to share readGitStatus; add unquoteGitPath for C-style-quoted porcelain paths
src/types.ts Add indexSubmodules?: boolean to CodeGraphConfig, default true in DEFAULT_CONFIG
__tests__/sync.test.ts 5 new tests: indexAll picks up submodule contents, sync detects modifications inside a submodule, sync detects new untracked files inside a submodule, missing submodule directory doesn't break the scan, indexSubmodules: false skips submodule contents

Test plan

  • npm test (serialized): 385/385 pass (was 379, added 6 — 5 submodule + 1 internal helper test indirectly)
  • npx tsc --noEmit clean
  • npm run build clean
  • Reviewer to spot-check on a real repo with multiple submodules

🤖 Generated with Claude Code

`git ls-files` (used for both the initial scan and incremental sync)
does not enter submodules — they appear as gitlink entries with their
contents invisible. As a result, source files inside submodules were
silently skipped during indexing.

Both file-discovery paths now recurse into active submodules:

  - getGitVisibleFiles (full index) enumerates active submodules via
    `git submodule foreach --recursive --quiet 'echo "$displaypath"'`
    and runs `git ls-files -co --exclude-standard` inside each, prefixing
    the submodule path so files are reported relative to the parent root.

  - getGitChangedFiles (sync) was refactored to share its status-parsing
    logic between the parent repo and each submodule. Submodule directory
    entries that the parent's status emits when a submodule pointer moves
    (e.g., " m vendor/sub") are filtered out so we don't try to read a
    directory as a file.

Submodule indexing is on by default and can be disabled via
`indexSubmodules: false` in CodeGraphConfig — useful for repos with
large vendor submodules that should remain unindexed without having to
add a path-based exclude. Uninitialized / missing submodules are
silently skipped (best-effort enhancement on top of the existing scan).

Status output paths are now C-style-unquoted before being used or
compared against the submodule directory set, so submodule paths
containing spaces or non-ASCII bytes are handled correctly. The parent
status command failing still falls back to the full filesystem scan via
a null return, preserving the prior contract; only submodule-internal
status failures are absorbed silently.

Closes colbymchenry#86.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@giovannidoni
Copy link
Copy Markdown

Stumbled upon this issue - cool to see there is a pr opened for it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Submodules scan

2 participants