Skip to content

fix(shell): use git ls-files for @ file mention in large repos#1588

Open
n-WN wants to merge 15 commits intoMoonshotAI:mainfrom
n-WN:fix/file-mention-git-ls-files
Open

fix(shell): use git ls-files for @ file mention in large repos#1588
n-WN wants to merge 15 commits intoMoonshotAI:mainfrom
n-WN:fix/file-mention-git-ls-files

Conversation

@n-WN
Copy link
Copy Markdown
Collaborator

@n-WN n-WN commented Mar 26, 2026

Related Issue

Fixes #1375, helps #1276, #1339

Description

Root Cause

In large repositories (e.g. apache/superset with 65k+ files), the @ file mention completer fails to discover files in directories that sort late alphabetically. The root cause is os.walk() depth-first traversal with a hardcoded 1000-file limit — early directories exhaust the budget before later ones are visited.

Analysis

Analyzed how professional tools solve this:

Tool File Discovery Matching
Claude Code git ls-files primary → vendored rg --files fallback Custom bitmask fuzzy matcher
Codex CLI ignore crate (ripgrep internals), parallel walker nucleo (Helix editor's matcher)
fzf External process (git ls-files/fd), streaming Smith-Waterman variant, parallel
VS Code Pre-index + file watchers Custom fuzzyScore
kimi-cli (before) os.walk() + 1000 limit prompt_toolkit FuzzyCompleter

Common pattern: all professional tools use git ls-files or equivalent (reads git index, not filesystem). None use os.walk with hard truncation.

Fix

New shared module kimi_cli/utils/file_filter.py — unifies file discovery for Shell and Web:

  • git ls-files -z as primary file source: Reads git's pre-built index with NUL-delimited output (-z) — instant for 100k+ file repos, handles special characters (tab, quotes, backslash) correctly.
  • --recurse-submodules: Includes submodule files.
  • --others --exclude-standard: Discovers untracked files while respecting .gitignore.
  • --deleted exclusion: Filters stale renamed/deleted files (e.g. after mv old.py new.py without staging).
  • is_ignored() post-filtering: Tracked node_modules/, vendor/, etc. are filtered from results for consistency with the walk fallback.
  • Empty directory pruning: Synthesized directory entries with no surviving file children are removed.
  • Path traversal prevention: Rejects .. segments in scope for git paths; uses resolve() + is_relative_to() for walk paths.
  • Scoped search: When fragment contains / (e.g. @src/utils/), both git and walk restrict to that subtree.
  • os.walk fallback: Preserved for non-git repos with identical scoping and filtering.
  • .git/index mtime cache: 5s TTL + mtime change detection (borrowed from Claude Code) avoids redundant subprocess calls.

Verification

# Before: late-alphabetical dirs unreachable in large repos
@zzz_target/ → (no results)

# After: instant via git ls-files index
@zzz_target/ → zzz_target/important.py

Key files changed

  • src/kimi_cli/utils/file_filter.pynew: shared file discovery module
  • src/kimi_cli/ui/shell/prompt.py — delegates to file_filter, adds mtime cache
  • src/kimi_cli/web/api/sessions.py — restored unfiltered iterdir() for general-purpose file browser API
  • tests/utils/test_file_filter.pynew: 46 tests (git/walk parity, ignored dirs, deleted files, special chars, path traversal)
  • tests/ui_and_conv/test_file_completer.py — added git-mode integration tests
  • tests/e2e/test_file_mention_e2e.pynew: 4 PTY E2E tests

Checklist

  • I have read the CONTRIBUTING document.
  • I have linked the related issue, if any.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have run make gen-changelog to update the changelog.
  • I have run make gen-docs to update the user documentation.

@n-WN n-WN requested a review from RealKai42 March 27, 2026 08:37
@n-WN n-WN marked this pull request as ready for review March 27, 2026 09:05
devin-ai-integration[bot]

This comment was marked as resolved.

@n-WN n-WN force-pushed the fix/file-mention-git-ls-files branch from 7c0ff57 to 2da1698 Compare March 27, 2026 09:17
@n-WN
Copy link
Copy Markdown
Collaborator Author

n-WN commented Mar 27, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2da1698ad2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

except Exception:
return None

paths = _parse_ls_files_output(result.stdout)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reapply ignore filters to git-discovered completion paths

list_files_git now returns parsed git ls-files output directly, so deep completion mode (@ queries with length ≥3 or any scoped path) can surface paths under ignored buckets like node_modules/, vendor/, and target/ whenever those directories are tracked. The previous walker filtered these names everywhere, and losing that filter can flood candidates and make file mentions much less usable in repos that commit dependencies or build artifacts.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 976c71225d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

paths: list[str] = []
seen_dirs: set[str] = set()
for line in stdout.splitlines():
line = line.strip()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve filename whitespace when parsing git output

_parse_ls_files_output currently does line.strip(), which removes legitimate leading/trailing spaces from tracked filenames. In repos that contain files like " leading.txt" or "name ".py, completion entries are rewritten to a different path, so selecting the suggestion points to a non-existent file. This regression comes from the new git-based path source; the previous os.walk path handling preserved names exactly.

Useful? React with 👍 / 👎.

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 409c8c5996

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

except Exception:
return None

paths = _parse_ls_files_output(result.stdout)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reapply ignore rules to git-discovered paths

list_files_git returns parsed git ls-files output without running is_ignored, so deep @ completion now includes tracked artifact/dependency trees like node_modules/, vendor/, and target/. This regresses the previous walker behavior and can flood completions in repos that commit generated assets. Fresh evidence: calling list_files_git on a temp repo with tracked node_modules/a.js returns that path, confirming no post-parse ignore filtering is applied.

Useful? React with 👍 / 👎.

if dir_path not in seen_dirs:
seen_dirs.add(dir_path)
paths.append(dir_path)
paths.append(line)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drop deleted tracked files from git completion list

git ls-files includes index entries for files deleted from the working tree but not yet staged, and _parse_ls_files_output appends each line directly, so completions can suggest paths that no longer exist on disk. The old os.walk implementation only returned existing files, so this is a correctness regression that can produce invalid @ suggestions until users run git add -u/commit.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 378c27c93a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +245 to +246
if deleted:
paths = [p for p in paths if p.endswith("/") or p not in deleted]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drop synthesized dirs for deleted tracked paths

list_files_git synthesizes directory entries in _parse_ls_files_output before removing git ls-files --deleted entries, but the deletion filter only removes file paths (p not in deleted). In a git repo where the last tracked file in a directory is deleted (and the directory removed on disk) without staging, completion still returns that non-existent directory (e.g., foo/), which regresses correctness versus the previous os.walk behavior that only surfaced existing paths.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8945e7def0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +153 to +154
for line in stdout.splitlines():
if not line:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse git ls-files output in NUL-delimited mode

list_files_git currently relies on newline-delimited git ls-files output and _parse_ls_files_output treats each raw line as a literal path, but Git emits C-style quoted paths for special characters (tabs/newlines/quotes/backslashes), so candidates can become escaped strings (e.g. "tab\\tname.py") that do not exist on disk. I validated this with a temp repo: list_files_git returned quoted/escaped names while list_files_walk returned the real filename. git ls-files -h documents -z (“separate paths with the NUL character”); using -z parsing would prevent this path corruption.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e89cd99e67

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +716 to +718
if self._is_git:
paths = list_files_git(self._root, scope)
self._git_index_mtime = git_index_mtime(self._root)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reapply completion cap for git-backed deep searches

When a repo is detected, deep @ completion now calls list_files_git without applying self._limit, so the candidate set becomes unbounded while the os.walk fallback still enforces the limit. In large repos this can feed tens of thousands of paths into FuzzyCompleter on each completion request, causing noticeable UI latency and negating the guardrail that the limit constructor argument previously provided.

Useful? React with 👍 / 👎.

@n-WN n-WN force-pushed the fix/file-mention-git-ls-files branch from e89cd99 to 68db846 Compare March 27, 2026 13:42
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68db846c12

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

"--recurse-submodules",
]
if scope:
cmd.append(scope + "/")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Insert -- before scoped pathspec in git calls

When scope is appended directly (for example -docs/), git ls-files treats it as an option instead of a pathspec and exits with an error, so deep @ completion for top-level directories/files starting with - silently falls back to list_files_walk and can miss candidates once the walk limit is hit. I verified this with git ls-files (which expects options before positional <file> args): git ls-files ... -docs/ returns an unknown-switch error, while git ls-files ... -- -docs/ works. The same option-termination issue applies to the scoped --deleted/--others calls too.

Useful? React with 👍 / 👎.

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7d394cfe8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

"""
deleted = _git_ls_files_z(root, "--deleted", *_scope_args(scope))
for sub in _git_submodule_paths(root):
if scope and not sub.startswith(scope + "/") and scope != sub:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match scope against submodule prefix direction

The scope guard is reversed here: with a scoped query inside a submodule (for example scope="sub/pkg" and sub="sub"), not sub.startswith(scope + "/") evaluates true, so the submodule is skipped entirely. That means _git_deleted_files() and the analogous loop in list_files_git() do not apply --deleted/--others to that submodule, so scoped @sub/pkg/ completion can show stale deleted files and miss new untracked files. I reproduced this with a temp repo + submodule where untracked sub/pkg/b.py was present in unscoped results but missing in scoped results.

Useful? React with 👍 / 👎.

Comment on lines +202 to +204
parts = line.strip().split()
if len(parts) >= 2:
paths.append(parts[1])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse submodule paths without whitespace splitting

Using line.strip().split() to parse git submodule status breaks submodule paths that contain spaces (e.g. sub mod), because only the first token after the SHA is kept. When that happens, later deleted/untracked scans run against a non-existent path (root/sub) and dirty-state reconciliation for the real submodule is skipped, so completions can include deleted files and miss untracked ones under spaced submodule directories.

Useful? React with 👍 / 👎.

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 45dd8df406

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

"core.quotepath=false",
"ls-files",
"-z",
"--recurse-submodules",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reconcile dirty submodules in git file listing

list_files_git enables git ls-files --recurse-submodules for tracked paths, but the follow-up deleted/untracked reconciliation only runs in the superproject (_git_deleted_files and --others --exclude-standard without submodule handling). In practice this returns stale candidates for dirty submodules: deleting sub/a.txt and creating sub/new.txt inside a submodule still yields sub/a.txt and misses sub/new.txt. I verified the CLI semantics here: git ls-files --deleted --recurse-submodules and git ls-files --others --exclude-standard --recurse-submodules both fail with fatal: ls-files --recurse-submodules unsupported mode, so the current approach cannot keep submodule results in sync with the working tree.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 16 additional findings in Devin Review.

Open in Devin Review

"core.quotepath=false",
"ls-files",
"-z",
"--recurse-submodules",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 --recurse-submodules in main listing without matching submodule handling in --deleted causes stale submodule files in completions

The main git ls-files command at src/kimi_cli/utils/file_filter.py:230 uses --recurse-submodules, so files inside submodules are included in the listing. However, _git_deleted_files at line 189 does NOT recurse into submodules (because git ls-files --deleted --recurse-submodules is unsupported by git — it errors with "unsupported mode"). Similarly, the --others command at line 252 doesn't recurse into submodules.

This asymmetry means that if a file is deleted (or renamed) inside a submodule's working tree but still exists in its index, it will appear in the main listing but won't be caught by _git_deleted_files, resulting in stale completion candidates for files that no longer exist on disk. The previous commit (f7d394c) attempted to fix this by manually recursing into submodules, but was reverted (45dd8df), re-introducing the inconsistency.

Prompt for agents
In src/kimi_cli/utils/file_filter.py, the main git ls-files command at line 230 uses --recurse-submodules, but _git_deleted_files (line 189) and the --others command (line 252) do not recurse into submodules because git doesn't support combining those flags with --recurse-submodules. This causes stale submodule files to appear in completions.

Two options to fix:
1. Remove --recurse-submodules from the main listing (line 230) to make behavior consistent — submodule files simply won't appear in completions, which is an acceptable trade-off for correctness.
2. Re-implement submodule recursion for --deleted and --others by enumerating submodule paths via git submodule status and running separate git ls-files commands inside each submodule directory (this was attempted in commit f7d394cf but reverted — investigate why and fix the approach).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

n-WN added 10 commits March 28, 2026 01:22
…#1375)

Replace os.walk with git ls-files as the primary file discovery
mechanism for the @ file mention completer. This fixes large repos
(e.g. apache/superset with 65k+ files) where the 1000-file limit
caused late-alphabetical directories to be unreachable.

Changes:
- Add _get_git_paths(): runs `git ls-files --recurse-submodules`
  with 5s timeout, extracts directory entries from file paths
- Add _walk_paths(): refactored os.walk fallback for non-git repos
- Add scoped search: when fragment contains "/" (e.g. @src/utils/),
  both git and walk paths restrict to that subtree
- Lazy git detection with _detect_git() (cached)
- Scope-aware caching: cache invalidates when search scope changes

Fixes MoonshotAI#1375, helps MoonshotAI#1276
…odule

Extract ignore rules, git ls-files integration, and directory listing
into kimi_cli/utils/file_filter.py. Both the shell completer and the
web backend API now use the same filtering logic.

Changes:
- New kimi_cli/utils/file_filter.py: shared is_ignored(), detect_git(),
  list_files_git(), list_files_walk(), list_directory_filtered()
- Shell prompt.py: delegate to file_filter instead of inline rules
- Web sessions.py: use list_directory_filtered() which filters .git,
  node_modules, etc. (fixes MoonshotAI#1339)
- New e2e tests (4): @ trigger, scoped search, git ls-files deep files,
  gitignore filtering
- Remove section separator banners and module docstring (match stdrc style)
- Make all constants private (_IGNORED_NAMES, _IGNORED_PATTERNS)
- Add untracked file discovery via git ls-files --others --exclude-standard
- Add .git/index mtime-based cache invalidation (vs pure TTL)
- Extract _parse_ls_files_output helper to reduce duplication
- Add git_index_mtime() utility for cache staleness detection
Reject scope values containing '..' to prevent @../ from escaping the
workspace root in both git ls-files and os.walk code paths.
The get_session_file endpoint is a general-purpose file browser, not an
@ mention completer.  Revert to listing all directory entries without
is_ignored() filtering so web UI users can still see node_modules,
build, dist, etc. when browsing their project.
Remove unnecessary strip() from _parse_ls_files_output — splitlines()
already handles line endings, and strip() would silently mangle
filenames with leading/trailing spaces.
Two regressions fixed:

1. Tracked ignored directories (node_modules/, vendor/, etc.) were
   bypassing is_ignored() when discovered via git ls-files, making
   completion inconsistent with top-level and os.walk fallback paths.
   Now _parse_ls_files_output applies is_ignored() to each path segment.

2. Files renamed/deleted on disk but not staged (e.g. mv old.py new.py
   without git add) remained as stale candidates because git ls-files
   reads the index. Now git ls-files --deleted is used to exclude
   working-tree deletions.

Add regression tests for both scenarios.
When all files under a directory are deleted from the working tree
(but not staged), the synthesized directory entry was left as a stale
candidate. Now directory entries are pruned if no surviving file
children exist after deleted-file exclusion.
43 tests covering:
- Git vs walk parity on clean repos (flat, nested, scoped, gitignored)
- Tracked ignored dir filtering (node_modules, vendor, __pycache__, etc.)
- Deleted/renamed file exclusion and empty dir pruning
- Untracked file discovery and gitignore respect
- Path traversal prevention
- is_ignored unit tests
- Non-git fallback behaviour
n-WN added 5 commits March 28, 2026 01:22
Filenames containing tab, quotes, or backslash were C-style quoted by
git even with core.quotepath=false, producing escaped strings like
"tab\there.py" that don't match real paths on disk.  Switch all three
git ls-files invocations to -z (NUL-delimited) mode and parse with
split('\0') instead of splitlines().

Add regression tests for tab, quote, and backslash in filenames.
Directory names starting with `-` (e.g. `-docs/`) are misinterpreted
as git options without `--`.  Extract `_scope_args()` helper to ensure
all three git ls-files invocations use the end-of-options marker.
git ls-files --deleted and --others do not support --recurse-submodules
(fatal: unsupported mode). Enumerate submodules via `git submodule
status` and run per-submodule queries, prefixing results back to the
root repo namespace. Also extract _git_ls_files_z() helper to reduce
subprocess boilerplate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File mentions (@) can't find the files

1 participant