Skip to content

Windows: tree-sitter definitions pass fails on ALL files when repo_path contains non-ASCII (CJK / Cyrillic / Arabic / accented Latin / etc.) characters #636

Description

@maxniu1

Bug Description

On Windows, when the repo_path passed to index_repository contains Chinese (CJK) characters in directory or file names, the definitions pass (tree-sitter AST parsing) fails on every filedefs=0, calls=0, imports=0, errors=N for all files. The resulting knowledge graph contains only File/Folder/Project nodes with no Function, Class, Method, or Variable nodes, making the index effectively useless for code intelligence.

This is more severe than issue #571 (project name truncation for CJK paths) — that issue is cosmetic, while this one prevents all code-level analysis entirely.

Steps to Reproduce

  1. On Windows, create or index a project whose path contains Chinese characters, e.g.:
    F:/项目资产-ProjectAssets/智能体/智能体原型/
    
  2. Run:
    codebase-memory-mcp cli index_repository '{"repo_path":"F:/项目资产-ProjectAssets/智能体/智能体原型"}'
    
  3. Observe the pipeline log — the definitions pass shows:
    pass=definitions defs=0 calls=0 imports=0 errors=41
    
  4. get_architecture returns only File/Folder nodes, zero Function/Class/Method nodes

Expected Behavior

Files at paths with Chinese characters should be parsed by tree-sitter identically to ASCII-only paths. A 41-file TypeScript/JS/CSS project should produce, e.g., defs=200+ with errors=0.

Actual Behavior (Verbose Pipeline Log)

pass=discover files=41
pass=structure nodes=60 edges=58           ← File/Folder tree works fine
pass=definitions defs=0 calls=0 imports=0 errors=41  ← ALL files fail
pass=calls total=0 resolved=0 unresolved=0 errors=41
pass=semantic inherits=0 decorates=0 implements=0 errors=41

Root Cause Verified

Identical file content, copied to an ASCII-only path, indexes correctly:

Test Path defs calls errors
ASCII path C:/tmp/core-test.js 30 103 0
Chinese path F:/...智能体.../core.js 0 0 41

The 417-line JS file is byte-identical. The only variable is whether the parent path contains CJK characters. Languages tested: JavaScript, TypeScript (.ts/.tsx), Python, CSS — all fail identically when the path has CJK characters.

Environment

  • OS: Windows 10 (64-bit)
  • codebase-memory-mcp version: v0.8.1
  • Filesystem: NTFS (UTF-8)
  • Shell tested: git-bash (MSYS2) — same result via direct Python subprocess call

Suspected Cause

Likely a Windows-specific encoding issue. The structure pass (directory enumeration via FindFirstFileW/FindNextFileW) works because it natively uses wide-char UTF-16 paths. But the definitions pass probably reads file paths as plain char* and passes them to tree-sitter via fopen or similar, which on Windows uses the ANSI codepage (e.g., GBK on zh-CN systems) rather than UTF-8. This causes tree-sitter to receive garbled paths or fail to open file contents for AST parsing.

On Linux/macOS, char* paths in UTF-8 work natively because the filesystem encoding is UTF-8 — this explains why similar issues (#571) were reported but the definitions-pass failure was not caught.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions