fix: use _wfopen on Windows to support non-ASCII (CJK/U...) file paths by maxniu1 · Pull Request #637 · DeusData/codebase-memory-mcp

maxniu1 · 2026-06-26T19:13:23Z

Summary

On Windows, fopen() uses the active ANSI codepage (e.g. GBK on zh-CN, Shift-JIS on ja-JP, CP1251 on ru-RU) rather than UTF-8. When a repository path contains characters outside ASCII-7 — any non-English language: Chinese, Japanese, Korean, Cyrillic, Arabic, accented Latin (é/ü/ñ), Thai, Hebrew, Devanagari, etc. — every file's tree-sitter definitions pass fails (defs=0, errors=N for all files). The resulting knowledge graph contains only File/Folder nodes with zero code intelligence.

The existing directory-enumeration code in compat_fs.c already handles this correctly via FindFirstFileW/_wmkdir/_wunlink with wide-character paths. This PR applies the same pattern to file reading for tree-sitter parsing.

Changes

File	Change
`src/foundation/compat_fs.h`	Declare `cbm_fopen()`
`src/foundation/compat_fs.c`	Implement `cbm_fopen()`: `_wfopen` on Windows, `fopen` on POSIX
`src/pipeline/pass_parallel.c`	Replace `fopen` → `cbm_fopen`
`src/pipeline/pass_calls.c`	Replace `fopen` → `cbm_fopen`
`src/pipeline/pass_definitions.c`	Replace `fopen` → `cbm_fopen`

All three pipeline files had a duplicated read_file() static function. Each now calls cbm_fopen() instead of raw fopen().

Testing

Before (v0.8.1)

pass=definitions defs=0 calls=0 imports=0 errors=41   ← ALL files fail

After (this PR)

pass=definitions defs=215 calls=259 imports=18 errors=0  ← all pass

Tested on a 41-file TypeScript/JS/Python/CSS project with CJK directory names (F:/.../智能体/智能体原型/). Full index: 245 nodes, 347 edges, 197ms.

Fixes Windows: tree-sitter definitions pass fails on ALL files when repo_path contains non-ASCII (CJK / Cyrillic / Arabic / accented Latin / etc.) characters #636 (this specific bug)
Related to Project name strips non-ASCII (CJK) characters from path, resulting in truncated/unrecognizable names #571 (project name truncation for CJK paths — cosmetic, already fixed)
This is the functional fix that makes the tool work for non-English Windows users

On Windows, fopen() uses the active ANSI codepage (e.g. GBK on zh-CN, Shift-JIS on ja-JP, CP1251 on ru-RU) rather than UTF-8. When a repository path contains non-ASCII characters — Chinese, Japanese, Korean, Cyrillic, Arabic, accented Latin (é, ü, ñ), etc. — every file's tree-sitter definitions pass fails (defs=0, errors=N for all files), producing a knowledge graph with only File/Folder nodes and zero code intelligence. The codebase already has the infrastructure for UTF-8→wide conversion (win_utf8.h: cbm_utf8_to_wide/cbm_wide_to_utf8) and uses it correctly for directory enumeration (FindFirstFileW), mkdir (_wmkdir), and unlink (_wunlink). But fopen() for reading file content was missed. Fix: - Add cbm_fopen() to compat_fs.h / compat_fs.c - Windows: uses _wfopen() with wide-char path - POSIX: delegates to fopen() - Replace fopen(path, "rb") with cbm_fopen(path, "rb") in the 3 pipeline files that read source files for tree-sitter parsing: pass_parallel.c, pass_calls.c, pass_definitions.c Fixes DeusData#636

maxniu1 mentioned this pull request Jun 26, 2026

Windows: tree-sitter definitions pass fails on ALL files when repo_path contains non-ASCII (CJK / Cyrillic / Arabic / accented Latin / etc.) characters #636

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use _wfopen on Windows to support non-ASCII (CJK/U...) file paths#637

fix: use _wfopen on Windows to support non-ASCII (CJK/U...) file paths#637
maxniu1 wants to merge 1 commit into
DeusData:mainfrom
maxniu1:fix/win-utf8-path-fopen

maxniu1 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maxniu1 commented Jun 26, 2026

Summary

Changes

Testing

Before (v0.8.1)

After (this PR)

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant