feat: multilingual GEO rule patterns (zh/ja/ko)#11
Open
angziii wants to merge 1 commit into
Open
Conversation
Replace English-only heuristics with language-aware pattern sets loaded from a new i18n-patterns module. Detects page language via Unicode range heuristics, then applies locale-specific regex for: - BLUF intro verbs (B1) - FAQ/Q&A detection (B5, computeFaqStatus) - Source attribution (B7) - e.g. 根据, によると, 에 따르면 - Definitional patterns (B10) - e.g. 是指, とは, 란 - Research language (D10) - e.g. 研究/調査/연구 - Byline detection - e.g. 作者, 著者, 작성자 - Date signals - e.g. 最后更新, 更新日, 최종 업데이트 - Commercial/technical keyword detection English pages continue to use the existing patterns unchanged. Rebased onto latest main (Intl.Segmenter word counting + cjkRatio FK guard already present upstream).
|
@angziii is attempting to deploy a commit to the ContextDev Team on Vercel. A member of the Team first needs to authorize it. |
28d4ac6 to
451dd6a
Compare
Contributor
|
@angziii lmk when i should review this |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Current GEO audit heuristics are heavily English-centric —
"according to", English question marks, English month names, English verb patterns — and fail to detect equivalent signals on Chinese, Japanese, and Korean pages.This PR adds language-aware GEO rule patterns so the audit correctly evaluates non-English content.
lib/i18n-patterns.ts: Language detection via Unicode range heuristics + full pattern dictionaries foren,zh,ja,kolib/rules.ts: B1/B5/B7/B10/D10 evaluators,isCommercialOfferingLike(),isTechnicalProductLike()all use language-specific patternslib/audit.ts:computeFaqStatus(),detectByline(),extractDateSignals()all language-awareRebased cleanly onto latest main (which already has Intl.Segmenter word counting and cjkRatio FK guard).
Example: what changes for a Chinese page
"according to"found"根据"matched?found"常见问题"+?matched"最后更新"+ Chinese date matched作者:matchedBackward compatibility
English pages are unchanged —
detectLanguage()returns"en"for English input, and all existing English regex patterns are preserved inEN_PATTERNS.Test plan
npm run buildpasses