Skip to content

feat: multilingual GEO rule patterns (zh/ja/ko)#11

Open
angziii wants to merge 1 commit into
context-dot-dev:mainfrom
angziii:feat/multilingual-geo-rules
Open

feat: multilingual GEO rule patterns (zh/ja/ko)#11
angziii wants to merge 1 commit into
context-dot-dev:mainfrom
angziii:feat/multilingual-geo-rules

Conversation

@angziii

@angziii angziii commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Current GEO audit heuristics are heavily English-centric — "according to", English question marks, English month names, English verb patterns — and fail to detect equivalent signals on Chinese, Japanese, and Korean pages.

This PR adds language-aware GEO rule patterns so the audit correctly evaluates non-English content.

  • New lib/i18n-patterns.ts: Language detection via Unicode range heuristics + full pattern dictionaries for en, zh, ja, ko
  • lib/rules.ts: B1/B5/B7/B10/D10 evaluators, isCommercialOfferingLike(), isTechnicalProductLike() all use language-specific patterns
  • lib/audit.ts: computeFaqStatus(), detectByline(), extractDateSignals() all language-aware

Rebased cleanly onto latest main (which already has Intl.Segmenter word counting and cjkRatio FK guard).

Example: what changes for a Chinese page

Rule Before After
B7 (source attribution) fail — no "according to" found pass — "根据" matched
B5 (FAQ detection) fail — no ? found pass — "常见问题" + matched
B12 (last-updated date) fail — no English month name pass — "最后更新" + Chinese date matched
byline detection false true — 作者: matched

Backward compatibility

English pages are unchanged — detectLanguage() returns "en" for English input, and all existing English regex patterns are preserved in EN_PATTERNS.

Test plan

  • npm run build passes
  • Audit a Chinese page and verify B1/B5/B7/B10/D10 fire
  • Audit a Japanese page and verify rules fire
  • Audit a Korean page and verify rules fire
  • Audit an English page and verify no regressions

Replace English-only heuristics with language-aware pattern sets
loaded from a new i18n-patterns module. Detects page language via
Unicode range heuristics, then applies locale-specific regex for:

- BLUF intro verbs (B1)
- FAQ/Q&A detection (B5, computeFaqStatus)
- Source attribution (B7) - e.g. 根据, によると, 에 따르면
- Definitional patterns (B10) - e.g. 是指, とは, 란
- Research language (D10) - e.g. 研究/調査/연구
- Byline detection - e.g. 作者, 著者, 작성자
- Date signals - e.g. 最后更新, 更新日, 최종 업데이트
- Commercial/technical keyword detection

English pages continue to use the existing patterns unchanged.
Rebased onto latest main (Intl.Segmenter word counting +
cjkRatio FK guard already present upstream).
@vercel

vercel Bot commented May 13, 2026

Copy link
Copy Markdown

@angziii is attempting to deploy a commit to the ContextDev Team on Vercel.

A member of the Team first needs to authorize it.

@angziii angziii force-pushed the feat/multilingual-geo-rules branch 2 times, most recently from 28d4ac6 to 451dd6a Compare May 13, 2026 11:14
@YahiaBakour

Copy link
Copy Markdown
Contributor

@angziii lmk when i should review this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants