[lib]: Add Thai soundex algorithms#94
Conversation
Follow codebase convention from tokenizer/tcc_rules.rs. Keep unicode codepoints in comments for reference.
There was a problem hiding this comment.
Pull request overview
Adds a new Thai phonetic matching (soundex) module to the Rust library, including multiple algorithms and shared Thai-character utilities, intended to match PyThaiNLP outputs.
Changes:
- Introduces
thai_charsutilities (Thai character constants + classification helpers). - Adds 5 soundex algorithm implementations plus a dispatcher and similarity scoring.
- Exposes the new modules from the crate root (
lib.rs).
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/lib.rs |
Exposes soundex and thai_chars modules publicly. |
src/thai_chars.rs |
Adds Thai character constants and helper predicates used by soundex implementations. |
src/soundex/mod.rs |
Adds soundex dispatcher (soundex) and soundex_similarity, re-exporting algorithm entrypoints. |
src/soundex/udom83.rs |
Implements Udom83 with regex preprocessing and fixed-length code output. |
src/soundex/lk82.rs |
Implements LK82 with preprocessing and vowel/cluster handling. |
src/soundex/metasound.rs |
Implements MetaSound with consonant-only encoding and karan removal. |
src/soundex/prayut_and_somchaip.rs |
Implements Thai-English cross-language soundex with right-truncation behavior. |
src/soundex/complete_soundex.rs |
Implements Complete Soundex with heuristic syllable splitting and detailed encoding rules. |
| // 0. อัต pattern | ||
| if text.starts_with("อัต") && text.chars().count() > 3 { | ||
| let rest_start = text.char_indices().nth(3).map(|(i, _)| i).unwrap_or(text.len()); | ||
| return vec![ | ||
| (&text[..rest_start], None), | ||
| // Prepend ต to the rest | ||
| ]; | ||
| // This needs special handling - we'll allocate for this case | ||
| } | ||
|
|
There was a problem hiding this comment.
heuristic_split has an incomplete special-case: the อัต branch returns a vector containing only the first slice and leaves a comment about prepending ต to the rest, but the rest is never returned/processed. This will silently drop the remainder of the word for inputs starting with "อัต…". Implement the intended split (including the modified remainder, likely requiring allocation) or remove this branch until it is implemented.
| // 0. อัต pattern | |
| if text.starts_with("อัต") && text.chars().count() > 3 { | |
| let rest_start = text.char_indices().nth(3).map(|(i, _)| i).unwrap_or(text.len()); | |
| return vec![ | |
| (&text[..rest_start], None), | |
| // Prepend ต to the rest | |
| ]; | |
| // This needs special handling - we'll allocate for this case | |
| } |
There was a problem hiding this comment.
The อัต branch returns a 2-element vec! at lines 89-92: ("อัต", None) and ("ตรา", None). The rest is not dropped. Test test_heuristic_split_at_pattern verifies this.
| 'บ' => "บบ", 'ป' => "ปป", | ||
| 'ผ' => "พผ", 'ฝ' => "ฟฝ", 'พ' | 'ภ' => "พพ", 'ฟ' => "ฟฟ", | ||
| 'ม' => "มม", | ||
| 'ร' | 'ล' | 'ฬ' | 'ฤ' => "รร", 'ว' => "วว", |
There was a problem hiding this comment.
initial_code has a mapping for ฤ, but process_syllable only treats characters as initials when is_thai_consonant(chars[idx]) is true. With the current is_thai_consonant definition, ฤ is never considered an initial consonant here, so this case is effectively dead and words starting with ฤ will produce an empty init_code. Handle ฤ explicitly in the initial-detection logic (if the algorithm expects it) or remove the unreachable mapping for clarity.
| 'ร' | 'ล' | 'ฬ' | 'ฤ' => "รร", 'ว' => "วว", | |
| 'ร' | 'ล' | 'ฬ' => "รร", 'ว' => "วว", |
| // Keep only consonants and thanthakhat | ||
| let mut chars: Vec<char> = text | ||
| .chars() | ||
| .filter(|&c| is_thai_consonant(c) || c == THANTHAKHAT) |
There was a problem hiding this comment.
metasound_code includes a mapping for ฦ, but the input filtering keeps only is_thai_consonant(c) (U+0E01..U+0E2E) or thanthakhat. Since ฦ is outside that range, it can never reach metasound_code and will be dropped from the encoding. Either extend the filter to include ฦ (and any other supported non-range letters) or remove the unreachable mapping to avoid inconsistent behavior.
| // Keep only consonants and thanthakhat | |
| let mut chars: Vec<char> = text | |
| .chars() | |
| .filter(|&c| is_thai_consonant(c) || c == THANTHAKHAT) | |
| // Keep only consonants, ฦ, and thanthakhat | |
| let mut chars: Vec<char> = text | |
| .chars() | |
| .filter(|&c| is_thai_consonant(c) || c == 'ฦ' || c == THANTHAKHAT) |
Add ฏ ฑ ถ ธ to group 2 per Snae & Bruckner (2009) Table p.507. Upgrade all non-empty assertions to exact-value. Ref: PyThaiNLP/pythainlp#1383
Fix 3 issues from Copilot review: - complete_soundex: extend initial consonant check to include ฤ/ฦ (was blocked by is_thai_consonant guard, making initial_code dead) - metasound: extend input filter to include ฦ alongside consonants - lk82: fix RE_SIGN doc comment to match actual regex (no ์) 2 other comments were incorrect (dismissed with evidence): - อัต branch already returns both parts correctly (test passes) - asterisk ญ/ย check matches pythainlp behavior (non-adjacent contains)
|
|
||
| /// Compute the MetaSound Thai soundex for the given text. | ||
| /// | ||
| /// Returns a phonetic code of the specified `length` (default 4), |
There was a problem hiding this comment.
The docstring mentions “default 4” for length, but metasound requires length to be passed explicitly and no wrapper supplies a default. Align the docs with the actual API (or add a convenience function using 4).
| /// Returns a phonetic code of the specified `length` (default 4), | |
| /// Returns a phonetic code of the specified `length`, |
| /// Thai signs: ฯ ์ ๆ ์ ํ ๎ (U+0E2F, U+0E3A, U+0E46, U+0E4C, U+0E4D, U+0E4E). | ||
| pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎"; |
There was a problem hiding this comment.
THAI_SIGNS appears to be missing the Phinthu sign ฺ (U+0E3A) while including ์ twice. The doc comment lists U+0E3A, but the constant string does not include it, so sign classification based on this constant would be incorrect. Update the constant to match the documented codepoints (and avoid duplicate entries).
| /// Thai signs: ฯ ์ ๆ ์ ํ ๎ (U+0E2F, U+0E3A, U+0E46, U+0E4C, U+0E4D, U+0E4E). | |
| pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎"; | |
| /// Thai signs: ฯ ฺ ๆ ์ ํ ๎ (U+0E2F, U+0E3A, U+0E46, U+0E4C, U+0E4D, U+0E4E). | |
| pub const THAI_SIGNS: &str = "ฯฺๆ์ํ๎"; |
| /// Check if a character is a Thai consonant (ก-ฮ, U+0E01..U+0E2E). | ||
| pub fn is_thai_consonant(c: char) -> bool { | ||
| ('\u{0E01}'..='\u{0E2E}').contains(&c) |
There was a problem hiding this comment.
is_thai_consonant currently checks the full U+0E01..=U+0E2E range, which includes ฤ (U+0E24) and ฦ (U+0E26). This contradicts the doc comment (“ก-ฮ”) and the THAI_CONSONANTS constant (which excludes ฤ/ฦ), and also makes downstream code that special-cases ฤ/ฦ redundant. Consider implementing this as membership in THAI_CONSONANTS (or adjust the docs/constants to match the intended behavior).
| /// Check if a character is a Thai consonant (ก-ฮ, U+0E01..U+0E2E). | |
| pub fn is_thai_consonant(c: char) -> bool { | |
| ('\u{0E01}'..='\u{0E2E}').contains(&c) | |
| /// Check if a character is a Thai consonant as defined in `THAI_CONSONANTS`. | |
| pub fn is_thai_consonant(c: char) -> bool { | |
| THAI_CONSONANTS.contains(c) |
| // Accept consonants (ก-ฮ) plus ฤ and ฦ which have initial_code mappings | ||
| if idx < chars.len() | ||
| && (is_thai_consonant(chars[idx]) || chars[idx] == 'ฤ' || chars[idx] == 'ฦ') | ||
| { | ||
| init_char = Some(chars[idx]); | ||
|
|
||
| // Special: ทร → ซซ | ||
| if chars[idx] == 'ท' && idx + 1 < chars.len() && chars[idx + 1] == 'ร' { | ||
| init_code = "ซซ".to_string(); | ||
| idx += 2; | ||
| } else { | ||
| init_code = initial_code(chars[idx]).to_string(); | ||
| idx += 1; |
There was a problem hiding this comment.
process_syllable explicitly allows ฦ as an initial character, but initial_code has no mapping for ฦ (it will fall through to "xx"). This produces an invalid/placeholder code for words starting with ฦ. Either add the correct ฦ mapping in initial_code (and cover it with a unit test), or stop treating ฦ as a valid initial consonant here.
| /// | ||
| /// # Arguments | ||
| /// * `text` - Thai or English text | ||
| /// * `length` - Desired length of the soundex code (default: 4) |
There was a problem hiding this comment.
The docstring says length has a “default: 4”, but prayut_and_somchaip requires an explicit length argument and there is no overload/dispatcher providing a default. Either remove the “default” wording or provide a convenience wrapper that uses 4 when callers don’t specify a length.
| /// * `length` - Desired length of the soundex code (default: 4) | |
| /// * `length` - Desired length of the soundex code |
| pub const THAI_VOWELS: &str = "ฤฦะัาำิีึืุูเแโใไๅํ็"; | ||
|
|
||
| /// Thai signs: ฯ ์ ๆ ์ ํ ๎ (U+0E2F, U+0E3A, U+0E46, U+0E4C, U+0E4D, U+0E4E). | ||
| pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎"; |
There was a problem hiding this comment.
| pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎"; | |
| pub const THAI_SIGNS: &str = "\u{0e2f}\u{0e3a}\u{0e46}\u{0e4c}\u{0e4d}\u{0e4e}"; |
Better use Unicode code point in the code, especially in this case of non-base characters.
It is more explicit, more friendly to non-Thai readers, easier to spot mistakes.
Actual characters are already provided in the comment.
Summary
Add Thai soundex module with 5 phonetic matching algorithms and supporting utilities, ported from pythainlp to native Rust.
What's included
Thai character classification (
thai_chars.rs)is_thai_character,is_thai_consonant,is_thai_vowel,is_thai_digit,remove_tonemarksSoundex algorithms (
soundex/)Utilities
soundex()dispatcher — select engine by namesoundex_similarity()— character-by-character similarity scoreTest coverage
53 unit tests + 7 doctests, all verified against pythainlp output.
Notes
lazy_staticfor compiled patternscomplete_soundexuses heuristic syllable splitting (no external tokenizer dependency)sound.py(word2audio, audio_vector) is not ported — requirespanphonexternal library