Skip to content

[lib]: Add Thai soundex algorithms#94

Open
phoneee wants to merge 20 commits into
PyThaiNLP:mainfrom
phoneee:feat/soundex
Open

[lib]: Add Thai soundex algorithms#94
phoneee wants to merge 20 commits into
PyThaiNLP:mainfrom
phoneee:feat/soundex

Conversation

@phoneee
Copy link
Copy Markdown

@phoneee phoneee commented Mar 28, 2026

Summary

Add Thai soundex module with 5 phonetic matching algorithms and supporting utilities, ported from pythainlp to native Rust.

What's included

Thai character classification (thai_chars.rs)

  • Constants: consonants, vowels, tone marks, digits, signs
  • Functions: is_thai_character, is_thai_consonant, is_thai_vowel, is_thai_digit, remove_tonemarks

Soundex algorithms (soundex/)

  • LK82 — Lorchirachoonkul (1982), 5-char phonetic code
  • Udom83 — Udompanich (1983), 7-char phonetic code with 11 regex preprocessing rules
  • MetaSound — Snae & Brückner (2009), consonant-only variable-length code
  • Prayut & Somchaip — Thai-English cross-language soundex (1998), right-truncation
  • Complete Soundex — Tapsai, Meesad & Haruechaiyasak (2020), multi-syllable with heuristic splitting

Utilities

  • soundex() dispatcher — select engine by name
  • soundex_similarity() — character-by-character similarity score

Test coverage

53 unit tests + 7 doctests, all verified against pythainlp output.

Notes

  • Follows existing code style: Thai literals in regex (not unicode escapes), lazy_static for compiled patterns
  • complete_soundex uses heuristic syllable splitting (no external tokenizer dependency)
  • sound.py (word2audio, audio_vector) is not ported — requires panphon external library

@bact bact added the enhancement New feature or request label Mar 28, 2026
@bact bact requested a review from Copilot March 29, 2026 09:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Thai phonetic matching (soundex) module to the Rust library, including multiple algorithms and shared Thai-character utilities, intended to match PyThaiNLP outputs.

Changes:

  • Introduces thai_chars utilities (Thai character constants + classification helpers).
  • Adds 5 soundex algorithm implementations plus a dispatcher and similarity scoring.
  • Exposes the new modules from the crate root (lib.rs).

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/lib.rs Exposes soundex and thai_chars modules publicly.
src/thai_chars.rs Adds Thai character constants and helper predicates used by soundex implementations.
src/soundex/mod.rs Adds soundex dispatcher (soundex) and soundex_similarity, re-exporting algorithm entrypoints.
src/soundex/udom83.rs Implements Udom83 with regex preprocessing and fixed-length code output.
src/soundex/lk82.rs Implements LK82 with preprocessing and vowel/cluster handling.
src/soundex/metasound.rs Implements MetaSound with consonant-only encoding and karan removal.
src/soundex/prayut_and_somchaip.rs Implements Thai-English cross-language soundex with right-truncation behavior.
src/soundex/complete_soundex.rs Implements Complete Soundex with heuristic syllable splitting and detailed encoding rules.

Comment thread src/soundex/complete_soundex.rs Outdated
Comment on lines +86 to +95
// 0. อัต pattern
if text.starts_with("อัต") && text.chars().count() > 3 {
let rest_start = text.char_indices().nth(3).map(|(i, _)| i).unwrap_or(text.len());
return vec![
(&text[..rest_start], None),
// Prepend ต to the rest
];
// This needs special handling - we'll allocate for this case
}

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heuristic_split has an incomplete special-case: the อัต branch returns a vector containing only the first slice and leaves a comment about prepending to the rest, but the rest is never returned/processed. This will silently drop the remainder of the word for inputs starting with "อัต…". Implement the intended split (including the modified remainder, likely requiring allocation) or remove this branch until it is implemented.

Suggested change
// 0. อัต pattern
if text.starts_with("อัต") && text.chars().count() > 3 {
let rest_start = text.char_indices().nth(3).map(|(i, _)| i).unwrap_or(text.len());
return vec![
(&text[..rest_start], None),
// Prepend ต to the rest
];
// This needs special handling - we'll allocate for this case
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

@phoneee phoneee Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The อัต branch returns a 2-element vec! at lines 89-92: ("อัต", None) and ("ตรา", None). The rest is not dropped. Test test_heuristic_split_at_pattern verifies this.

Comment thread src/soundex/complete_soundex.rs
'บ' => "บบ", 'ป' => "ปป",
'ผ' => "พผ", 'ฝ' => "ฟฝ", 'พ' | 'ภ' => "พพ", 'ฟ' => "ฟฟ",
'ม' => "มม",
'ร' | 'ล' | 'ฬ' | 'ฤ' => "รร", 'ว' => "วว",
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial_code has a mapping for , but process_syllable only treats characters as initials when is_thai_consonant(chars[idx]) is true. With the current is_thai_consonant definition, is never considered an initial consonant here, so this case is effectively dead and words starting with will produce an empty init_code. Handle explicitly in the initial-detection logic (if the algorithm expects it) or remove the unreachable mapping for clarity.

Suggested change
'ร' | 'ล' | 'ฬ' | 'ฤ' => "รร", 'ว' => "วว",
'ร' | 'ล' | 'ฬ' => "รร", 'ว' => "วว",

Copilot uses AI. Check for mistakes.
Comment thread src/soundex/metasound.rs Outdated
Comment on lines +58 to +61
// Keep only consonants and thanthakhat
let mut chars: Vec<char> = text
.chars()
.filter(|&c| is_thai_consonant(c) || c == THANTHAKHAT)
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metasound_code includes a mapping for , but the input filtering keeps only is_thai_consonant(c) (U+0E01..U+0E2E) or thanthakhat. Since is outside that range, it can never reach metasound_code and will be dropped from the encoding. Either extend the filter to include (and any other supported non-range letters) or remove the unreachable mapping to avoid inconsistent behavior.

Suggested change
// Keep only consonants and thanthakhat
let mut chars: Vec<char> = text
.chars()
.filter(|&c| is_thai_consonant(c) || c == THANTHAKHAT)
// Keep only consonants, ฦ, and thanthakhat
let mut chars: Vec<char> = text
.chars()
.filter(|&c| is_thai_consonant(c) || c == 'ฦ' || c == THANTHAKHAT)

Copilot uses AI. Check for mistakes.
Comment thread src/soundex/lk82.rs
phoneee added 4 commits March 29, 2026 18:38
Add ฏ ฑ ถ ธ to group 2 per Snae & Bruckner (2009) Table p.507.
Upgrade all non-empty assertions to exact-value.
Ref: PyThaiNLP/pythainlp#1383
Fix 3 issues from Copilot review:
- complete_soundex: extend initial consonant check to include ฤ/ฦ
  (was blocked by is_thai_consonant guard, making initial_code dead)
- metasound: extend input filter to include ฦ alongside consonants
- lk82: fix RE_SIGN doc comment to match actual regex (no ์)

2 other comments were incorrect (dismissed with evidence):
- อัต branch already returns both parts correctly (test passes)
- asterisk ญ/ย check matches pythainlp behavior (non-adjacent contains)
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Comment thread src/soundex/metasound.rs

/// Compute the MetaSound Thai soundex for the given text.
///
/// Returns a phonetic code of the specified `length` (default 4),
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring mentions “default 4” for length, but metasound requires length to be passed explicitly and no wrapper supplies a default. Align the docs with the actual API (or add a convenience function using 4).

Suggested change
/// Returns a phonetic code of the specified `length` (default 4),
/// Returns a phonetic code of the specified `length`,

Copilot uses AI. Check for mistakes.
Comment thread src/thai_chars.rs
Comment on lines +31 to +32
/// Thai signs: ฯ ์ ๆ ์ ํ ๎ (U+0E2F, U+0E3A, U+0E46, U+0E4C, U+0E4D, U+0E4E).
pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎";
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THAI_SIGNS appears to be missing the Phinthu sign (U+0E3A) while including twice. The doc comment lists U+0E3A, but the constant string does not include it, so sign classification based on this constant would be incorrect. Update the constant to match the documented codepoints (and avoid duplicate entries).

Suggested change
/// Thai signs: ฯ ๆ ์ ํ ๎ (U+0E2F, U+0E3A, U+0E46, U+0E4C, U+0E4D, U+0E4E).
pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎";
/// Thai signs: ฯ ๆ ์ ํ ๎ (U+0E2F, U+0E3A, U+0E46, U+0E4C, U+0E4D, U+0E4E).
pub const THAI_SIGNS: &str = "ฯฺๆ์ํ๎";

Copilot uses AI. Check for mistakes.
Comment thread src/thai_chars.rs
Comment on lines +51 to +53
/// Check if a character is a Thai consonant (ก-ฮ, U+0E01..U+0E2E).
pub fn is_thai_consonant(c: char) -> bool {
('\u{0E01}'..='\u{0E2E}').contains(&c)
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_thai_consonant currently checks the full U+0E01..=U+0E2E range, which includes (U+0E24) and (U+0E26). This contradicts the doc comment (“ก-ฮ”) and the THAI_CONSONANTS constant (which excludes ฤ/ฦ), and also makes downstream code that special-cases ฤ/ฦ redundant. Consider implementing this as membership in THAI_CONSONANTS (or adjust the docs/constants to match the intended behavior).

Suggested change
/// Check if a character is a Thai consonant (ก-ฮ, U+0E01..U+0E2E).
pub fn is_thai_consonant(c: char) -> bool {
('\u{0E01}'..='\u{0E2E}').contains(&c)
/// Check if a character is a Thai consonant as defined in `THAI_CONSONANTS`.
pub fn is_thai_consonant(c: char) -> bool {
THAI_CONSONANTS.contains(c)

Copilot uses AI. Check for mistakes.
Comment on lines +196 to +208
// Accept consonants (ก-ฮ) plus ฤ and ฦ which have initial_code mappings
if idx < chars.len()
&& (is_thai_consonant(chars[idx]) || chars[idx] == 'ฤ' || chars[idx] == 'ฦ')
{
init_char = Some(chars[idx]);

// Special: ทร → ซซ
if chars[idx] == 'ท' && idx + 1 < chars.len() && chars[idx + 1] == 'ร' {
init_code = "ซซ".to_string();
idx += 2;
} else {
init_code = initial_code(chars[idx]).to_string();
idx += 1;
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_syllable explicitly allows as an initial character, but initial_code has no mapping for (it will fall through to "xx"). This produces an invalid/placeholder code for words starting with . Either add the correct mapping in initial_code (and cover it with a unit test), or stop treating as a valid initial consonant here.

Copilot uses AI. Check for mistakes.
///
/// # Arguments
/// * `text` - Thai or English text
/// * `length` - Desired length of the soundex code (default: 4)
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says length has a “default: 4”, but prayut_and_somchaip requires an explicit length argument and there is no overload/dispatcher providing a default. Either remove the “default” wording or provide a convenience wrapper that uses 4 when callers don’t specify a length.

Suggested change
/// * `length` - Desired length of the soundex code (default: 4)
/// * `length` - Desired length of the soundex code

Copilot uses AI. Check for mistakes.
Comment thread src/thai_chars.rs
pub const THAI_VOWELS: &str = "ฤฦะัาำิีึืุูเแโใไๅํ็";

/// Thai signs: ฯ ์ ๆ ์ ํ ๎ (U+0E2F, U+0E3A, U+0E46, U+0E4C, U+0E4D, U+0E4E).
pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎";
pub const THAI_SIGNS: &str = "\u{0e2f}\u{0e3a}\u{0e46}\u{0e4c}\u{0e4d}\u{0e4e}";

Better use Unicode code point in the code, especially in this case of non-base characters.

It is more explicit, more friendly to non-Thai readers, easier to spot mistakes.

Actual characters are already provided in the comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants