[lib]: Add Thai soundex algorithms by phoneee · Pull Request #94 · PyThaiNLP/nlpo3

phoneee · 2026-03-28T20:49:58Z

Summary

Add Thai soundex module with 5 phonetic matching algorithms and supporting utilities, ported from pythainlp to native Rust.

What's included

Thai character classification (thai_chars.rs)

Constants: consonants, vowels, tone marks, digits, signs
Functions: is_thai_character, is_thai_consonant, is_thai_vowel, is_thai_digit, remove_tonemarks

Soundex algorithms (soundex/)

LK82 — Lorchirachoonkul (1982), 5-char phonetic code
Udom83 — Udompanich (1983), 7-char phonetic code with 11 regex preprocessing rules
MetaSound — Snae & Brückner (2009), consonant-only variable-length code
Prayut & Somchaip — Thai-English cross-language soundex (1998), right-truncation
Complete Soundex — Tapsai, Meesad & Haruechaiyasak (2020), multi-syllable with heuristic splitting

Utilities

soundex() dispatcher — select engine by name
soundex_similarity() — character-by-character similarity score

Test coverage

53 unit tests + 7 doctests, all verified against pythainlp output.

Notes

Follows existing code style: Thai literals in regex (not unicode escapes), lazy_static for compiled patterns
complete_soundex uses heuristic syllable splitting (no external tokenizer dependency)
sound.py (word2audio, audio_vector) is not ported — requires panphon external library

Follow codebase convention from tokenizer/tcc_rules.rs. Keep unicode codepoints in comments for reference.

Copilot

Pull request overview

Adds a new Thai phonetic matching (soundex) module to the Rust library, including multiple algorithms and shared Thai-character utilities, intended to match PyThaiNLP outputs.

Changes:

Introduces thai_chars utilities (Thai character constants + classification helpers).
Adds 5 soundex algorithm implementations plus a dispatcher and similarity scoring.
Exposes the new modules from the crate root (lib.rs).

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/lib.rs`	Exposes `soundex` and `thai_chars` modules publicly.
`src/thai_chars.rs`	Adds Thai character constants and helper predicates used by soundex implementations.
`src/soundex/mod.rs`	Adds soundex dispatcher (`soundex`) and `soundex_similarity`, re-exporting algorithm entrypoints.
`src/soundex/udom83.rs`	Implements Udom83 with regex preprocessing and fixed-length code output.
`src/soundex/lk82.rs`	Implements LK82 with preprocessing and vowel/cluster handling.
`src/soundex/metasound.rs`	Implements MetaSound with consonant-only encoding and karan removal.
`src/soundex/prayut_and_somchaip.rs`	Implements Thai-English cross-language soundex with right-truncation behavior.
`src/soundex/complete_soundex.rs`	Implements Complete Soundex with heuristic syllable splitting and detailed encoding rules.

Copilot · 2026-03-29T09:54:37Z

+    // 0. อัต pattern
+    if text.starts_with("อัต") && text.chars().count() > 3 {
+        let rest_start = text.char_indices().nth(3).map(|(i, _)| i).unwrap_or(text.len());
+        return vec![
+            (&text[..rest_start], None),
+            // Prepend ต to the rest
+        ];
+        // This needs special handling - we'll allocate for this case
+    }
+


heuristic_split has an incomplete special-case: the อัต branch returns a vector containing only the first slice and leaves a comment about prepending ต to the rest, but the rest is never returned/processed. This will silently drop the remainder of the word for inputs starting with "อัต…". Implement the intended split (including the modified remainder, likely requiring allocation) or remove this branch until it is implemented.

Suggested change

// 0. อัต pattern

if text.starts_with("อัต") && text.chars().count() > 3 {

let rest_start = text.char_indices().nth(3).map(|(i, _)| i).unwrap_or(text.len());

return vec![

(&text[..rest_start], None),

// Prepend ต to the rest

];

// This needs special handling - we'll allocate for this case

}

The อัต branch returns a 2-element vec! at lines 89-92: ("อัต", None) and ("ตรา", None). The rest is not dropped. Test test_heuristic_split_at_pattern verifies this.

Copilot · 2026-03-29T09:54:38Z

+        'บ' => "บบ", 'ป' => "ปป",
+        'ผ' => "พผ", 'ฝ' => "ฟฝ", 'พ' | 'ภ' => "พพ", 'ฟ' => "ฟฟ",
+        'ม' => "มม",
+        'ร' | 'ล' | 'ฬ' | 'ฤ' => "รร", 'ว' => "วว",


initial_code has a mapping for ฤ, but process_syllable only treats characters as initials when is_thai_consonant(chars[idx]) is true. With the current is_thai_consonant definition, ฤ is never considered an initial consonant here, so this case is effectively dead and words starting with ฤ will produce an empty init_code. Handle ฤ explicitly in the initial-detection logic (if the algorithm expects it) or remove the unreachable mapping for clarity.

Suggested change

'ร' | 'ล' | 'ฬ' | 'ฤ' => "รร", 'ว' => "วว",

'ร' | 'ล' | 'ฬ' => "รร", 'ว' => "วว",

Copilot · 2026-03-29T09:54:38Z

+    // Keep only consonants and thanthakhat
+    let mut chars: Vec<char> = text
+        .chars()
+        .filter(|&c| is_thai_consonant(c) || c == THANTHAKHAT)


metasound_code includes a mapping for ฦ, but the input filtering keeps only is_thai_consonant(c) (U+0E01..U+0E2E) or thanthakhat. Since ฦ is outside that range, it can never reach metasound_code and will be dropped from the encoding. Either extend the filter to include ฦ (and any other supported non-range letters) or remove the unreachable mapping to avoid inconsistent behavior.

Suggested change

// Keep only consonants and thanthakhat

let mut chars: Vec<char> = text

.chars()

.filter(|&c| is_thai_consonant(c) || c == THANTHAKHAT)

// Keep only consonants, ฦ, and thanthakhat

let mut chars: Vec<char> = text

.chars()

.filter(|&c| is_thai_consonant(c) || c == 'ฦ' || c == THANTHAKHAT)

Add ฏ ฑ ถ ธ to group 2 per Snae & Bruckner (2009) Table p.507. Upgrade all non-empty assertions to exact-value. Ref: PyThaiNLP/pythainlp#1383

Fix 3 issues from Copilot review: - complete_soundex: extend initial consonant check to include ฤ/ฦ (was blocked by is_thai_consonant guard, making initial_code dead) - metasound: extend input filter to include ฦ alongside consonants - lk82: fix RE_SIGN doc comment to match actual regex (no ์) 2 other comments were incorrect (dismissed with evidence): - อัต branch already returns both parts correctly (test passes) - asterisk ญ/ย check matches pythainlp behavior (non-adjacent contains)

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Copilot · 2026-03-29T20:27:49Z

+
+/// Compute the MetaSound Thai soundex for the given text.
+///
+/// Returns a phonetic code of the specified `length` (default 4),


The docstring mentions “default 4” for length, but metasound requires length to be passed explicitly and no wrapper supplies a default. Align the docs with the actual API (or add a convenience function using 4).

Suggested change

/// Returns a phonetic code of the specified `length` (default 4),

/// Returns a phonetic code of the specified `length`,

Copilot · 2026-03-29T20:27:49Z

+/// Thai signs: ฯ ์ ๆ ์ ํ ๎ (U+0E2F, U+0E3A, U+0E46, U+0E4C, U+0E4D, U+0E4E).
+pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎";


THAI_SIGNS appears to be missing the Phinthu sign ฺ (U+0E3A) while including ์ twice. The doc comment lists U+0E3A, but the constant string does not include it, so sign classification based on this constant would be incorrect. Update the constant to match the documented codepoints (and avoid duplicate entries).

Suggested change

/// Thai signs: ฯ ์ ๆ ์ ํ ๎ (U+0E2F, U+0E3A, U+0E46, U+0E4C, U+0E4D, U+0E4E).

pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎";

/// Thai signs: ฯ ฺ ๆ ์ ํ ๎ (U+0E2F, U+0E3A, U+0E46, U+0E4C, U+0E4D, U+0E4E).

pub const THAI_SIGNS: &str = "ฯฺๆ์ํ๎";

Copilot · 2026-03-29T20:27:49Z

+/// Check if a character is a Thai consonant (ก-ฮ, U+0E01..U+0E2E).
+pub fn is_thai_consonant(c: char) -> bool {
+    ('\u{0E01}'..='\u{0E2E}').contains(&c)


is_thai_consonant currently checks the full U+0E01..=U+0E2E range, which includes ฤ (U+0E24) and ฦ (U+0E26). This contradicts the doc comment (“ก-ฮ”) and the THAI_CONSONANTS constant (which excludes ฤ/ฦ), and also makes downstream code that special-cases ฤ/ฦ redundant. Consider implementing this as membership in THAI_CONSONANTS (or adjust the docs/constants to match the intended behavior).

Suggested change

/// Check if a character is a Thai consonant (ก-ฮ, U+0E01..U+0E2E).

pub fn is_thai_consonant(c: char) -> bool {

('\u{0E01}'..='\u{0E2E}').contains(&c)

/// Check if a character is a Thai consonant as defined in `THAI_CONSONANTS`.

pub fn is_thai_consonant(c: char) -> bool {

THAI_CONSONANTS.contains(c)

Copilot · 2026-03-29T20:27:49Z

+    // Accept consonants (ก-ฮ) plus ฤ and ฦ which have initial_code mappings
+    if idx < chars.len()
+        && (is_thai_consonant(chars[idx]) || chars[idx] == 'ฤ' || chars[idx] == 'ฦ')
+    {
+        init_char = Some(chars[idx]);
+
+        // Special: ทร → ซซ
+        if chars[idx] == 'ท' && idx + 1 < chars.len() && chars[idx + 1] == 'ร' {
+            init_code = "ซซ".to_string();
+            idx += 2;
+        } else {
+            init_code = initial_code(chars[idx]).to_string();
+            idx += 1;


process_syllable explicitly allows ฦ as an initial character, but initial_code has no mapping for ฦ (it will fall through to "xx"). This produces an invalid/placeholder code for words starting with ฦ. Either add the correct ฦ mapping in initial_code (and cover it with a unit test), or stop treating ฦ as a valid initial consonant here.

Copilot · 2026-03-29T20:27:50Z

+///
+/// # Arguments
+/// * `text` - Thai or English text
+/// * `length` - Desired length of the soundex code (default: 4)


The docstring says length has a “default: 4”, but prayut_and_somchaip requires an explicit length argument and there is no overload/dispatcher providing a default. Either remove the “default” wording or provide a convenience wrapper that uses 4 when callers don’t specify a length.

Suggested change

/// * `length` - Desired length of the soundex code (default: 4)

/// * `length` - Desired length of the soundex code

bact · 2026-03-29T20:35:09Z

+pub const THAI_VOWELS: &str = "ฤฦะัาำิีึืุูเแโใไๅํ็";
+
+/// Thai signs: ฯ ์ ๆ ์ ํ ๎ (U+0E2F, U+0E3A, U+0E46, U+0E4C, U+0E4D, U+0E4E).
+pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎";


Suggested change

pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎";

pub const THAI_SIGNS: &str = "\u{0e2f}\u{0e3a}\u{0e46}\u{0e4c}\u{0e4d}\u{0e4e}";

Better use Unicode code point in the code, especially in this case of non-base characters.

It is more explicit, more friendly to non-Thai readers, easier to spot mistakes.

Actual characters are already provided in the comment.

phoneee added 16 commits March 28, 2026 17:32

Add thai_chars module

2ae08c0

Add lk82 soundex

2d94115

Add udom83 soundex

7938ab6

Add metasound soundex

2258d23

Add prayut_and_somchaip soundex

0fd0c18

Simplify is_thai_character to single range

601befa

Upgrade lk82 tests to exact-value assertions

4ebfa3a

Upgrade soundex tests to exact-value assertions

0f2cb10

Use Thai literals instead of unicode escapes

2fd0f93

Follow codebase convention from tokenizer/tcc_rules.rs. Keep unicode codepoints in comments for reference.

Add soundex_similarity function

e3a01d4

Add soundex dispatcher function

1049f1d

Add complete_soundex lookup tables

279bd9a

Add clean_text and heuristic_split for complete_soundex

424135c

Add process_syllable for complete_soundex

7cb957f

Add complete_soundex encoder with heuristic split

d6bba52

Register complete_soundex in dispatcher

93e9033

bact added the enhancement New feature or request label Mar 28, 2026

bact requested a review from Copilot March 29, 2026 09:51

Copilot started reviewing on behalf of bact March 29, 2026 09:51 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

phoneee added 4 commits March 29, 2026 18:38

Fix metasound: add 4 missing consonants per paper, upgrade tests

6c4ce7d

Add ฏ ฑ ถ ธ to group 2 per Snae & Bruckner (2009) Table p.507. Upgrade all non-empty assertions to exact-value. Ref: PyThaiNLP/pythainlp#1383

Add edge case and common word tests

4109830

Fix heuristic_split อัต pattern, fix RE_SIGN comment

697f76d

bact requested a review from Copilot March 29, 2026 20:25

Copilot started reviewing on behalf of bact March 29, 2026 20:25 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

bact requested changes Mar 29, 2026

View reviewed changes

	'ร' \| 'ล' \| 'ฬ' \| 'ฤ' => "รร", 'ว' => "วว",
	'ร' \| 'ล' \| 'ฬ' => "รร", 'ว' => "วว",

	/// Returns a phonetic code of the specified `length` (default 4),
	/// Returns a phonetic code of the specified `length`,

		/// Thai signs: ฯ ์ ๆ ์ ํ ๎ (U+0E2F, U+0E3A, U+0E46, U+0E4C, U+0E4D, U+0E4E).
		pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎";

	/// * `length` - Desired length of the soundex code (default: 4)
	/// * `length` - Desired length of the soundex code

	pub const THAI_SIGNS: &str = "ฯ์ๆ์ํ๎";
	pub const THAI_SIGNS: &str = "\u{0e2f}\u{0e3a}\u{0e46}\u{0e4c}\u{0e4d}\u{0e4e}";

Conversation

phoneee commented Mar 28, 2026

Summary

What's included

Test coverage

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

phoneee Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

bact Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

phoneee Mar 29, 2026 •

edited

Loading