fix: correct 6 diacritization bugs found in 83K poem corpus by lesmartiepants · Pull Request #74 · linuxscout/mishkal

lesmartiepants · 2026-03-07T10:26:45Z

Context

Assalamu alaikum Dr. Zerrouki,

Thank you for building and maintaining Mishkal — it is an invaluable tool for Arabic NLP. We used it to diacritize a corpus of 83,377 Arabic poems and, through systematic analysis of the output, identified several reproducible bugs. This PR contains fixes for 6 issues that fall within the Mishkal codebase itself.

Summary of Fixes

1. Duplicate diacritical marks (35,928 poems affected)

Doubled diacritics (e.g. ِِ double kasra, ْْ double sukun) appear in the output. The root cause is in alyahmor affix constant tables, but we added a post-processing safety net in _ajust_vocalized_result() using re.sub(u'([\u064B-\u0652])\\1+', u'\\1', text) to collapse duplicates.

2. Impossible hamza + sukun sequences (23,745 poems affected)

Hamza characters receive vowel+sukun, which is phonologically impossible. Added a post-processing strip in _ajust_vocalized_result().

3. Missing diacritics for unknown/foreign words (related to #2)

In unknown_tashkeel.py:vocalize_foreign(), a missing else branch caused a length mismatch between characters and marks arrays, making araby.joint() return empty strings for many common words (e.g. إبراهيم, إسماعيل, أمريكا). Added the missing branch.

4. Spurious shadda on function words (19,836 poems affected)

Common prepositions like عَلَى and إِلَى incorrectly received shadda (عَلَّى, إِلَّى), producing non-existent Arabic forms. Added targeted post-processing corrections and two unambiguous entries to CorrectedTashkeel (الى → إِلَى, أيضا → أَيْضًا).

5. Missing sun letter assimilation (2,226 poems affected)

After the definite article ال, sun letters were missing the assimilation shadda (e.g. الْنُور instead of النُّور). Added post-processing to detect الْ + sun letter and insert the shadda. Relates to existing issue #39.

6. Default word limit too low (structural)

Changed the default limit from 1,000 to 20,000 words, which covers virtually all realistic single-text inputs without requiring an explicit set_limit() call.

Files Changed

File	Changes
`mishkal/tashkeel.py`	Default limit 1000→20000; 5 new post-processing rules in `_ajust_vocalized_result()`
`mishkal/tashkeel_const.py`	Added 2 entries to `CorrectedTashkeel` dictionary
`mishkal/unknown_tashkeel.py`	Fixed missing `else` clause in `vocalize_foreign()`

+64 / -23 lines across 3 files.

What is NOT included

We also found bugs in the alyahmor and qalsadi dependencies:

alyahmor: Doubled diacritics in aly_stem_noun_const.py and aly_stem_stopword_const.py prefix/suffix tables; sun letter shadda logic issue in noun_affixer.py
qalsadi: Ya possessive/nisba confusion (ياء المتكلم vs ياء النسبة) — affects 44,594 poems but requires deeper morphological analysis changes

These will be submitted as separate PRs/issues on the respective repositories.

Approach

All fixes use conservative post-processing in _ajust_vocalized_result() to avoid changing the core analysis pipeline. This minimizes regression risk while catching the most impactful output errors. We intentionally avoided aggressive pre-corrections to words that could have legitimate alternative readings.

Notes

Happy to split this into individual PRs per fix if that would be easier to review.
We have a detailed investigation report documenting the root cause analysis for each issue if that would be useful.
These fixes relate to existing issues Add the respect of rule of لا تقف العرب الا على ساكن #39, Problem in tanween #49, and Wrong Haraka #68.

شكراً جزيلاً على عملك الممتاز في خدمة اللغة العربية.

The default limit of 1000 words silently truncates longer texts, which is common in poetry collections and book passages. Users had to explicitly call set_limit() to process texts longer than ~1000 characters. Increase the default to 20000 to handle most use cases without requiring manual configuration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ukun Add two post-processing corrections in _ajust_vocalized_result(): 1. Collapse duplicate consecutive tashkeel marks (e.g., doubled sukun ْْ, doubled kasra ِِ) into single marks. Root cause is in the alyahmor dependency's affix constant tables which contain doubled diacritics in entries like كَالْْ (double sukun on prefix 'kal'), َتَانِِ (double kasra on suffix 'taan'), and others. Affects 35,928 poems + 4,574 poems (al-lam corruption) in a corpus of 83,377 Arabic poems. 2. Remove impossible vowel+sukun sequences on hamza characters where a letter has both a vowel mark and sukun simultaneously. Keeps only the vowel mark. Affects 23,745 poems in the corpus. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nown words The vocalize_foreign() function had a bug where the elif branch handling characters after long vowels (ALEF/YEH/WAW) only appended a mark when the current character was YEH_HAMZA. For all other characters, no mark was appended, causing a length mismatch between the word and marks arrays. This made araby.joint() return an empty string, leaving many unknown/foreign words completely unvocalized. Words that previously failed and now work correctly: - إنسان, أولئك, إبراهيم, إسماعيل, أمريكا, أين, أول, etc. Fix: add an else clause to append NOT_DEF_HARAKA when the current character is not YEH_HAMZA, ensuring every character gets exactly one mark entry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two changes to address wrongly diacritized function words (19,836 poems): 1. Add post-processing corrections in _ajust_vocalized_result() for function words that never take shadda in Arabic: - عَلَّى -> عَلَى (the preposition 'ala never has shadda) - إِلَّى -> إِلَى (the preposition ila never has shadda) 2. Expand CorrectedTashkeel dictionary in tashkeel_const.py with unambiguous pre-corrections: - الى (without hamza) -> إِلَى (no valid alternative reading) - أيضا -> أَيْضًا (unambiguous adverb) The post-processing approach is safer than pre-processing for words like على which could be the name Ali in some contexts, because it targets specific impossible output forms (عَلَّى) rather than overriding all input forms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

After the definite article ال, sun letters (ت ث د ذ ر ز س ش ص ض ط ظ ل ن) must carry shadda due to assimilation (الإدغام الشمسي). For example, الشَمس should be الشَّمس, and النَاس should be النَّاس. Add post-processing in _ajust_vocalized_result() to detect الْ followed by a sun letter without shadda, and insert the missing shadda. Uses negative lookahead to avoid double-shadda when assimilation is already correctly marked. Affects 2,226 poems in a corpus of 83,377 Arabic poems. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

lesmartiepants · 2026-03-07T10:38:40Z

Test Results

We ran a 25-test suite against both master and fix/tashkeel-bugs to verify these fixes:

Branch	Passed	Failed	Total
`master`	23	2	25
`fix/tashkeel-bugs`	25	0	25

Per-Fix Results

Fix 1 (word limit 1000 → 20000) — Clear improvement:

master: 1260-word input truncated to 863 words (68%)
fix: all 1260 words diacritized (100%)

Fix 2 (duplicate diacritics) — Clear improvement:

master: الْمُجْتَهِدَتَانِِ has doubled kasra on dual feminine
fix: single kasra الْمُجْتَهِدَتَانِ

Fixes 3-5 (hamza-sukun, function word shadda, sun letter assimilation) — Defensive safety nets:

Both branches pass for common vocabulary (analyzer handles them correctly)
Fixes provide post-processing guards for edge cases with unknown/foreign words
These were observed at scale in our 83,377-poem corpus where rare words and unusual contexts triggered the bugs

End-to-End Poem Test (Al-Mutanabbi)

Both branches produce identical, correct output for classical poetry: 71 diacritical marks, no duplicates, no impossible combinations, sun letters properly assimilated.

Verdict

All fixes are safe to merge. No regressions detected. Fixes 1 & 2 have immediately measurable improvements; Fixes 3-5 are preventive guards that improve robustness at scale.

Test script: tests/test_poetry_fixes.py (25 tests covering all 5 fixes)

Tests cover: word limit increase, duplicate diacritics removal, impossible hamza+sukun, spurious shadda on function words, and sun letter assimilation. All 25 pass on fix branch; 2 fail on master. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…n report Numbered-step pipeline for Arabic poetry diacritization: - 01_export_poems: DB to parquet - 02_diacritize: Mishkal batch processing - 03_postprocess: 8 fix rules for known Mishkal bugs - 04_audit: quality checks - 05_upload: DB upload with checkpointing - run_pipeline: orchestrator entry point - generate_report: bilingual HTML analysis report Shared modules: config.py (paths, constants), arabic_utils.py (text helpers) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Covers first-time setup, full backfill, incremental updates, dry runs, resume from interruption, report generation, and adding new rules. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Siraj Farage and others added 5 commits March 7, 2026 02:16

Siraj Farage and others added 3 commits March 7, 2026 02:38

docs(pipeline): add INSTRUCTIONS.md with setup, usage, and common cases

507f9f0

Covers first-time setup, full backfill, incremental updates, dry runs, resume from interruption, report generation, and adding new rules. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

lesmartiepants mentioned this pull request Mar 7, 2026

feat(quality): add tashkeel diacritization pipeline and Mishkal fixes lesmartiepants/poetry-bil-araby#205

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: correct 6 diacritization bugs found in 83K poem corpus#74

fix: correct 6 diacritization bugs found in 83K poem corpus#74
lesmartiepants wants to merge 8 commits intolinuxscout:masterfrom
lesmartiepants:fix/tashkeel-bugs

lesmartiepants commented Mar 7, 2026

Uh oh!

lesmartiepants commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lesmartiepants commented Mar 7, 2026

Context

Summary of Fixes

1. Duplicate diacritical marks (35,928 poems affected)

2. Impossible hamza + sukun sequences (23,745 poems affected)

3. Missing diacritics for unknown/foreign words (related to #2)

4. Spurious shadda on function words (19,836 poems affected)

5. Missing sun letter assimilation (2,226 poems affected)

6. Default word limit too low (structural)

Files Changed

What is NOT included

Approach

Notes

Uh oh!

lesmartiepants commented Mar 7, 2026

Test Results

Per-Fix Results

End-to-End Poem Test (Al-Mutanabbi)

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant