fix: correct 6 diacritization bugs found in 83K poem corpus#74
Open
lesmartiepants wants to merge 8 commits intolinuxscout:masterfrom
Open
fix: correct 6 diacritization bugs found in 83K poem corpus#74lesmartiepants wants to merge 8 commits intolinuxscout:masterfrom
lesmartiepants wants to merge 8 commits intolinuxscout:masterfrom
Conversation
The default limit of 1000 words silently truncates longer texts, which is common in poetry collections and book passages. Users had to explicitly call set_limit() to process texts longer than ~1000 characters. Increase the default to 20000 to handle most use cases without requiring manual configuration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ukun Add two post-processing corrections in _ajust_vocalized_result(): 1. Collapse duplicate consecutive tashkeel marks (e.g., doubled sukun ْْ, doubled kasra ِِ) into single marks. Root cause is in the alyahmor dependency's affix constant tables which contain doubled diacritics in entries like كَالْْ (double sukun on prefix 'kal'), َتَانِِ (double kasra on suffix 'taan'), and others. Affects 35,928 poems + 4,574 poems (al-lam corruption) in a corpus of 83,377 Arabic poems. 2. Remove impossible vowel+sukun sequences on hamza characters where a letter has both a vowel mark and sukun simultaneously. Keeps only the vowel mark. Affects 23,745 poems in the corpus. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nown words The vocalize_foreign() function had a bug where the elif branch handling characters after long vowels (ALEF/YEH/WAW) only appended a mark when the current character was YEH_HAMZA. For all other characters, no mark was appended, causing a length mismatch between the word and marks arrays. This made araby.joint() return an empty string, leaving many unknown/foreign words completely unvocalized. Words that previously failed and now work correctly: - إنسان, أولئك, إبراهيم, إسماعيل, أمريكا, أين, أول, etc. Fix: add an else clause to append NOT_DEF_HARAKA when the current character is not YEH_HAMZA, ensuring every character gets exactly one mark entry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two changes to address wrongly diacritized function words (19,836 poems): 1. Add post-processing corrections in _ajust_vocalized_result() for function words that never take shadda in Arabic: - عَلَّى -> عَلَى (the preposition 'ala never has shadda) - إِلَّى -> إِلَى (the preposition ila never has shadda) 2. Expand CorrectedTashkeel dictionary in tashkeel_const.py with unambiguous pre-corrections: - الى (without hamza) -> إِلَى (no valid alternative reading) - أيضا -> أَيْضًا (unambiguous adverb) The post-processing approach is safer than pre-processing for words like على which could be the name Ali in some contexts, because it targets specific impossible output forms (عَلَّى) rather than overriding all input forms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After the definite article ال, sun letters (ت ث د ذ ر ز س ش ص ض ط ظ ل ن) must carry shadda due to assimilation (الإدغام الشمسي). For example, الشَمس should be الشَّمس, and النَاس should be النَّاس. Add post-processing in _ajust_vocalized_result() to detect الْ followed by a sun letter without shadda, and insert the missing shadda. Uses negative lookahead to avoid double-shadda when assimilation is already correctly marked. Affects 2,226 poems in a corpus of 83,377 Arabic poems. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
Test ResultsWe ran a 25-test suite against both
Per-Fix ResultsFix 1 (word limit 1000 → 20000) — Clear improvement:
Fix 2 (duplicate diacritics) — Clear improvement:
Fixes 3-5 (hamza-sukun, function word shadda, sun letter assimilation) — Defensive safety nets:
End-to-End Poem Test (Al-Mutanabbi)Both branches produce identical, correct output for classical poetry: 71 diacritical marks, no duplicates, no impossible combinations, sun letters properly assimilated. VerdictAll fixes are safe to merge. No regressions detected. Fixes 1 & 2 have immediately measurable improvements; Fixes 3-5 are preventive guards that improve robustness at scale. Test script: |
Tests cover: word limit increase, duplicate diacritics removal, impossible hamza+sukun, spurious shadda on function words, and sun letter assimilation. All 25 pass on fix branch; 2 fail on master. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n report Numbered-step pipeline for Arabic poetry diacritization: - 01_export_poems: DB to parquet - 02_diacritize: Mishkal batch processing - 03_postprocess: 8 fix rules for known Mishkal bugs - 04_audit: quality checks - 05_upload: DB upload with checkpointing - run_pipeline: orchestrator entry point - generate_report: bilingual HTML analysis report Shared modules: config.py (paths, constants), arabic_utils.py (text helpers) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers first-time setup, full backfill, incremental updates, dry runs, resume from interruption, report generation, and adding new rules. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Assalamu alaikum Dr. Zerrouki,
Thank you for building and maintaining Mishkal — it is an invaluable tool for Arabic NLP. We used it to diacritize a corpus of 83,377 Arabic poems and, through systematic analysis of the output, identified several reproducible bugs. This PR contains fixes for 6 issues that fall within the Mishkal codebase itself.
Summary of Fixes
1. Duplicate diacritical marks (35,928 poems affected)
Doubled diacritics (e.g.
ِِdouble kasra,ْْdouble sukun) appear in the output. The root cause is inalyahmoraffix constant tables, but we added a post-processing safety net in_ajust_vocalized_result()usingre.sub(u'([\u064B-\u0652])\\1+', u'\\1', text)to collapse duplicates.2. Impossible hamza + sukun sequences (23,745 poems affected)
Hamza characters receive vowel+sukun, which is phonologically impossible. Added a post-processing strip in
_ajust_vocalized_result().3. Missing diacritics for unknown/foreign words (related to #2)
In
unknown_tashkeel.py:vocalize_foreign(), a missingelsebranch caused a length mismatch between characters and marks arrays, makingaraby.joint()return empty strings for many common words (e.g.إبراهيم,إسماعيل,أمريكا). Added the missing branch.4. Spurious shadda on function words (19,836 poems affected)
Common prepositions like
عَلَىandإِلَىincorrectly received shadda (عَلَّى,إِلَّى), producing non-existent Arabic forms. Added targeted post-processing corrections and two unambiguous entries toCorrectedTashkeel(الى→إِلَى,أيضا→أَيْضًا).5. Missing sun letter assimilation (2,226 poems affected)
After the definite article
ال, sun letters were missing the assimilation shadda (e.g.الْنُورinstead ofالنُّور). Added post-processing to detectالْ+ sun letter and insert the shadda. Relates to existing issue #39.6. Default word limit too low (structural)
Changed the default limit from 1,000 to 20,000 words, which covers virtually all realistic single-text inputs without requiring an explicit
set_limit()call.Files Changed
mishkal/tashkeel.py_ajust_vocalized_result()mishkal/tashkeel_const.pyCorrectedTashkeeldictionarymishkal/unknown_tashkeel.pyelseclause invocalize_foreign()+64 / -23 lines across 3 files.
What is NOT included
We also found bugs in the
alyahmorandqalsadidependencies:aly_stem_noun_const.pyandaly_stem_stopword_const.pyprefix/suffix tables; sun letter shadda logic issue innoun_affixer.pyThese will be submitted as separate PRs/issues on the respective repositories.
Approach
All fixes use conservative post-processing in
_ajust_vocalized_result()to avoid changing the core analysis pipeline. This minimizes regression risk while catching the most impactful output errors. We intentionally avoided aggressive pre-corrections to words that could have legitimate alternative readings.Notes
شكراً جزيلاً على عملك الممتاز في خدمة اللغة العربية.