ParsTranslit

This repository contains the code, data, and models for the ParsTranslit project.

Transliterate.py contains example inference code for our transliteration model using ctranslate2.

Bi-directional Tajik-Farsi Transliteration Model

ParsText (filtered)

Included within the released data will be a filtered version of the corpus for the paper entitled "ParsText: A Digraphic Corpus for Tajik-Farsi Transliteration".

This filtered version only includes the manually-collected blog posts, and does not include the previously-included BBC articles, as it was discovered that another dataset contained a more complete set of articles with which ours overlap.

Cite Us!

If you use or reference ParsTranslit, please cite:

@inproceedings{merchant-tang-2026-parstranslit,
    title = "{P}ars{T}ranslit: Truly Versatile {T}ajik-{F}arsi Transliteration",
    author = "Merchant, Rayyan  and
      Tang, Kevin",
    editor = "Demberg, Vera  and
      Inui, Kentaro  and
      Marquez, Llu{\'i}s",
    booktitle = "Findings of the {A}ssociation for {C}omputational {L}inguistics: {EACL} 2026",
    month = mar,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.findings-eacl.73/",
    doi = "10.18653/v1/2026.findings-eacl.73",
    pages = "1431--1443",
    ISBN = "979-8-89176-386-9",
    abstract = "As a digraphic language, the Persian language utilizes two written standards: Perso-Arabic in Afghanistan and Iran, and Tajik-Cyrillic in Tajikistan. Despite the significant similarity between the dialects of each country, script differences prevent simple one-to-one mapping, hindering written communication and interaction between Tajikistan and its Persian-speaking ``siblings''. To overcome this, previously-published efforts have investigated machine transliteration models to convert between the two scripts. Unfortunately, most efforts did not use datasets other than those they created, limiting these models to certain domains of text such as archaic poetry or word lists. A truly usable transliteration system must be capable of handling varied domains, meaning that suck models lack the versatility required for real-world usage. The contrast in domain between data also obscures the task{'}s true difficulty. We present a new state-of-the-art sequence-to-sequence model for Tajik-Farsi transliteration trained across all available datasets, and present two datasets of our own. Our results across domains provide clearer understanding of the task, and set comprehensive comparable leading benchmarks. Overall, our model achieves chrF++ and Normalized CER scores of 87.91 and 0.05 from Farsi to Tajik and 92.28 and 0.04 from Tajik to Farsi. Our model, data, and code are available at https://github.com/merchantrayyan/ParsTranslit."
}

If you use or reference ParsText, please cite our paper:

@InProceedings{MerchantTang_CAWL_Accepted_2024,
author = {Rayyan Merchant and Kevin Tang},
title = {{P}ars{T}ext: A Digraphic Corpus for {T}ajik-{F}arsi Transliteration},
booktitle = {{Proceedings of the Second Workshop on Computation and Written Language (CAWL 2024)}},
year = {2024},
editor = {Gorman, Kyle and Prud’hommeaux, Emily and Sproat, Richard and Roark, Brian},
note = {accepted (Preprint: \url{https://doi.org/10.31234/osf.io/xdf2w}},
publisher = {Association for Computational Linguistics},
month = {05},
pubstate = {forthcoming},
address = {Torino, Italia},
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data/raw		data/raw
inference		inference
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
transliterate.py		transliterate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParsTranslit

Bi-directional Tajik-Farsi Transliteration Model

ParsText (filtered)

Cite Us!

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ParsTranslit

Bi-directional Tajik-Farsi Transliteration Model

ParsText (filtered)

Cite Us!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages