Comparative Analysis of the Intrinsic Metrics for Tokenizers and their effect on Downstream Tasks for Hindi and Marathi
Shagun Dwivedi · Kaushik Gopalan
The 64th Annual Meeting of the Association for Computational Linguistics, 2026
In this paper, we compare the performance of five existing tokenizers that use UTF-8 inputs, and we study how, ceteris paribus, different tokenization schemes affect the ability of language models for question-answering, transliteration, grapheme-to-phoneme conversion, and their robustness to noise. We also propose a novel grapheme cluster tokenizer, a form of visual character unit level tokenizer for Devanagari. We assess whether the performance of tokenizers on intrinsic evaluation metrics translates to the downstream performance of models trained using those tokenizers.
auto_eval/contains code for the automated evaluation framework for the question answering taskseval/contains code for model inference for QA, and inference + evaluation for word level taskstrain_llm/contains code for training T5 for QA tasks and the word-level taskstraining_tokenizers/contains code for training and implementation of all tokenizers used in the study.
We assess the performance of six different tokenizers: Byte Pair Encoding (BPE), Unigram, WordPiece, grapheme cluster tokenizer, Grapheme Pair Encoding(GPE), and Byte Pair Encoding with transliteration during the pretokenization step (ITR+BPE).
The tokenizers are assessed using information-theoretic intrinsic evaluation metrics, as well as through the extrinsic evaluation of the performance of a language model pre-trained using different tokenizers. The investigation of correlation between the intrinsic and extrinsic metrics is also reported.
- Intrinsic Evaluation
- Renyi Efficiency
- Fertility
- Percentile Frequency
- Extrinsic Evaluation
- Generation Task:
- QA - Accuracy
- Noise Robust QA - Accuracy
- Word-Level Tasks
- G2P - PER, WER
- Transliteration - CER
- Generation Task:
@inproceedings{dwivedi-gopalan-2026-comparative,
title = "Comparative Analysis of the Intrinsic Metrics for Tokenizers
and their effect on Downstream Tasks for Hindi and Marathi",
author = "Dwivedi, Shagun and
Gopalan, Kaushik",
booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = july,
year = "2026",
publisher = "Association for Computational Linguistics",
url = "https://github.com/flame-cai/tokenizer-comparative-analysis",
}



