Skip to content

flame-cai/tokenizer-comparative-analysis

Repository files navigation

Comparative Analysis of the Intrinsic Metrics for Tokenizers and their effect on Downstream Tasks for Hindi and Marathi

Shagun Dwivedi · Kaushik Gopalan
The 64th Annual Meeting of the Association for Computational Linguistics, 2026

Introduction

In this paper, we compare the performance of five existing tokenizers that use UTF-8 inputs, and we study how, ceteris paribus, different tokenization schemes affect the ability of language models for question-answering, transliteration, grapheme-to-phoneme conversion, and their robustness to noise. We also propose a novel grapheme cluster tokenizer, a form of visual character unit level tokenizer for Devanagari. We assess whether the performance of tokenizers on intrinsic evaluation metrics translates to the downstream performance of models trained using those tokenizers.

Repo Structure

  • auto_eval/ contains code for the automated evaluation framework for the question answering tasks
  • eval/ contains code for model inference for QA, and inference + evaluation for word level tasks
  • train_llm/ contains code for training T5 for QA tasks and the word-level tasks
  • training_tokenizers/ contains code for training and implementation of all tokenizers used in the study.

Methodology

We assess the performance of six different tokenizers: Byte Pair Encoding (BPE), Unigram, WordPiece, grapheme cluster tokenizer, Grapheme Pair Encoding(GPE), and Byte Pair Encoding with transliteration during the pretokenization step (ITR+BPE).

Grapheme Cluster Tokenization Algorithm

The tokenizers are assessed using information-theoretic intrinsic evaluation metrics, as well as through the extrinsic evaluation of the performance of a language model pre-trained using different tokenizers. The investigation of correlation between the intrinsic and extrinsic metrics is also reported.

  • Intrinsic Evaluation
    • Renyi Efficiency
    • Fertility
    • Percentile Frequency
  • Extrinsic Evaluation
    • Generation Task:
      • QA - Accuracy
      • Noise Robust QA - Accuracy
    • Word-Level Tasks
      • G2P - PER, WER
      • Transliteration - CER

Results

Intrinsic Results

QA Results

Word Level Results

Citation

@inproceedings{dwivedi-gopalan-2026-comparative,
    title = "Comparative Analysis of the Intrinsic Metrics for Tokenizers
and their effect on Downstream Tasks for Hindi and Marathi",
    author = "Dwivedi, Shagun  and
      Gopalan, Kaushik",
    booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = july,
    year = "2026",
    publisher = "Association for Computational Linguistics",
    url = "https://github.com/flame-cai/tokenizer-comparative-analysis",
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages