- We propose TVSIP, a novel explainable framework that combines low-level visual artifact detection with high-level semantic analysis for tampered text verification. It aims to leverage MLLMs to enhance the pixel-level localization ability of expert models while providing detailed and reliable tampering analysis, including image description, tampered text detection, localization, and explanation.
- We present TextDDLE, a meticulously curated benchmark that facilitates both training and evaluation of tampered text analysis capabilities. Created through a systematic pipeline utilizing GPT-4o with expert verification, TextDDLE supports the four fundamental tasks of tampering analysis.
- Extensive experiments demonstrate that semantic clues notably improve model performance and robustness in the TTD task. TVSIP offers strong robustness to image degradation and excellent generalization to unseen scenarios.
Dataset
Since TextDDLE-PT is too large, we keep it separate from the other subsets.
| Dataset | Link |
|---|---|
| TextDDLE-PT | BaiduYun:p8q6 |
| TextDDLE wo PT | BaiduYun:5yre |
Note:
- The TextDDLE dataset can only be used for non-commercial research purposes. For scholar or organization who wants to use the TextDDLE dataset, please first fill in this Application Form and sign the Legal Commitment and email them to us (eelwjin@scut.edu.cn, cc: eegtxu@mail.scut.edu.cn). When submitting the application form to us, please list or attached 1-2 of your publications in the recent 6 years to indicate that you (or your team) do research in the related research fields of OCR, image forgery detection and localization, document image processing, and so on.
- We will give you the decompression password after your application has been received and approved.
- All users must follow all use conditions; otherwise, the authorization will be revoked.
Model Zoo
| Model | Checkpoint |
|---|---|
| Locator | BaiduYun:4ake |
| Pretrained Interpreter | BaiduYun:ibv9 |
| Fine-tuned Interpreter | BaiduYun:avw5 |
Inference Results of TVSIP
You can download all inference results of TVSIP from BaiduYun:j3jb.
git clone https://github.com/SCUT-DLVCLab/TVSIP.git
cd TVSIP
conda create --name tvsip --file requirements.txt
conda activate tvsipData preparation
- Download the TextDDLE dataset into the datasets folder.
- Move JSON files in TextDDLE to the data folder.
For Locator:
bash tools/train_locator.shFor Interpreter:
bash tools/train_interpreter_stage1.shYou can also skip the pretraining step and fine-tune directly.
bash tools/train_interpreter_stage2.shNote: Since visual expert models (i.e., the low-level vision clue branch of Locator in this work) are not the focus of this work, we directly use the results trained by SegFormer. You can download the inference results of the expert models from BaiduYun:j3jb.
For the high-level semantic clue branch of Locator:
bash tools/infer_locator.shFor Interpreter:
bash tools/infer_interpreter.shFor Locator:
bash tools/evaluation_for_locator.shAlso, you can obtain the final fusion results from the Locator
For Interpreter:
bash tools/evaluation_for_interpreter.shIf you have any questions, feel free to contact me at eegtxu@mail.scut.edu.cn.
The code and dataset should be used and distributed under (CC BY-NC-ND 4.0) for non-commercial research purposes.
- This repository can only be used for non-commercial research purposes.
- For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).
- Copyright 2025, Deep Learning and Vision Computing Lab (DLVC-Lab), South China University of Technology.
If you find this paper helpful, please consider giving this repo a ⭐ and citing:
@inproceedings{xu2025pixels,
title={From Pixels to Semantics: A Novel MLLM-Driven Approach for Explainable Tampered Text Detection},
author={Xu, Guitao and Yi, Ziqi and Zhang, Peirong and Cao, Jiahuan and Wu, Shihang and Jin, Lianwen},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
pages={757--766},
year={2025}
}