Haoji Zhang*, Xin Gu*, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, Yansong Tang†
*Equal contributions, †Correspondence
We proposed VITAL, a tool-augmented framework that enables advanced long video reasoning and temporal grounding.
We also introduce MTVR, a high-quality multi-task video reasoning training dataset.
The dataset is available here.
This project is based on verl 0.4.0.dev, vllm 0.8.5.post1, transformers 4.51.1, torch 2.6.0 and python 3.10. Install dependencies:
bash setup.shVITAL/
├── data/
│ ├── actnet/
│ ├── charades/
│ ├── longvideo-reason/
│ ├── mmvu/
│ ├── nextgqa/
│ ├── rextime/
│ ├── vidchapters/
│ ├── Video-R1-data/
│ ├── videomme/
│ ├── videommmu/
│ ├── vidi/
│ ├── vsibench/
├── models/
│ ├── Qwen2.5-VL-3B-Instruct/
│ ├── Qwen2.5-VL-7B-Instruct/
├── outputs/ # an empty folder to store the outputs- Prepare Qwen2.5-VL-7B-Instruct model
mkdir -p models
huggingface-cli download --resume-download --local-dir-use-symlinks False Qwen/Qwen2.5-VL-7B-Instruct --revision main --local-dir ./models/Qwen2.5-VL-7B-Instruct- Prepare dataset for training and evaluation. You can download the following datasets from there official websites:
| Folder | Dataset | Source |
|---|---|---|
| data/actnet | ActivityNet-MR | https://cs.stanford.edu/people/ranjaykrishna/densevid/ |
| data/charades | Charades-STA | https://github.com/jiyanggao/TALL |
| data/longvideo-reason | LongVideo-Reason | https://github.com/NVlabs/Long-RL/tree/main/longvideo-reason |
| data/mmvu | MMVU | https://github.com/yale-nlp/MMVU |
| data/nextgqa | NExT-GQA | https://github.com/doc-doc/NExT-GQA |
| data/rextime | ReXTime | https://huggingface.co/datasets/ReXTime/ReXTime |
| data/vidchapters | VidChapters-7M | https://github.com/antoyang/VidChapters |
| data/Video-R1-data | Video-R1 | https://huggingface.co/datasets/Video-R1/Video-R1-data |
| data/videomme | Video-MME | https://github.com/MME-Benchmarks/Video-MME |
| data/videommmu | Video-MMMU | https://github.com/EvolvingLMMs-Lab/VideoMMMU |
| data/vidi | VIDI / VUE-TR | https://github.com/bytedance/vidi |
| data/vsibench | VSI-Bench | https://github.com/vision-x-nyu/thinking-in-space |
Run the following script to train and evaluate the model:
bash train_stage_1_sft.sh
bash train_stage_2_dgrpo.sh
bash train_stage_3_sft.sh
bash train_stage_4_dgrpo.shNote:
- You need to set the some configuration in the corresponding script, such as
PRETRAINED_CKPT,YOUR_WANDB_API_KEY.
If you find this project useful in your research, please consider citing:
@article{zhang2025thinking,
title={Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning},
author={Zhang, Haoji and Gu, Xin and Li, Jiawen and Ma, Chixiang and Bai, Sule and Zhang, Chubin and Zhang, Bowen and Zhou, Zhichao and He, Dongliang and Tang, Yansong},
journal={arXiv preprint arXiv:2508.04416},
year={2025}
}
