Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Haoji Zhang^*, Xin Gu^*, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, Yansong Tang^†

^*Equal contributions, ^†Correspondence

We proposed VITAL, a tool-augmented framework that enables advanced long video reasoning and temporal grounding.

We also introduce MTVR, a high-quality multi-task video reasoning training dataset.

MTVR Dataset

The dataset is available here.

Training and Evaluation

1. Prepare coding environment:

This project is based on verl 0.4.0.dev, vllm 0.8.5.post1, transformers 4.51.1, torch 2.6.0 and python 3.10. Install dependencies:

bash setup.sh

2. Prepare model and data:

VITAL/
├── data/
│   ├── actnet/
│   ├── charades/
│   ├── longvideo-reason/
│   ├── mmvu/
│   ├── nextgqa/
│   ├── rextime/
│   ├── vidchapters/
│   ├── Video-R1-data/
│   ├── videomme/
│   ├── videommmu/
│   ├── vidi/
│   ├── vsibench/
├── models/
│   ├── Qwen2.5-VL-3B-Instruct/
│   ├── Qwen2.5-VL-7B-Instruct/
├── outputs/ # an empty folder to store the outputs

Prepare Qwen2.5-VL-7B-Instruct model

mkdir -p models
huggingface-cli download --resume-download --local-dir-use-symlinks False  Qwen/Qwen2.5-VL-7B-Instruct --revision main --local-dir ./models/Qwen2.5-VL-7B-Instruct

Prepare dataset for training and evaluation. You can download the following datasets from there official websites:

Folder	Dataset	Source
data/actnet	ActivityNet-MR	https://cs.stanford.edu/people/ranjaykrishna/densevid/
data/charades	Charades-STA	https://github.com/jiyanggao/TALL
data/longvideo-reason	LongVideo-Reason	https://github.com/NVlabs/Long-RL/tree/main/longvideo-reason
data/mmvu	MMVU	https://github.com/yale-nlp/MMVU
data/nextgqa	NExT-GQA	https://github.com/doc-doc/NExT-GQA
data/rextime	ReXTime	https://huggingface.co/datasets/ReXTime/ReXTime
data/vidchapters	VidChapters-7M	https://github.com/antoyang/VidChapters
data/Video-R1-data	Video-R1	https://huggingface.co/datasets/Video-R1/Video-R1-data
data/videomme	Video-MME	https://github.com/MME-Benchmarks/Video-MME
data/videommmu	Video-MMMU	https://github.com/EvolvingLMMs-Lab/VideoMMMU
data/vidi	VIDI / VUE-TR	https://github.com/bytedance/vidi
data/vsibench	VSI-Bench	https://github.com/vision-x-nyu/thinking-in-space

3. Training and evaluation script

Run the following script to train and evaluate the model:

bash train_stage_1_sft.sh
bash train_stage_2_dgrpo.sh
bash train_stage_3_sft.sh
bash train_stage_4_dgrpo.sh

Note:

You need to set the some configuration in the corresponding script, such as PRETRAINED_CKPT, YOUR_WANDB_API_KEY.

Citation

If you find this project useful in your research, please consider citing:

@article{zhang2025thinking,
  title={Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning},
  author={Zhang, Haoji and Gu, Xin and Li, Jiawen and Ma, Chixiang and Bai, Sule and Zhang, Chubin and Zhang, Bowen and Zhou, Zhichao and He, Dongliang and Tang, Yansong},
  journal={arXiv preprint arXiv:2508.04416},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
data		data
verl		verl
LICENSE		LICENSE
README.md		README.md
eval_GQA.py		eval_GQA.py
eval_VQA.py		eval_VQA.py
eval_prompts.py		eval_prompts.py
eval_reward.py		eval_reward.py
eval_temporal_grounding.py		eval_temporal_grounding.py
eval_temporal_grounding_multi.py		eval_temporal_grounding_multi.py
inference_vllm_multiturn_number.py		inference_vllm_multiturn_number.py
inference_vllm_origin.py		inference_vllm_origin.py
inference_vllm_origin_number.py		inference_vllm_origin_number.py
setup.sh		setup.sh
train_stage_1_sft.sh		train_stage_1_sft.sh
train_stage_2_dgrpo.sh		train_stage_2_dgrpo.sh
train_stage_3_sft.sh		train_stage_3_sft.sh
train_stage_4_dgrpo.sh		train_stage_4_dgrpo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Contents

MTVR Dataset

Training and Evaluation

1. Prepare coding environment:

2. Prepare model and data:

3. Training and evaluation script

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Contents

MTVR Dataset

Training and Evaluation

1. Prepare coding environment:

2. Prepare model and data:

3. Training and evaluation script

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages