On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks

Yannic Neuhaus¹, Nicolas Flammarion², Matthias Hein¹, Francesco Croce³

¹University of Tübingen ²EPFL ³ELLIS Institute Finland, Aalto University

Data

Download the datasets here and unzip the file in ./data.

Input representations

We provide our datasets with four different input representations (the corresponding jsonls contain the substrings "image", "desc", "grid" or "table")

Reasoning traces

For the training data, we also provide the reasoning traces

desc : simple descriptive reasoning
table : ASCII visualization of the grid after each step
grid : more concise ASCII visualization of the grid after each step
table_desc / grid_desc: combination of the descriptive reasoning with the grid visualizations

Environment

git clone https://github.com/YanNeu/frozen_ood.git
cd frozen_ood
conda env create -f environment.yml

Fine-tuning

Use task=sft_text for the text based inputs and task=sft_image for image inputs, e.g.

python src/main.py epochs=10 task=sft_text data_path="./data/train/train_grid.jsonl" run_name="sft_grid"

to fine-tune the models with grid input and no reasoning and

python src/main.py epochs=10 task=sft_text data_path="./data/train/train_grid_reasoning_grid_desc.jsonl" run_name="sft_grid_reas_grid_desc"

for the version with description and grid based reasoning traces.

Evaluation

After fine-tuning the model you can evaluate it on all ID test sets via

python src/eval.py load_model_path="./checkpoints/sft_grid" data_path="./data/test_id/test_level3_4_5_6_grid.jsonl" save_dir="./results_id"

or on one of the OOD sets

python src/eval.py load_model_path="./checkpoints/sft_grid" data_path="./data/test_ood/test_level7_grid.jsonl" save_dir="./results_ood"

Citation

@article{neuhaus2026oodreasoning,
      title={On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks}, 
      author={Yannic Neuhaus and Nicolas Flammarion and Matthias Hein and Francesco Croce},
      journal={arXiv preprint arXiv:2602.15460},
      year={2026},
}

Acknowledgement

The code in this repository is based on VSP and Mirage

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
src		src
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks

Data

Input representations

Reasoning traces

Environment

Fine-tuning

Evaluation

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks

Data

Input representations

Reasoning traces

Environment

Fine-tuning

Evaluation

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages