On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks
Yannic Neuhaus1, Nicolas Flammarion2, Matthias Hein1, Francesco Croce3
1University of Tübingen 2EPFL 3ELLIS Institute Finland, Aalto University
Download the datasets here and unzip the file in ./data.
We provide our datasets with four different input representations (the corresponding jsonls contain the substrings "image", "desc", "grid" or "table")
For the training data, we also provide the reasoning traces
- desc : simple descriptive reasoning
- table : ASCII visualization of the grid after each step
- grid : more concise ASCII visualization of the grid after each step
- table_desc / grid_desc: combination of the descriptive reasoning with the grid visualizations
git clone https://github.com/YanNeu/frozen_ood.git
cd frozen_ood
conda env create -f environment.yml
Use task=sft_text for the text based inputs and task=sft_image for image inputs, e.g.
python src/main.py epochs=10 task=sft_text data_path="./data/train/train_grid.jsonl" run_name="sft_grid"
to fine-tune the models with grid input and no reasoning and
python src/main.py epochs=10 task=sft_text data_path="./data/train/train_grid_reasoning_grid_desc.jsonl" run_name="sft_grid_reas_grid_desc"
for the version with description and grid based reasoning traces.
After fine-tuning the model you can evaluate it on all ID test sets via
python src/eval.py load_model_path="./checkpoints/sft_grid" data_path="./data/test_id/test_level3_4_5_6_grid.jsonl" save_dir="./results_id"
or on one of the OOD sets
python src/eval.py load_model_path="./checkpoints/sft_grid" data_path="./data/test_ood/test_level7_grid.jsonl" save_dir="./results_ood"
@article{neuhaus2026oodreasoning,
title={On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks},
author={Yannic Neuhaus and Nicolas Flammarion and Matthias Hein and Francesco Croce},
journal={arXiv preprint arXiv:2602.15460},
year={2026},
}

