Skip to content

nogully/spatial_probing

Repository files navigation

Spatial Reasoning Probing Study

Does visual training help language models understand space?

One known weakness of language models is spatial reasoning. Having learned from text alone, they have no perceptual grounding for concepts like above, left of, or inside — they've only ever seen these words in context, never experienced what they refer to.

CLIP changes the training regime: instead of predicting masked tokens, it learns to align image-caption pairs. The hypothesis here is that this contrastive visual training leaves a residue in the text embedding space — making spatial relations more linearly decodable, even when no image is present at inference time.

This study tests that hypothesis using probing classifiers: lightweight logistic regression models trained on frozen embeddings to ask whether spatial relation labels are linearly decodable from each representation.

N.B. I am disclosing that I used AI (Claude Code) to assist with coding in this project, as encouraged in the course. Research design, experimental decisions, and interpretation of results are my own.


Research Question

Does visual contrastive training (CLIP) produce text embeddings that encode spatial relations better than purely distributional training (SBERT)? And which specific relation types benefit — or remain resistant?


Main Deliverable

Gully_Probing_Study.ipynb

Results Summary

Mean F1 by Model (36 relations, min 50 examples)

Model Mean F1
SBERT (all-mpnet-base-v2) 0.710
CLIP Text encoder 0.589
CLIP Concat (image + text, 1024d) 0.514
CLIP Image encoder 0.493

Main Results Heatmap

Main Results Heatmap


Models

Model Type Training Objective Visual Signal
sentence-transformers/all-mpnet-base-v2 SBERT NLI + semantic similarity None
openai/clip-vit-base-patch32 (text encoder) CLIP Text Contrastive image-text Indirect
openai/clip-vit-base-patch32 (image encoder) CLIP Image Contrastive image-text Direct
CLIP text + image (concat, 1024d) CLIP Concat Contrastive image-text Both

Dataset

VSR — Visual Spatial Reasoning (Liu et al., 2022)

  • ~7,680 training examples (image, caption, relation type, True/False label)
  • 64 relation types; 36 retained after filtering relations with fewer than 50 examples
  • Images are COCO images fetched at runtime; only filenames stored in the HuggingFace dataset

Reproducing the Results

Environment

git clone https://github.com/nogully/spatial_probing.git
cd spatial_probing
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Embedding Extraction (Google Colab — GPU required)

You may need:

  • A GitHub Personal Access Token (to clone the repo in Colab - should be able to do without tho)
  • A HuggingFace token (optional - to suppress rate-limit warnings when downloading models)

Open notebooks/02_embedding_extraction.ipynb, run the Colab setup cell (mounts Drive, clones repo, installs deps), then run all cells. Embeddings are cached to Google Drive as .npy files.

Sync the following files to results/embeddings/ locally before running probing:

  • sbert_vsr_train.npy
  • clip_text_vsr_train.npy
  • clip_image_vsr_train.npy
  • clip_concat_vsr_train.npy

Probing and Visualization (local)

Run notebooks 03 + 05 locally:

notebooks/03_probing_experiments.ipynb   # trains probes, saves CSVs
notebooks/05_visualization.ipynb         # all graphics (I like data viz)

Project Structure

spatial_probing/
├── README.md
├── requirements.txt
├── .env                         # gitignored — set CACHE_DIR here
│
├── notebooks/
│   ├── 01_data_exploration.ipynb
│   ├── 02_embedding_extraction.ipynb    # run on Colab
│   ├── 03_probing_experiments.ipynb
│   └── 05_visualization.ipynb
│
├── src/
│   ├── datasets.py              # VSR loader
│   ├── embedders.py             # SBERT, CLIP text, image, concat embedders
│   └── probing.py               # logistic regression probe + CV
│
└── results/
    ├── embeddings/              # cached .npy files — gitignored
    └── figures/                 # output plots

References


Author

Nora Gully — University of Colorado Boulder, CSCI 4622 Machine Learning, Spring 2026

About

ml / nlp interpretability research 2026

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors