Small, targeted probes that isolate what a multimodal model actually perceives — beyond aggregate VQA scores.
- Overview
- Architecture
- Installation
- Quick Start
- Probes
- Adding a New Probe
- Benchmarks
- Citation
- License
Most multimodal benchmarks bundle perception, reasoning, language, and world knowledge into a single accuracy number. When a model gets 62%, you have no idea why. MM-Probe-Suite collects a growing set of minimal probes — each focused on a single, controllable aspect — so you can decompose model behavior into something diagnosable.
Each probe is a small generator that produces images and questions where the ground truth is controlled by construction, plus an evaluator that decides whether the model's natural-language answer is correct.
We currently ship eight probes covering perception, binding, spatial reasoning, and reading. They are designed to be cheap to run (under 1000 samples each) so you can sweep models without burning a week on inference.
┌──────────────────┐
│ Probe config │ (yaml)
└────────┬─────────┘
│
┌──────────────┴──────────────┐
▼ ▼
┌───────────────┐ ┌──────────────────┐
│ Generator │── images ──▶│ ModelRunner │── answers ──▶ Evaluator
│ (synthetic) │ + prompts │ (HF / OpenAI) │
└───────────────┘ └──────────────────┘ │
▼
┌──────────────┐
│ JSONL report │
└──────────────┘
Each probe is independently runnable; runners and evaluators are interchangeable.
git clone https://github.com/pulgog/mm-probe-suite
cd mm-probe-suite
pip install -e .For development:
pip install -e ".[dev]"
pre-commit install # optionalRun a single probe against a local Hugging Face model:
from mmprobe import load_probe
from mmprobe.runners import HFModelRunner
probe = load_probe("counting")
runner = HFModelRunner("llava-hf/llava-1.5-7b-hf", device="cuda")
report = probe.evaluate(runner, n=200)
print(f"counting accuracy: {report.accuracy:.3f}")
print(f"per-bucket: {report.per_bucket}")Or from the command line:
python -m mmprobe.cli run \
--probe counting \
--model llava-hf/llava-1.5-7b-hf \
--n 200 \
--out reports/llava-counting.jsonl| Probe | What it tests | # configs |
|---|---|---|
counting |
counts of 1-10 simple shapes with controlled clutter | 24 |
color_binding |
"which shape is red?" with multiple colored distractors | 18 |
spatial_rel |
above / below / left / right between two objects | 16 |
size_compare |
which of two objects is larger | 8 |
text_read |
synthetic single-word OCR on textured backgrounds | 12 |
temporal_order |
2-frame side-by-side, "which happened first" | 10 |
presence |
is object X present, yes/no | 14 |
negation |
"is there NO X" — easy to fail by ignoring negation | 9 |
A probe is just a class that implements two methods. See
docs/adding_probes.md for a tutorial.
from mmprobe.base import Probe, Sample
class MyProbe(Probe):
name = "myprobe"
def generate(self, n, seed):
# yield Sample(image=..., prompt=..., answer=...)
...
def score(self, sample, model_answer):
return sample.answer.lower() in model_answer.lower()Preliminary results on a few open multimodal models (April 2025 snapshot):
| Model | counting | color | spatial | text | avg |
|---|---|---|---|---|---|
| LLaVA-1.5-7B | 0.42 | 0.71 | 0.55 | 0.18 | 0.46 |
| LLaVA-1.5-13B | 0.48 | 0.76 | 0.61 | 0.22 | 0.52 |
| Qwen-VL-Chat | 0.51 | 0.74 | 0.58 | 0.34 | 0.54 |
| InstructBLIP-7B | 0.37 | 0.63 | 0.49 | 0.12 | 0.40 |
Caveat: numbers will shift as probe configs evolve. Re-run with the version tag pinned for any comparison you publish.
@software{linzihan_mmprobe_2024,
author = {Lin, Zihan},
title = {MM-Probe-Suite: Probing Tasks for Multimodal LLMs},
year = {2024},
url = {https://github.com/pulgog/mm-probe-suite}
}Apache License 2.0 — see LICENSE.