A Multimodal Reasoning Agent with Stateful Experiences

If you find this project helpful, please give us a star ⭐

🎬 Demo

MuSEAgent_demo.mp4

💡 Overview

MuSEAgent enhances multimodal agent reasoning by leveraging fine-grained stateful experiences rather than coarse trajectory-level retrieval. The framework consists of two phases: (1) Experience Abstraction, which extracts state-level experiences via hindsight evaluation and builds multi-viewpoint embeddings for each experience; (2) Experience Exploitation, where the agent performs a deep-and-wide search over the experience bank to determine the next action at inference time.

Phase 1: Experience Abstraction

Decompose historical trajectories into atomic state-action transitions;
Evaluate each transition via a hindsight reasoning model (e.g., GPT-4o), producing a quality score and decision guidance;
Retain high-quality transitions into a stateful experience bank, filtering out noisy or uninformative steps;
Index each experience under multiple semantic viewpoints — different combinations of state components (e.g., Query+Image, Query+Task, Image+Task) — so a single experience can be discovered through different contextual cues.

Phase 2: Experience Exploitation via Deep-and-Wide Search

Before each action, the agent selects a semantic viewpoint based on its current state and queries the experience bank through the corresponding embedding index;
Wide Search retrieves Top-K diverse experiences under the selected viewpoint for broad coverage;
Deep Search iterates across different viewpoints (e.g., first Query+Image, then Query+Task) to progressively gather complementary guidance;
Retrieved decision guidance is injected into the agent's context to inform the next action.

🌈 Check the case for better understanding!

📊 Key Results

Overall Performance

State-level experiences consistently outperform trajectory-level baselines. On Qwen3-VL-32B, MuSEAgent reaches 65.30% average accuracy, surpassing the strongest trajectory-level baseline (Reflexion, 57.33%) by nearly 8%. Decomposing trajectories into granular state-action pairs effectively mitigates the noise of monolithic trajectory retrieval.
Stateful experiences benefit compact models most, while absolute performance scales with model size. MuSEAgent improves the best baseline by +8% on Qwen3-VL-32B, indicating that compact models with limited intrinsic reasoning capacity benefit most from fine-grained stateful experiences. Meanwhile, absolute performance continues to scale with model size.

Deep-and-Wide Search

Scaling search depth and breadth consistently improves over the experience-free ReAct baseline. Even at minimal search scales, querying diverse semantic viewpoints provides effective state-aware guidance.
Performance peaks at moderate search scales before slightly declining. Excessive historical experiences introduce redundant information into the context window, mildly diluting guidance for the current state. A balanced search scope optimizes experience utilization without overwhelming the model's instruction-following capacity.

🛠️ Tool Bank

MuSEAgent is equipped with 13 multimodal tools:

Category	Tool	Description	Dependency
Vision	`ocr`	Extract text from images	PaddleOCR-VL model
	`localize_objects`	Detect and localize objects	SAM3 model
	`zoom_in`	Zoom into specific regions	—
	`crop`	Crop image regions	—
	`visualize_regions`	Draw bounding boxes on images	—
Depth	`estimate_region_depth`	Estimate depth of a region	Depth-Anything model
	`estimate_object_depth`	Estimate depth of an object	Depth-Anything model
Similarity	`get_image2images_similarity`	Compare image-to-image similarity	CLIP model
	`get_image2texts_similarity`	Compare image-to-text similarity	CLIP model
	`get_text2images_similarity`	Compare text-to-image similarity	CLIP model
Utility	`calculator`	Arithmetic operations	—
	`solve_math_equation`	Solve equations via Wolfram Alpha	API key
	`web_search`	Web search via Google	API key

⚙️ Configuration

1. Model Services

Start API services for the VLM and embedding model:

Vision-Language Model (Required)

vllm serve Qwen/Qwen3-VL-32B-Instruct \
    --port 8000 \
    --dtype auto
# → http://localhost:8000/v1

Embedding Model (For retrieval)

vllm serve Qwen/Qwen3-VL-Embedding-8B \
    --runner pooling \
    --port 8001 \
    --dtype float16 \
    --trust-remote-code \
    --served-model-name qwen3vl-embed \
    --allowed-local-media-path /
# → http://localhost:8001/v1

2. Tool Models

Some tools require local models. Edit tool/model_config.py and set the paths to your downloaded models:

# tool/model_config.py

CLIP_VERSION = "ViT-H-14-378-quickgelu"
CLIP_PRETRAINED = "/path/to/your/clip/model.bin"        # for similarity tools

DEPTH_ANYTHING_PATH = "/path/to/your/depth-anything"     # for depth tools

PADDLE_OCR_VL_MODEL_PATH = "/path/to/your/paddleocr"     # for ocr tool

SAM3_MODEL_PATH = "/path/to/your/sam3"                    # for localize_objects tool

Model	Required By	Download
CLIP (ViT-H-14-378)	`get_image2*_similarity`, `get_text2images_similarity`	laion/CLIP-ViT-H-14-378-quickgelu
Depth-Anything-V2	`estimate_region_depth`, `estimate_object_depth`	depth-anything/Depth-Anything-V2-Large-hf
PaddleOCR-VL	`ocr`	PaddlePaddle/PaddleOCR-VL-1.5
SAM3	`localize_objects`	facebook/sam3

3. Tool API Keys

Edit api/utils/keys.env for tools that require external APIs:

# api/utils/keys.env

GOOGLE_SEARCH_API_KEY=your-serper-api-key        # web_search tool
WOLFRAM_ALPHA_API_KEYS=your-key-1,your-key-2     # solve_math_equation tool

Get your Google Search API key from https://serper.dev/
Get your Wolfram Alpha API key from https://products.wolframalpha.com/api/

🔧 Installation

# 1. Install uv (package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and enter repo
git clone https://github.com/DeepExperience/MuSEAgent.git
cd MuSEAgent

# 3. Create environment and install all dependencies
uv venv && source .venv/bin/activate
uv sync --all-extras

🏃 Quick Start

Step 1 — Prepare Data

Organize your data in the following format:

data/
└── <dataset_name>/
    ├── exploration/             # For collecting experiences
    │   ├── dataset.jsonl
    │   └── images/
    └── evaluation/              # For final evaluation
        ├── dataset.jsonl
        └── images/

Each line in dataset.jsonl:

{"id": "001", "question": "What color is the umbrella?", "images": ["img_001.jpg"], "answer": "red"}

🤗 Our evaluation datasets are available at ShijianW01/MuSEAgent-Eval.

Step 2 — Run Exploration

Generate trajectories on the exploration set using ReAct mode:

python run_react.py

Configure data paths and agent settings in run_react.py:

dataset_jsonl = "data/mmstar/exploration/dataset.jsonl"
dataset_image_dir = "data/mmstar/exploration/images"

agent_config = {
    "model_name": "Qwen/Qwen3-VL-32B-Instruct",
    "base_url": "http://localhost:8000/v1",
    "api_keys": ["your-api-key"], # or dummy key(s)
    "tool_bank": [
        "ocr", "localize_objects", "zoom_in", "crop", "visualize_regions",
        "estimate_region_depth", "estimate_object_depth",
        "get_image2images_similarity", "get_image2texts_similarity",
        "get_text2images_similarity",
        "calculator", "solve_math_equation", "web_search",
    ],
    "memory_dir": "memory/exploration",   # Trajectories saved here
    "max_iterations": 20,
}

Step 3 — Build Experience Bank

Extract state-level experiences and compute multi-view embeddings. We provide two scripts depending on your LLM setup:

Option A: vLLM / OpenAI-compatible endpoint (uses the openai SDK)

python scripts/build_state_bank_vllm.py \
    --memory_dir memory/exploration \
    --llm_model Qwen/Qwen3-VL-32B-Instruct \
    --llm_base_url http://localhost:8000/v1 \
    --llm_api_key your-api-key \
    --embedding_model Qwen/Qwen3-VL-Embedding-8B \
    --embedding_base_url http://localhost:8001/v1 \
    --bank_dir_name state_bank

Option B: Azure / AWS / other cloud APIs (uses httpx directly, no SDK dependency)

# OpenAI API
python scripts/build_state_bank_aws.py \
    --memory_dir memory/exploration \
    --llm_api_key sk-xxx \
    --llm_base_url https://api.openai.com/v1/chat/completions \
    --llm_model gpt-4o \
    --embedding_model Qwen/Qwen3-VL-Embedding-8B \
    --embedding_base_url http://localhost:8001/v1

# Azure OpenAI (auto-detected from URL, uses api-key header)
python scripts/build_state_bank_aws.py \
    --memory_dir memory/exploration \
    --llm_api_key YOUR_AZURE_KEY \
    --llm_base_url "https://YOUR_RESOURCE.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-12-01-preview" \
    --embedding_model Qwen/Qwen3-VL-Embedding-8B \
    --embedding_base_url http://localhost:8001/v1

Both scripts produce identical output. Option B auto-detects Azure endpoints (from azure or api-version in the URL) and switches to api-key header auth; all other URLs use standard Bearer token auth.

This produces memory/exploration/state_bank/ containing:

state_meta.json — experience metadata with Q-values and guidance
views/ — multi-view embedding indices (question, task, images, observations, and their compositions)

Step 4 — Evaluate with Experience Search

Run full MuSEAgent on the evaluation set:

python run_muse_agent.py

Configure evaluation paths and retrieval settings in run_muse_agent.py:

dataset_jsonl = "data/mmstar/evaluation/dataset.jsonl"
dataset_image_dir = "data/mmstar/evaluation/images"

agent_config = {
    "model_name": "Qwen/Qwen3-VL-32B-Instruct",
    "base_url": "http://localhost:8000/v1",
    "api_keys": ["your-api-key"],
    "tool_bank": [
        "ocr", "localize_objects", "zoom_in", "crop", "visualize_regions",
        "estimate_region_depth", "estimate_object_depth",
        "get_image2images_similarity", "get_image2texts_similarity",
        "get_text2images_similarity",
        "calculator", "solve_math_equation", "web_search",
    ],
    "memory_dir": "memory/evaluation",
    "max_iterations": 20,

    "retrieval": {
        "enable": True,
        "bank_memory_dir": "memory/exploration",
        "bank_dir_name": "state_bank",
        "embedding_model": "Qwen/Qwen3-VL-Embedding-8B",
        "embedding_base_url": "http://localhost:8001/v1",
        "min_score": 0.1,
        "min_q_value": 5,
        "experience_top_n": 3,
        "max_epoch": 3,
    },
}

Output Structure

Each run produces two types of outputs:

Evaluation results are saved to the output_dir specified in the Runner (e.g., results/muse_agent/):

results/muse_agent/
├── results.jsonl         # Per-sample predictions and correctness
└── stats.json            # Aggregated accuracy (overall, by type, by subtask)

Trajectory memory is saved to the memory_dir specified in agent_config (e.g., memory/exploration/). Each task gets its own folder with a full interaction trace and copied input images:

memory/exploration/
└── tasks/
    ├── 00001/
    │   ├── trace.json    # Full interaction trace (states, actions, observations)
    │   ├── img_0.jpg     # Copied input image
    │   └── img_1.jpg     # Tool-generated image (e.g., crop, zoom_in)
    ├── 00002/
    │   ├── trace.json
    │   ├── img_0.jpg
    │   └── img_1.jpg
    └── ...

After building the experience bank (Step 3), the state bank is saved under the memory directory:

memory/exploration/
├── tasks/                # Raw trajectories (from Step 2)
└── state_bank/           # Experience bank (from Step 3)
    ├── state_meta.json   # Experience metadata with Q-values and guidance
    └── views/            # Multi-view embedding indices
        ├── question.npy
        ├── question_mask.npy
        ├── question+images.npy
        ├── question+images_mask.npy
        └── ...           # One .npy + mask per semantic viewpoint

📖 Agent Modes

Mode	Script	Tools	Retrieval	Description
Vanilla CoT	`run_vanilla_cot.py`	—	—	Direct VLM reasoning without tools
ReAct	`run_react.py`	✅	—	Tool-augmented reasoning loop
MuSEAgent	`run_muse_agent.py`	✅	✅	Full system with stateful experience search

📂 Project Structure

MuSEAgent/
├── agent/
│   ├── base_agent.py              # Base agent with ReAct loop and tool dispatch
│   └── mm_agent.py                # MuSE Agent with stateful experience search
├── api/
│   ├── wrapper.py                 # Async OpenAI-compatible LLM wrapper
│   ├── async_pool.py              # Multi-key API pool with load balancing
│   ├── vision_utils.py            # Multimodal message builder (image encoding)
│   └── utils/keys.env             #   ← API keys for web_search, solve_math_equation
├── tool/                          # 13 multimodal tools (see Tool Bank above)
│   ├── base_tool.py               # Abstract tool interface
│   ├── model_cache.py             # Shared model loading with GPU management
│   ├── model_config.py            #   ← Local model paths (CLIP, Depth, OCR, SAM3)
│   └── ...
├── mm_memory/
│   ├── memory.py                  # Trajectory memory (trace recording and persistence)
│   ├── state_bank.py              # State bank (multi-view indexing and retrieval)
│   └── embedder.py                # Multimodal embedding client (vLLM endpoint)
├── runner/
│   └── runner.py                  # Batch evaluation runner (async, with scoring)
├── template/
│   └── prompt_en.jinja2           # Jinja2 system prompt template
├── scripts/
│   ├── build_state_bank_vllm.py   # Build experience bank via vLLM / OpenAI SDK
│   └── build_state_bank_aws.py    # Build experience bank via Azure / AWS / httpx
├── config.py                      # Dataclass-based configuration (Agent, API, Retrieval)
├── run_vanilla_cot.py             # Entry: Vanilla CoT mode
├── run_react.py                   # Entry: ReAct mode
└── run_muse_agent.py              # Entry: Full MuSE Agent mode

📄 License

This project is released under the MIT License.

📞 Contact

For questions or feedback, please reach out at shijian@seu.edu.cn or open an issue on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Multimodal Reasoning Agent with Stateful Experiences

If you find this project helpful, please give us a star ⭐

🎬 Demo

💡 Overview

📊 Key Results

Overall Performance

Deep-and-Wide Search

🛠️ Tool Bank

⚙️ Configuration

1. Model Services

2. Tool Models

3. Tool API Keys

🔧 Installation

🏃 Quick Start

Step 1 — Prepare Data

Step 2 — Run Exploration

Step 3 — Build Experience Bank

Step 4 — Evaluate with Experience Search

Output Structure

📖 Agent Modes

📂 Project Structure

📄 License

📞 Contact

⭐ Star History

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
agent		agent
api		api
assets		assets
mm_memory		mm_memory
runner		runner
scripts		scripts
template		template
tool		tool
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
pyproject.toml		pyproject.toml
run_muse_agent.py		run_muse_agent.py
run_react.py		run_react.py
run_vanilla_cot.py		run_vanilla_cot.py

Folders and files

Latest commit

History

Repository files navigation

A Multimodal Reasoning Agent with Stateful Experiences

If you find this project helpful, please give us a star ⭐

🎬 Demo

💡 Overview

📊 Key Results

Overall Performance

Deep-and-Wide Search

🛠️ Tool Bank

⚙️ Configuration

1. Model Services

2. Tool Models

3. Tool API Keys

🔧 Installation

🏃 Quick Start

Step 1 — Prepare Data

Step 2 — Run Exploration

Step 3 — Build Experience Bank

Step 4 — Evaluate with Experience Search

Output Structure

📖 Agent Modes

📂 Project Structure

📄 License

📞 Contact

⭐ Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages