MuSEAgent_demo.mp4
MuSEAgent enhances multimodal agent reasoning by leveraging fine-grained stateful experiences rather than coarse trajectory-level retrieval. The framework consists of two phases: (1) Experience Abstraction, which extracts state-level experiences via hindsight evaluation and builds multi-viewpoint embeddings for each experience; (2) Experience Exploitation, where the agent performs a deep-and-wide search over the experience bank to determine the next action at inference time.
Phase 1: Experience Abstraction
- Decompose historical trajectories into atomic state-action transitions;
- Evaluate each transition via a hindsight reasoning model (e.g., GPT-4o), producing a quality score and decision guidance;
- Retain high-quality transitions into a stateful experience bank, filtering out noisy or uninformative steps;
- Index each experience under multiple semantic viewpoints β different combinations of state components (e.g., Query+Image, Query+Task, Image+Task) β so a single experience can be discovered through different contextual cues.
Phase 2: Experience Exploitation via Deep-and-Wide Search
- Before each action, the agent selects a semantic viewpoint based on its current state and queries the experience bank through the corresponding embedding index;
- Wide Search retrieves Top-K diverse experiences under the selected viewpoint for broad coverage;
- Deep Search iterates across different viewpoints (e.g., first Query+Image, then Query+Task) to progressively gather complementary guidance;
- Retrieved decision guidance is injected into the agent's context to inform the next action.
- State-level experiences consistently outperform trajectory-level baselines. On Qwen3-VL-32B, MuSEAgent reaches 65.30% average accuracy, surpassing the strongest trajectory-level baseline (Reflexion, 57.33%) by nearly 8%. Decomposing trajectories into granular state-action pairs effectively mitigates the noise of monolithic trajectory retrieval.
- Stateful experiences benefit compact models most, while absolute performance scales with model size. MuSEAgent improves the best baseline by +8% on Qwen3-VL-32B, indicating that compact models with limited intrinsic reasoning capacity benefit most from fine-grained stateful experiences. Meanwhile, absolute performance continues to scale with model size.
- Scaling search depth and breadth consistently improves over the experience-free ReAct baseline. Even at minimal search scales, querying diverse semantic viewpoints provides effective state-aware guidance.
- Performance peaks at moderate search scales before slightly declining. Excessive historical experiences introduce redundant information into the context window, mildly diluting guidance for the current state. A balanced search scope optimizes experience utilization without overwhelming the model's instruction-following capacity.
MuSEAgent is equipped with 13 multimodal tools:
| Category | Tool | Description | Dependency |
|---|---|---|---|
| Vision | ocr |
Extract text from images | PaddleOCR-VL model |
localize_objects |
Detect and localize objects | SAM3 model | |
zoom_in |
Zoom into specific regions | β | |
crop |
Crop image regions | β | |
visualize_regions |
Draw bounding boxes on images | β | |
| Depth | estimate_region_depth |
Estimate depth of a region | Depth-Anything model |
estimate_object_depth |
Estimate depth of an object | Depth-Anything model | |
| Similarity | get_image2images_similarity |
Compare image-to-image similarity | CLIP model |
get_image2texts_similarity |
Compare image-to-text similarity | CLIP model | |
get_text2images_similarity |
Compare text-to-image similarity | CLIP model | |
| Utility | calculator |
Arithmetic operations | β |
solve_math_equation |
Solve equations via Wolfram Alpha | API key | |
web_search |
Web search via Google | API key |
Start API services for the VLM and embedding model:
Vision-Language Model (Required)
vllm serve Qwen/Qwen3-VL-32B-Instruct \
--port 8000 \
--dtype auto
# β http://localhost:8000/v1Embedding Model (For retrieval)
vllm serve Qwen/Qwen3-VL-Embedding-8B \
--runner pooling \
--port 8001 \
--dtype float16 \
--trust-remote-code \
--served-model-name qwen3vl-embed \
--allowed-local-media-path /
# β http://localhost:8001/v1Some tools require local models. Edit tool/model_config.py and set the paths to your downloaded models:
# tool/model_config.py
CLIP_VERSION = "ViT-H-14-378-quickgelu"
CLIP_PRETRAINED = "/path/to/your/clip/model.bin" # for similarity tools
DEPTH_ANYTHING_PATH = "/path/to/your/depth-anything" # for depth tools
PADDLE_OCR_VL_MODEL_PATH = "/path/to/your/paddleocr" # for ocr tool
SAM3_MODEL_PATH = "/path/to/your/sam3" # for localize_objects tool| Model | Required By | Download |
|---|---|---|
| CLIP (ViT-H-14-378) | get_image2*_similarity, get_text2images_similarity |
laion/CLIP-ViT-H-14-378-quickgelu |
| Depth-Anything-V2 | estimate_region_depth, estimate_object_depth |
depth-anything/Depth-Anything-V2-Large-hf |
| PaddleOCR-VL | ocr |
PaddlePaddle/PaddleOCR-VL-1.5 |
| SAM3 | localize_objects |
facebook/sam3 |
Edit api/utils/keys.env for tools that require external APIs:
# api/utils/keys.env
GOOGLE_SEARCH_API_KEY=your-serper-api-key # web_search tool
WOLFRAM_ALPHA_API_KEYS=your-key-1,your-key-2 # solve_math_equation tool- Get your Google Search API key from https://serper.dev/
- Get your Wolfram Alpha API key from https://products.wolframalpha.com/api/
# 1. Install uv (package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone and enter repo
git clone https://github.com/DeepExperience/MuSEAgent.git
cd MuSEAgent
# 3. Create environment and install all dependencies
uv venv && source .venv/bin/activate
uv sync --all-extrasOrganize your data in the following format:
data/
βββ <dataset_name>/
βββ exploration/ # For collecting experiences
β βββ dataset.jsonl
β βββ images/
βββ evaluation/ # For final evaluation
βββ dataset.jsonl
βββ images/
Each line in dataset.jsonl:
{"id": "001", "question": "What color is the umbrella?", "images": ["img_001.jpg"], "answer": "red"}π€ Our evaluation datasets are available at ShijianW01/MuSEAgent-Eval.
Generate trajectories on the exploration set using ReAct mode:
python run_react.pyConfigure data paths and agent settings in run_react.py:
dataset_jsonl = "data/mmstar/exploration/dataset.jsonl"
dataset_image_dir = "data/mmstar/exploration/images"
agent_config = {
"model_name": "Qwen/Qwen3-VL-32B-Instruct",
"base_url": "http://localhost:8000/v1",
"api_keys": ["your-api-key"], # or dummy key(s)
"tool_bank": [
"ocr", "localize_objects", "zoom_in", "crop", "visualize_regions",
"estimate_region_depth", "estimate_object_depth",
"get_image2images_similarity", "get_image2texts_similarity",
"get_text2images_similarity",
"calculator", "solve_math_equation", "web_search",
],
"memory_dir": "memory/exploration", # Trajectories saved here
"max_iterations": 20,
}Extract state-level experiences and compute multi-view embeddings. We provide two scripts depending on your LLM setup:
Option A: vLLM / OpenAI-compatible endpoint (uses the openai SDK)
python scripts/build_state_bank_vllm.py \
--memory_dir memory/exploration \
--llm_model Qwen/Qwen3-VL-32B-Instruct \
--llm_base_url http://localhost:8000/v1 \
--llm_api_key your-api-key \
--embedding_model Qwen/Qwen3-VL-Embedding-8B \
--embedding_base_url http://localhost:8001/v1 \
--bank_dir_name state_bankOption B: Azure / AWS / other cloud APIs (uses httpx directly, no SDK dependency)
# OpenAI API
python scripts/build_state_bank_aws.py \
--memory_dir memory/exploration \
--llm_api_key sk-xxx \
--llm_base_url https://api.openai.com/v1/chat/completions \
--llm_model gpt-4o \
--embedding_model Qwen/Qwen3-VL-Embedding-8B \
--embedding_base_url http://localhost:8001/v1
# Azure OpenAI (auto-detected from URL, uses api-key header)
python scripts/build_state_bank_aws.py \
--memory_dir memory/exploration \
--llm_api_key YOUR_AZURE_KEY \
--llm_base_url "https://YOUR_RESOURCE.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-12-01-preview" \
--embedding_model Qwen/Qwen3-VL-Embedding-8B \
--embedding_base_url http://localhost:8001/v1Both scripts produce identical output. Option B auto-detects Azure endpoints (from
azureorapi-versionin the URL) and switches toapi-keyheader auth; all other URLs use standardBearertoken auth.
This produces memory/exploration/state_bank/ containing:
state_meta.jsonβ experience metadata with Q-values and guidanceviews/β multi-view embedding indices (question, task, images, observations, and their compositions)
Run full MuSEAgent on the evaluation set:
python run_muse_agent.pyConfigure evaluation paths and retrieval settings in run_muse_agent.py:
dataset_jsonl = "data/mmstar/evaluation/dataset.jsonl"
dataset_image_dir = "data/mmstar/evaluation/images"
agent_config = {
"model_name": "Qwen/Qwen3-VL-32B-Instruct",
"base_url": "http://localhost:8000/v1",
"api_keys": ["your-api-key"],
"tool_bank": [
"ocr", "localize_objects", "zoom_in", "crop", "visualize_regions",
"estimate_region_depth", "estimate_object_depth",
"get_image2images_similarity", "get_image2texts_similarity",
"get_text2images_similarity",
"calculator", "solve_math_equation", "web_search",
],
"memory_dir": "memory/evaluation",
"max_iterations": 20,
"retrieval": {
"enable": True,
"bank_memory_dir": "memory/exploration",
"bank_dir_name": "state_bank",
"embedding_model": "Qwen/Qwen3-VL-Embedding-8B",
"embedding_base_url": "http://localhost:8001/v1",
"min_score": 0.1,
"min_q_value": 5,
"experience_top_n": 3,
"max_epoch": 3,
},
}Each run produces two types of outputs:
Evaluation results are saved to the output_dir specified in the Runner (e.g., results/muse_agent/):
results/muse_agent/
βββ results.jsonl # Per-sample predictions and correctness
βββ stats.json # Aggregated accuracy (overall, by type, by subtask)
Trajectory memory is saved to the memory_dir specified in agent_config (e.g., memory/exploration/). Each task gets its own folder with a full interaction trace and copied input images:
memory/exploration/
βββ tasks/
βββ 00001/
β βββ trace.json # Full interaction trace (states, actions, observations)
β βββ img_0.jpg # Copied input image
β βββ img_1.jpg # Tool-generated image (e.g., crop, zoom_in)
βββ 00002/
β βββ trace.json
β βββ img_0.jpg
β βββ img_1.jpg
βββ ...
After building the experience bank (Step 3), the state bank is saved under the memory directory:
memory/exploration/
βββ tasks/ # Raw trajectories (from Step 2)
βββ state_bank/ # Experience bank (from Step 3)
βββ state_meta.json # Experience metadata with Q-values and guidance
βββ views/ # Multi-view embedding indices
βββ question.npy
βββ question_mask.npy
βββ question+images.npy
βββ question+images_mask.npy
βββ ... # One .npy + mask per semantic viewpoint
| Mode | Script | Tools | Retrieval | Description |
|---|---|---|---|---|
| Vanilla CoT | run_vanilla_cot.py |
β | β | Direct VLM reasoning without tools |
| ReAct | run_react.py |
β | β | Tool-augmented reasoning loop |
| MuSEAgent | run_muse_agent.py |
β | β | Full system with stateful experience search |
MuSEAgent/
βββ agent/
β βββ base_agent.py # Base agent with ReAct loop and tool dispatch
β βββ mm_agent.py # MuSE Agent with stateful experience search
βββ api/
β βββ wrapper.py # Async OpenAI-compatible LLM wrapper
β βββ async_pool.py # Multi-key API pool with load balancing
β βββ vision_utils.py # Multimodal message builder (image encoding)
β βββ utils/keys.env # β API keys for web_search, solve_math_equation
βββ tool/ # 13 multimodal tools (see Tool Bank above)
β βββ base_tool.py # Abstract tool interface
β βββ model_cache.py # Shared model loading with GPU management
β βββ model_config.py # β Local model paths (CLIP, Depth, OCR, SAM3)
β βββ ...
βββ mm_memory/
β βββ memory.py # Trajectory memory (trace recording and persistence)
β βββ state_bank.py # State bank (multi-view indexing and retrieval)
β βββ embedder.py # Multimodal embedding client (vLLM endpoint)
βββ runner/
β βββ runner.py # Batch evaluation runner (async, with scoring)
βββ template/
β βββ prompt_en.jinja2 # Jinja2 system prompt template
βββ scripts/
β βββ build_state_bank_vllm.py # Build experience bank via vLLM / OpenAI SDK
β βββ build_state_bank_aws.py # Build experience bank via Azure / AWS / httpx
βββ config.py # Dataclass-based configuration (Agent, API, Retrieval)
βββ run_vanilla_cot.py # Entry: Vanilla CoT mode
βββ run_react.py # Entry: ReAct mode
βββ run_muse_agent.py # Entry: Full MuSE Agent mode
This project is released under the MIT License.
For questions or feedback, please reach out at shijian@seu.edu.cn or open an issue on GitHub.



