Skip to content

DeepExperience/MuSEAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MuSEAgent
A Multimodal Reasoning Agent with Stateful Experiences

Paper HF Paper HF Dataset Python 3.10+ License

If you find this project helpful, please give us a star ⭐
Typing Animation

🎬 Demo

MuSEAgent_demo.mp4

πŸ’‘ Overview

MuSEAgent enhances multimodal agent reasoning by leveraging fine-grained stateful experiences rather than coarse trajectory-level retrieval. The framework consists of two phases: (1) Experience Abstraction, which extracts state-level experiences via hindsight evaluation and builds multi-viewpoint embeddings for each experience; (2) Experience Exploitation, where the agent performs a deep-and-wide search over the experience bank to determine the next action at inference time.

Phase 1: Experience Abstraction

  • Decompose historical trajectories into atomic state-action transitions;
  • Evaluate each transition via a hindsight reasoning model (e.g., GPT-4o), producing a quality score and decision guidance;
  • Retain high-quality transitions into a stateful experience bank, filtering out noisy or uninformative steps;
  • Index each experience under multiple semantic viewpoints β€” different combinations of state components (e.g., Query+Image, Query+Task, Image+Task) β€” so a single experience can be discovered through different contextual cues.

Phase 2: Experience Exploitation via Deep-and-Wide Search

  • Before each action, the agent selects a semantic viewpoint based on its current state and queries the experience bank through the corresponding embedding index;
  • Wide Search retrieves Top-K diverse experiences under the selected viewpoint for broad coverage;
  • Deep Search iterates across different viewpoints (e.g., first Query+Image, then Query+Task) to progressively gather complementary guidance;
  • Retrieved decision guidance is injected into the agent's context to inform the next action.
🌈 Check the case for better understanding!

πŸ“Š Key Results

Overall Performance

  • State-level experiences consistently outperform trajectory-level baselines. On Qwen3-VL-32B, MuSEAgent reaches 65.30% average accuracy, surpassing the strongest trajectory-level baseline (Reflexion, 57.33%) by nearly 8%. Decomposing trajectories into granular state-action pairs effectively mitigates the noise of monolithic trajectory retrieval.
  • Stateful experiences benefit compact models most, while absolute performance scales with model size. MuSEAgent improves the best baseline by +8% on Qwen3-VL-32B, indicating that compact models with limited intrinsic reasoning capacity benefit most from fine-grained stateful experiences. Meanwhile, absolute performance continues to scale with model size.

Deep-and-Wide Search

  • Scaling search depth and breadth consistently improves over the experience-free ReAct baseline. Even at minimal search scales, querying diverse semantic viewpoints provides effective state-aware guidance.
  • Performance peaks at moderate search scales before slightly declining. Excessive historical experiences introduce redundant information into the context window, mildly diluting guidance for the current state. A balanced search scope optimizes experience utilization without overwhelming the model's instruction-following capacity.

πŸ› οΈ Tool Bank

MuSEAgent is equipped with 13 multimodal tools:

Category Tool Description Dependency
Vision ocr Extract text from images PaddleOCR-VL model
localize_objects Detect and localize objects SAM3 model
zoom_in Zoom into specific regions β€”
crop Crop image regions β€”
visualize_regions Draw bounding boxes on images β€”
Depth estimate_region_depth Estimate depth of a region Depth-Anything model
estimate_object_depth Estimate depth of an object Depth-Anything model
Similarity get_image2images_similarity Compare image-to-image similarity CLIP model
get_image2texts_similarity Compare image-to-text similarity CLIP model
get_text2images_similarity Compare text-to-image similarity CLIP model
Utility calculator Arithmetic operations β€”
solve_math_equation Solve equations via Wolfram Alpha API key
web_search Web search via Google API key

βš™οΈ Configuration

1. Model Services

Start API services for the VLM and embedding model:

Vision-Language Model (Required)

vllm serve Qwen/Qwen3-VL-32B-Instruct \
    --port 8000 \
    --dtype auto
# β†’ http://localhost:8000/v1

Embedding Model (For retrieval)

vllm serve Qwen/Qwen3-VL-Embedding-8B \
    --runner pooling \
    --port 8001 \
    --dtype float16 \
    --trust-remote-code \
    --served-model-name qwen3vl-embed \
    --allowed-local-media-path /
# β†’ http://localhost:8001/v1

2. Tool Models

Some tools require local models. Edit tool/model_config.py and set the paths to your downloaded models:

# tool/model_config.py

CLIP_VERSION = "ViT-H-14-378-quickgelu"
CLIP_PRETRAINED = "/path/to/your/clip/model.bin"        # for similarity tools

DEPTH_ANYTHING_PATH = "/path/to/your/depth-anything"     # for depth tools

PADDLE_OCR_VL_MODEL_PATH = "/path/to/your/paddleocr"     # for ocr tool

SAM3_MODEL_PATH = "/path/to/your/sam3"                    # for localize_objects tool
Model Required By Download
CLIP (ViT-H-14-378) get_image2*_similarity, get_text2images_similarity laion/CLIP-ViT-H-14-378-quickgelu
Depth-Anything-V2 estimate_region_depth, estimate_object_depth depth-anything/Depth-Anything-V2-Large-hf
PaddleOCR-VL ocr PaddlePaddle/PaddleOCR-VL-1.5
SAM3 localize_objects facebook/sam3

3. Tool API Keys

Edit api/utils/keys.env for tools that require external APIs:

# api/utils/keys.env

GOOGLE_SEARCH_API_KEY=your-serper-api-key        # web_search tool
WOLFRAM_ALPHA_API_KEYS=your-key-1,your-key-2     # solve_math_equation tool

πŸ”§ Installation

# 1. Install uv (package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and enter repo
git clone https://github.com/DeepExperience/MuSEAgent.git
cd MuSEAgent

# 3. Create environment and install all dependencies
uv venv && source .venv/bin/activate
uv sync --all-extras

πŸƒ Quick Start

Step 1 β€” Prepare Data

Organize your data in the following format:

data/
└── <dataset_name>/
    β”œβ”€β”€ exploration/             # For collecting experiences
    β”‚   β”œβ”€β”€ dataset.jsonl
    β”‚   └── images/
    └── evaluation/              # For final evaluation
        β”œβ”€β”€ dataset.jsonl
        └── images/

Each line in dataset.jsonl:

{"id": "001", "question": "What color is the umbrella?", "images": ["img_001.jpg"], "answer": "red"}

πŸ€— Our evaluation datasets are available at ShijianW01/MuSEAgent-Eval.

Step 2 β€” Run Exploration

Generate trajectories on the exploration set using ReAct mode:

python run_react.py

Configure data paths and agent settings in run_react.py:

dataset_jsonl = "data/mmstar/exploration/dataset.jsonl"
dataset_image_dir = "data/mmstar/exploration/images"

agent_config = {
    "model_name": "Qwen/Qwen3-VL-32B-Instruct",
    "base_url": "http://localhost:8000/v1",
    "api_keys": ["your-api-key"], # or dummy key(s)
    "tool_bank": [
        "ocr", "localize_objects", "zoom_in", "crop", "visualize_regions",
        "estimate_region_depth", "estimate_object_depth",
        "get_image2images_similarity", "get_image2texts_similarity",
        "get_text2images_similarity",
        "calculator", "solve_math_equation", "web_search",
    ],
    "memory_dir": "memory/exploration",   # Trajectories saved here
    "max_iterations": 20,
}

Step 3 β€” Build Experience Bank

Extract state-level experiences and compute multi-view embeddings. We provide two scripts depending on your LLM setup:

Option A: vLLM / OpenAI-compatible endpoint (uses the openai SDK)

python scripts/build_state_bank_vllm.py \
    --memory_dir memory/exploration \
    --llm_model Qwen/Qwen3-VL-32B-Instruct \
    --llm_base_url http://localhost:8000/v1 \
    --llm_api_key your-api-key \
    --embedding_model Qwen/Qwen3-VL-Embedding-8B \
    --embedding_base_url http://localhost:8001/v1 \
    --bank_dir_name state_bank

Option B: Azure / AWS / other cloud APIs (uses httpx directly, no SDK dependency)

# OpenAI API
python scripts/build_state_bank_aws.py \
    --memory_dir memory/exploration \
    --llm_api_key sk-xxx \
    --llm_base_url https://api.openai.com/v1/chat/completions \
    --llm_model gpt-4o \
    --embedding_model Qwen/Qwen3-VL-Embedding-8B \
    --embedding_base_url http://localhost:8001/v1

# Azure OpenAI (auto-detected from URL, uses api-key header)
python scripts/build_state_bank_aws.py \
    --memory_dir memory/exploration \
    --llm_api_key YOUR_AZURE_KEY \
    --llm_base_url "https://YOUR_RESOURCE.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-12-01-preview" \
    --embedding_model Qwen/Qwen3-VL-Embedding-8B \
    --embedding_base_url http://localhost:8001/v1

Both scripts produce identical output. Option B auto-detects Azure endpoints (from azure or api-version in the URL) and switches to api-key header auth; all other URLs use standard Bearer token auth.

This produces memory/exploration/state_bank/ containing:

  • state_meta.json β€” experience metadata with Q-values and guidance
  • views/ β€” multi-view embedding indices (question, task, images, observations, and their compositions)

Step 4 β€” Evaluate with Experience Search

Run full MuSEAgent on the evaluation set:

python run_muse_agent.py

Configure evaluation paths and retrieval settings in run_muse_agent.py:

dataset_jsonl = "data/mmstar/evaluation/dataset.jsonl"
dataset_image_dir = "data/mmstar/evaluation/images"

agent_config = {
    "model_name": "Qwen/Qwen3-VL-32B-Instruct",
    "base_url": "http://localhost:8000/v1",
    "api_keys": ["your-api-key"],
    "tool_bank": [
        "ocr", "localize_objects", "zoom_in", "crop", "visualize_regions",
        "estimate_region_depth", "estimate_object_depth",
        "get_image2images_similarity", "get_image2texts_similarity",
        "get_text2images_similarity",
        "calculator", "solve_math_equation", "web_search",
    ],
    "memory_dir": "memory/evaluation",
    "max_iterations": 20,

    "retrieval": {
        "enable": True,
        "bank_memory_dir": "memory/exploration",
        "bank_dir_name": "state_bank",
        "embedding_model": "Qwen/Qwen3-VL-Embedding-8B",
        "embedding_base_url": "http://localhost:8001/v1",
        "min_score": 0.1,
        "min_q_value": 5,
        "experience_top_n": 3,
        "max_epoch": 3,
    },
}

Output Structure

Each run produces two types of outputs:

Evaluation results are saved to the output_dir specified in the Runner (e.g., results/muse_agent/):

results/muse_agent/
β”œβ”€β”€ results.jsonl         # Per-sample predictions and correctness
└── stats.json            # Aggregated accuracy (overall, by type, by subtask)

Trajectory memory is saved to the memory_dir specified in agent_config (e.g., memory/exploration/). Each task gets its own folder with a full interaction trace and copied input images:

memory/exploration/
└── tasks/
    β”œβ”€β”€ 00001/
    β”‚   β”œβ”€β”€ trace.json    # Full interaction trace (states, actions, observations)
    β”‚   β”œβ”€β”€ img_0.jpg     # Copied input image
    β”‚   └── img_1.jpg     # Tool-generated image (e.g., crop, zoom_in)
    β”œβ”€β”€ 00002/
    β”‚   β”œβ”€β”€ trace.json
    β”‚   β”œβ”€β”€ img_0.jpg
    β”‚   └── img_1.jpg
    └── ...

After building the experience bank (Step 3), the state bank is saved under the memory directory:

memory/exploration/
β”œβ”€β”€ tasks/                # Raw trajectories (from Step 2)
└── state_bank/           # Experience bank (from Step 3)
    β”œβ”€β”€ state_meta.json   # Experience metadata with Q-values and guidance
    └── views/            # Multi-view embedding indices
        β”œβ”€β”€ question.npy
        β”œβ”€β”€ question_mask.npy
        β”œβ”€β”€ question+images.npy
        β”œβ”€β”€ question+images_mask.npy
        └── ...           # One .npy + mask per semantic viewpoint

πŸ“– Agent Modes

Mode Script Tools Retrieval Description
Vanilla CoT run_vanilla_cot.py β€” β€” Direct VLM reasoning without tools
ReAct run_react.py βœ… β€” Tool-augmented reasoning loop
MuSEAgent run_muse_agent.py βœ… βœ… Full system with stateful experience search

πŸ“‚ Project Structure

MuSEAgent/
β”œβ”€β”€ agent/
β”‚   β”œβ”€β”€ base_agent.py              # Base agent with ReAct loop and tool dispatch
β”‚   └── mm_agent.py                # MuSE Agent with stateful experience search
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ wrapper.py                 # Async OpenAI-compatible LLM wrapper
β”‚   β”œβ”€β”€ async_pool.py              # Multi-key API pool with load balancing
β”‚   β”œβ”€β”€ vision_utils.py            # Multimodal message builder (image encoding)
β”‚   └── utils/keys.env             #   ← API keys for web_search, solve_math_equation
β”œβ”€β”€ tool/                          # 13 multimodal tools (see Tool Bank above)
β”‚   β”œβ”€β”€ base_tool.py               # Abstract tool interface
β”‚   β”œβ”€β”€ model_cache.py             # Shared model loading with GPU management
β”‚   β”œβ”€β”€ model_config.py            #   ← Local model paths (CLIP, Depth, OCR, SAM3)
β”‚   └── ...
β”œβ”€β”€ mm_memory/
β”‚   β”œβ”€β”€ memory.py                  # Trajectory memory (trace recording and persistence)
β”‚   β”œβ”€β”€ state_bank.py              # State bank (multi-view indexing and retrieval)
β”‚   └── embedder.py                # Multimodal embedding client (vLLM endpoint)
β”œβ”€β”€ runner/
β”‚   └── runner.py                  # Batch evaluation runner (async, with scoring)
β”œβ”€β”€ template/
β”‚   └── prompt_en.jinja2           # Jinja2 system prompt template
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ build_state_bank_vllm.py   # Build experience bank via vLLM / OpenAI SDK
β”‚   └── build_state_bank_aws.py    # Build experience bank via Azure / AWS / httpx
β”œβ”€β”€ config.py                      # Dataclass-based configuration (Agent, API, Retrieval)
β”œβ”€β”€ run_vanilla_cot.py             # Entry: Vanilla CoT mode
β”œβ”€β”€ run_react.py                   # Entry: ReAct mode
└── run_muse_agent.py              # Entry: Full MuSE Agent mode

πŸ“„ License

This project is released under the MIT License.

πŸ“ž Contact

For questions or feedback, please reach out at shijian@seu.edu.cn or open an issue on GitHub.

⭐ Star History

Star History Chart

About

A Multimodal Reasoning Agent with Stateful Experiences

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors