Tip
- [2025-02] π₯³ MemOCR code and checkpoint released!
- [2025-01] π MemOCR paper released on arXiv and huggingfaceπ€
- [2025-01] π We introduce MemOCR, a memory agent capable of forming and utilizing its memory in visual form.
MemOCR is a visual memory agent that dynamically adapts information density during memory drafting and reading. By incorporating budget-aware training objectives and adaptive information density mechanisms, MemOCR achieves superior performance in multi-hop question answering tasks while maintaining efficient token usage.
- Adaptive Information Density: Dynamically adjusts memory content richness based on task requirements
- Budget-Aware Training: Optimizes memory usage with explicit token budget constraints
- Dual-Domain Architecture: Separate memory drafting (text domain) and reading (vision domain) processes
- State-of-the-Art Performance: Superior results on HotpotQA, 2WikiMultihopQA, NaturalQuestions, and TriviaQA benchmarks
MemOCR consists of two main components:
- Memory Drafting in Text Domain: An LLM agent iteratively refines rich-text memory content based on question-answering feedback
- Memory Reading in Vision Domain: A vision-language model processes rendered visual memory with optimized information density
The framework employs budget-aware training objectives to balance memory informativeness and token efficiency.
MemOCR achieves state-of-the-art performance across multiple multi-hop QA benchmarks while maintaining efficient memory budgets (number of tokens for memory at inference time).
- Python 3.10
- NVIDIA GPU with CUDA support (we use CUDA 12.1)
- Compatible NVIDIA driver + CUDA runtime
conda create -n memocr -y python=3.10
conda activate memocr
python -m pip install --upgrade pippip3 install -r requirements.txtwget -O flash_attn.whl https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install --no-cache-dir flash_attn.whlMemOCR constructs two types of richtext-to-image rendering servers for training and evaluation:
- Markdown Rendering (by default)
- HTML Rendering (beta version)
cd md2img
pip3 install -r requirements_api.txt
playwright install
python3 markdown_api_server.pyThe server will be available at http://localhost:9000 for rendering markdown content to images.
The interaction between the agent and the rendering server is in recurrent/impls/call_md_renderer.py.
We use the same training data as MemAgent, which can be downloaded from HuggingFace.
Place the downloaded files in the following structure:
MemOCR/
βββ data/
β βββ hotpotqa/
β βββ hotpotqa_dev.parquet
β βββ hotpotqa_train_32k.parquet
βββ scripts/
βββ train.sh
On the head node:
ray stop --force
ray start --head \
--port=8278 \
--dashboard-host=0.0.0.0 \
--dashboard-port=8265On worker nodes (replace <HEAD_IP> with the head node's IP):
ray stop --force
ray start --address="<HEAD_IP>:8278"Run the training script from the head node:
bash scripts/train.shTraining logs will be saved to:
./log/<EXP_LOG_NAME>.log./results/<EXP>/...
In our paper, we use 8x8 H800 GPUs for training MemOCR.
Our evaluation benchmark includes HotpotQA, 2WikiMultihopQA, NaturalQuestions, and TriviaQA. Pre-process the datasets using:
cd taskutils/memory_data
bash cmd-process.shOur evaluation is conducted on 8 Γ H800 GPUs. Reproduce the results with:
bash scripts/eval.shIf you find MemOCR useful in your research, please consider citing:
@article{shi2026memocr,
title={MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning},
author={Yaorui Shi and Shugui Liu and Yu Yang and Wenyu Mao and Yuxin Chen and Qi GU and Hui Su and Xunliang Cai and Xiang Wang and An Zhang},
journal={arXiv preprint arXiv:2601.21468},
year={2026},
}This project is developed by the Meituan LongCat team, building upon several excellent open-source projects:
- veRL as the reinforcement learning training framework;
- MemAgent for the
recurrentmodule and training dataset; - Playwright for rendering richtext to visual memory;
- Qwen2.5-VL as the vision-language model backbone.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Our training and evaluation datasets are derived from open-source resources:
- HotpotQA, 2WikiMultihopQA, Natural Questions, TriviaQA: These datasets are sourced from Wikipedia and other publicly available corpora. The Wikipedia-derived content is licensed under CC BY-SA 4.0 License.







