AgentPO

AgentPO: Enhancing Multi-Agent Collaboration via Reinforcement Learning

This repository contains the official implementation of AgentPO (ICLR 2026). It trains a dedicated Collaborator agent with reinforcement learning under a fixed multi-agent topology, optimizing collaboration with an Actor agent to improve end-to-end performance on reasoning tasks.

Paper: AgentPO: Enhancing Multi-Agent Collaboration via Reinforcement Learning — code: github.com/sunlin-ai/agentpo.

Repository layout

.
├── agentpo/                 # Core training, rewards, and related logic
│   ├── main_dapo.py         # Training entry point (DAPO / verl)
│   ├── reward_manager_agentpo.py
│   ├── rl_dataset.py        # Dataset pipeline (referenced by scripts)
│   └── evaluation/          # Math benchmarks (includes latex2sympy, etc.)
├── scripts/
│   ├── train.sh             # Example training script (edit paths & hyperparameters)
│   └── test.sh              # Example evaluation script (edit paths & checkpoints)
└── verl/                    # Upstream [verl](https://github.com/volcengine/verl) (editable install)

The evaluation pipeline is adapted from math-evaluation-harness. Training builds on the verl stack.

Setup

Python: Python 3.10+ recommended, with a CUDA-enabled PyTorch build.
Install verl (required) from the repository root:
```
cd verl
pip install -e .
```
See the verl documentation for optional components (e.g., vLLM, FSDP).
Add the repository root to PYTHONPATH so python -m agentpo.main_dapo resolves the agentpo package:
```
export PYTHONPATH="/path/to/this/repo:${PYTHONPATH}"
```
Extra dependencies for evaluation (when running scripts under agentpo/evaluation): see agentpo/evaluation/README.md for local latex2sympy install and requirements.txt.

Training

Example (matches the bundled script; edit model paths, data paths, HOME, etc. in scripts/train.sh for your environment):

CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/train.sh

The main entry point is python -m agentpo.main_dapo with Hydra configuration; train.sh illustrates typical settings for the AgentPO reward manager and cooperation mode. For algorithm and implementation details, refer to the paper (Section 2 and appendices).

Evaluation

Example (update scripts/test.sh for your checkpoint, MERGE_MODEL_PATH, datasets, and GPUs):

bash scripts/test.sh

The script invokes agentpo/evaluation/math_eval.py with multi-dataset and vLLM options; see agentpo/evaluation/README.md.

Citation

If you use this code or the paper, please cite:

@inproceedings{sun2026agentpo,
  title     = {AgentPO: Enhancing Multi-Agent Collaboration via Reinforcement Learning},
  author    = {Sun, Lin and Liu, Chuang and Zhang, Can and Wu, Yubin and Lu, Weijia and Wu, Ning},
  booktitle = {International Conference on Learning Representations},
  year      = {2026}
}

Acknowledgements

Training stack: verl (Apache License 2.0).
Math evaluation: math-evaluation-harness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentPO

Repository layout

Setup

Training

Evaluation

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agentpo		agentpo
scripts		scripts
verl		verl
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

AgentPO

Repository layout

Setup

Training

Evaluation

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages