AgentPO: Enhancing Multi-Agent Collaboration via Reinforcement Learning
This repository contains the official implementation of AgentPO (ICLR 2026). It trains a dedicated Collaborator agent with reinforcement learning under a fixed multi-agent topology, optimizing collaboration with an Actor agent to improve end-to-end performance on reasoning tasks.
Paper: AgentPO: Enhancing Multi-Agent Collaboration via Reinforcement Learning — code: github.com/sunlin-ai/agentpo.
.
├── agentpo/ # Core training, rewards, and related logic
│ ├── main_dapo.py # Training entry point (DAPO / verl)
│ ├── reward_manager_agentpo.py
│ ├── rl_dataset.py # Dataset pipeline (referenced by scripts)
│ └── evaluation/ # Math benchmarks (includes latex2sympy, etc.)
├── scripts/
│ ├── train.sh # Example training script (edit paths & hyperparameters)
│ └── test.sh # Example evaluation script (edit paths & checkpoints)
└── verl/ # Upstream [verl](https://github.com/volcengine/verl) (editable install)
The evaluation pipeline is adapted from math-evaluation-harness. Training builds on the verl stack.
-
Python: Python 3.10+ recommended, with a CUDA-enabled PyTorch build.
-
Install verl (required) from the repository root:
cd verl pip install -e .
See the verl documentation for optional components (e.g., vLLM, FSDP).
-
Add the repository root to
PYTHONPATHsopython -m agentpo.main_daporesolves theagentpopackage:export PYTHONPATH="/path/to/this/repo:${PYTHONPATH}"
-
Extra dependencies for evaluation (when running scripts under
agentpo/evaluation): seeagentpo/evaluation/README.mdfor locallatex2sympyinstall andrequirements.txt.
Example (matches the bundled script; edit model paths, data paths, HOME, etc. in scripts/train.sh for your environment):
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/train.shThe main entry point is python -m agentpo.main_dapo with Hydra configuration; train.sh illustrates typical settings for the AgentPO reward manager and cooperation mode. For algorithm and implementation details, refer to the paper (Section 2 and appendices).
Example (update scripts/test.sh for your checkpoint, MERGE_MODEL_PATH, datasets, and GPUs):
bash scripts/test.shThe script invokes agentpo/evaluation/math_eval.py with multi-dataset and vLLM options; see agentpo/evaluation/README.md.
If you use this code or the paper, please cite:
@inproceedings{sun2026agentpo,
title = {AgentPO: Enhancing Multi-Agent Collaboration via Reinforcement Learning},
author = {Sun, Lin and Liu, Chuang and Zhang, Can and Wu, Yubin and Lu, Weijia and Wu, Ning},
booktitle = {International Conference on Learning Representations},
year = {2026}
}- Training stack: verl (Apache License 2.0).
- Math evaluation: math-evaluation-harness.