Skip to content

sunlin-ai/agentpo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AgentPO

AgentPO: Enhancing Multi-Agent Collaboration via Reinforcement Learning

This repository contains the official implementation of AgentPO (ICLR 2026). It trains a dedicated Collaborator agent with reinforcement learning under a fixed multi-agent topology, optimizing collaboration with an Actor agent to improve end-to-end performance on reasoning tasks.

Paper: AgentPO: Enhancing Multi-Agent Collaboration via Reinforcement Learning — code: github.com/sunlin-ai/agentpo.

Repository layout

.
├── agentpo/                 # Core training, rewards, and related logic
│   ├── main_dapo.py         # Training entry point (DAPO / verl)
│   ├── reward_manager_agentpo.py
│   ├── rl_dataset.py        # Dataset pipeline (referenced by scripts)
│   └── evaluation/          # Math benchmarks (includes latex2sympy, etc.)
├── scripts/
│   ├── train.sh             # Example training script (edit paths & hyperparameters)
│   └── test.sh              # Example evaluation script (edit paths & checkpoints)
└── verl/                    # Upstream [verl](https://github.com/volcengine/verl) (editable install)

The evaluation pipeline is adapted from math-evaluation-harness. Training builds on the verl stack.


Setup

  1. Python: Python 3.10+ recommended, with a CUDA-enabled PyTorch build.

  2. Install verl (required) from the repository root:

    cd verl
    pip install -e .

    See the verl documentation for optional components (e.g., vLLM, FSDP).

  3. Add the repository root to PYTHONPATH so python -m agentpo.main_dapo resolves the agentpo package:

    export PYTHONPATH="/path/to/this/repo:${PYTHONPATH}"
  4. Extra dependencies for evaluation (when running scripts under agentpo/evaluation): see agentpo/evaluation/README.md for local latex2sympy install and requirements.txt.


Training

Example (matches the bundled script; edit model paths, data paths, HOME, etc. in scripts/train.sh for your environment):

CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/train.sh

The main entry point is python -m agentpo.main_dapo with Hydra configuration; train.sh illustrates typical settings for the AgentPO reward manager and cooperation mode. For algorithm and implementation details, refer to the paper (Section 2 and appendices).


Evaluation

Example (update scripts/test.sh for your checkpoint, MERGE_MODEL_PATH, datasets, and GPUs):

bash scripts/test.sh

The script invokes agentpo/evaluation/math_eval.py with multi-dataset and vLLM options; see agentpo/evaluation/README.md.


Citation

If you use this code or the paper, please cite:

@inproceedings{sun2026agentpo,
  title     = {AgentPO: Enhancing Multi-Agent Collaboration via Reinforcement Learning},
  author    = {Sun, Lin and Liu, Chuang and Zhang, Can and Wu, Yubin and Lu, Weijia and Wu, Ning},
  booktitle = {International Conference on Learning Representations},
  year      = {2026}
}

Acknowledgements

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages