MARL Phase 1: add per-agent reward group and examples by luzai · Pull Request #1129 · inclusionAI/AReaL

luzai · 2026-04-01T23:10:54Z

Description

This PR implements Phase 1 of the Reasoning & MARL Infrastructure roadmap as outlined in #1114. It establishes the data pipelines and specialized verifiers for math reasoning tasks, and provides infrastructure support for Multi-Agent Reinforcement Learning (MARL) workflows where agents share a single inference backend.

Key Changes:

Dataset Integration & RL Support:
- Added support for MATH-500 and amc12 datasets in areal/dataset.
- Special thanks to the edev2000/amc12-full dataset on Hugging Face for providing the processed AMC12 data.
- Implemented RL-specific dataloaders that format prompts for reasoning tasks (e.g., enforcing \boxed{} output format).
Infrastructure for MARL:
- Introduced the norm_group field in InteractionWithTokenLogpReward to support agent-wise group-based reward normalization.
- Updated GroupedRolloutWorkflow to perform deterministic merging of interaction results based on norm_group. This ensures that the turns generated by each specific agent in the multi-agent system are correctly ordered for the shared-backend MAS setting, so the agent-wise GRPO proceeds properly.
Math Verification Support:
- MathMultipleChoiceVerifyWorker: A verifier for multiple-choice datasets (like AMC12) that uses regex for LaTeX extraction and fallback string matching.
- MathVerifyWorker: Added verify_for_math500 with automated canonicalization for \boxed answers and <think> tag stripping.
Reference Implementations:
- Added example training scripts (train_math_marti_shared.py and train_math_marti_single.py) demonstrating a Generator-Verifier-Refiner (Marti) multi-agent reasoning loop.
- Provided corresponding YAML configurations for GSM8K, MATH-500, and AMC12.

Setup

Hardware Environment: 1x Node equipped with 16x Huawei Ascend 910C NPUs.

Methods (CoA vs. Single-Agent):

Baseline: A standard Single-Agent GRPO setup.
Proposed (CoA): A Chain-of-Agents architecture utilizing Per-Agent Reward Grouping

Benchmarking Datasets:

GSM8K: Basic multi-step grade school math.
MATH-500: A representative challenging subset of the MATH dataset.
AMC12: High-school competition-level multiple-choice problems, utilizing the MathMultipleChoiceVerifyWorker.

How to Run

Training (GSM8K):

python -m examples.openai_agents.train_math_marti_shared --config examples/openai_agents/config_marti_grpo-gsm8k.yaml

Training (MATH-500):

python -m examples.openai_agents.train_math_marti_shared --config examples/openai_agents/config_marti_grpo-math500.yaml

Training (AMC12):

python -m examples.openai_agents.train_math_marti_shared --config examples/openai_agents/config_marti_grpo-amc.yaml

Evaluation Results

On the GSM8K dataset, the Chain-of-Agents (CoA) approach exhibits significantly higher stability. In contrast, the Single-Agent baseline frequently experiences reward collapse during training on sample sets.

On the MATH-500 dataset, CoA demonstrates improved performance compared to the Single-Agent baseline, in both the rewards on evaluation and training samples.

Rewards on training samples:

Rewards on evaluation samples:

Related Issue

Fixes #1114

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):
N/A, it is backward compatible

Additional Context

Phase 1 focuses on establishing the evaluation and data pipeline for reasoning tasks. Future phases will build upon this infrastructure for heterogeneous MARL.

gemini-code-assist

Code Review

This pull request introduces support for the MATH-500 and AMC12 datasets and enables multi-agent reinforcement learning (MARL) by incorporating normalization group logic into the inference engine. It also adds specialized reward functions and verifiers for mathematical tasks, including a new multiple-choice verifier. Review feedback identifies several critical improvements for the verifiers, such as catching TimeoutException to prevent crashes, fixing case-sensitivity bugs in regex-based answer extraction, and ensuring the multiple-choice verifier dynamically uses the configured choice set rather than hardcoding values.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist bot reviewed Apr 1, 2026

View reviewed changes

luzai force-pushed the marl-phase-1 branch 5 times, most recently from 4aca2a9 to 0eae11f Compare April 6, 2026 16:33

luzai marked this pull request as ready for review April 6, 2026 17:21

luzai force-pushed the marl-phase-1 branch 6 times, most recently from f551138 to 41c5b5d Compare April 10, 2026 21:33

luzai force-pushed the marl-phase-1 branch from 41c5b5d to e91ff86 Compare April 14, 2026 20:31

luzai requested a review from garrett4wade as a code owner April 14, 2026 20:31

luzai and others added 7 commits April 15, 2026 10:35

MARL Phase 1: add per-agent reward group and examples

9cdd883

Update areal/reward/__init__.py

8cf7f1c

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update areal/reward/__init__.py

52f2f52

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update areal/reward/__init__.py

22604cd

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

apply gemini advice on exception

ff4959d

update regexp

e8457c7

remove redundant script

c88b780

luzai force-pushed the marl-phase-1 branch from e91ff86 to c88b780 Compare April 15, 2026 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MARL Phase 1: add per-agent reward group and examples#1129

MARL Phase 1: add per-agent reward group and examples#1129
luzai wants to merge 7 commits intoinclusionAI:mainfrom
luzai:marl-phase-1

luzai commented Apr 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luzai commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes:

Setup

How to Run

Evaluation Results

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

luzai commented Apr 1, 2026 •

edited

Loading