Skip to content

MARL Phase 1: add per-agent reward group and examples#1129

Open
luzai wants to merge 7 commits intoinclusionAI:mainfrom
luzai:marl-phase-1
Open

MARL Phase 1: add per-agent reward group and examples#1129
luzai wants to merge 7 commits intoinclusionAI:mainfrom
luzai:marl-phase-1

Conversation

@luzai
Copy link
Copy Markdown

@luzai luzai commented Apr 1, 2026

Description

This PR implements Phase 1 of the Reasoning & MARL Infrastructure roadmap as outlined in #1114. It establishes the data pipelines and specialized verifiers for math reasoning tasks, and provides infrastructure support for Multi-Agent Reinforcement Learning (MARL) workflows where agents share a single inference backend.

Key Changes:

  1. Dataset Integration & RL Support:

    • Added support for MATH-500 and amc12 datasets in areal/dataset.
    • Special thanks to the edev2000/amc12-full dataset on Hugging Face for providing the processed AMC12 data.
    • Implemented RL-specific dataloaders that format prompts for reasoning tasks (e.g., enforcing \boxed{} output format).
  2. Infrastructure for MARL:

    • Introduced the norm_group field in InteractionWithTokenLogpReward to support agent-wise group-based reward normalization.
    • Updated GroupedRolloutWorkflow to perform deterministic merging of interaction results based on norm_group. This ensures that the turns generated by each specific agent in the multi-agent system are correctly ordered for the shared-backend MAS setting, so the agent-wise GRPO proceeds properly.
  3. Math Verification Support:

    • MathMultipleChoiceVerifyWorker: A verifier for multiple-choice datasets (like AMC12) that uses regex for LaTeX extraction and fallback string matching.
    • MathVerifyWorker: Added verify_for_math500 with automated canonicalization for \boxed answers and <think> tag stripping.
  4. Reference Implementations:

    • Added example training scripts (train_math_marti_shared.py and train_math_marti_single.py) demonstrating a Generator-Verifier-Refiner (Marti) multi-agent reasoning loop.
    • Provided corresponding YAML configurations for GSM8K, MATH-500, and AMC12.

Setup

Hardware Environment: 1x Node equipped with 16x Huawei Ascend 910C NPUs.

Methods (CoA vs. Single-Agent):

  • Baseline: A standard Single-Agent GRPO setup.
  • Proposed (CoA): A Chain-of-Agents architecture utilizing Per-Agent Reward Grouping

Benchmarking Datasets:

  • GSM8K: Basic multi-step grade school math.
  • MATH-500: A representative challenging subset of the MATH dataset.
  • AMC12: High-school competition-level multiple-choice problems, utilizing the MathMultipleChoiceVerifyWorker.

How to Run

Training (GSM8K):

python -m examples.openai_agents.train_math_marti_shared --config examples/openai_agents/config_marti_grpo-gsm8k.yaml 

Training (MATH-500):

python -m examples.openai_agents.train_math_marti_shared --config examples/openai_agents/config_marti_grpo-math500.yaml

Training (AMC12):

python -m examples.openai_agents.train_math_marti_shared --config examples/openai_agents/config_marti_grpo-amc.yaml

Evaluation Results

On the GSM8K dataset, the Chain-of-Agents (CoA) approach exhibits significantly higher stability. In contrast, the Single-Agent baseline frequently experiences reward collapse during training on sample sets.

image

On the MATH-500 dataset, CoA demonstrates improved performance compared to the Single-Agent baseline, in both the rewards on evaluation and training samples.

Rewards on training samples:
image

Rewards on evaluation samples:
image


Related Issue

Fixes #1114

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):
N/A, it is backward compatible


Additional Context

Phase 1 focuses on establishing the evaluation and data pipeline for reasoning tasks. Future phases will build upon this infrastructure for heterogeneous MARL.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the MATH-500 and AMC12 datasets and enables multi-agent reinforcement learning (MARL) by incorporating normalization group logic into the inference engine. It also adds specialized reward functions and verifiers for mathematical tasks, including a new multiple-choice verifier. Review feedback identifies several critical improvements for the verifiers, such as catching TimeoutException to prevent crashes, fixing case-sensitivity bugs in regex-based answer extraction, and ensuring the multiple-choice verifier dynamically uses the configured choice set rather than hardcoding values.

Comment thread areal/reward/__init__.py Outdated
Comment thread areal/reward/__init__.py Outdated
Comment thread areal/reward/__init__.py Outdated
Comment thread areal/reward/__init__.py Outdated
Comment thread areal/reward/__init__.py Outdated
Comment thread areal/reward/amc12.py Outdated
Comment thread areal/reward/__init__.py
@luzai luzai force-pushed the marl-phase-1 branch 5 times, most recently from 4aca2a9 to 0eae11f Compare April 6, 2026 16:33
@luzai luzai marked this pull request as ready for review April 6, 2026 17:21
@luzai luzai force-pushed the marl-phase-1 branch 6 times, most recently from f551138 to 41c5b5d Compare April 10, 2026 21:33
@luzai luzai requested a review from garrett4wade as a code owner April 14, 2026 20:31
luzai and others added 7 commits April 15, 2026 10:35
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Feature Proposal: Multi-Agent Training Framework (Dr. MAS Integration)

1 participant