MARL Phase 1: add per-agent reward group and examples#1129
Open
luzai wants to merge 7 commits intoinclusionAI:mainfrom
Open
MARL Phase 1: add per-agent reward group and examples#1129luzai wants to merge 7 commits intoinclusionAI:mainfrom
luzai wants to merge 7 commits intoinclusionAI:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces support for the MATH-500 and AMC12 datasets and enables multi-agent reinforcement learning (MARL) by incorporating normalization group logic into the inference engine. It also adds specialized reward functions and verifiers for mathematical tasks, including a new multiple-choice verifier. Review feedback identifies several critical improvements for the verifiers, such as catching TimeoutException to prevent crashes, fixing case-sensitivity bugs in regex-based answer extraction, and ensuring the multiple-choice verifier dynamically uses the configured choice set rather than hardcoding values.
4aca2a9 to
0eae11f
Compare
f551138 to
41c5b5d
Compare
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR implements Phase 1 of the Reasoning & MARL Infrastructure roadmap as outlined in #1114. It establishes the data pipelines and specialized verifiers for math reasoning tasks, and provides infrastructure support for Multi-Agent Reinforcement Learning (MARL) workflows where agents share a single inference backend.
Key Changes:
Dataset Integration & RL Support:
MATH-500andamc12datasets inareal/dataset.\boxed{}output format).Infrastructure for MARL:
Math Verification Support:
MathMultipleChoiceVerifyWorker: A verifier for multiple-choice datasets (like AMC12) that uses regex for LaTeX extraction and fallback string matching.MathVerifyWorker: Addedverify_for_math500with automated canonicalization for\boxedanswers and<think>tag stripping.Reference Implementations:
train_math_marti_shared.pyandtrain_math_marti_single.py) demonstrating a Generator-Verifier-Refiner (Marti) multi-agent reasoning loop.Setup
Hardware Environment: 1x Node equipped with 16x Huawei Ascend 910C NPUs.
Methods (CoA vs. Single-Agent):
Benchmarking Datasets:
How to Run
Training (GSM8K):
Training (MATH-500):
Training (AMC12):
Evaluation Results
On the GSM8K dataset, the Chain-of-Agents (CoA) approach exhibits significantly higher stability. In contrast, the Single-Agent baseline frequently experiences reward collapse during training on sample sets.
On the MATH-500 dataset, CoA demonstrates improved performance compared to the Single-Agent baseline, in both the rewards on evaluation and training samples.
Rewards on training samples:

Rewards on evaluation samples:

Related Issue
Fixes #1114
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prBreaking Change Details (if applicable):
N/A, it is backward compatible
Additional Context
Phase 1 focuses on establishing the evaluation and data pipeline for reasoning tasks. Future phases will build upon this infrastructure for heterogeneous MARL.