Skip to content

Support GRPO Training Recipe for Speech LLM#117

Merged
robin1001 merged 21 commits into
wenet-e2e:mainfrom
yuekaizhang:rl
Feb 3, 2026
Merged

Support GRPO Training Recipe for Speech LLM#117
robin1001 merged 21 commits into
wenet-e2e:mainfrom
yuekaizhang:rl

Conversation

@yuekaizhang
Copy link
Copy Markdown
Contributor

@yuekaizhang yuekaizhang commented Jan 30, 2026

Support Matrix

Models

Model HuggingFace
Qwen2.5-Omni-3B Qwen/Qwen2.5-Omni-3B
Qwen2.5-Omni-7B Qwen/Qwen2.5-Omni-7B
Qwen2-Audio-7B-Instruct Qwen/Qwen2-Audio-7B-Instruct

Results

Model MMAU (v05.15.25) MMSU
Qwen2.5-Omni-3B 69.8 59.1
+ GRPO 71.6 60.46
Qwen2.5-Omni-7B 72.1 58.56
+ GRPO 73.4 65.38
Qwen2-Audio-7B 56.9 30.38
+ GRPO 67.2 54.12

Copilot AI review requested due to automatic review settings January 30, 2026 08:02
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a GRPO (Group Relative Policy Optimization) training/eval recipe for speech/audio-capable LLMs (Qwen2-Audio / Qwen2.5-Omni), including dataset adapters, reward functions, trainer implementation, and runnable example scripts.

Changes:

  • Introduces a custom GRPOTrainer implementing rollout, reward computation, KL penalty, and GRPO loss.
  • Adds HuggingFace audio dataset wrapper + prompt templates + reward functions for <answer> / <think> formatting.
  • Adds training and vLLM-based evaluation scripts plus an example recipe (DeepSpeed configs, README, helper scripts).

Reviewed changes

Copilot reviewed 10 out of 14 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
west/utils/rewards.py Adds reward functions used for GRPO training.
west/utils/constants.py Adds prompt templates and a template map.
west/trainer/grpo_trainer.py Implements GRPO training loop on top of transformers.Trainer.
west/trainer/init.py Package marker (currently empty).
west/dataset/hf_dataset.py Adds HF dataset loader + collator for audio QA training/eval.
west/bin/train_grpo.py Adds GRPO training entrypoint for Qwen audio/omni models.
west/bin/decode_mmsu.py Adds vLLM-based MMSU decoding + accuracy reporting.
west/bin/decode_mmau.py Adds vLLM-based MMAU decoding.
examples/grpo/scripts/download_mmau_test.sh Adds helper to download MMAU test-mini audio set.
examples/grpo/run.sh Adds end-to-end example runner (prepare/train/eval stages).
examples/grpo/requirements.txt Adds example-specific Python dependencies.
examples/grpo/conf/ds_zero3.json Adds DS ZeRO-3 config for the example.
examples/grpo/conf/ds_zero1.json Adds DS ZeRO-1 config for the example.
examples/grpo/README.md Documents the GRPO recipe and reported results.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread west/utils/constants.py
Comment thread examples/grpo/README.md
Comment thread examples/grpo/scripts/download_mmau_test.sh
Comment thread examples/grpo/run.sh Outdated
Comment thread west/dataset/hf_dataset.py Outdated
Comment thread west/utils/rewards.py
Comment thread west/utils/rewards.py Outdated
Comment thread west/utils/rewards.py
Comment thread examples/grpo/scripts/download_mmau_test.sh
Comment thread examples/grpo/run.sh
@robin1001 robin1001 merged commit 65edd9d into wenet-e2e:main Feb 3, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants