support on policy distillation by yuekaizhang · Pull Request #118 · wenet-e2e/west

yuekaizhang · 2026-02-27T04:22:23Z

This PR supports on policy distillation training. See on policy distillation.

Task1. Audio Caption

Model	MMAU (v05.15.25)
Omni-Captioner	72.8
Step-Audio-R1.1	72.3
Qwen2.5-Omni-7B	68.7
Qwen2.5-Omni-3B	65.6
OPD (Student: Omni-7B, Teacher: Omni-Captioner)	70.1
OPD (Student: Omni-3B, Teacher: Omni-Captioner)	69.1
OPD (Student: Omni-3B, Teacher: Step-Audio-R1.1)	69.6

Note: Used cascade evaluation where the model first generates detailed captions, which are subsequently used to perform downstream question-answering tasks

Example: Caption Quality Comparison

mmau_test.wav

Before OPD: Qwen2.5-Omni-3B (Baseline)

The audio features a man speaking in a casual tone, expressing excitement about successfully defeating two opponents in a game with a single hit. The background includes music and sound effects, suggesting an action-packed video game environment.

After OPD: Qwen2.5-Omni-3B + On-Policy Distillation (Omni-captioner teacher)

The audio clip is a brief, high-energy, and immersive gaming session, likely from a first-person shooter or action game, set in a virtual environment that mimics a modern, urban battlefield. The recording begins with a male voice, likely the player, speaking in a casual, conversational tone. The voice is clear, with a neutral to slightly excited inflection, and is delivered in a standard American accent. The player's speech is punctuated by a brief, sharp electronic beep, which is characteristic of a digital interface or game alert.

As the player speaks, the audio transitions into a chaotic, action-packed sequence. The first sound is a sharp, metallic clank, followed by a rapid, high-pitched electronic beep, which is immediately accompanied by a loud, resonant explosion. The explosion is accompanied by a deep, rumbling bass tone, and the sound is layered with a gritty, metallic texture, suggesting a heavy, industrial environment. The player's voice is heard in the background, with a brief, excited exclamation: "I smashed both of them with one hit, very nice."

The next moment is marked by a sudden, intense burst of gunfire and a series of rapid, overlapping electronic beeps and clicks, indicating the use of a weapon or device. The player's voice is heard again, this time with a more urgent, excited tone: "And there's Roy's second life down." The audio is filled with the sounds of gunfire, explosions, and electronic beeps, all layered with a sense of urgency and excitement.

The clip ends abruptly, with a final, sharp electronic beep and a brief, high-pitched electronic tone, followed by a sudden, loud, and distorted electronic sound. The audio is heavily distorted, with a sense of chaos and disorientation, and the clip ends with a brief, high-pitched electronic beep.

Throughout the clip, the player's voice is clear and audible, and the background is filled with the sounds of gunfire, explosions, and electronic beeps, all layered with a sense of urgency and excitement. The overall atmosphere is one of intense action and high stakes, with the player's voice and the game's soundscape creating a vivid, immersive experience. The clip is likely from a first-person shooter or action game, and the player's excitement and commentary suggest a high level of engagement and enjoyment.

Task2. Audio QA

Model	MMAU (v05.15.25)	MMSU
Qwen2-Audio-7B	56.9	30.38
+ On-Policy Distillation (Used Qwen-omni-3b-grpo as the teacher)	67.9	53.30

Copilot

Pull request overview

This PR implements on-policy knowledge distillation training for audio language models, where a student model learns to match a teacher model's distribution on its own generated samples to avoid distribution shift issues.

Changes:

Adds two trainer classes for on-policy distillation: local teacher mode and remote API-based teacher mode
Extends datasets to include question and choices metadata for reward computation
Adds training script and complete example with evaluation pipeline

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
west/utils/constants.py	Adds audio caption template for captioning tasks
west/trainer/kd_trainer.py	Implements KnowledgeDistillationTrainer and RemoteKnowledgeDistillationTrainer classes
west/dataset/hf_dataset.py	Extends dataset to include question/choices metadata in collate function
west/bin/train_knowledge_distillation.py	Main training script supporting both local and remote teacher modes
west/bin/decode_mmau.py	Adds caption template to available choices
examples/on_policy_distillation/run.sh	Complete training/evaluation pipeline script
examples/on_policy_distillation/cascaded_audio_capiton_llm_eval.py	Cascaded evaluation using LLM to answer questions from captions
examples/on_policy_distillation/README.md	Documentation with results and usage instructions
examples/grpo/run.sh	Updates default deepspeed config reference
examples/grpo/conf/ds_zero3_omni.json	New DeepSpeed ZeRO-3 configuration file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

yuekaizhang added 8 commits February 3, 2026 16:19

add back opd

3e1549c

add remote kd trainer

ce5f566

fix kd metrics

f979db1

add opd audio capiton task

2772025

support rollout > 1

f1587d4

add opd

290e349

clean code

665be3d

add opd results

69f9468

Copilot AI review requested due to automatic review settings February 27, 2026 04:22

Copilot AI reviewed Feb 27, 2026

View reviewed changes

fix wrong name

2a5087a

robin1001 merged commit b68af95 into wenet-e2e:main Mar 2, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support on policy distillation#118

support on policy distillation#118
robin1001 merged 9 commits into
wenet-e2e:mainfrom
yuekaizhang:on_policy

yuekaizhang commented Feb 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuekaizhang commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before OPD: Qwen2.5-Omni-3B (Baseline)

After OPD: Qwen2.5-Omni-3B + On-Policy Distillation (Omni-captioner teacher)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuekaizhang commented Feb 27, 2026 •

edited

Loading