Skip to content

support on policy distillation#118

Merged
robin1001 merged 9 commits into
wenet-e2e:mainfrom
yuekaizhang:on_policy
Mar 2, 2026
Merged

support on policy distillation#118
robin1001 merged 9 commits into
wenet-e2e:mainfrom
yuekaizhang:on_policy

Conversation

@yuekaizhang
Copy link
Copy Markdown
Contributor

@yuekaizhang yuekaizhang commented Feb 27, 2026

This PR supports on policy distillation training. See on policy distillation.

Task1. Audio Caption

Model MMAU (v05.15.25)
Omni-Captioner 72.8
Step-Audio-R1.1 72.3
Qwen2.5-Omni-7B 68.7
Qwen2.5-Omni-3B 65.6
OPD (Student: Omni-7B, Teacher: Omni-Captioner) 70.1
OPD (Student: Omni-3B, Teacher: Omni-Captioner) 69.1
OPD (Student: Omni-3B, Teacher: Step-Audio-R1.1) 69.6

Note: Used cascade evaluation where the model first generates detailed captions, which are subsequently used to perform downstream question-answering tasks

Example: Caption Quality Comparison

mmau_test.wav

Before OPD: Qwen2.5-Omni-3B (Baseline)

The audio features a man speaking in a casual tone, expressing excitement about successfully defeating two opponents in a game with a single hit. The background includes music and sound effects, suggesting an action-packed video game environment.

After OPD: Qwen2.5-Omni-3B + On-Policy Distillation (Omni-captioner teacher)

The audio clip is a brief, high-energy, and immersive gaming session, likely from a first-person shooter or action game, set in a virtual environment that mimics a modern, urban battlefield. The recording begins with a male voice, likely the player, speaking in a casual, conversational tone. The voice is clear, with a neutral to slightly excited inflection, and is delivered in a standard American accent. The player's speech is punctuated by a brief, sharp electronic beep, which is characteristic of a digital interface or game alert.

As the player speaks, the audio transitions into a chaotic, action-packed sequence. The first sound is a sharp, metallic clank, followed by a rapid, high-pitched electronic beep, which is immediately accompanied by a loud, resonant explosion. The explosion is accompanied by a deep, rumbling bass tone, and the sound is layered with a gritty, metallic texture, suggesting a heavy, industrial environment. The player's voice is heard in the background, with a brief, excited exclamation: "I smashed both of them with one hit, very nice."

The next moment is marked by a sudden, intense burst of gunfire and a series of rapid, overlapping electronic beeps and clicks, indicating the use of a weapon or device. The player's voice is heard again, this time with a more urgent, excited tone: "And there's Roy's second life down." The audio is filled with the sounds of gunfire, explosions, and electronic beeps, all layered with a sense of urgency and excitement.

The clip ends abruptly, with a final, sharp electronic beep and a brief, high-pitched electronic tone, followed by a sudden, loud, and distorted electronic sound. The audio is heavily distorted, with a sense of chaos and disorientation, and the clip ends with a brief, high-pitched electronic beep.

Throughout the clip, the player's voice is clear and audible, and the background is filled with the sounds of gunfire, explosions, and electronic beeps, all layered with a sense of urgency and excitement. The overall atmosphere is one of intense action and high stakes, with the player's voice and the game's soundscape creating a vivid, immersive experience. The clip is likely from a first-person shooter or action game, and the player's excitement and commentary suggest a high level of engagement and enjoyment.

Task2. Audio QA

Model MMAU (v05.15.25) MMSU
Qwen2-Audio-7B 56.9 30.38
+ On-Policy Distillation (Used Qwen-omni-3b-grpo as the teacher) 67.9 53.30

Copilot AI review requested due to automatic review settings February 27, 2026 04:22
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements on-policy knowledge distillation training for audio language models, where a student model learns to match a teacher model's distribution on its own generated samples to avoid distribution shift issues.

Changes:

  • Adds two trainer classes for on-policy distillation: local teacher mode and remote API-based teacher mode
  • Extends datasets to include question and choices metadata for reward computation
  • Adds training script and complete example with evaluation pipeline

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
west/utils/constants.py Adds audio caption template for captioning tasks
west/trainer/kd_trainer.py Implements KnowledgeDistillationTrainer and RemoteKnowledgeDistillationTrainer classes
west/dataset/hf_dataset.py Extends dataset to include question/choices metadata in collate function
west/bin/train_knowledge_distillation.py Main training script supporting both local and remote teacher modes
west/bin/decode_mmau.py Adds caption template to available choices
examples/on_policy_distillation/run.sh Complete training/evaluation pipeline script
examples/on_policy_distillation/cascaded_audio_capiton_llm_eval.py Cascaded evaluation using LLM to answer questions from captions
examples/on_policy_distillation/README.md Documentation with results and usage instructions
examples/grpo/run.sh Updates default deepspeed config reference
examples/grpo/conf/ds_zero3_omni.json New DeepSpeed ZeRO-3 configuration file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread west/trainer/kd_trainer.py
Comment thread west/trainer/kd_trainer.py
Comment thread west/trainer/kd_trainer.py
Comment thread west/trainer/kd_trainer.py
Comment thread examples/on_policy_distillation/cascaded_audio_caption_llm_eval.py
Comment thread examples/on_policy_distillation/cascaded_audio_caption_llm_eval.py
Comment thread examples/on_policy_distillation/run.sh
@robin1001 robin1001 merged commit b68af95 into wenet-e2e:main Mar 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants