I fine-tuned Fun-CosyVoice3-0.5B, but after training, during inference I observe significant distortion, noise, and vibration in the generated audio.
To isolate the issue, I performed the following tests:
1. HiFiGAN-only test
- Regenerated audio directly from an input audio chunk using HiFiGAN (no tokenizer or Flow)
- Output matches the original clean audio
- Suggests HiFiGAN is not the source of the issue
2. Full pipeline test (tokenizer → Flow → HiFiGAN)
- Passed clean audio samples from my dataset through the full pipeline
- Output contains noticeable vibration and distortion, despite clean input
3. Base vs fine-tuned Flow
Tested with both:
- Base Flow model
- Fine-tuned Flow model
- Both produce similar vibration artifacts
Additional observation:
- A clicking/mouse-like sound appears at the start and end of generated audio
What I’ve tried:
- Multiple audio normalization techniques before feeding data to the tokenizer
- No improvement
Questions:
- Has anyone encountered similar vibration/distortion artifacts in the tokenizer → Flow → HiFiGAN pipeline?
- Could this be related to tokenizer encoding/decoding mismatch or preprocessing?
- Any suggestions on debugging?
I fine-tuned Fun-CosyVoice3-0.5B, but after training, during inference I observe significant distortion, noise, and vibration in the generated audio.
To isolate the issue, I performed the following tests:
1. HiFiGAN-only test
2. Full pipeline test (tokenizer → Flow → HiFiGAN)
3. Base vs fine-tuned Flow
Tested with both:
Additional observation:
What I’ve tried:
Questions: