Distortion and vibration in output after tokenizer → Flow → HiFiGAN

I fine-tuned Fun-CosyVoice3-0.5B, but after training, during inference I observe significant distortion, noise, and vibration in the generated audio.

To isolate the issue, I performed the following tests:

**1. HiFiGAN-only test**

- Regenerated audio directly from an input audio chunk using HiFiGAN (no tokenizer or Flow)
- Output matches the original clean audio
- Suggests HiFiGAN is not the source of the issue

**2. Full pipeline test (tokenizer → Flow → HiFiGAN)**

- Passed clean audio samples from my dataset through the full pipeline
- Output contains noticeable vibration and distortion, despite clean input

**3. Base vs fine-tuned Flow** 

Tested with both:
- Base Flow model
- Fine-tuned Flow model
- Both produce similar vibration artifacts

**Additional observation:**

- A clicking/mouse-like sound appears at the start and end of generated audio

**What I’ve tried:**

- Multiple audio normalization techniques before feeding data to the tokenizer
- No improvement

**Questions:**

- Has anyone encountered similar vibration/distortion artifacts in the tokenizer → Flow → HiFiGAN pipeline?
- Could this be related to tokenizer encoding/decoding mismatch or preprocessing?
- Any suggestions on debugging?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distortion and vibration in output after tokenizer → Flow → HiFiGAN #1879

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Distortion and vibration in output after tokenizer → Flow → HiFiGAN #1879

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions