How to finetune audio duplex ability?


This work and demo is very impressive.  😎  👍  👍 

I'm interest in fine-tuning the model, particularly for improving speech duplex interaction (simultaneous listening and speaking) and interruption handling. However, it would be important to avoid catastrophic forgetting that might degrade these capabilities.

I would appreciate some clarification about the training design:

1. Interruption training
       How is the "barge-in" trained in this model? Is it implemented in a way similar to Moshi-style streaming speech interaction, or  FLM-Audio style duplex conversational modeling?

2. Duplex interaction (listen while speaking)
       How is the model trained to listen while speaking? Does the training data contain overlapping speech segments or a special interaction format that enables duplex behavior and monologue generation?

3. About Some finetuning details
    If we want to fine-tune data to keep duplex capabilities, How to input data to model? ( I assume the training data might follow a format similar to the Hugging Face chat template. However, I'm not sure how barge-in events or interruption labels are encoded in the dataset)

Thank you very much  😍 
    

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to finetune audio duplex ability? #1084

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

How to finetune audio duplex ability? #1084

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions