Skip to content

How to finetune audio duplex ability? #1084

@edward15241

Description

@edward15241

This work and demo is very impressive. 😎 👍 👍

I'm interest in fine-tuning the model, particularly for improving speech duplex interaction (simultaneous listening and speaking) and interruption handling. However, it would be important to avoid catastrophic forgetting that might degrade these capabilities.

I would appreciate some clarification about the training design:

  1. Interruption training
    How is the "barge-in" trained in this model? Is it implemented in a way similar to Moshi-style streaming speech interaction, or FLM-Audio style duplex conversational modeling?

  2. Duplex interaction (listen while speaking)
    How is the model trained to listen while speaking? Does the training data contain overlapping speech segments or a special interaction format that enables duplex behavior and monologue generation?

  3. About Some finetuning details
    If we want to fine-tune data to keep duplex capabilities, How to input data to model? ( I assume the training data might follow a format similar to the Hugging Face chat template. However, I'm not sure how barge-in events or interruption labels are encoded in the dataset)

Thank you very much 😍

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions