Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ MOSS‑TTS Family is an open‑source **speech and sound generation model family
When a single piece of audio needs to **sound like a real person**, **pronounce every word accurately**, **switch speaking styles across content**, **remain stable over tens of minutes**, and **support dialogue, role‑play, and real‑time interaction**, a single TTS model is often not enough. The **MOSS‑TTS Family** breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.

- **MOSS‑TTS**: The flagship production model featuring high fidelity and optimal zero-shot voice cloning. It supports **long-speech generation**, **fine-grained control over Pinyin, phonemes, and duration**, as well as **multilingual/code-switched synthesis**.
- **MOSS‑TTSD**: A spoken dialogue generation model for expressive, multi-speaker, and ultra-long dialogues. The new **v1.0 version** achieves **industry-leading performance on objective metrics** and **outperformed top closed-source models like Doubao and Gemini 2.5-pro** in subjective evaluations.
- **MOSS‑TTSD**: A spoken dialogue generation model for expressive, multi-speaker, and ultra-long dialogues. The new **v1.0 version** achieves **industry-leading performance on objective metrics** and **outperformed top closed-source models like Doubao and Gemini 2.5-pro** in subjective evaluations. You can visit the [MOSS-TTSD repository](https://github.com/OpenMOSS/MOSS-TTSD) for details.
- **MOSS‑VoiceGenerator**: An open-source voice design model capable of generating diverse voices and styles directly from text prompts, **without any reference speech**. It unifies voice design, style control, and synthesis, functioning independently or as a design layer for downstream TTS. Its performance **surpasses other top-tier voice design models in arena ratings**.
- **MOSS‑TTS‑Realtime**: A multi-turn context-aware model for real-time voice agents. It uses incremental synthesis to ensure natural and coherent replies, making it **ideal for building low-latency voice agents when paired with text models**.
- **MOSS‑SoundEffect**: A content creation model specialized in **sound effect generation** with wide category coverage and controllable duration. It generates audio for natural environments, urban scenes, biological sounds, human actions, and musical fragments, suitable for film, games, and interactive experiences.
Expand Down
2 changes: 1 addition & 1 deletion README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ MOSS‑TTS 家族是由 [MOSI.AI](https://mosi.cn/#hero) 与 [OpenMOSS 团队](h
当一段音频需要 **听起来像真实的人类****准确发音****在不同内容间切换说话风格****稳定持续数十分钟**,并且 **支持对话、角色扮演与实时交互** 时,单一 TTS 模型往往不足以胜任。**MOSS‑TTS 家族**将工作流拆分为 5 个可独立使用、亦可组合成完整管线的量产级模型。

- **MOSS‑TTS**:MOSS‑TTS 是家族中的旗舰量产级 TTS 基础模型,**核心能力是高保真以及最优性能的零样本语音克隆**,支持**长文本长语音生成****拼音、音标与时长精细控制**,以及**多语种/中英混合合成**。它可作为大规模旁白、配音和语音产品的核心底座。
- **MOSS‑TTSD**:MOSS‑TTSD 是对话语音生成模型,用于生成高表现力、多说话人、超长连续对话的音频。本次我们更新了全新的**v1.0版本**,相比于0.7版本,它在音色相似度,说话人切换准确率,词错误率等**客观指标上取得了业界最优的性能**,在竞技场主观评测中,也**战胜了豆包、Gemini2.5-pro**等顶尖闭源模型。
- **MOSS‑TTSD**:MOSS‑TTSD 是对话语音生成模型,用于生成高表现力、多说话人、超长连续对话的音频。本次我们更新了全新的**v1.0版本**,相比于0.7版本,它在音色相似度,说话人切换准确率,词错误率等**客观指标上取得了业界最优的性能**,在竞技场主观评测中,也**战胜了豆包、Gemini2.5-pro**等顶尖闭源模型。详情请访问 [MOSS-TTSD 仓库](https://github.com/OpenMOSS/MOSS-TTSD)
- **MOSS‑VoiceGenerator**:MOSS‑VoiceGenerator 是开源音色设计模型,可从文本风格指令直接生成多样的说话人音色或风格,**无需参考音频**。它统一音色设计、风格控制与内容合成,可独立创作,也可作为下游 TTS 的音色设计层。模型性能在**竞技场评分上超过了其余等顶尖音色设计模型**
- **MOSS‑TTS‑Realtime**:MOSS‑TTS‑Realtime 是面向实时语音智能体的多轮上下文感知实时 TTS 模型。它结合多轮对话中的文本与历史语音信号进行低时延增量合成,使多轮回复保持连贯、自然且音色一致。**非常适合搭配文本模型构建低时延语音智能体**
- **MOSS‑SoundEffect**:MOSS‑SoundEffect 是面向内容制作的**音效生成**模型,具备广泛类别覆盖与可控时长能力。它能根据文本指令生成自然环境、城市场景、生物、人类动作与类音乐片段等音频,适用于影视、游戏、交互体验和数据合成。
Expand Down
137 changes: 136 additions & 1 deletion clis/moss_ttsd_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,66 @@
DEFAULT_MAX_NEW_TOKENS = 2000
MIN_SPEAKERS = 1
MAX_SPEAKERS = 5
PRESET_REF_AUDIO_S1 = "assets/audio/reference_02_s1.wav"
PRESET_REF_AUDIO_S2 = "assets/audio/reference_02_s2.wav"
PRESET_PROMPT_TEXT_S1 = (
"[S1] In short, we embarked on a mission to make America great again for all Americans."
)
PRESET_PROMPT_TEXT_S2 = (
"[S2] NVIDIA reinvented computing for the first time after 60 years. In fact, Erwin at IBM knows quite "
"well that the computer has largely been the same since the 60s."
)
PRESET_DIALOGUE_TEXT = (
"[S1] Listen, let's talk business. China. I'm hearing things.\n"
"People are saying they're catching up. Fast. What's the real scoop?\n"
"Their AI, is it a threat?\n"
"[S2] Well, the pace of innovation there is extraordinary, honestly.\n"
"They have the researchers, and they have the drive.\n"
"[S1] Extraordinary? I don't like that. I want us to be extraordinary.\n"
"Are they winning?\n"
"[S2] I wouldn't say winning, but their progress is very promising.\n"
"They are building massive clusters. They're very determined.\n"
"[S1] Promising. There it is. I hate that word.\n"
"When China is promising, it means we're losing.\n"
"It's a disaster, Jensen. A total disaster."
)
PRESET_EXAMPLES = [
{
"name": "Quick Start | reference_02_s1/s2",
"speaker_count": 2,
"s1_audio": PRESET_REF_AUDIO_S1,
"s1_prompt": PRESET_PROMPT_TEXT_S1,
"s2_audio": PRESET_REF_AUDIO_S2,
"s2_prompt": PRESET_PROMPT_TEXT_S2,
"dialogue_text": PRESET_DIALOGUE_TEXT,
}
]
PRESET_DISPLAY_FIELDS = [
("Speaker Count", "speaker_count"),
("S1 Reference Audio (Optional)", "s1_audio"),
("S1 Prompt Text (Required with reference audio)", "s1_prompt"),
("S2 Reference Audio (Optional)", "s2_audio"),
("S2 Prompt Text (Required with reference audio)", "s2_prompt"),
("Dialogue Text", "dialogue_text"),
]


def _build_preset_table_rows():
rows = []
row_to_preset = []
for preset_idx, preset in enumerate(PRESET_EXAMPLES):
for field_name, field_key in PRESET_DISPLAY_FIELDS:
value = str(preset.get(field_key, ""))
if field_key == "dialogue_text":
value = value.replace("\n", " ").strip()
if len(value) > 120:
value = value[:120] + " ..."
rows.append([field_name, value])
row_to_preset.append(preset_idx)
return rows, row_to_preset


PRESET_TABLE_ROWS, PRESET_TABLE_ROW_TO_PRESET = _build_preset_table_rows()


def resolve_attn_implementation(requested: str, device: torch.device, dtype: torch.dtype) -> str | None:
Expand Down Expand Up @@ -197,6 +257,59 @@ def update_speaker_panels(speaker_count: int):
return [gr.update(visible=(idx < count)) for idx in range(MAX_SPEAKERS)]


def apply_preset_selection(evt: gr.SelectData):
if evt is None or evt.index is None:
return (
gr.update(),
gr.update(),
gr.update(),
gr.update(),
gr.update(),
gr.update(),
*[gr.update() for _ in range(MAX_SPEAKERS)],
)

if isinstance(evt.index, (tuple, list)):
row_idx = int(evt.index[0])
else:
row_idx = int(evt.index)

if row_idx < 0 or row_idx >= len(PRESET_TABLE_ROW_TO_PRESET):
return (
gr.update(),
gr.update(),
gr.update(),
gr.update(),
gr.update(),
gr.update(),
*[gr.update() for _ in range(MAX_SPEAKERS)],
)

preset_idx = PRESET_TABLE_ROW_TO_PRESET[row_idx]
if preset_idx < 0 or preset_idx >= len(PRESET_EXAMPLES):
return (
gr.update(),
gr.update(),
gr.update(),
gr.update(),
gr.update(),
gr.update(),
*[gr.update() for _ in range(MAX_SPEAKERS)],
)

preset = PRESET_EXAMPLES[preset_idx]
panel_updates = update_speaker_panels(int(preset["speaker_count"]))
return (
gr.update(value=int(preset["speaker_count"])),
gr.update(value=str(preset["s1_audio"])),
gr.update(value=str(preset["s1_prompt"])),
gr.update(value=str(preset["s2_audio"])),
gr.update(value=str(preset["s2_prompt"])),
gr.update(value=str(preset["dialogue_text"])),
*panel_updates,
)


def _merge_consecutive_speaker_tags(text: str) -> str:
segments = re.split(r"(?=\[S\d+\])", text)
if not segments:
Expand Down Expand Up @@ -593,12 +706,34 @@ def build_demo(args: argparse.Namespace):
output_audio = gr.Audio(label="Output Audio", type="numpy", elem_id="output_audio")
gr.HTML("", elem_id="output_audio_spacer")
status = gr.Textbox(label="Status", lines=4, interactive=False, elem_id="output_status")
preset_examples = gr.Dataframe(
headers=["Field", "Value (click any row to fill inputs)"],
value=PRESET_TABLE_ROWS,
datatype=["str", "str"],
row_count=(len(PRESET_TABLE_ROWS), "fixed"),
col_count=(2, "fixed"),
interactive=False,
wrap=True,
label="Preset Examples",
)

speaker_count.change(
fn=update_speaker_panels,
inputs=[speaker_count],
outputs=speaker_panels,
)
preset_examples.select(
fn=apply_preset_selection,
outputs=[
speaker_count,
speaker_refs[0],
speaker_prompts[0],
speaker_refs[1],
speaker_prompts[1],
dialogue_text,
*speaker_panels,
],
)

run_btn.click(
fn=lambda speaker_count, *inputs: run_inference(
Expand Down Expand Up @@ -673,4 +808,4 @@ def main() -> None:


if __name__ == "__main__":
main()
main()