CosyVoice API cung cấp dịch vụ Text-to-Speech (TTS) với khả năng clone giọng nói đa ngôn ngữ.
- Base URL:
http://localhost:8012 - API Versions: v1, v2 (CosyVoice2), v3 (CosyVoice3 - Recommended)
| Version | Model | Description |
|---|---|---|
| v1 | CosyVoice2-0.5B | Backward compatibility |
| v2 | CosyVoice2-0.5B | Legacy support |
| v3 | CosyVoice3-0.5B | Latest - Recommended |
- 9+ languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
- 18+ Chinese dialects
- Instruction-based voice control
- ~150ms streaming latency
- Better content consistency & speaker similarity
Hiện tại API không yêu cầu authentication.
{
"success": true,
"message": "Operation completed",
"data": { ... }
}{
"detail": "Error message"
}Kiểm tra trạng thái server và các model.
Response:
{
"status": "healthy",
"v2": {
"ready": true,
"model": "CosyVoice2-0.5B"
},
"v3": {
"ready": true,
"model": "CosyVoice3-0.5B"
}
}Tạo voice mới từ audio sample.
Content-Type: multipart/form-data
Request Parameters:
| Field | Type | Required | Description |
|---|---|---|---|
| voice_id | string | Yes | Unique voice identifier |
| name | string | Yes | Human-readable voice name |
| description | string | No | Voice description |
| voice_type | enum | Yes | sft, zero_shot, cross_lingual, instruct |
| language | string | No | Primary language (e.g., "zh", "en", "ja") |
| prompt_text | string | No | Text matching the audio sample |
| audio_format | enum | No | wav, mp3, flac (default: wav) |
| audio_file | file | Yes | Audio file for voice cloning |
Response:
{
"voice_id": "my_voice_01",
"name": "My Custom Voice",
"description": "A custom voice for testing",
"voice_type": "cross_lingual",
"language": "zh",
"created_at": "2024-01-15T10:30:00Z",
"updated_at": "2024-01-15T10:30:00Z",
"audio_format": "wav",
"file_size": 245760,
"duration": 5.2,
"sample_rate": 22050,
"is_active": true
}Liệt kê tất cả voices.
Query Parameters:
| Field | Type | Required | Description |
|---|---|---|---|
| voice_type | enum | No | Filter by voice type |
| language | string | No | Filter by language |
| page | int | No | Page number (default: 1) |
| page_size | int | No | Items per page (default: 50, max: 100) |
Response:
{
"voices": [...],
"total": 10,
"page": 1,
"page_size": 50
}Lấy thông tin voice theo ID.
Cập nhật thông tin voice.
Xóa voice.
Tổng hợp giọng nói với audio file tham chiếu.
Content-Type: multipart/form-data
Request Parameters:
| Field | Type | Required | Description |
|---|---|---|---|
| text | string | Yes | Text to synthesize (max 2000 chars) |
| format | enum | No | Output format: wav, mp3, flac (default: wav) |
| speed | float | No | Speech speed 0.5-2.0 (default: 1.0) |
| stream | bool | No | Enable streaming (default: false) |
| instruct_text | string | No | V3 only: Instruction for voice control |
| prompt_audio | file | Yes | Reference audio file |
instruct_text Examples (v3):
"Use Cantonese dialect""Speak slowly with a happy tone""Use formal and professional tone""Whisper softly"
Response:
{
"success": true,
"message": "CosyVoice3 synthesis completed",
"audio_url": "/api/v3/audio/v3_cross_lingual_abc12345.wav",
"file_path": "outputs/v3_cross_lingual_abc12345.wav",
"duration": 3.5,
"format": "wav",
"synthesis_time": 1.2
}Tổng hợp giọng nói với cached voice.
Content-Type: application/json
Request Body:
{
"text": "Hello, this is a test.",
"voice_id": "my_voice_01",
"format": "wav",
"speed": 1.0,
"stream": false
}Response:
{
"success": true,
"message": "CosyVoice3 synthesis completed",
"audio_url": "/api/v3/audio/v3_cache_abc12345.wav",
"file_path": "outputs/v3_cache_abc12345.wav",
"duration": 2.8,
"format": "wav",
"synthesis_time": 0.9
}Tổng hợp với instruction control.
Content-Type: multipart/form-data
Request Parameters:
| Field | Type | Required | Description |
|---|---|---|---|
| text | string | Yes | Text to synthesize |
| instruct_text | string | Yes | Instruction for voice control |
| format | enum | No | Output format (default: wav) |
| speed | float | No | Speech speed (default: 1.0) |
| stream | bool | No | Enable streaming (default: false) |
| prompt_audio | file | Yes | Reference audio file |
Lấy thông tin capabilities của CosyVoice3.
Response:
{
"version": "3.0",
"model": "Fun-CosyVoice3-0.5B-2512",
"features": {
"languages": ["Chinese (Mandarin)", "English", "Japanese", "Korean", "German", "Spanish", "French", "Italian", "Russian"],
"chinese_dialects": ["Cantonese", "Sichuan", "Shanghai", "Hokkien", "Hakka", "..."],
"instruct_support": true,
"phoneme_support": {
"chinese_pinyin": true,
"english_cmu": true
},
"streaming": {
"enabled": true,
"latency_ms": 150
}
}
}Tạo synthesis task chạy background.
Content-Type: application/json
Request Body:
{
"text": "Text to synthesize",
"voice_id": "my_voice_01",
"format": "wav",
"speed": 1.0,
"instruct_text": "Optional instruction"
}Response:
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "registered",
"file_path": "outputs/v3_task_abc12345.wav",
"audio_url": "/api/v3/audio/v3_task_abc12345.wav",
"estimated_duration": 1.8,
"created_at": "2024-01-15T10:30:00Z"
}Lấy trạng thái task.
Response:
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"file_path": "outputs/v3_task_abc12345.wav",
"audio_url": "/api/v3/audio/v3_task_abc12345.wav",
"progress": 1.0,
"duration": 3.2,
"synthesis_time": 1.1,
"created_at": "2024-01-15T10:30:00Z",
"completed_at": "2024-01-15T10:30:02Z",
"error_message": null
}Task Status Values:
registered: Task created, waiting to processprocessing: Currently synthesizingcompleted: Done, audio readyfailed: Error occurred
List all tasks.
Delete a task.
The scheduled rendering API allows you to register tasks for later background processing. This is useful for batch processing or when you want to control when the rendering starts.
1. POST /api/v3/schedule/register → Returns task_id (status: pending)
2. POST /api/v3/schedule/render/{id} → Starts background rendering (status: processing)
3. GET /api/v3/schedule/status/{id} → Check status, get audio_url when completed
Register a new task for background rendering (does NOT start rendering).
Content-Type: application/json
Request Body:
{
"text": "Text to synthesize",
"voice_id": "my_voice_01",
"format": "wav",
"speed": 1.0,
"instruct_text": "Optional instruction for voice control",
"priority": "normal",
"callback_url": "https://your-server.com/webhook",
"metadata": {"custom_field": "custom_value"}
}Priority Values: low, normal, high, urgent
Response:
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "pending",
"message": "Task registered successfully. Call POST /schedule/render/{task_id} to start rendering.",
"queue_position": 1,
"estimated_wait_time": 2.4,
"created_at": "2024-01-15T10:30:00Z",
"priority": "normal"
}Start background rendering for a registered task.
Response:
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "scheduled",
"message": "Background rendering started. Use GET /schedule/status/{task_id} to check progress."
}Get status of a scheduled task.
Response:
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"file_path": "outputs/v3_task_abc12345.wav",
"audio_url": "/api/v3/audio/v3_task_abc12345.wav",
"progress": 1.0,
"duration": 3.2,
"synthesis_time": 1.1,
"created_at": "2024-01-15T10:30:00Z",
"started_at": "2024-01-15T10:30:01Z",
"completed_at": "2024-01-15T10:30:02Z",
"priority": "normal"
}Status Values:
pending: Task registered, waiting for render to startscheduled: Task added to render queueprocessing: Currently renderingcompleted: Rendering done, audio ready for downloadfailed: Rendering failed (check error_message)cancelled: Task was cancelled
List all scheduled tasks with optional filters.
Query Parameters:
| Field | Type | Required | Description |
|---|---|---|---|
| limit | int | No | Max tasks to return (default: 50) |
| status | string | No | Filter by status |
| priority | string | No | Filter by priority |
Cancel or delete a scheduled task.
Start rendering for all pending tasks (batch processing).
Query Parameters:
| Field | Type | Required | Description |
|---|---|---|---|
| max_tasks | int | No | Max tasks to render (default: 10) |
Get queue statistics.
Response:
{
"total_tasks": 15,
"status_counts": {
"pending": 5,
"processing": 2,
"completed": 8
},
"priority_counts": {
"normal": 10,
"high": 5
},
"average_synthesis_time": 1.234,
"completed_tasks": 8
}HTTP streaming synthesis.
Content-Type: multipart/form-data
Request Parameters:
| Field | Type | Required | Description |
|---|---|---|---|
| text | string | Yes | Text to synthesize |
| voice_id | string | Yes | Cached voice ID |
| format | enum | No | Audio format (default: wav) |
| speed | float | No | Speech speed (default: 1.0) |
| quality | enum | No | low, medium, high (default: medium) |
| chunk_size | int | No | Chunk size 256-8192 bytes (default: 1024) |
Response: Streaming audio data with chunked transfer encoding.
Quality Settings:
| Quality | Sample Rate | Chunk Duration |
|---|---|---|
| low | 16000 Hz | 0.5s |
| medium | 22050 Hz | 0.3s |
| high | 44100 Hz | 0.2s |
Server-Sent Events streaming.
Query Parameters: Same as above
Response: SSE stream with events:
start: Synthesis startedaudio_chunk: Audio data (base64 encoded)complete: Synthesis finishederror: Error occurred
Streaming service health check.
Bidirectional WebSocket streaming.
Connection URL: ws://localhost:8012/api/v3/ws/stream?client_id=optional_id
Message Types:
{
"message_type": "text_request",
"request_id": "unique_id",
"text": "Text to synthesize",
"voice_id": "my_voice_01",
"format": "wav",
"speed": 1.0,
"quality": "medium"
}{
"message_type": "synthesis_start",
"request_id": "unique_id",
"timestamp": 1705312200.123,
"text": "Text to synthesize",
"voice_id": "my_voice_01"
}{
"message_type": "audio_chunk",
"request_id": "unique_id",
"timestamp": 1705312200.456,
"audio_data": "base64_encoded_audio_data",
"metadata": {
"chunk_index": 1,
"chunk_size": 1024,
"sample_rate": 22050,
"channels": 1,
"is_final": false
}
}{
"message_type": "synthesis_complete",
"request_id": "unique_id",
"timestamp": 1705312201.789,
"total_chunks": 15,
"total_duration": 3.2,
"synthesis_time": 1.1
}// Client sends
{"message_type": "ping", "request_id": "ping_1"}
// Server responds
{"message_type": "pong", "request_id": "ping_1", "timestamp": 1705312200.123}Liệt kê active WebSocket sessions.
Download generated audio file.
Response: Audio file with appropriate Content-Type.
| HTTP Code | Description |
|---|---|
| 400 | Bad Request - Invalid parameters |
| 404 | Not Found - Voice/Task not found |
| 409 | Conflict - Voice already exists |
| 422 | Unprocessable Entity - Invalid audio file |
| 500 | Internal Server Error |
| 503 | Service Unavailable - Model not ready |
import requests
# Create voice
files = {'audio_file': open('sample.wav', 'rb')}
data = {
'voice_id': 'my_voice',
'name': 'My Voice',
'voice_type': 'cross_lingual'
}
response = requests.post('http://localhost:8012/api/v3/voices/', files=files, data=data)
# Synthesize with cached voice
response = requests.post('http://localhost:8012/api/v3/cross-lingual/with-cache', json={
'text': 'Hello world',
'voice_id': 'my_voice'
})
audio_url = response.json()['audio_url']
# Download audio
audio = requests.get(f'http://localhost:8012{audio_url}')
with open('output.wav', 'wb') as f:
f.write(audio.content)// Create voice
const formData = new FormData();
formData.append('voice_id', 'my_voice');
formData.append('name', 'My Voice');
formData.append('voice_type', 'cross_lingual');
formData.append('audio_file', audioFile);
const response = await fetch('http://localhost:8012/api/v3/voices/', {
method: 'POST',
body: formData
});
// Synthesize
const synthesisResponse = await fetch('http://localhost:8012/api/v3/cross-lingual/with-cache', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
text: 'Hello world',
voice_id: 'my_voice'
})
});
const { audio_url } = await synthesisResponse.json();
// WebSocket streaming
const ws = new WebSocket('ws://localhost:8012/api/v3/ws/stream');
ws.onopen = () => {
ws.send(JSON.stringify({
message_type: 'text_request',
request_id: 'req_1',
text: 'Hello world',
voice_id: 'my_voice'
}));
};
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.message_type === 'audio_chunk') {
// Process audio chunk
}
};# Health check
curl http://localhost:8012/health
# Create voice
curl -X POST http://localhost:8012/api/v3/voices/ \
-F "voice_id=my_voice" \
-F "name=My Voice" \
-F "voice_type=cross_lingual" \
-F "audio_file=@sample.wav"
# Synthesize
curl -X POST http://localhost:8012/api/v3/cross-lingual/with-cache \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "voice_id": "my_voice"}'
# Download audio
curl -O http://localhost:8012/api/v3/audio/v3_cache_abc12345.wav| Variable | Default | Description |
|---|---|---|
| HOST | 0.0.0.0 | Server host |
| PORT | 8012 | Server port |
| DEBUG | false | Debug mode |
| MODEL_DIR | models/CosyVoice2-0.5B | CosyVoice2 model path |
| MODEL_DIR_V3 | models/Fun-CosyVoice3-0.5B | CosyVoice3 model path |
| AUTO_DOWNLOAD_MODELS | true | Auto-download from HuggingFace |
| VOICE_CACHE_DIR | voice_cache | Voice cache directory |
| MAX_TEXT_LENGTH | 1000 | Max text length |
| MAX_AUDIO_DURATION | 30 | Max audio duration (seconds) |
| SAMPLE_RATE | 22050 | Default sample rate |
Full OpenAPI 3.0 schema available at:
- Swagger UI: http://localhost:8012/docs
- ReDoc: http://localhost:8012/redoc
- OpenAPI JSON: http://localhost:8012/openapi.json