Mkulakow/cherry pick#4000
Conversation
demos/audio/README.md
Outdated
| ### Prepare speaker embeddings | ||
| When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well: | ||
| ```bash | ||
| pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt |
There was a problem hiding this comment.
Pull request overview
This PR updates the speech demos and docs to support (1) optional word-level timestamps for speech-to-text and (2) custom speaker embeddings/voices for text-to-speech, including corresponding export script flags and usage examples.
Changes:
- Document and expose
enable_word_timestampsfor S2T (graph option + export flag) and add timestamp usage examples in the audio demo docs. - Add speaker embedding creation script + requirements, and extend the export script/docs to register custom voices via
speaker_name/speaker_path. - Refine REST API documentation notes around
timestamp_granularities/ partial support indicators.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/speech_recognition/reference.md | Adds enable_word_timestamps option to S2T graph docs and updates example config. |
| docs/model_server_rest_api_speech_to_text.md | Updates S2T REST API table entries for temperature/timestamps and adds enabling note. |
| demos/common/export_models/export_model.py | Adds CLI flags + graph template rendering for T2S voices and S2T word timestamps. |
| demos/audio/requirements.txt | Introduces Python deps for speaker embedding creation and OpenAI client usage. |
| demos/audio/create_speaker_embedding.py | New utility script to generate a speaker embedding binary from an audio file. |
| demos/audio/README.md | Updates demo instructions for speaker embeddings/voices and word/segment timestamp requests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -53,6 +54,7 @@ Above node configuration should be used as a template since user is not expected | |||
| The calculator supports the following `node_options` for tuning the pipeline configuration: | |||
| - `required string models_path` - location of the models and scheduler directory (can be relative); | |||
| - `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"] | |||
There was a problem hiding this comment.
In the node_options list, the option is documented as optional string device, but the actual proto uses target_device (see src/audio/speech_to_text/s2t_calculator.proto). This mismatch makes it unclear which key users should set in graph.pbtxt. Update the docs to use target_device consistently (and keep the supported values/default there).
| - `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"] | |
| - `optional string target_device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"] |
| | stream | ❌ | ✅ | boolean | Generate the response in streaming mode. | | ||
| | temperature | ❌ | ✅ | number | The sampling temperature, between 0 and 1. | | ||
| | timestamp_granularities | ❌ | ✅ | array | The timestamp granularities to populate for this transcription. | | ||
| | temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. | |
There was a problem hiding this comment.
The documentation says temperature is between 0 and 1, but the implementation accepts values up to 2.0 (and rejects outside 0.0–2.0). Update this range in the API table (and/or add an explicit note explaining the difference vs OpenAI) to prevent client-side validation issues.
| | temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. | | |
| | temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 2. OpenVINO Model Server accepts values in the range 0.0–2.0 (note: OpenAI’s documentation typically states a range of 0.0–1.0). | |
| parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name') | ||
| parser_text2speech.add_argument('--speaker_path', type=str, help='Path to the speaker.bin file.', dest='speaker_path') |
There was a problem hiding this comment.
--speaker_name and --speaker_path are intended to be used together, but the graph template silently omits the voices block unless both are set. Add argparse validation (or an explicit error) when only one of these flags is provided to avoid creating a graph that ignores the user's input.
| python export_model.py text2speech --source_model microsoft/speecht5_tts --weight-format fp16 --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --vocoder microsoft/speecht5_hifigan --speaker_name voice1 --speaker_path /models/speakers/voice1.bin | ||
| ``` | ||
|
|
||
| > **Note:** Change the `--weight-format` to quantize the model to `int8` precision to reduce memory consumption and improve performance. | ||
| > **Note:** `speaker_name` and `speaker_path` may be omitted if the default model voice is sufficient |
There was a problem hiding this comment.
The example passes --speaker_path /models/speakers/voice1.bin, but earlier you create the file at models/speakers/voice1.bin and the README also documents bare-metal runs (where /models/... may not exist). Consider using a path that is valid in both scenarios (e.g., relative to the model repository) or clearly scope this example to the Docker deployment with /models bind-mounted.
| The calculator supports the following `node_options` for tuning the pipeline configuration: | ||
| - `required string models_path` - location of the models and scheduler directory (can be relative); | ||
| - `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"] | ||
| - `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false] |
There was a problem hiding this comment.
Minor grammar: the option description says "word timestamp" (singular) but the feature and API use plural "word timestamps" (e.g., timestamp_granularities=["word"]). Consider updating the wording for consistency.
| - `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false] | |
| - `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamps. [default = false] |
| | temperature | ❌ | ✅ | number | The sampling temperature, between 0 and 1. | | ||
| | timestamp_granularities | ❌ | ✅ | array | The timestamp granularities to populate for this transcription. | | ||
| | temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. | | ||
| | timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) | |
There was a problem hiding this comment.
Grammar: "To enable word timestamps ... need to be set" should be "needs to be set".
| | timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) | | |
| | timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` needs to be set in graph.pbtxt) | |
demos/audio/README.md
Outdated
| ### Prepare speaker embeddings | ||
| When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well: | ||
| ```bash | ||
| pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt |
There was a problem hiding this comment.
The README installs demos/audio/requirements.txt from the main branch. This makes the instructions non-reproducible for release branches (and can drift over time). Consider pointing to the same release branch/tag used elsewhere in this README, or reference the local file path instead.
| pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt | |
| pip3 install -r requirements.txt |
| with client.audio.speech.with_streaming_response.create( | ||
| model="microsoft/speecht5_tts", | ||
| voice="unused", | ||
| voice=None, |
There was a problem hiding this comment.
In the "default voice" OpenAI Python example, voice=None sends JSON null and may not satisfy client-side type validation (and it’s inconsistent with other docs that use a placeholder like voice="unused"). Prefer omitting the voice argument entirely if you want the server default, or use a documented placeholder consistently across docs.
| voice=None, |
| audio_file = open(filename, "rb") | ||
| transcript = client.audio.transcriptions.create( | ||
| model="openai/whisper-large-v3-turbo", | ||
| language="en", | ||
| response_format="verbose_json", | ||
| timestamp_granularities=["segment", "word"], | ||
| file=audio_file | ||
| ) |
There was a problem hiding this comment.
The Python example opens the audio file without closing it. Use a context manager (with open(...) as audio_file:) in the snippet to avoid leaking file descriptors in copy/pasted code.
| audio_file = open(filename, "rb") | |
| transcript = client.audio.transcriptions.create( | |
| model="openai/whisper-large-v3-turbo", | |
| language="en", | |
| response_format="verbose_json", | |
| timestamp_granularities=["segment", "word"], | |
| file=audio_file | |
| ) | |
| with open(filename, "rb") as audio_file: | |
| transcript = client.audio.transcriptions.create( | |
| model="openai/whisper-large-v3-turbo", | |
| language="en", | |
| response_format="verbose_json", | |
| timestamp_granularities=["segment", "word"], | |
| file=audio_file | |
| ) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| target_device: "CPU", | ||
| enable_word_timestamps: true |
There was a problem hiding this comment.
In the example graph snippet, target_device now ends with a comma. Other graph.pbtxt examples in the docs use protobuf text format without commas (e.g., speech_generation/reference.md), so this is inconsistent and could confuse users copying the config. Consider removing the comma to keep formatting consistent across docs.
| - `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"] | ||
| - `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false] |
There was a problem hiding this comment.
The node_options list documents optional string device, but the proto field and the example graph use target_device. This mismatch makes it unclear which key users should set. Update the bullet to optional string target_device and consider tweaking the new enable_word_timestamps description grammar ("word timestamps").
| - `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"] | |
| - `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false] | |
| - `optional string target_device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"] | |
| - `optional bool enable_word_timestamps` - set to true if the model should support user requests for word timestamps. [default = false] |
| embedding = classifier.encode_batch(signal) | ||
| embedding = torch.nn.functional.normalize(embedding, dim=2) |
There was a problem hiding this comment.
This script runs speaker embedding extraction under normal autograd tracking, which can add unnecessary memory/overhead during inference because model parameters require gradients by default. Consider wrapping the embedding computation in torch.inference_mode() (or torch.no_grad()) to make it explicitly inference-only.
| embedding = classifier.encode_batch(signal) | |
| embedding = torch.nn.functional.normalize(embedding, dim=2) | |
| with torch.inference_mode(): | |
| embedding = classifier.encode_batch(signal) | |
| embedding = torch.nn.functional.normalize(embedding, dim=2) |
| | temperature | ❌ | ✅ | number | The sampling temperature, between 0 and 1. | | ||
| | timestamp_granularities | ❌ | ✅ | array | The timestamp granularities to populate for this transcription. | | ||
| | temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. | | ||
| | timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) | |
There was a problem hiding this comment.
The server parses timestamp granularities from multipart field name timestamp_granularities[] (array-style), but the parameter name in the table is listed as timestamp_granularities and doesn’t mention the required [] suffix. Please align the documented form field name with what the server actually expects, and fix the grammar in the note ("needs to be set").
| | timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) | | |
| | timestamp_granularities[] | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` needs to be set in graph.pbtxt) | |
| models_path: "{{model_path}}", | ||
| plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }', | ||
| target_device: "{{target_device|default("CPU", true)}}" | ||
| target_device: "{{target_device|default("CPU", true)}}", | ||
| {%- if speaker_name and speaker_path %} | ||
| voices: [ | ||
| { | ||
| name: "{{speaker_name}}", | ||
| path: "{{speaker_path}}" | ||
| } | ||
| ]{% endif %} |
There was a problem hiding this comment.
In t2s_graph_template, target_device is always followed by a comma, but voices is conditional. When no speaker is provided, the rendered graph.pbtxt will end up with a dangling trailing comma after target_device. Please move the comma into the conditional block (or otherwise avoid emitting a trailing comma) to keep the generated graph formatting consistent with the other templates in this script.
| plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }', | ||
| target_device: "{{target_device|default("CPU", true)}}" | ||
| target_device: "{{target_device|default("CPU", true)}}", | ||
| enable_word_timestamps: {% if not enable_word_timestamps %}false{% else %}true{% endif%}, |
There was a problem hiding this comment.
In s2t_graph_template, enable_word_timestamps is the last field but is rendered with a trailing comma. Other templates in this file generally avoid trailing commas on the last field, so this inconsistency makes the generated graph harder to compare/debug. Consider rendering the boolean without a trailing comma (and you can simplify the Jinja to emit true/false directly).
| enable_word_timestamps: {% if not enable_word_timestamps %}false{% else %}true{% endif%}, | |
| enable_word_timestamps: {{ 'true' if enable_word_timestamps else 'false' }} |
| torch==2.9.1+cpu | ||
| torchaudio==2.9.1+cpu | ||
| speechbrain==1.0.3 | ||
| openai==2.21.0 No newline at end of file |
There was a problem hiding this comment.
openai==2.21.0 is inconsistent with the OpenAI client version pinned in other demos (e.g., openai==1.107.0). Unless there’s a strong reason to require a different major version here, consider aligning the version (or loosening it to a compatible range) so users don’t run into API differences between demos.
| openai==2.21.0 | |
| openai==1.107.0 |
| python export_model.py text2speech --source_model microsoft/speecht5_tts --weight-format fp16 --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --vocoder microsoft/speecht5_hifigan --speaker_name voice1 --speaker_path /models/speakers/voice1.bin | ||
| ``` |
There was a problem hiding this comment.
The example passes --speaker_path /models/speakers/voice1.bin, which matches the Docker volume mount shown later, but it won’t exist for the Bare Metal deployment section (where the model repo is models/...). Please clarify that speaker_path must match the runtime filesystem (or suggest using a path relative to the model repository) to avoid a broken bare-metal setup.
No description provided.