Skip to content

Comments

Mkulakow/cherry pick#4000

Merged
michalkulakowski merged 4 commits intoreleases/2026/0from
mkulakow/cherry_pick
Feb 23, 2026
Merged

Mkulakow/cherry pick#4000
michalkulakowski merged 4 commits intoreleases/2026/0from
mkulakow/cherry_pick

Conversation

@michalkulakowski
Copy link
Collaborator

No description provided.

@mzegla mzegla requested a review from Copilot February 20, 2026 08:57
### Prepare speaker embeddings
When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well:
```bash
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the speech demos and docs to support (1) optional word-level timestamps for speech-to-text and (2) custom speaker embeddings/voices for text-to-speech, including corresponding export script flags and usage examples.

Changes:

  • Document and expose enable_word_timestamps for S2T (graph option + export flag) and add timestamp usage examples in the audio demo docs.
  • Add speaker embedding creation script + requirements, and extend the export script/docs to register custom voices via speaker_name/speaker_path.
  • Refine REST API documentation notes around timestamp_granularities / partial support indicators.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
docs/speech_recognition/reference.md Adds enable_word_timestamps option to S2T graph docs and updates example config.
docs/model_server_rest_api_speech_to_text.md Updates S2T REST API table entries for temperature/timestamps and adds enabling note.
demos/common/export_models/export_model.py Adds CLI flags + graph template rendering for T2S voices and S2T word timestamps.
demos/audio/requirements.txt Introduces Python deps for speaker embedding creation and OpenAI client usage.
demos/audio/create_speaker_embedding.py New utility script to generate a speaker embedding binary from an audio file.
demos/audio/README.md Updates demo instructions for speaker embeddings/voices and word/segment timestamp requests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -53,6 +54,7 @@ Above node configuration should be used as a template since user is not expected
The calculator supports the following `node_options` for tuning the pipeline configuration:
- `required string models_path` - location of the models and scheduler directory (can be relative);
- `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the node_options list, the option is documented as optional string device, but the actual proto uses target_device (see src/audio/speech_to_text/s2t_calculator.proto). This mismatch makes it unclear which key users should set in graph.pbtxt. Update the docs to use target_device consistently (and keep the supported values/default there).

Suggested change
- `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]
- `optional string target_device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]

Copilot uses AI. Check for mistakes.
| stream | ❌ | ✅ | boolean | Generate the response in streaming mode. |
| temperature | | ✅ | number | The sampling temperature, between 0 and 1. |
| timestamp_granularities | | ✅ | array | The timestamp granularities to populate for this transcription. |
| temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. |
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says temperature is between 0 and 1, but the implementation accepts values up to 2.0 (and rejects outside 0.0–2.0). Update this range in the API table (and/or add an explicit note explaining the difference vs OpenAI) to prevent client-side validation issues.

Suggested change
| temperature | ⚠️ || number | The sampling temperature, between 0 and 1. |
| temperature | ⚠️ || number | The sampling temperature, between 0 and 2. OpenVINO Model Server accepts values in the range 0.0–2.0 (note: OpenAI’s documentation typically states a range of 0.0–1.0). |

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +95
parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name')
parser_text2speech.add_argument('--speaker_path', type=str, help='Path to the speaker.bin file.', dest='speaker_path')
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--speaker_name and --speaker_path are intended to be used together, but the graph template silently omits the voices block unless both are set. Add argparse validation (or an explicit error) when only one of these flags is provided to avoid creating a graph that ignores the user's input.

Copilot uses AI. Check for mistakes.
Comment on lines +54 to +58
python export_model.py text2speech --source_model microsoft/speecht5_tts --weight-format fp16 --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --vocoder microsoft/speecht5_hifigan --speaker_name voice1 --speaker_path /models/speakers/voice1.bin
```

> **Note:** Change the `--weight-format` to quantize the model to `int8` precision to reduce memory consumption and improve performance.
> **Note:** `speaker_name` and `speaker_path` may be omitted if the default model voice is sufficient
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example passes --speaker_path /models/speakers/voice1.bin, but earlier you create the file at models/speakers/voice1.bin and the README also documents bare-metal runs (where /models/... may not exist). Consider using a path that is valid in both scenarios (e.g., relative to the model repository) or clearly scope this example to the Docker deployment with /models bind-mounted.

Copilot uses AI. Check for mistakes.
The calculator supports the following `node_options` for tuning the pipeline configuration:
- `required string models_path` - location of the models and scheduler directory (can be relative);
- `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]
- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false]
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor grammar: the option description says "word timestamp" (singular) but the feature and API use plural "word timestamps" (e.g., timestamp_granularities=["word"]). Consider updating the wording for consistency.

Suggested change
- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false]
- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamps. [default = false]

Copilot uses AI. Check for mistakes.
| temperature | | ✅ | number | The sampling temperature, between 0 and 1. |
| timestamp_granularities | | ✅ | array | The timestamp granularities to populate for this transcription. |
| temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. |
| timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) |
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: "To enable word timestamps ... need to be set" should be "needs to be set".

Suggested change
| timestamp_granularities | ⚠️ || array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) |
| timestamp_granularities | ⚠️ || array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` needs to be set in graph.pbtxt) |

Copilot uses AI. Check for mistakes.
### Prepare speaker embeddings
When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well:
```bash
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README installs demos/audio/requirements.txt from the main branch. This makes the instructions non-reproducible for release branches (and can drift over time). Consider pointing to the same release branch/tag used elsewhere in this README, or reference the local file path instead.

Suggested change
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt
pip3 install -r requirements.txt

Copilot uses AI. Check for mistakes.
with client.audio.speech.with_streaming_response.create(
model="microsoft/speecht5_tts",
voice="unused",
voice=None,
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "default voice" OpenAI Python example, voice=None sends JSON null and may not satisfy client-side type validation (and it’s inconsistent with other docs that use a placeholder like voice="unused"). Prefer omitting the voice argument entirely if you want the server default, or use a documented placeholder consistently across docs.

Suggested change
voice=None,

Copilot uses AI. Check for mistakes.
Comment on lines +302 to +309
audio_file = open(filename, "rb")
transcript = client.audio.transcriptions.create(
model="openai/whisper-large-v3-turbo",
language="en",
response_format="verbose_json",
timestamp_granularities=["segment", "word"],
file=audio_file
)
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python example opens the audio file without closing it. Use a context manager (with open(...) as audio_file:) in the snippet to avoid leaking file descriptors in copy/pasted code.

Suggested change
audio_file = open(filename, "rb")
transcript = client.audio.transcriptions.create(
model="openai/whisper-large-v3-turbo",
language="en",
response_format="verbose_json",
timestamp_granularities=["segment", "word"],
file=audio_file
)
with open(filename, "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="openai/whisper-large-v3-turbo",
language="en",
response_format="verbose_json",
timestamp_granularities=["segment", "word"],
file=audio_file
)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +45 to +46
target_device: "CPU",
enable_word_timestamps: true
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example graph snippet, target_device now ends with a comma. Other graph.pbtxt examples in the docs use protobuf text format without commas (e.g., speech_generation/reference.md), so this is inconsistent and could confuse users copying the config. Consider removing the comma to keep formatting consistent across docs.

Copilot uses AI. Check for mistakes.
Comment on lines 56 to +57
- `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]
- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false]
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node_options list documents optional string device, but the proto field and the example graph use target_device. This mismatch makes it unclear which key users should set. Update the bullet to optional string target_device and consider tweaking the new enable_word_timestamps description grammar ("word timestamps").

Suggested change
- `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]
- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false]
- `optional string target_device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]
- `optional bool enable_word_timestamps` - set to true if the model should support user requests for word timestamps. [default = false]

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +32
embedding = classifier.encode_batch(signal)
embedding = torch.nn.functional.normalize(embedding, dim=2)
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script runs speaker embedding extraction under normal autograd tracking, which can add unnecessary memory/overhead during inference because model parameters require gradients by default. Consider wrapping the embedding computation in torch.inference_mode() (or torch.no_grad()) to make it explicitly inference-only.

Suggested change
embedding = classifier.encode_batch(signal)
embedding = torch.nn.functional.normalize(embedding, dim=2)
with torch.inference_mode():
embedding = classifier.encode_batch(signal)
embedding = torch.nn.functional.normalize(embedding, dim=2)

Copilot uses AI. Check for mistakes.
| temperature | | ✅ | number | The sampling temperature, between 0 and 1. |
| timestamp_granularities | | ✅ | array | The timestamp granularities to populate for this transcription. |
| temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. |
| timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) |
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server parses timestamp granularities from multipart field name timestamp_granularities[] (array-style), but the parameter name in the table is listed as timestamp_granularities and doesn’t mention the required [] suffix. Please align the documented form field name with what the server actually expects, and fix the grammar in the note ("needs to be set").

Suggested change
| timestamp_granularities | ⚠️ || array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) |
| timestamp_granularities[] | ⚠️ || array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` needs to be set in graph.pbtxt) |

Copilot uses AI. Check for mistakes.
Comment on lines 115 to +124
models_path: "{{model_path}}",
plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }',
target_device: "{{target_device|default("CPU", true)}}"
target_device: "{{target_device|default("CPU", true)}}",
{%- if speaker_name and speaker_path %}
voices: [
{
name: "{{speaker_name}}",
path: "{{speaker_path}}"
}
]{% endif %}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In t2s_graph_template, target_device is always followed by a comma, but voices is conditional. When no speaker is provided, the rendered graph.pbtxt will end up with a dangling trailing comma after target_device. Please move the comma into the conditional block (or otherwise avoid emitting a trailing comma) to keep the generated graph formatting consistent with the other templates in this script.

Copilot uses AI. Check for mistakes.
plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }',
target_device: "{{target_device|default("CPU", true)}}"
target_device: "{{target_device|default("CPU", true)}}",
enable_word_timestamps: {% if not enable_word_timestamps %}false{% else %}true{% endif%},
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In s2t_graph_template, enable_word_timestamps is the last field but is rendered with a trailing comma. Other templates in this file generally avoid trailing commas on the last field, so this inconsistency makes the generated graph harder to compare/debug. Consider rendering the boolean without a trailing comma (and you can simplify the Jinja to emit true/false directly).

Suggested change
enable_word_timestamps: {% if not enable_word_timestamps %}false{% else %}true{% endif%},
enable_word_timestamps: {{ 'true' if enable_word_timestamps else 'false' }}

Copilot uses AI. Check for mistakes.
torch==2.9.1+cpu
torchaudio==2.9.1+cpu
speechbrain==1.0.3
openai==2.21.0 No newline at end of file
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openai==2.21.0 is inconsistent with the OpenAI client version pinned in other demos (e.g., openai==1.107.0). Unless there’s a strong reason to require a different major version here, consider aligning the version (or loosening it to a compatible range) so users don’t run into API differences between demos.

Suggested change
openai==2.21.0
openai==1.107.0

Copilot uses AI. Check for mistakes.
Comment on lines +54 to 55
python export_model.py text2speech --source_model microsoft/speecht5_tts --weight-format fp16 --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --vocoder microsoft/speecht5_hifigan --speaker_name voice1 --speaker_path /models/speakers/voice1.bin
```
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example passes --speaker_path /models/speakers/voice1.bin, which matches the Docker volume mount shown later, but it won’t exist for the Bare Metal deployment section (where the model repo is models/...). Please clarify that speaker_path must match the runtime filesystem (or suggest using a path relative to the model repository) to avoid a broken bare-metal setup.

Copilot uses AI. Check for mistakes.
@michalkulakowski michalkulakowski merged commit faf4e20 into releases/2026/0 Feb 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants