Mkulakow/cherry pick by michalkulakowski · Pull Request #4000 · openvinotoolkit/model_server

michalkulakowski · 2026-02-20T08:56:06Z

No description provided.

dkalinowski · 2026-02-20T08:57:58Z

demos/audio/README.md

+### Prepare speaker embeddings
+When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well:
+```bash
+pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt


Copilot

Pull request overview

This PR updates the speech demos and docs to support (1) optional word-level timestamps for speech-to-text and (2) custom speaker embeddings/voices for text-to-speech, including corresponding export script flags and usage examples.

Changes:

Document and expose enable_word_timestamps for S2T (graph option + export flag) and add timestamp usage examples in the audio demo docs.
Add speaker embedding creation script + requirements, and extend the export script/docs to register custom voices via speaker_name/speaker_path.
Refine REST API documentation notes around timestamp_granularities / partial support indicators.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
docs/speech_recognition/reference.md	Adds `enable_word_timestamps` option to S2T graph docs and updates example config.
docs/model_server_rest_api_speech_to_text.md	Updates S2T REST API table entries for temperature/timestamps and adds enabling note.
demos/common/export_models/export_model.py	Adds CLI flags + graph template rendering for T2S voices and S2T word timestamps.
demos/audio/requirements.txt	Introduces Python deps for speaker embedding creation and OpenAI client usage.
demos/audio/create_speaker_embedding.py	New utility script to generate a speaker embedding binary from an audio file.
demos/audio/README.md	Updates demo instructions for speaker embeddings/voices and word/segment timestamp requests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T09:03:36Z

docs/speech_recognition/reference.md

@@ -53,6 +54,7 @@ Above node configuration should be used as a template since user is not expected
 The calculator supports the following `node_options` for tuning the pipeline configuration:
 -    `required string models_path` - location of the models and scheduler directory (can be relative);
 -    `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]


In the node_options list, the option is documented as optional string device, but the actual proto uses target_device (see src/audio/speech_to_text/s2t_calculator.proto). This mismatch makes it unclear which key users should set in graph.pbtxt. Update the docs to use target_device consistently (and keep the supported values/default there).

Suggested change

- `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]

- `optional string target_device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]

Copilot · 2026-02-20T09:03:37Z

docs/model_server_rest_api_speech_to_text.md

 | stream | ❌ | ✅ | boolean | Generate the response in streaming mode. |
-| temperature | ❌ | ✅ | number | The sampling temperature, between 0 and 1. |
-| timestamp_granularities | ❌ | ✅ | array | The timestamp granularities to populate for this transcription. |
+| temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. |


The documentation says temperature is between 0 and 1, but the implementation accepts values up to 2.0 (and rejects outside 0.0–2.0). Update this range in the API table (and/or add an explicit note explaining the difference vs OpenAI) to prevent client-side validation issues.

Suggested change

| temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. |

| temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 2. OpenVINO Model Server accepts values in the range 0.0–2.0 (note: OpenAI’s documentation typically states a range of 0.0–1.0). |

Copilot · 2026-02-20T09:03:37Z

demos/common/export_models/export_model.py

+parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name')
+parser_text2speech.add_argument('--speaker_path', type=str, help='Path to the speaker.bin file.', dest='speaker_path')


--speaker_name and --speaker_path are intended to be used together, but the graph template silently omits the voices block unless both are set. Add argparse validation (or an explicit error) when only one of these flags is provided to avoid creating a graph that ignores the user's input.

Copilot · 2026-02-20T09:03:37Z

demos/audio/README.md

+python export_model.py text2speech --source_model microsoft/speecht5_tts --weight-format fp16 --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --vocoder microsoft/speecht5_hifigan --speaker_name voice1 --speaker_path /models/speakers/voice1.bin
 ```

 > **Note:** Change the `--weight-format` to quantize the model to `int8` precision to reduce memory consumption and improve performance.
+> **Note:** `speaker_name` and `speaker_path` may be omitted if the default model voice is sufficient


The example passes --speaker_path /models/speakers/voice1.bin, but earlier you create the file at models/speakers/voice1.bin and the README also documents bare-metal runs (where /models/... may not exist). Consider using a path that is valid in both scenarios (e.g., relative to the model repository) or clearly scope this example to the Docker deployment with /models bind-mounted.

Copilot · 2026-02-20T09:03:37Z

docs/speech_recognition/reference.md

 The calculator supports the following `node_options` for tuning the pipeline configuration:
 -    `required string models_path` - location of the models and scheduler directory (can be relative);
 -    `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]
+-    `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false]


Minor grammar: the option description says "word timestamp" (singular) but the feature and API use plural "word timestamps" (e.g., timestamp_granularities=["word"]). Consider updating the wording for consistency.

Suggested change

- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false]

- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamps. [default = false]

Copilot · 2026-02-20T09:03:38Z

docs/model_server_rest_api_speech_to_text.md

-| temperature | ❌ | ✅ | number | The sampling temperature, between 0 and 1. |
-| timestamp_granularities | ❌ | ✅ | array | The timestamp granularities to populate for this transcription. |
+| temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. |
+| timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) |


Grammar: "To enable word timestamps ... need to be set" should be "needs to be set".

Suggested change

| timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) |

| timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` needs to be set in graph.pbtxt) |

Copilot · 2026-02-20T09:03:38Z

demos/audio/README.md

+### Prepare speaker embeddings
+When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well:
+```bash
+pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt


The README installs demos/audio/requirements.txt from the main branch. This makes the instructions non-reproducible for release branches (and can drift over time). Consider pointing to the same release branch/tag used elsewhere in this README, or reference the local file path instead.

Suggested change

pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt

pip3 install -r requirements.txt

Copilot · 2026-02-20T09:03:38Z

demos/audio/README.md

 with client.audio.speech.with_streaming_response.create(
  model="microsoft/speecht5_tts",
-  voice="unused",
+  voice=None,


In the "default voice" OpenAI Python example, voice=None sends JSON null and may not satisfy client-side type validation (and it’s inconsistent with other docs that use a placeholder like voice="unused"). Prefer omitting the voice argument entirely if you want the server default, or use a documented placeholder consistently across docs.

Suggested change

voice=None,

Copilot · 2026-02-20T09:03:38Z

demos/audio/README.md

+audio_file = open(filename, "rb")
+transcript = client.audio.transcriptions.create(
+  model="openai/whisper-large-v3-turbo",
+  language="en",
+  response_format="verbose_json",
+  timestamp_granularities=["segment", "word"],
+  file=audio_file
+)


The Python example opens the audio file without closing it. Use a context manager (with open(...) as audio_file:) in the snippet to avoid leaking file descriptors in copy/pasted code.

Suggested change

audio_file = open(filename, "rb")

transcript = client.audio.transcriptions.create(

model="openai/whisper-large-v3-turbo",

language="en",

response_format="verbose_json",

timestamp_granularities=["segment", "word"],

file=audio_file

)

with open(filename, "rb") as audio_file:

transcript = client.audio.transcriptions.create(

model="openai/whisper-large-v3-turbo",

language="en",

response_format="verbose_json",

timestamp_granularities=["segment", "word"],

file=audio_file

)

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T09:16:59Z

docs/speech_recognition/reference.md

+      target_device: "CPU",
+      enable_word_timestamps: true


In the example graph snippet, target_device now ends with a comma. Other graph.pbtxt examples in the docs use protobuf text format without commas (e.g., speech_generation/reference.md), so this is inconsistent and could confuse users copying the config. Consider removing the comma to keep formatting consistent across docs.

Copilot · 2026-02-20T09:16:59Z

docs/speech_recognition/reference.md

 -    `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]
+-    `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false]


The node_options list documents optional string device, but the proto field and the example graph use target_device. This mismatch makes it unclear which key users should set. Update the bullet to optional string target_device and consider tweaking the new enable_word_timestamps description grammar ("word timestamps").

Suggested change

- `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]

- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false]

- `optional string target_device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]

- `optional bool enable_word_timestamps` - set to true if the model should support user requests for word timestamps. [default = false]

Copilot · 2026-02-20T09:17:00Z

demos/audio/create_speaker_embedding.py

+embedding = classifier.encode_batch(signal)
+embedding = torch.nn.functional.normalize(embedding, dim=2)


This script runs speaker embedding extraction under normal autograd tracking, which can add unnecessary memory/overhead during inference because model parameters require gradients by default. Consider wrapping the embedding computation in torch.inference_mode() (or torch.no_grad()) to make it explicitly inference-only.

Suggested change

embedding = classifier.encode_batch(signal)

embedding = torch.nn.functional.normalize(embedding, dim=2)

with torch.inference_mode():

embedding = classifier.encode_batch(signal)

embedding = torch.nn.functional.normalize(embedding, dim=2)

Copilot · 2026-02-20T09:17:00Z

docs/model_server_rest_api_speech_to_text.md

-| temperature | ❌ | ✅ | number | The sampling temperature, between 0 and 1. |
-| timestamp_granularities | ❌ | ✅ | array | The timestamp granularities to populate for this transcription. |
+| temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. |
+| timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) |


The server parses timestamp granularities from multipart field name timestamp_granularities[] (array-style), but the parameter name in the table is listed as timestamp_granularities and doesn’t mention the required [] suffix. Please align the documented form field name with what the server actually expects, and fix the grammar in the note ("needs to be set").

Suggested change

| timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) |

| timestamp_granularities[] | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` needs to be set in graph.pbtxt) |

Copilot · 2026-02-20T09:17:00Z

demos/common/export_models/export_model.py

      models_path: "{{model_path}}",
      plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }',
-      target_device: "{{target_device|default("CPU", true)}}"
+      target_device: "{{target_device|default("CPU", true)}}",
+      {%- if speaker_name and speaker_path %}
+      voices: [
+        {
+            name: "{{speaker_name}}",
+            path: "{{speaker_path}}"
+        }
+      ]{% endif %}


In t2s_graph_template, target_device is always followed by a comma, but voices is conditional. When no speaker is provided, the rendered graph.pbtxt will end up with a dangling trailing comma after target_device. Please move the comma into the conditional block (or otherwise avoid emitting a trailing comma) to keep the generated graph formatting consistent with the other templates in this script.

Copilot · 2026-02-20T09:17:01Z

demos/common/export_models/export_model.py

      plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }',
-      target_device: "{{target_device|default("CPU", true)}}"
+      target_device: "{{target_device|default("CPU", true)}}",
+      enable_word_timestamps: {% if not enable_word_timestamps %}false{% else %}true{% endif%},


In s2t_graph_template, enable_word_timestamps is the last field but is rendered with a trailing comma. Other templates in this file generally avoid trailing commas on the last field, so this inconsistency makes the generated graph harder to compare/debug. Consider rendering the boolean without a trailing comma (and you can simplify the Jinja to emit true/false directly).

Suggested change

enable_word_timestamps: {% if not enable_word_timestamps %}false{% else %}true{% endif%},

enable_word_timestamps: {{ 'true' if enable_word_timestamps else 'false' }}

Copilot · 2026-02-20T09:17:01Z

demos/audio/requirements.txt

+torch==2.9.1+cpu
+torchaudio==2.9.1+cpu
+speechbrain==1.0.3
+openai==2.21.0


openai==2.21.0 is inconsistent with the OpenAI client version pinned in other demos (e.g., openai==1.107.0). Unless there’s a strong reason to require a different major version here, consider aligning the version (or loosening it to a compatible range) so users don’t run into API differences between demos.

Suggested change

openai==2.21.0

openai==1.107.0

Copilot · 2026-02-20T09:17:01Z

demos/audio/README.md

+python export_model.py text2speech --source_model microsoft/speecht5_tts --weight-format fp16 --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --vocoder microsoft/speecht5_hifigan --speaker_name voice1 --speaker_path /models/speakers/voice1.bin
 ```


The example passes --speaker_path /models/speakers/voice1.bin, which matches the Docker volume mount shown later, but it won’t exist for the Bare Metal deployment section (where the model repo is models/...). Please clarify that speaker_path must match the runtime filesystem (or suggest using a path relative to the model repository) to avoid a broken bare-metal setup.

michalkulakowski added 2 commits February 20, 2026 09:55

Update transcriptions docs (#3961)

8f225af

Speaker embeddings demo improvements (#3987)

71d42d5

mzegla requested a review from Copilot February 20, 2026 08:57

dkalinowski reviewed Feb 20, 2026

View reviewed changes

Copilot started reviewing on behalf of mzegla February 20, 2026 08:58 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

fix

8ccafa6

michalkulakowski requested review from Copilot and dkalinowski February 20, 2026 09:10

dkalinowski approved these changes Feb 20, 2026

View reviewed changes

Copilot started reviewing on behalf of michalkulakowski February 20, 2026 09:11 View session

fix

6fd7a4e

Copilot AI reviewed Feb 20, 2026

View reviewed changes

mzegla approved these changes Feb 20, 2026

View reviewed changes

michalkulakowski merged commit faf4e20 into releases/2026/0 Feb 23, 2026
1 check passed

	- `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]
	- `optional string target_device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]

	\| temperature \| ⚠️ \| ✅ \| number \| The sampling temperature, between 0 and 1. \|
	\| temperature \| ⚠️ \| ✅ \| number \| The sampling temperature, between 0 and 2. OpenVINO Model Server accepts values in the range 0.0–2.0 (note: OpenAI’s documentation typically states a range of 0.0–1.0). \|

		parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name')
		parser_text2speech.add_argument('--speaker_path', type=str, help='Path to the speaker.bin file.', dest='speaker_path')

	- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false]
	- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamps. [default = false]

	\| timestamp_granularities \| ⚠️ \| ✅ \| array \| The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️Note: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) \|
	\| timestamp_granularities \| ⚠️ \| ✅ \| array \| The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️Note: To enable word timestamps `enable_word_timestamps: true` needs to be set in graph.pbtxt) \|

	pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt
	pip3 install -r requirements.txt

		embedding = classifier.encode_batch(signal)
		embedding = torch.nn.functional.normalize(embedding, dim=2)

	\| timestamp_granularities \| ⚠️ \| ✅ \| array \| The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️Note: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) \|
	\| timestamp_granularities[] \| ⚠️ \| ✅ \| array \| The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️Note: To enable word timestamps `enable_word_timestamps: true` needs to be set in graph.pbtxt) \|

	enable_word_timestamps: {% if not enable_word_timestamps %}false{% else %}true{% endif%},
	enable_word_timestamps: {{ 'true' if enable_word_timestamps else 'false' }}

		python export_model.py text2speech --source_model microsoft/speecht5_tts --weight-format fp16 --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --vocoder microsoft/speecht5_hifigan --speaker_name voice1 --speaker_path /models/speakers/voice1.bin
		```

Comments

Conversation

michalkulakowski commented Feb 20, 2026

Uh oh!

dkalinowski Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants