-
Notifications
You must be signed in to change notification settings - Fork 239
Mkulakow/cherry pick #4000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mkulakow/cherry pick #4000
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -16,6 +16,17 @@ Check supported [Speech Recognition Models](https://openvinotoolkit.github.io/op | |||||||||||||||||||||||||||||||||
| **Client**: curl or Python for using OpenAI client package | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| ## Speech generation | ||||||||||||||||||||||||||||||||||
| ### Prepare speaker embeddings | ||||||||||||||||||||||||||||||||||
| When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well: | ||||||||||||||||||||||||||||||||||
| ```bash | ||||||||||||||||||||||||||||||||||
| pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2026/0/demos/audio/requirements.txt | ||||||||||||||||||||||||||||||||||
| mkdir -p audio_samples | ||||||||||||||||||||||||||||||||||
| curl --output audio_samples/audio.wav "https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0032_8k.wav" | ||||||||||||||||||||||||||||||||||
| mkdir -p models | ||||||||||||||||||||||||||||||||||
| mkdir -p models/speakers | ||||||||||||||||||||||||||||||||||
| python create_speaker_embedding.py audio_samples/audio.wav models/speakers/voice1.bin | ||||||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| ### Model preparation | ||||||||||||||||||||||||||||||||||
| Supported models should use the topology of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) which needs to be converted to IR format before using in OVMS. | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
|
@@ -40,48 +51,14 @@ Run `export_model.py` script to download and quantize the model: | |||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| **CPU** | ||||||||||||||||||||||||||||||||||
| ```console | ||||||||||||||||||||||||||||||||||
| python export_model.py text2speech --source_model microsoft/speecht5_tts --weight-format fp16 --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --vocoder microsoft/speecht5_hifigan | ||||||||||||||||||||||||||||||||||
| python export_model.py text2speech --source_model microsoft/speecht5_tts --weight-format fp16 --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --vocoder microsoft/speecht5_hifigan --speaker_name voice1 --speaker_path /models/speakers/voice1.bin | ||||||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| > **Note:** Change the `--weight-format` to quantize the model to `int8` precision to reduce memory consumption and improve performance. | ||||||||||||||||||||||||||||||||||
| > **Note:** `speaker_name` and `speaker_path` may be omitted if the default model voice is sufficient | ||||||||||||||||||||||||||||||||||
|
Comment on lines
+54
to
+58
|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [T2s calculator documentation](../../docs/speech_generation/reference.md) to learn more about configuration options and limitations. | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| ### Speaker embeddings | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| Instead of generating speech with default model voice you can create speaker embeddings with [this script](https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/speech_generation/create_speaker_embedding.py) | ||||||||||||||||||||||||||||||||||
| ```bash | ||||||||||||||||||||||||||||||||||
| curl --output create_speaker_embedding.py "https://raw.githubusercontent.com/openvinotoolkit/openvino.genai/refs/heads/master/samples/python/speech_generation/create_speaker_embedding.py" | ||||||||||||||||||||||||||||||||||
| python create_speaker_embedding.py | ||||||||||||||||||||||||||||||||||
| mv speaker_embedding.bin models/ | ||||||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||||||
| Script records your speech for 5 seconds(you can adjust duration of recording to achieve better results) and then, using speechbrain/spkrec-xvect-voxceleb model, creates `speaker_embedding.bin` file that contains your speaker embedding. | ||||||||||||||||||||||||||||||||||
| Now you need to add speaker embedding path to graph.pbtxt file of text2speech graph: | ||||||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||||||
| input_stream: "HTTP_REQUEST_PAYLOAD:input" | ||||||||||||||||||||||||||||||||||
| output_stream: "HTTP_RESPONSE_PAYLOAD:output" | ||||||||||||||||||||||||||||||||||
| node { | ||||||||||||||||||||||||||||||||||
| name: "T2sExecutor" | ||||||||||||||||||||||||||||||||||
| input_side_packet: "TTS_NODE_RESOURCES:t2s_servable" | ||||||||||||||||||||||||||||||||||
| calculator: "T2sCalculator" | ||||||||||||||||||||||||||||||||||
| input_stream: "HTTP_REQUEST_PAYLOAD:input" | ||||||||||||||||||||||||||||||||||
| output_stream: "HTTP_RESPONSE_PAYLOAD:output" | ||||||||||||||||||||||||||||||||||
| node_options: { | ||||||||||||||||||||||||||||||||||
| [type.googleapis.com / mediapipe.T2sCalculatorOptions]: { | ||||||||||||||||||||||||||||||||||
| models_path: "./", | ||||||||||||||||||||||||||||||||||
| plugin_config: '{ "NUM_STREAMS": "1" }', | ||||||||||||||||||||||||||||||||||
| target_device: "CPU", | ||||||||||||||||||||||||||||||||||
| voices: [ | ||||||||||||||||||||||||||||||||||
| { | ||||||||||||||||||||||||||||||||||
| name: "voice", | ||||||||||||||||||||||||||||||||||
| path: "/models/speaker_embedding.bin", | ||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||
| ] | ||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| ### Deployment | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| **CPU** | ||||||||||||||||||||||||||||||||||
|
|
@@ -95,21 +72,21 @@ docker run -d -u $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models:rw | |||||||||||||||||||||||||||||||||
| **Deploying on Bare Metal** | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| ```bat | ||||||||||||||||||||||||||||||||||
| mkdir models | ||||||||||||||||||||||||||||||||||
| ovms --rest_port 8000 --source_model microsoft/speecht5_tts --model_repository_path models --model_name microsoft/speecht5_tts --task text2speech --target_device CPU | ||||||||||||||||||||||||||||||||||
| mkdir -p models | ||||||||||||||||||||||||||||||||||
| ovms --rest_port 8000 --model_path models/microsoft/speecht5_tts --model_name microsoft/speecht5_tts | ||||||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| ### Request Generation | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| :::{dropdown} **Unary call with curl** | ||||||||||||||||||||||||||||||||||
| :::{dropdown} **Unary call with curl with default voice** | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| ```bash | ||||||||||||||||||||||||||||||||||
| curl http://localhost:8000/v3/audio/speech -H "Content-Type: application/json" -d "{\"model\": \"microsoft/speecht5_tts\", \"input\": \"The quick brown fox jumped over the lazy dog\"}" -o speech.wav | ||||||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||||||
| ::: | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| :::{dropdown} **Unary call with OpenAi python library** | ||||||||||||||||||||||||||||||||||
| :::{dropdown} **Unary call with OpenAI python library with default voice** | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| ```python | ||||||||||||||||||||||||||||||||||
| from pathlib import Path | ||||||||||||||||||||||||||||||||||
|
|
@@ -125,7 +102,41 @@ client = OpenAI(base_url=url, api_key="not_used") | |||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| with client.audio.speech.with_streaming_response.create( | ||||||||||||||||||||||||||||||||||
| model="microsoft/speecht5_tts", | ||||||||||||||||||||||||||||||||||
| voice="unused", | ||||||||||||||||||||||||||||||||||
| voice=None, | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
| voice=None, |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python example opens the audio file without closing it. Use a context manager (with open(...) as audio_file:) in the snippet to avoid leaking file descriptors in copy/pasted code.
| audio_file = open(filename, "rb") | |
| transcript = client.audio.transcriptions.create( | |
| model="openai/whisper-large-v3-turbo", | |
| language="en", | |
| response_format="verbose_json", | |
| timestamp_granularities=["segment", "word"], | |
| file=audio_file | |
| ) | |
| with open(filename, "rb") as audio_file: | |
| transcript = client.audio.transcriptions.create( | |
| model="openai/whisper-large-v3-turbo", | |
| language="en", | |
| response_format="verbose_json", | |
| timestamp_granularities=["segment", "word"], | |
| file=audio_file | |
| ) |
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,36 @@ | ||||||||||||
| #!/usr/bin/env python3 | ||||||||||||
| # Copyright (C) 2026 Intel Corporation | ||||||||||||
| # SPDX-License-Identifier: Apache-2.0 | ||||||||||||
|
|
||||||||||||
| import torch | ||||||||||||
| import torchaudio | ||||||||||||
| from speechbrain.inference.speaker import EncoderClassifier | ||||||||||||
| import sys | ||||||||||||
|
|
||||||||||||
| if len(sys.argv) != 3: | ||||||||||||
| print(f"Usage: {sys.argv[0]} <input_audio_file> <output_embedding_file>") | ||||||||||||
| sys.exit(1) | ||||||||||||
|
|
||||||||||||
| file = sys.argv[1] | ||||||||||||
| signal, fs = torchaudio.load(file) | ||||||||||||
| if signal.shape[0] > 1: | ||||||||||||
| signal = torch.mean(signal, dim=0, keepdim=True) | ||||||||||||
| expected_sample_rate = 16000 | ||||||||||||
| if fs != expected_sample_rate: | ||||||||||||
| resampler = torchaudio.transforms.Resample(orig_freq=fs, new_freq=expected_sample_rate) | ||||||||||||
| signal = resampler(signal) | ||||||||||||
|
|
||||||||||||
| if signal.ndim != 2 or signal.shape[0] != 1: | ||||||||||||
| print(f"Error: expected signal shape [1, num_samples], got {list(signal.shape)}") | ||||||||||||
| sys.exit(1) | ||||||||||||
| if signal.shape[1] == 0: | ||||||||||||
| print("Error: audio file contains no samples") | ||||||||||||
| sys.exit(1) | ||||||||||||
|
|
||||||||||||
| classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-xvect-voxceleb") | ||||||||||||
| embedding = classifier.encode_batch(signal) | ||||||||||||
| embedding = torch.nn.functional.normalize(embedding, dim=2) | ||||||||||||
|
Comment on lines
+31
to
+32
|
||||||||||||
| embedding = classifier.encode_batch(signal) | |
| embedding = torch.nn.functional.normalize(embedding, dim=2) | |
| with torch.inference_mode(): | |
| embedding = classifier.encode_batch(signal) | |
| embedding = torch.nn.functional.normalize(embedding, dim=2) |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,5 @@ | ||||||
| --extra-index-url "https://download.pytorch.org/whl/cpu" | ||||||
| torch==2.9.1+cpu | ||||||
| torchaudio==2.9.1+cpu | ||||||
| speechbrain==1.0.3 | ||||||
| openai==2.21.0 | ||||||
|
||||||
| openai==2.21.0 | |
| openai==1.107.0 |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -91,10 +91,14 @@ def add_common_arguments(parser): | |||||
| add_common_arguments(parser_text2speech) | ||||||
| parser_text2speech.add_argument('--num_streams', default=0, type=int, help='The number of parallel execution streams to use for the models in the pipeline.', dest='num_streams') | ||||||
| parser_text2speech.add_argument('--vocoder', type=str, help='The vocoder model to use for text2speech. For example microsoft/speecht5_hifigan', dest='vocoder') | ||||||
| parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name') | ||||||
| parser_text2speech.add_argument('--speaker_path', type=str, help='Path to the speaker.bin file.', dest='speaker_path') | ||||||
|
Comment on lines
+94
to
+95
|
||||||
|
|
||||||
|
|
||||||
| parser_speech2text = subparsers.add_parser('speech2text', help='export model for speech2text endpoint') | ||||||
| add_common_arguments(parser_speech2text) | ||||||
| parser_speech2text.add_argument('--num_streams', default=0, type=int, help='The number of parallel execution streams to use for the models in the pipeline.', dest='num_streams') | ||||||
| parser_speech2text.add_argument('--enable_word_timestamps', default=False, action='store_true', help='Load model with word timestamps support.', dest='enable_word_timestamps') | ||||||
| args = vars(parser.parse_args()) | ||||||
|
|
||||||
| t2s_graph_template = """ | ||||||
|
|
@@ -110,7 +114,14 @@ def add_common_arguments(parser): | |||||
| [type.googleapis.com / mediapipe.T2sCalculatorOptions]: { | ||||||
| models_path: "{{model_path}}", | ||||||
| plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }', | ||||||
| target_device: "{{target_device|default("CPU", true)}}" | ||||||
| target_device: "{{target_device|default("CPU", true)}}", | ||||||
| {%- if speaker_name and speaker_path %} | ||||||
| voices: [ | ||||||
| { | ||||||
| name: "{{speaker_name}}", | ||||||
| path: "{{speaker_path}}" | ||||||
| } | ||||||
| ]{% endif %} | ||||||
|
Comment on lines
115
to
+124
|
||||||
| } | ||||||
| } | ||||||
| } | ||||||
|
|
@@ -129,7 +140,8 @@ def add_common_arguments(parser): | |||||
| [type.googleapis.com / mediapipe.S2tCalculatorOptions]: { | ||||||
| models_path: "{{model_path}}", | ||||||
| plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }', | ||||||
| target_device: "{{target_device|default("CPU", true)}}" | ||||||
| target_device: "{{target_device|default("CPU", true)}}", | ||||||
| enable_word_timestamps: {% if not enable_word_timestamps %}false{% else %}true{% endif%}, | ||||||
|
||||||
| enable_word_timestamps: {% if not enable_word_timestamps %}false{% else %}true{% endif%}, | |
| enable_word_timestamps: {{ 'true' if enable_word_timestamps else 'false' }} |
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -47,8 +47,8 @@ curl -X POST http://localhost:8000/v3/audio/translations \ | |||||||||
| | prompt | ❌ | ✅ | string | An optional text to guide the model's style or continue a previous audio segment. | | ||||||||||
| | response_format | ❌ | ✅ | string | The format of the output. | | ||||||||||
| | stream | ❌ | ✅ | boolean | Generate the response in streaming mode. | | ||||||||||
| | temperature | ❌ | ✅ | number | The sampling temperature, between 0 and 1. | | ||||||||||
| | timestamp_granularities | ❌ | ✅ | array | The timestamp granularities to populate for this transcription. | | ||||||||||
| | temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. | | ||||||||||
|
||||||||||
| | temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. | | |
| | temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 2. OpenVINO Model Server accepts values in the range 0.0–2.0 (note: OpenAI’s documentation typically states a range of 0.0–1.0). | |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar: "To enable word timestamps ... need to be set" should be "needs to be set".
| | timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) | | |
| | timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` needs to be set in graph.pbtxt) | |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The server parses timestamp granularities from multipart field name timestamp_granularities[] (array-style), but the parameter name in the table is listed as timestamp_granularities and doesn’t mention the required [] suffix. Please align the documented form field name with what the server actually expects, and fix the grammar in the note ("needs to be set").
| | timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) | | |
| | timestamp_granularities[] | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` needs to be set in graph.pbtxt) | |
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -42,7 +42,8 @@ node { | |||||||||||||
| node_options: { | ||||||||||||||
| [type.googleapis.com / mediapipe.S2tCalculatorOptions]: { | ||||||||||||||
| models_path: "./", | ||||||||||||||
| target_device: "CPU" | ||||||||||||||
| target_device: "CPU", | ||||||||||||||
| enable_word_timestamps: true | ||||||||||||||
|
Comment on lines
+45
to
+46
|
||||||||||||||
| } | ||||||||||||||
| } | ||||||||||||||
| } | ||||||||||||||
|
|
@@ -53,6 +54,7 @@ Above node configuration should be used as a template since user is not expected | |||||||||||||
| The calculator supports the following `node_options` for tuning the pipeline configuration: | ||||||||||||||
| - `required string models_path` - location of the models and scheduler directory (can be relative); | ||||||||||||||
| - `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"] | ||||||||||||||
|
||||||||||||||
| - `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"] | |
| - `optional string target_device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"] |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor grammar: the option description says "word timestamp" (singular) but the feature and API use plural "word timestamps" (e.g., timestamp_granularities=["word"]). Consider updating the wording for consistency.
| - `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false] | |
| - `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamps. [default = false] |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The node_options list documents optional string device, but the proto field and the example graph use target_device. This mismatch makes it unclear which key users should set. Update the bullet to optional string target_device and consider tweaking the new enable_word_timestamps description grammar ("word timestamps").
| - `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"] | |
| - `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false] | |
| - `optional string target_device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"] | |
| - `optional bool enable_word_timestamps` - set to true if the model should support user requests for word timestamps. [default = false] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example passes
--speaker_path /models/speakers/voice1.bin, which matches the Docker volume mount shown later, but it won’t exist for the Bare Metal deployment section (where the model repo ismodels/...). Please clarify thatspeaker_pathmust match the runtime filesystem (or suggest using a path relative to the model repository) to avoid a broken bare-metal setup.