Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 103 additions & 47 deletions demos/audio/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,17 @@ Check supported [Speech Recognition Models](https://openvinotoolkit.github.io/op
**Client**: curl or Python for using OpenAI client package

## Speech generation
### Prepare speaker embeddings
When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well:
```bash
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2026/0/demos/audio/requirements.txt
mkdir -p audio_samples
curl --output audio_samples/audio.wav "https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0032_8k.wav"
mkdir -p models
mkdir -p models/speakers
python create_speaker_embedding.py audio_samples/audio.wav models/speakers/voice1.bin
```

### Model preparation
Supported models should use the topology of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) which needs to be converted to IR format before using in OVMS.

Expand All @@ -40,48 +51,14 @@ Run `export_model.py` script to download and quantize the model:

**CPU**
```console
python export_model.py text2speech --source_model microsoft/speecht5_tts --weight-format fp16 --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --vocoder microsoft/speecht5_hifigan
python export_model.py text2speech --source_model microsoft/speecht5_tts --weight-format fp16 --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --vocoder microsoft/speecht5_hifigan --speaker_name voice1 --speaker_path /models/speakers/voice1.bin
```
Comment on lines +54 to 55
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example passes --speaker_path /models/speakers/voice1.bin, which matches the Docker volume mount shown later, but it won’t exist for the Bare Metal deployment section (where the model repo is models/...). Please clarify that speaker_path must match the runtime filesystem (or suggest using a path relative to the model repository) to avoid a broken bare-metal setup.

Copilot uses AI. Check for mistakes.

> **Note:** Change the `--weight-format` to quantize the model to `int8` precision to reduce memory consumption and improve performance.
> **Note:** `speaker_name` and `speaker_path` may be omitted if the default model voice is sufficient
Comment on lines +54 to +58
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example passes --speaker_path /models/speakers/voice1.bin, but earlier you create the file at models/speakers/voice1.bin and the README also documents bare-metal runs (where /models/... may not exist). Consider using a path that is valid in both scenarios (e.g., relative to the model repository) or clearly scope this example to the Docker deployment with /models bind-mounted.

Copilot uses AI. Check for mistakes.

The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [T2s calculator documentation](../../docs/speech_generation/reference.md) to learn more about configuration options and limitations.

### Speaker embeddings

Instead of generating speech with default model voice you can create speaker embeddings with [this script](https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/speech_generation/create_speaker_embedding.py)
```bash
curl --output create_speaker_embedding.py "https://raw.githubusercontent.com/openvinotoolkit/openvino.genai/refs/heads/master/samples/python/speech_generation/create_speaker_embedding.py"
python create_speaker_embedding.py
mv speaker_embedding.bin models/
```
Script records your speech for 5 seconds(you can adjust duration of recording to achieve better results) and then, using speechbrain/spkrec-xvect-voxceleb model, creates `speaker_embedding.bin` file that contains your speaker embedding.
Now you need to add speaker embedding path to graph.pbtxt file of text2speech graph:
```
input_stream: "HTTP_REQUEST_PAYLOAD:input"
output_stream: "HTTP_RESPONSE_PAYLOAD:output"
node {
name: "T2sExecutor"
input_side_packet: "TTS_NODE_RESOURCES:t2s_servable"
calculator: "T2sCalculator"
input_stream: "HTTP_REQUEST_PAYLOAD:input"
output_stream: "HTTP_RESPONSE_PAYLOAD:output"
node_options: {
[type.googleapis.com / mediapipe.T2sCalculatorOptions]: {
models_path: "./",
plugin_config: '{ "NUM_STREAMS": "1" }',
target_device: "CPU",
voices: [
{
name: "voice",
path: "/models/speaker_embedding.bin",
}
]
}
}
}
```

### Deployment

**CPU**
Expand All @@ -95,21 +72,21 @@ docker run -d -u $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models:rw
**Deploying on Bare Metal**

```bat
mkdir models
ovms --rest_port 8000 --source_model microsoft/speecht5_tts --model_repository_path models --model_name microsoft/speecht5_tts --task text2speech --target_device CPU
mkdir -p models
ovms --rest_port 8000 --model_path models/microsoft/speecht5_tts --model_name microsoft/speecht5_tts
```

### Request Generation

:::{dropdown} **Unary call with curl**
:::{dropdown} **Unary call with curl with default voice**


```bash
curl http://localhost:8000/v3/audio/speech -H "Content-Type: application/json" -d "{\"model\": \"microsoft/speecht5_tts\", \"input\": \"The quick brown fox jumped over the lazy dog\"}" -o speech.wav
```
:::

:::{dropdown} **Unary call with OpenAi python library**
:::{dropdown} **Unary call with OpenAI python library with default voice**

```python
from pathlib import Path
Expand All @@ -125,7 +102,41 @@ client = OpenAI(base_url=url, api_key="not_used")

with client.audio.speech.with_streaming_response.create(
model="microsoft/speecht5_tts",
voice="unused",
voice=None,
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "default voice" OpenAI Python example, voice=None sends JSON null and may not satisfy client-side type validation (and it’s inconsistent with other docs that use a placeholder like voice="unused"). Prefer omitting the voice argument entirely if you want the server default, or use a documented placeholder consistently across docs.

Suggested change
voice=None,

Copilot uses AI. Check for mistakes.
input=prompt
) as response:
response.stream_to_file(speech_file_path)


print("Generation finished")
```
:::

:::{dropdown} **Unary call with curl**


```bash
curl http://localhost:8000/v3/audio/speech -H "Content-Type: application/json" -d "{\"model\": \"microsoft/speecht5_tts\", \"voice\":\"voice1\", \"input\": \"The quick brown fox jumped over the lazy dog\"}" -o speech.wav
```
:::

:::{dropdown} **Unary call with OpenAI python library**

```python
from pathlib import Path
from openai import OpenAI

prompt = "The quick brown fox jumped over the lazy dog"
filename = "speech.wav"
url="http://localhost:8000/v3"


speech_file_path = Path(__file__).parent / "speech.wav"
client = OpenAI(base_url=url, api_key="not_used")

with client.audio.speech.with_streaming_response.create(
model="microsoft/speecht5_tts",
voice="voice1",
input=prompt
) as response:
response.stream_to_file(speech_file_path)
Expand All @@ -144,7 +155,7 @@ An asynchronous benchmarking client can be used to access the model server perfo
git clone https://github.com/openvinotoolkit/model_server
cd model_server/demos/benchmark/v3/
pip install -r requirements.txt
python benchmark.py --api_url http://localhost:8122/v3/audio/speech --model microsoft/speecht5_tts --batch_size 1 --limit 100 --request_rate inf --backend text2speech --dataset edinburghcstr/ami --hf-subset 'ihm' --tokenizer openai/whisper-large-v3-turbo --trust-remote-code True
python benchmark.py --api_url http://localhost:8000/v3/audio/speech --model microsoft/speecht5_tts --batch_size 1 --limit 100 --request_rate inf --backend text2speech --dataset edinburghcstr/ami --hf-subset 'ihm' --tokenizer openai/whisper-large-v3-turbo --trust-remote-code True
Number of documents: 100
100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [01:58<00:00, 1.19s/it]
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Expand Down Expand Up @@ -178,10 +189,11 @@ Run `export_model.py` script to download and quantize the model:

**CPU**
```console
python export_model.py speech2text --source_model openai/whisper-large-v3-turbo --weight-format fp16 --model_name openai/whisper-large-v3-turbo --config_file_path models/config.json --model_repository_path models --overwrite_models
python export_model.py speech2text --source_model openai/whisper-large-v3-turbo --weight-format fp16 --model_name openai/whisper-large-v3-turbo --config_file_path models/config.json --model_repository_path models --overwrite_models --enable_word_timestamps
```

> **Note:** Change the `--weight-format` to quantize the model to `int8` precision to reduce memory consumption and improve performance.
> **Note:** `--enable_word_timestamps` can be omitted if there is no need for word timestamps support.

### Deployment

Expand Down Expand Up @@ -221,16 +233,16 @@ ovms --rest_port 8000 --source_model openai/whisper-large-v3-turbo --model_repos
```
:::

The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [S2t calculator documentation](../../docs/speech_recognition/reference.md) to learn more about configuration options and limitations.
The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [s2t calculator documentation](../../docs/speech_recognition/reference.md) to learn more about configuration options and limitations.

### Request Generation
Transcript file that was previously generated with audio/speech endpoint.

:::{dropdown} **Unary call with curl**
:::{dropdown} **Unary call with cURL**


```bash
curl http://localhost:8000/v3/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@speech.wav" -F model="openai/whisper-large-v3-turbo"
curl http://localhost:8000/v3/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@speech.wav" -F model="openai/whisper-large-v3-turbo" -F language="en"
```
```json
{"text": " The quick brown fox jumped over the lazy dog."}
Expand All @@ -253,6 +265,7 @@ client = OpenAI(base_url=url, api_key="not_used")
audio_file = open(filename, "rb")
transcript = client.audio.transcriptions.create(
model="openai/whisper-large-v3-turbo",
language="en",
file=audio_file
)

Expand All @@ -262,6 +275,49 @@ print(transcript.text)
The quick brown fox jumped over the lazy dog.
```
:::
:::{dropdown} **Unary call with timestamps**


```bash
curl http://localhost:8000/v3/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@speech.wav" -F model="openai/whisper-large-v3-turbo" -F language="en" -F timestamp_granularities[]="segment" -F timestamp_granularities[]="word"
```
```json
{"text":" A quick brown fox jumped over the lazy dog","words":[{"word":" A","start":0.0,"end":0.14000000059604645},{"word":" quick","start":0.14000000059604645,"end":0.3400000035762787},{"word":" brown","start":0.3400000035762787,"end":0.7799999713897705},{"word":" fox","start":0.7799999713897705,"end":1.3199999332427979},{"word":" jumped","start":1.3199999332427979,"end":1.7799999713897705},{"word":" over","start":1.7799999713897705,"end":2.0799999237060547},{"word":" the","start":2.0799999237060547,"end":2.259999990463257},{"word":" lazy","start":2.259999990463257,"end":2.5399999618530273},{"word":" dog","start":2.5399999618530273,"end":2.919999837875366}],"segments":[{"text":" A quick brown fox jumped over the lazy dog","start":0.0,"end":3.1399998664855957}]}
```
:::

:::{dropdown} **Unary call with python OpenAI library with timestamps**

```python
from pathlib import Path
from openai import OpenAI

filename = "speech.wav"
url="http://localhost:8000/v3"


speech_file_path = Path(__file__).parent / filename
client = OpenAI(base_url=url, api_key="not_used")

audio_file = open(filename, "rb")
transcript = client.audio.transcriptions.create(
model="openai/whisper-large-v3-turbo",
language="en",
response_format="verbose_json",
timestamp_granularities=["segment", "word"],
file=audio_file
)
Comment on lines +302 to +309
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python example opens the audio file without closing it. Use a context manager (with open(...) as audio_file:) in the snippet to avoid leaking file descriptors in copy/pasted code.

Suggested change
audio_file = open(filename, "rb")
transcript = client.audio.transcriptions.create(
model="openai/whisper-large-v3-turbo",
language="en",
response_format="verbose_json",
timestamp_granularities=["segment", "word"],
file=audio_file
)
with open(filename, "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="openai/whisper-large-v3-turbo",
language="en",
response_format="verbose_json",
timestamp_granularities=["segment", "word"],
file=audio_file
)

Copilot uses AI. Check for mistakes.

print(transcript.text)
print(transcript.segments)
print(transcript.words)
```
```
A quick brown fox jumped over the lazy dog
[TranscriptionSegment(id=None, avg_logprob=None, compression_ratio=None, end=3.1399998664855957, no_speech_prob=None, seek=None, start=0.0, temperature=None, text=' A quick brown fox jumped over the lazy dog', tokens=None)]
[TranscriptionWord(end=0.14000000059604645, start=0.0, word=' A'), TranscriptionWord(end=0.3400000035762787, start=0.14000000059604645, word=' quick'), TranscriptionWord(end=0.7799999713897705, start=0.3400000035762787, word=' brown'), TranscriptionWord(end=1.3199999332427979, start=0.7799999713897705, word=' fox'), TranscriptionWord(end=1.7799999713897705, start=1.3199999332427979, word=' jumped'), TranscriptionWord(end=2.0799999237060547, start=1.7799999713897705, word=' over'), TranscriptionWord(end=2.259999990463257, start=2.0799999237060547, word=' the'), TranscriptionWord(end=2.5399999618530273, start=2.259999990463257, word=' lazy'), TranscriptionWord(end=2.919999837875366, start=2.5399999618530273, word=' dog')]
```
:::

## Benchmarking transcription
An asynchronous benchmarking client can be used to access the model server performance with various load conditions. Below are execution examples captured on Intel(R) Core(TM) Ultra 7 258V.
Expand Down Expand Up @@ -336,7 +392,7 @@ ovms --rest_port 8000 --source_model OpenVINO/whisper-large-v3-fp16-ov --model_r
### Request Generation
Transcript and translate file that was previously generated with audio/speech endpoint.

:::{dropdown} **Unary call with curl**
:::{dropdown} **Unary call with cURL**


```bash
Expand Down
36 changes: 36 additions & 0 deletions demos/audio/create_speaker_embedding.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/usr/bin/env python3
# Copyright (C) 2026 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import torch
import torchaudio
from speechbrain.inference.speaker import EncoderClassifier
import sys

if len(sys.argv) != 3:
print(f"Usage: {sys.argv[0]} <input_audio_file> <output_embedding_file>")
sys.exit(1)

file = sys.argv[1]
signal, fs = torchaudio.load(file)
if signal.shape[0] > 1:
signal = torch.mean(signal, dim=0, keepdim=True)
expected_sample_rate = 16000
if fs != expected_sample_rate:
resampler = torchaudio.transforms.Resample(orig_freq=fs, new_freq=expected_sample_rate)
signal = resampler(signal)

if signal.ndim != 2 or signal.shape[0] != 1:
print(f"Error: expected signal shape [1, num_samples], got {list(signal.shape)}")
sys.exit(1)
if signal.shape[1] == 0:
print("Error: audio file contains no samples")
sys.exit(1)

classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-xvect-voxceleb")
embedding = classifier.encode_batch(signal)
embedding = torch.nn.functional.normalize(embedding, dim=2)
Comment on lines +31 to +32
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script runs speaker embedding extraction under normal autograd tracking, which can add unnecessary memory/overhead during inference because model parameters require gradients by default. Consider wrapping the embedding computation in torch.inference_mode() (or torch.no_grad()) to make it explicitly inference-only.

Suggested change
embedding = classifier.encode_batch(signal)
embedding = torch.nn.functional.normalize(embedding, dim=2)
with torch.inference_mode():
embedding = classifier.encode_batch(signal)
embedding = torch.nn.functional.normalize(embedding, dim=2)

Copilot uses AI. Check for mistakes.
embedding = embedding.squeeze().cpu().numpy().astype("float32")

output_file = sys.argv[2]
embedding.tofile(output_file)
5 changes: 5 additions & 0 deletions demos/audio/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
--extra-index-url "https://download.pytorch.org/whl/cpu"
torch==2.9.1+cpu
torchaudio==2.9.1+cpu
speechbrain==1.0.3
openai==2.21.0
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openai==2.21.0 is inconsistent with the OpenAI client version pinned in other demos (e.g., openai==1.107.0). Unless there’s a strong reason to require a different major version here, consider aligning the version (or loosening it to a compatible range) so users don’t run into API differences between demos.

Suggested change
openai==2.21.0
openai==1.107.0

Copilot uses AI. Check for mistakes.
16 changes: 14 additions & 2 deletions demos/common/export_models/export_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,10 +91,14 @@ def add_common_arguments(parser):
add_common_arguments(parser_text2speech)
parser_text2speech.add_argument('--num_streams', default=0, type=int, help='The number of parallel execution streams to use for the models in the pipeline.', dest='num_streams')
parser_text2speech.add_argument('--vocoder', type=str, help='The vocoder model to use for text2speech. For example microsoft/speecht5_hifigan', dest='vocoder')
parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name')
parser_text2speech.add_argument('--speaker_path', type=str, help='Path to the speaker.bin file.', dest='speaker_path')
Comment on lines +94 to +95
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--speaker_name and --speaker_path are intended to be used together, but the graph template silently omits the voices block unless both are set. Add argparse validation (or an explicit error) when only one of these flags is provided to avoid creating a graph that ignores the user's input.

Copilot uses AI. Check for mistakes.


parser_speech2text = subparsers.add_parser('speech2text', help='export model for speech2text endpoint')
add_common_arguments(parser_speech2text)
parser_speech2text.add_argument('--num_streams', default=0, type=int, help='The number of parallel execution streams to use for the models in the pipeline.', dest='num_streams')
parser_speech2text.add_argument('--enable_word_timestamps', default=False, action='store_true', help='Load model with word timestamps support.', dest='enable_word_timestamps')
args = vars(parser.parse_args())

t2s_graph_template = """
Expand All @@ -110,7 +114,14 @@ def add_common_arguments(parser):
[type.googleapis.com / mediapipe.T2sCalculatorOptions]: {
models_path: "{{model_path}}",
plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }',
target_device: "{{target_device|default("CPU", true)}}"
target_device: "{{target_device|default("CPU", true)}}",
{%- if speaker_name and speaker_path %}
voices: [
{
name: "{{speaker_name}}",
path: "{{speaker_path}}"
}
]{% endif %}
Comment on lines 115 to +124
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In t2s_graph_template, target_device is always followed by a comma, but voices is conditional. When no speaker is provided, the rendered graph.pbtxt will end up with a dangling trailing comma after target_device. Please move the comma into the conditional block (or otherwise avoid emitting a trailing comma) to keep the generated graph formatting consistent with the other templates in this script.

Copilot uses AI. Check for mistakes.
}
}
}
Expand All @@ -129,7 +140,8 @@ def add_common_arguments(parser):
[type.googleapis.com / mediapipe.S2tCalculatorOptions]: {
models_path: "{{model_path}}",
plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }',
target_device: "{{target_device|default("CPU", true)}}"
target_device: "{{target_device|default("CPU", true)}}",
enable_word_timestamps: {% if not enable_word_timestamps %}false{% else %}true{% endif%},
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In s2t_graph_template, enable_word_timestamps is the last field but is rendered with a trailing comma. Other templates in this file generally avoid trailing commas on the last field, so this inconsistency makes the generated graph harder to compare/debug. Consider rendering the boolean without a trailing comma (and you can simplify the Jinja to emit true/false directly).

Suggested change
enable_word_timestamps: {% if not enable_word_timestamps %}false{% else %}true{% endif%},
enable_word_timestamps: {{ 'true' if enable_word_timestamps else 'false' }}

Copilot uses AI. Check for mistakes.
}
}
}
Expand Down
4 changes: 2 additions & 2 deletions docs/model_server_rest_api_speech_to_text.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,8 @@ curl -X POST http://localhost:8000/v3/audio/translations \
| prompt | ❌ | ✅ | string | An optional text to guide the model's style or continue a previous audio segment. |
| response_format | ❌ | ✅ | string | The format of the output. |
| stream | ❌ | ✅ | boolean | Generate the response in streaming mode. |
| temperature | | ✅ | number | The sampling temperature, between 0 and 1. |
| timestamp_granularities | | ✅ | array | The timestamp granularities to populate for this transcription. |
| temperature | ⚠️ | ✅ | number | The sampling temperature, between 0 and 1. |
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says temperature is between 0 and 1, but the implementation accepts values up to 2.0 (and rejects outside 0.0–2.0). Update this range in the API table (and/or add an explicit note explaining the difference vs OpenAI) to prevent client-side validation issues.

Suggested change
| temperature | ⚠️ || number | The sampling temperature, between 0 and 1. |
| temperature | ⚠️ || number | The sampling temperature, between 0 and 2. OpenVINO Model Server accepts values in the range 0.0–2.0 (note: OpenAI’s documentation typically states a range of 0.0–1.0). |

Copilot uses AI. Check for mistakes.
| timestamp_granularities | ⚠️ | ✅ | array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) |
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: "To enable word timestamps ... need to be set" should be "needs to be set".

Suggested change
| timestamp_granularities | ⚠️ || array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) |
| timestamp_granularities | ⚠️ || array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` needs to be set in graph.pbtxt) |

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server parses timestamp granularities from multipart field name timestamp_granularities[] (array-style), but the parameter name in the table is listed as timestamp_granularities and doesn’t mention the required [] suffix. Please align the documented form field name with what the server actually expects, and fix the grammar in the note ("needs to be set").

Suggested change
| timestamp_granularities | ⚠️ || array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` need to be set in graph.pbtxt) |
| timestamp_granularities[] | ⚠️ || array | The timestamp granularities to populate for this transcription. Supported values: "word" and "segment" (⚠️**Note**: To enable word timestamps `enable_word_timestamps: true` needs to be set in graph.pbtxt) |

Copilot uses AI. Check for mistakes.


### Translation
Expand Down
4 changes: 3 additions & 1 deletion docs/speech_recognition/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,8 @@ node {
node_options: {
[type.googleapis.com / mediapipe.S2tCalculatorOptions]: {
models_path: "./",
target_device: "CPU"
target_device: "CPU",
enable_word_timestamps: true
Comment on lines +45 to +46
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example graph snippet, target_device now ends with a comma. Other graph.pbtxt examples in the docs use protobuf text format without commas (e.g., speech_generation/reference.md), so this is inconsistent and could confuse users copying the config. Consider removing the comma to keep formatting consistent across docs.

Copilot uses AI. Check for mistakes.
}
}
}
Expand All @@ -53,6 +54,7 @@ Above node configuration should be used as a template since user is not expected
The calculator supports the following `node_options` for tuning the pipeline configuration:
- `required string models_path` - location of the models and scheduler directory (can be relative);
- `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the node_options list, the option is documented as optional string device, but the actual proto uses target_device (see src/audio/speech_to_text/s2t_calculator.proto). This mismatch makes it unclear which key users should set in graph.pbtxt. Update the docs to use target_device consistently (and keep the supported values/default there).

Suggested change
- `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]
- `optional string target_device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]

Copilot uses AI. Check for mistakes.
- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false]
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor grammar: the option description says "word timestamp" (singular) but the feature and API use plural "word timestamps" (e.g., timestamp_granularities=["word"]). Consider updating the wording for consistency.

Suggested change
- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false]
- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamps. [default = false]

Copilot uses AI. Check for mistakes.
Comment on lines 56 to +57
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node_options list documents optional string device, but the proto field and the example graph use target_device. This mismatch makes it unclear which key users should set. Update the bullet to optional string target_device and consider tweaking the new enable_word_timestamps description grammar ("word timestamps").

Suggested change
- `optional string device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]
- `optional bool enable_word_timestamps` - set to true if model should support user requests for word timestamp. [default = false]
- `optional string target_device` - device to load models to. Supported values: "CPU, GPU, NPU" [default = "CPU"]
- `optional bool enable_word_timestamps` - set to true if the model should support user requests for word timestamps. [default = false]

Copilot uses AI. Check for mistakes.

We recommend using [export script](../../demos/common/export_models/README.md) to prepare models directory structure for serving.
Check [supported models](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#speech-recognition-models).
Expand Down