-
Notifications
You must be signed in to change notification settings - Fork 419
Speech-to-Text Timestamp Stagnation in ElevenLabs API #607
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't workingneeds informationWaiting for more details from the reporterWaiting for more details from the reporterstalePR or issue has not had recent activity and may be closedPR or issue has not had recent activity and may be closedupstreamIssue originates from upstream or generated codeIssue originates from upstream or generated code
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingneeds informationWaiting for more details from the reporterWaiting for more details from the reporterstalePR or issue has not had recent activity and may be closedPR or issue has not had recent activity and may be closedupstreamIssue originates from upstream or generated codeIssue originates from upstream or generated code
Type
Fields
Give feedbackNo fields configured for issues without a type.
Description
Description
When using the client.speech_to_text.convert API with diarization enabled, the returned word-level timestamps occasionally become "stuck" - multiple consecutive words are assigned identical start and end timestamps. This breaks the continuity of the transcription timeline.
Example of Problem:
json
{
"text": "休",
"start": 452.3,
"end": 452.3
},
{
"text": "息",
"start": 452.3,
"end": 452.3
}
Impact:
Steps to Reproduce
Use a long audio file (>10 minutes) with multiple speaker changes
Call the API with parameters:
python
ElevenLabs.speech_to_text.convert(
file=audio_data,
model_id="scribe_v1",
diarize=True,
language_code="zh", # Also reproducible with other languages
tag_audio_events=True
)
Inspect word-level timestamps in the response
Observe duplicate timestamps for consecutive words, especially after speaker changes
Expected Behavior
Code example
No response
Additional context
No response