fix: skip firstTokenLogProbThreshold when promptTokens are set#438
fix: skip firstTokenLogProbThreshold when promptTokens are set#438alan890104 wants to merge 2 commits intoargmaxinc:mainfrom
Conversation
When promptTokens are provided, the decoder's KV cache state is shifted by the prompt context, causing the first content token's logprob to drop below firstTokenLogProbThreshold (-1.5). This immediately aborts the decoding loop, producing empty transcription results. This threshold is a WhisperKit-specific quality gate not present in OpenAI's original Whisper or whisper.cpp. The original Whisper relies on avgLogprob (computed over the full segment) for quality filtering, which remains active and serves as a safety net. The issue is intermittent and particularly affects distilled/turbo model variants (e.g. large-v3-turbo) where the reduced decoder capacity is more sensitive to prompt conditioning. Fixes argmaxinc#372
…shold Regression test for argmaxinc#372. Measured on tiny model + jfk.wav: - Without prompt tokens: firstToken logprob ≈ -0.087 - With CJK prompt tokens: firstToken logprob ≈ -0.578 Prompt tokens shift the first content token logprob ~6.6x lower. On turbo models (fewer decoder layers), this shift is amplified enough to breach the default threshold (-1.5). We use -0.5 here to reliably reproduce the issue on tiny, simulating the larger shift on turbo. Without the fix: test fails (empty transcription). With the fix: test passes (normal transcription).
01914ba to
6dafa8a
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses intermittent empty transcription results when DecodingOptions.promptTokens is used by disabling WhisperKit’s firstTokenLogProbThreshold early-abort check in that mode (root cause of #372).
Changes:
- Skip the
firstTokenLogProbThresholdcheck whenpromptTokensare provided. - Add a regression unit test ensuring transcription is non-empty with
promptTokenseven under a strictfirstTokenLogProbThreshold.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
Sources/WhisperKit/Core/TextDecoder.swift |
Disables the first-token logprob early-abort when options.promptTokens is non-nil to prevent empty outputs. |
Tests/WhisperKitTests/UnitTests.swift |
Adds regression coverage for prompting with a strict first-token threshold (Issue #372). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| isFirstTokenLogProbTooLow = | ||
| if isFirstToken, let firstTokenLogProbThreshold = options.firstTokenLogProbThreshold, nextTokenLogProb < firstTokenLogProbThreshold { | ||
| if isFirstToken, options.promptTokens == nil, let firstTokenLogProbThreshold = options.firstTokenLogProbThreshold, nextTokenLogProb < firstTokenLogProbThreshold { | ||
| true | ||
| } else { |
There was a problem hiding this comment.
This change makes firstTokenLogProbThreshold effectively a no-op whenever options.promptTokens is non-nil. That’s a behavior change for an existing public option, so it should be documented (e.g., in DecodingOptions docs / README) to avoid confusing callers who set a strict threshold expecting it to be enforced.
Problem
Using
promptTokensinDecodingOptionscauses intermittent empty transcription results, particularly with distilled/turbo model variants (e.g.large-v3-turbo). This is the root cause of #372.Root Cause Analysis
When
promptTokensare provided, the decoder input sequence becomes:The prompt tokens shift the decoder's KV cache state, causing the first content token's logprob to occasionally drop below
firstTokenLogProbThreshold(default:-1.5). When this happens:isFirstTokenLogProbTooLowis set totrueavgLogProbcomputes to0.000(no real tokens to average over)DecodingFallbacktriggers with reason"firstTokenLogProbThreshold"Why this only affects WhisperKit
firstTokenLogProbThresholdis a WhisperKit-specific quality gate — it does not exist in:whisper/decoding.py) — only useslogprob_threshold(avg over full segment),no_speech_threshold, andcompression_ratio_thresholdThe original Whisper design lets the decoder run to completion and evaluates quality over the entire segment via
avgLogprob. This is robust to prompt-induced shifts in the first token's distribution because subsequent tokens compensate. WhisperKit's early abort on the first token prevents this self-correction.Why it's intermittent
The first token logprob depends on the interaction between prompt token content and audio content. For certain audio segments, the prompt conditioning pushes the first token just below
-1.5; for others it stays above. This creates a non-deterministic failure pattern.Why turbo models are more affected
Distilled/turbo variants (e.g.
large-v3-v20240930) have fewer decoder layers, making them more sensitive to changes in the conditioning context. The reduced decoder capacity has less room to absorb the distributional shift from prompt tokens.Fix
Skip the
firstTokenLogProbThresholdcheck whenpromptTokensare present (options.promptTokens == nilguard). This is a one-line change inTextDecoder.swift.Safety: The existing
logProbThreshold(avg over full segment, default-1.0) andcompressionRatioThresholdremain active as quality gates, matching the original Whisper behavior. Truly bad segments will still be caught and retried via temperature fallback.Related