refactor(ai): split STT and Audio-LLM into separate interfaces#5928
refactor(ai): split STT and Audio-LLM into separate interfaces#5928
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughSplits ChangesSTT / AudioLLM Refactor
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 6/8 reviews remaining, refill in 9 minutes and 34 seconds.Comment |
Refactors internal/ai/ from a single Transcriber interface (which dressed Gemini's multimodal LLM up as STT) into two honest interfaces, aligned with mainstream OSS conventions (Vercel AI SDK, LiteLLM, Go AI ecosystem): - internal/ai/stt/ — deterministic speech-to-text (OpenAI Whisper + gpt-4o-transcribe family, plus any OpenAI-compatible STT endpoint via the Endpoint field — Groq Whisper, Azure Whisper, faster-whisper) - internal/ai/audiollm/ — multimodal audio LLMs (Gemini generateContent; reserved for future gpt-4o-audio support) The Transcribe gRPC handler now dispatches to the right package based on provider.Type. Transcription instructions are extracted from the Gemini package into the handler, so audiollm/gemini is a generic audio-LLM adapter that the handler steers — opening the door to future "audio → structured note" pipelines without further refactoring. User-visible improvements: - Gemini failure modes are now informative. A non-Stop finish reason from Gemini (safety filter, max-tokens, recitation, prohibited-content, blocklist, SPII, image-safety variants) surfaces as "transcription incomplete (finish reason: safety)" instead of the previous generic "did not include text". - Browser-recorded WebM works with Gemini transparently. Pure-Go decoder (at-wat/ebml-go + pion/opus) transcodes to WAV server-side; the misleading "Gemini does not accept WebM" UI warning is gone. - Model picker UI shows current options. OpenAI placeholder lists whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize. Gemini lists gemini-2.5-flash, 2.5-pro. Prompt help text explains the Whisper-soft-hint vs Gemini-literal- instruction difference. Schema and contract: - No proto field changes. Two comment-only updates on TranscriptionConfig.model and .prompt to reflect current model families and the cross-provider prompt semantics. - No public API changes. TranscribeRequest/TranscribeResponse shapes unchanged. - No DB migration. InstanceAISetting storage unchanged. - Duration dropped from internal STT response shape — it's a property of the audio file, not the transcription provider. The public TranscribeResponse only ever returned Text, so no API impact. Notable internal changes: - No factory layer between handler and provider implementations. A factory in stt/ would have imported stt/openai which imports stt for types — Go import cycle. The handler already switches on provider.Type, so the factory was redundant indirection. Matches the cloudwego/eino and tmc/langchaingo convention (per-provider constructors, caller-side dispatch). - New sentinel errors: ErrSTTNotSupported, ErrAudioLLMNotSupported. Available for callers that want errors.Is checks; the handler uses the inline switch instead. - audiollm.FinishReason maps eight genai safety-related reasons (Safety, Recitation, ProhibitedContent, SPII, Blocklist, ImageSafety, ImageProhibitedContent, ImageRecitation) to FinishSafety so the handler error surfaces a clear bucket regardless of which content-policy reason Gemini reports. Background: Spec and research findings live in: - docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md - docs/superpowers/plans/2026-05-02-stt-audiollm-split.md Verified against six OSS reference implementations: Vercel AI SDK, LiteLLM, cloudwego/eino + tmc/langchaingo + sashabaranov/go-openai (SDK layer); Open WebUI, LobeChat, Dify (application layer). All six split STT and multimodal-audio into separate paths; none add STT-specific fields onto a generic provider entity.
f5ec03f to
e25b04a
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (2)
internal/ai/audio/webm_test.go (1)
32-48: 🏗️ Heavy liftAdd a success-path fixture for the transcoder.
Right now the new codec path is only covered through failures. A small valid WebM/Opus sample that asserts
RIFF/WAVEoutput and a sane decoded payload would give this feature real regression coverage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/ai/audio/webm_test.go` around lines 32 - 48, Add a positive test case to TestWebMOpusToWAV that supplies a small valid WebM/Opus fixture and asserts the successful WAV output: call WebMOpusToWAV with the valid sample bytes, require no error, then assert the returned []byte begins with the "RIFF" header and contains "WAVE" (and optionally a sane decoded payload like a non-zero length or presence of a "fmt " chunk). Place the fixture and assertions in webm_test.go alongside the existing negative cases and reference the WebMOpusToWAV function so the test covers the success path.server/router/api/v1/test/ai_service_test.go (1)
140-195: ⚡ Quick winAdd a Gemini regression case for the new handler-side guardrails.
This fixture now covers the
STOPhappy path, but it still doesn't pin the new behavior introduced here: rejecting non-STOPfinish reasons, rejecting blank transcript text, and building transcription instructions in the handler. A small negative Gemini subtest would make this refactor much safer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/superpowers/plans/2026-05-02-stt-audiollm-split.md`:
- Around line 1306-1314: The markdown code fences showing the file list (e.g.,
the block containing "ai.go audio audiollm errors.go models.go resolver.go stt")
are unlabeled and triggering MD040; edit each of the three offending fenced
blocks (including the ones at the other mentioned locations) and add a language
label (use "text") immediately after the opening triple backticks so the fences
read ```text, leaving the block contents unchanged.
- Around line 47-50: The plan's "Untouched" manifest incorrectly lists webm.go
and webm_test.go as untouched; update the staged plan so the audio
implementation files (webm.go and webm_test.go in the internal ai audio area)
are marked as touched/part of this PR (or remove them from the "Untouched" list)
so the manifest matches the actual changes and won't confuse replaying or
reconciling the commit breakdown.
In `@docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md`:
- Around line 71-95: The markdown code fences showing the project file-tree (the
block listing internal/ai/, server/router/api/v1/ai_service.go, proto/store/,
web/src/components/Settings/AISection.tsx, etc.) are unlabeled and triggering
MD040; edit those fenced blocks and add a language tag of "text" (e.g., ```text)
for each occurrence (the two unlabeled blocks around the file-tree sections) so
linters stop flagging them.
In `@internal/ai/audio/webm.go`:
- Around line 53-61: The code validates only length of track.CodecPrivate before
reading header fields, but never checks the OpusHead signature; update the
parsing in the function handling track.CodecPrivate (look for opusHeadMinLength,
track.CodecPrivate, and preSkip) to first verify that the first 8 bytes equal
the ASCII string "OpusHead" and return an error if not, then proceed to validate
channels and read preSkip from bytes 10:12; make the error message descriptive
(e.g., "invalid OpusHead signature") so malformed buffers are rejected
deterministically.
In `@internal/ai/audiollm/gemini/gemini.go`:
- Around line 87-107: The code currently calls io.ReadAll(req.Audio) into
audioBytes before enforcing size limits, which can OOM on large uploads; replace
the unbounded read with a size-capped read (e.g., use io.LimitedReader or read
into a buffer with a cap of maxInlineSize+1) when reading req.Audio so you can
immediately reject inputs that exceed maxInlineSize, then proceed to check
contentType and call audio.WebMOpusToWAV as before but also re-check the
resulting wav size against maxInlineSize (reject if transcoded WAV exceeds the
limit); update the logic around audioBytes, req.Audio, audio.IsWebMContentType,
audio.WebMOpusToWAV and the maxInlineSize checks accordingly.
In `@proto/store/instance_setting.proto`:
- Around line 181-187: The docs block listing transcription model examples
currently promises speaker labels for "gpt-4o-transcribe-diarize"; change that
wording to say it is "diarization-capable" or "supports diarization (speaker
segmentation)" instead of claiming it "includes speaker labels" so you don't
imply the public transcription response includes speaker labels; update the two
relevant comments containing "gpt-4o-transcribe-diarize" and any similar
phrasing in the OPENAI/GEMINI examples to the new wording.
In `@server/router/api/v1/ai_service.go`:
- Around line 117-118: The current error handling at the call site collapses all
errors into codes.Internal; instead preserve helper gRPC statuses by checking
status.FromError(err) and returning the original status if present (or fallback
to codes.Internal only if no gRPC status exists), or alternatively make the
helper transcribeViaAudioLLM return a gRPC status error directly so callers can
return it unchanged; update the call in ai_service.go where
transcribeViaAudioLLM is invoked to use status.FromError(err) and return that
status when available rather than unconditionally wrapping with codes.Internal.
---
Nitpick comments:
In `@internal/ai/audio/webm_test.go`:
- Around line 32-48: Add a positive test case to TestWebMOpusToWAV that supplies
a small valid WebM/Opus fixture and asserts the successful WAV output: call
WebMOpusToWAV with the valid sample bytes, require no error, then assert the
returned []byte begins with the "RIFF" header and contains "WAVE" (and
optionally a sane decoded payload like a non-zero length or presence of a "fmt "
chunk). Place the fixture and assertions in webm_test.go alongside the existing
negative cases and reference the WebMOpusToWAV function so the test covers the
success path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f7857b82-8703-45ad-ae30-484e925687ce
⛔ Files ignored due to path filters (2)
go.sumis excluded by!**/*.sumproto/gen/store/instance_setting.pb.gois excluded by!**/*.pb.go,!**/gen/**
📒 Files selected for processing (23)
docs/superpowers/plans/2026-05-02-stt-audiollm-split.mddocs/superpowers/specs/2026-05-02-stt-audiollm-split-design.mdgo.modinternal/ai/audio/webm.gointernal/ai/audio/webm_test.gointernal/ai/audiollm/audiollm.gointernal/ai/audiollm/gemini/gemini.gointernal/ai/audiollm/gemini/gemini_test.gointernal/ai/audiollm/options.gointernal/ai/client.gointernal/ai/errors.gointernal/ai/gemini.gointernal/ai/openai.gointernal/ai/stt/openai/openai.gointernal/ai/stt/openai/openai_test.gointernal/ai/stt/options.gointernal/ai/stt/stt.gointernal/ai/transcription.goproto/store/instance_setting.protoserver/router/api/v1/ai_service.goserver/router/api/v1/test/ai_service_test.goweb/src/components/Settings/AISection.tsxweb/src/locales/en.json
💤 Files with no reviewable changes (5)
- web/src/components/Settings/AISection.tsx
- internal/ai/client.go
- internal/ai/transcription.go
- internal/ai/gemini.go
- internal/ai/openai.go
| ``` | ||
| ai.go | ||
| audio | ||
| audiollm | ||
| errors.go | ||
| models.go | ||
| resolver.go | ||
| stt | ||
| ``` |
There was a problem hiding this comment.
Label these expected-output fences to satisfy markdownlint.
markdownlint is already flagging these blocks with MD040. Adding text to the fences keeps the plan lint-clean without changing any content.
Also applies to: 1435-1439, 1580-1588
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 1306-1306: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/superpowers/plans/2026-05-02-stt-audiollm-split.md` around lines 1306 -
1314, The markdown code fences showing the file list (e.g., the block containing
"ai.go audio audiollm errors.go models.go resolver.go stt") are unlabeled and
triggering MD040; edit each of the three offending fenced blocks (including the
ones at the other mentioned locations) and add a language label (use "text")
immediately after the opening triple backticks so the fences read ```text,
leaving the block contents unchanged.
| ``` | ||
| internal/ai/ | ||
| ai.go # ProviderType, ProviderConfig, errors | ||
| client.go # NewTranscriber factory, transcriberOptions, normalizeEndpoint, requireAPIKey | ||
| transcription.go # Transcriber interface, TranscribeRequest, TranscribeResponse | ||
| openai.go # openAITranscriber → /audio/transcriptions | ||
| openai_test.go | ||
| gemini.go # geminiTranscriber → generateContent (multimodal) | ||
| gemini_test.go | ||
| models.go # DefaultOpenAITranscriptionModel, DefaultGeminiTranscriptionModel | ||
| resolver.go # FindProvider | ||
| errors.go # ErrProviderNotFound, ErrCapabilityUnsupported | ||
| audio/ | ||
| webm.go # IsWebMContentType, WebMOpusToWAV (used by Gemini path) | ||
| webm_test.go | ||
|
|
||
| server/router/api/v1/ | ||
| ai_service.go # Transcribe handler (lines 42–123) | ||
|
|
||
| proto/store/ | ||
| instance_setting.proto # InstanceAISetting, AIProviderConfig, AIProviderType, TranscriptionConfig | ||
|
|
||
| web/src/components/Settings/ | ||
| AISection.tsx # Provider list UI, TranscriptionForm | ||
| ``` |
There was a problem hiding this comment.
Add language tags to the file-tree fences.
These two unlabeled blocks are already tripping MD040; text is enough here.
Also applies to: 103-128
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 71-71: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md` around lines
71 - 95, The markdown code fences showing the project file-tree (the block
listing internal/ai/, server/router/api/v1/ai_service.go, proto/store/,
web/src/components/Settings/AISection.tsx, etc.) are unlabeled and triggering
MD040; edit those fenced blocks and add a language tag of "text" (e.g., ```text)
for each occurrence (the two unlabeled blocks around the file-tree sections) so
linters stop flagging them.
| if len(track.CodecPrivate) < opusHeadMinLength { | ||
| return nil, errors.Errorf("invalid OpusHead: expected at least %d bytes, got %d", opusHeadMinLength, len(track.CodecPrivate)) | ||
| } | ||
|
|
||
| channels := int(track.Audio.Channels) | ||
| if channels < 1 || channels > 2 { | ||
| return nil, errors.Errorf("unsupported Opus channel count: %d", channels) | ||
| } | ||
| preSkip := int(binary.LittleEndian.Uint16(track.CodecPrivate[10:12])) |
There was a problem hiding this comment.
Validate the OpusHead signature before reading header fields.
This only checks the buffer length, then immediately trusts CodecPrivate[10:12] for preSkip. A malformed CodecPrivate with the right length can slip through and fail later with a much less obvious decode error. Rejecting anything that does not start with OpusHead makes bad uploads fail deterministically here.
Suggested fix
if len(track.CodecPrivate) < opusHeadMinLength {
return nil, errors.Errorf("invalid OpusHead: expected at least %d bytes, got %d", opusHeadMinLength, len(track.CodecPrivate))
}
+ if !bytes.HasPrefix(track.CodecPrivate, []byte("OpusHead")) {
+ return nil, errors.New("invalid OpusHead signature")
+ }
channels := int(track.Audio.Channels)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/ai/audio/webm.go` around lines 53 - 61, The code validates only
length of track.CodecPrivate before reading header fields, but never checks the
OpusHead signature; update the parsing in the function handling
track.CodecPrivate (look for opusHeadMinLength, track.CodecPrivate, and preSkip)
to first verify that the first 8 bytes equal the ASCII string "OpusHead" and
return an error if not, then proceed to validate channels and read preSkip from
bytes 10:12; make the error message descriptive (e.g., "invalid OpusHead
signature") so malformed buffers are rejected deterministically.
| audioBytes, err := io.ReadAll(req.Audio) | ||
| if err != nil { | ||
| return nil, errors.Wrap(err, "failed to read audio") | ||
| } | ||
| if len(audioBytes) == 0 { | ||
| return nil, errors.New("audio is required") | ||
| } | ||
|
|
||
| contentType := req.ContentType | ||
| if audio.IsWebMContentType(contentType) { | ||
| wav, err := audio.WebMOpusToWAV(audioBytes) | ||
| if err != nil { | ||
| return nil, errors.Wrap(err, "failed to transcode webm audio for Gemini") | ||
| } | ||
| audioBytes = wav | ||
| contentType = "audio/wav" | ||
| } | ||
|
|
||
| if len(audioBytes) > maxInlineSize { | ||
| return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize) | ||
| } |
There was a problem hiding this comment.
Bound audio reads before decoding/transcoding
At Line 87, io.ReadAll(req.Audio) loads the entire stream before size enforcement (Line 105), so oversized uploads can consume excessive memory before rejection.
Suggested fix
func (m *Model) GenerateFromAudio(ctx context.Context, req audiollm.Request) (*audiollm.Response, error) {
@@
- audioBytes, err := io.ReadAll(req.Audio)
+ if req.Size > 0 && req.Size > maxInlineSize {
+ return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize)
+ }
+ audioBytes, err := io.ReadAll(io.LimitReader(req.Audio, maxInlineSize+1))
if err != nil {
return nil, errors.Wrap(err, "failed to read audio")
}
if len(audioBytes) == 0 {
return nil, errors.New("audio is required")
}
+ if len(audioBytes) > maxInlineSize {
+ return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize)
+ }
@@
- if len(audioBytes) > maxInlineSize {
+ if len(audioBytes) > maxInlineSize {
return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/ai/audiollm/gemini/gemini.go` around lines 87 - 107, The code
currently calls io.ReadAll(req.Audio) into audioBytes before enforcing size
limits, which can OOM on large uploads; replace the unbounded read with a
size-capped read (e.g., use io.LimitedReader or read into a buffer with a cap of
maxInlineSize+1) when reading req.Audio so you can immediately reject inputs
that exceed maxInlineSize, then proceed to check contentType and call
audio.WebMOpusToWAV as before but also re-check the resulting wav size against
maxInlineSize (reject if transcoded WAV exceeds the limit); update the logic
around audioBytes, req.Audio, audio.IsWebMContentType, audio.WebMOpusToWAV and
the maxInlineSize checks accordingly.
| // OPENAI examples: | ||
| // - whisper-1 (legacy, lower cost) | ||
| // - gpt-4o-transcribe, gpt-4o-mini-transcribe (higher quality) | ||
| // - gpt-4o-transcribe-diarize (includes speaker labels) | ||
| // GEMINI examples: | ||
| // - gemini-2.5-flash (default, multimodal call) | ||
| // - gemini-2.5-pro |
There was a problem hiding this comment.
Don't promise speaker labels in the config docs yet.
The public transcription response is still text-only, so calling gpt-4o-transcribe-diarize “includes speaker labels” reads like an end-to-end guarantee the current API/UI do not surface. I’d rephrase this as a diarization-capable model until segment metadata is actually exposed.
Suggested wording
- // - gpt-4o-transcribe-diarize (includes speaker labels)
+ // - gpt-4o-transcribe-diarize (speaker-diarization capable)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // OPENAI examples: | |
| // - whisper-1 (legacy, lower cost) | |
| // - gpt-4o-transcribe, gpt-4o-mini-transcribe (higher quality) | |
| // - gpt-4o-transcribe-diarize (includes speaker labels) | |
| // GEMINI examples: | |
| // - gemini-2.5-flash (default, multimodal call) | |
| // - gemini-2.5-pro | |
| // OPENAI examples: | |
| // - whisper-1 (legacy, lower cost) | |
| // - gpt-4o-transcribe, gpt-4o-mini-transcribe (higher quality) | |
| // - gpt-4o-transcribe-diarize (speaker-diarization capable) | |
| // GEMINI examples: | |
| // - gemini-2.5-flash (default, multimodal call) | |
| // - gemini-2.5-pro |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@proto/store/instance_setting.proto` around lines 181 - 187, The docs block
listing transcription model examples currently promises speaker labels for
"gpt-4o-transcribe-diarize"; change that wording to say it is
"diarization-capable" or "supports diarization (speaker segmentation)" instead
of claiming it "includes speaker labels" so you don't imply the public
transcription response includes speaker labels; update the two relevant comments
containing "gpt-4o-transcribe-diarize" and any similar phrasing in the
OPENAI/GEMINI examples to the new wording.
| if err != nil { | ||
| return nil, status.Errorf(codes.InvalidArgument, "failed to create AI transcriber: %v", err) | ||
| return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
wc -l server/router/api/v1/ai_service.goRepository: usememos/memos
Length of output: 97
🏁 Script executed:
sed -n '100,130p' server/router/api/v1/ai_service.goRepository: usememos/memos
Length of output: 1017
🏁 Script executed:
sed -n '160,185p' server/router/api/v1/ai_service.goRepository: usememos/memos
Length of output: 979
🏁 Script executed:
rg -n "status\.Errorf|errors\.(Wrap|Errorf|New)" server/router/api/v1/ai_service.go | head -30Repository: usememos/memos
Length of output: 1636
🏁 Script executed:
rg -B3 -A3 "errors\.Errorf|errors\.New" server/router/api/v1/ai_service.go | grep -A3 -B3 "return \"\"" | head -40Repository: usememos/memos
Length of output: 362
🏁 Script executed:
grep -r "status\.FromError" server/router/api/ 2>/dev/null | head -5Repository: usememos/memos
Length of output: 222
Preserve gRPC error codes from helpers instead of collapsing to codes.Internal.
The helper functions transcribeViaAudioLLM (lines 174, 177) return specific, meaningful errors for incomplete responses and empty text, but line 118 rewrites all helper errors to codes.Internal. This masks user-actionable failures (FailedPrecondition) as server faults and removes the semantic distinction these checks provide.
To fix, either preserve status codes from helpers that already return gRPC errors, or have helpers return status errors directly:
Option 1: Preserve existing status codes at line 118
if err != nil {
- return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)
+ if st, ok := status.FromError(err); ok {
+ return nil, st.Err()
+ }
+ return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)
}Option 2: Return gRPC status from helper (lines 174–177)
if resp.FinishReason != audiollm.FinishStop {
- return "", errors.Errorf("transcription incomplete (finish reason: %s)", resp.FinishReason)
+ return "", status.Errorf(codes.FailedPrecondition, "transcription incomplete (finish reason: %s)", resp.FinishReason)
}
if strings.TrimSpace(resp.Text) == "" {
- return "", errors.New("transcription response did not include text")
+ return "", status.Errorf(codes.FailedPrecondition, "transcription response did not include text")
}The pattern status.FromError() is already used in this package (connect_handler.go).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if err != nil { | |
| return nil, status.Errorf(codes.InvalidArgument, "failed to create AI transcriber: %v", err) | |
| return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err) | |
| if err != nil { | |
| if st, ok := status.FromError(err); ok { | |
| return nil, st.Err() | |
| } | |
| return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/router/api/v1/ai_service.go` around lines 117 - 118, The current error
handling at the call site collapses all errors into codes.Internal; instead
preserve helper gRPC statuses by checking status.FromError(err) and returning
the original status if present (or fallback to codes.Internal only if no gRPC
status exists), or alternatively make the helper transcribeViaAudioLLM return a
gRPC status error directly so callers can return it unchanged; update the call
in ai_service.go where transcribeViaAudioLLM is invoked to use
status.FromError(err) and return that status when available rather than
unconditionally wrapping with codes.Internal.
These two deps were added as // indirect when WebM transcoding first landed, but internal/ai/audio/webm.go imports them directly, so go mod tidy moves them to the direct require block. CI Static Checks runs `go mod tidy && git diff --exit-code`, which failed before this fix.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/ai/audiollm/gemini/gemini_test.go (2)
20-86: ⚡ Quick winNon-Stop finish-reason paths have no test coverage.
mapFinishReason'sSAFETY → FinishSafety,MAX_TOKENS → FinishLength, anddefault → FinishOtherbranches are untested. The PR description specifically calls out improved finish-reason surfacing as a key behavior change, so these paths warrant direct coverage.✅ Proposed table-driven test for non-Stop finish reasons
func TestGenerateFromAudioFinishReasons(t *testing.T) { t.Parallel() cases := []struct { geminiReason string want audiollm.FinishReason }{ {"STOP", audiollm.FinishStop}, {"SAFETY", audiollm.FinishSafety}, {"MAX_TOKENS", audiollm.FinishLength}, {"RECITATION", audiollm.FinishOther}, } for _, tc := range cases { tc := tc t.Run(tc.geminiReason, func(t *testing.T) { t.Parallel() server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "application/json") require.NoError(t, json.NewEncoder(w).Encode(map[string]any{ "candidates": []map[string]any{{ "finishReason": tc.geminiReason, "content": map[string]any{"parts": []map[string]string{{"text": ""}}}, }}, })) })) defer server.Close() model, err := audiollmgemini.New(ai.ProviderConfig{ Type: ai.ProviderGemini, Endpoint: server.URL + "/v1beta", APIKey: "test-key", }, audiollm.ApplyOptions(nil)) require.NoError(t, err) resp, err := model.GenerateFromAudio(context.Background(), audiollm.Request{ Model: "gemini-2.5-flash", ContentType: "audio/mp3", Audio: strings.NewReader("audio bytes"), Instructions: "transcribe", }) require.NoError(t, err) // GenerateFromAudio does not error on non-Stop; the handler does require.Equal(t, tc.want, resp.FinishReason) }) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/ai/audiollm/gemini/gemini_test.go` around lines 20 - 86, The test suite lacks coverage for non-"STOP" finish reasons mapped by mapFinishReason; add a table-driven test that exercises GenerateFromAudio's finish-reason mapping for "SAFETY"→FinishSafety, "MAX_TOKENS"→FinishLength and an unknown value (e.g., "RECITATION")→FinishOther. Implement a new TestGenerateFromAudioFinishReasons that spins up httptest servers returning candidates with different "finishReason" strings, constructs audiollmgemini.New with the test server endpoint, calls model.GenerateFromAudio (using same request fields as TestGenerateFromAudio but varying the mock response), and asserts resp.FinishReason equals the expected audiollm.FinishReason for each case; reference mapFinishReason, GenerateFromAudio, and TestGenerateFromAudio when locating where to add this test.
20-86: ⚡ Quick winWebM → WAV transcoding path is not exercised despite being called out in the plan.
Plan step C.1 states "We also verify that WebM input is transcoded to WAV", but both tests use
audio/mpegand never trigger theIsWebMContentType → WebMOpusToWAVbranch. A minimal test using a valid WebM/Opus byte fixture would cover that path and the resultingaudio/wavcontent-type assertion on the wire.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/ai/audiollm/gemini/gemini_test.go` around lines 20 - 86, The test suite never exercises the WebM→WAV branch; add a new parallel test (or extend TestGenerateFromAudio) that calls audiollmgemini.GenerateFromAudio with ContentType "audio/webm" (so IsWebMContentType triggers) using a small valid WebM/Opus fixture reader and assert the outgoing request's part MIMEType is "audio/wav" and the inlineData.Data decodes to the expected WAV bytes produced by WebMOpusToWAV; reference the test function TestGenerateFromAudio, the method GenerateFromAudio, and the branch guard IsWebMContentType/WebMOpusToWAV so reviewers can locate the code, and include fixture byte slices (e.g., webmFixture and expectedWav) to keep the test deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/ai/audiollm/gemini/gemini_test.go`:
- Around line 20-86: The test suite lacks coverage for non-"STOP" finish reasons
mapped by mapFinishReason; add a table-driven test that exercises
GenerateFromAudio's finish-reason mapping for "SAFETY"→FinishSafety,
"MAX_TOKENS"→FinishLength and an unknown value (e.g., "RECITATION")→FinishOther.
Implement a new TestGenerateFromAudioFinishReasons that spins up httptest
servers returning candidates with different "finishReason" strings, constructs
audiollmgemini.New with the test server endpoint, calls model.GenerateFromAudio
(using same request fields as TestGenerateFromAudio but varying the mock
response), and asserts resp.FinishReason equals the expected
audiollm.FinishReason for each case; reference mapFinishReason,
GenerateFromAudio, and TestGenerateFromAudio when locating where to add this
test.
- Around line 20-86: The test suite never exercises the WebM→WAV branch; add a
new parallel test (or extend TestGenerateFromAudio) that calls
audiollmgemini.GenerateFromAudio with ContentType "audio/webm" (so
IsWebMContentType triggers) using a small valid WebM/Opus fixture reader and
assert the outgoing request's part MIMEType is "audio/wav" and the
inlineData.Data decodes to the expected WAV bytes produced by WebMOpusToWAV;
reference the test function TestGenerateFromAudio, the method GenerateFromAudio,
and the branch guard IsWebMContentType/WebMOpusToWAV so reviewers can locate the
code, and include fixture byte slices (e.g., webmFixture and expectedWav) to
keep the test deterministic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 56ea4848-fb70-462d-bae3-f55ed3b7cecf
⛔ Files ignored due to path filters (2)
go.sumis excluded by!**/*.sumproto/gen/store/instance_setting.pb.gois excluded by!**/*.pb.go,!**/gen/**
📒 Files selected for processing (23)
docs/superpowers/plans/2026-05-02-stt-audiollm-split.mddocs/superpowers/specs/2026-05-02-stt-audiollm-split-design.mdgo.modinternal/ai/audio/webm.gointernal/ai/audio/webm_test.gointernal/ai/audiollm/audiollm.gointernal/ai/audiollm/gemini/gemini.gointernal/ai/audiollm/gemini/gemini_test.gointernal/ai/audiollm/options.gointernal/ai/client.gointernal/ai/errors.gointernal/ai/gemini.gointernal/ai/openai.gointernal/ai/stt/openai/openai.gointernal/ai/stt/openai/openai_test.gointernal/ai/stt/options.gointernal/ai/stt/stt.gointernal/ai/transcription.goproto/store/instance_setting.protoserver/router/api/v1/ai_service.goserver/router/api/v1/test/ai_service_test.goweb/src/components/Settings/AISection.tsxweb/src/locales/en.json
💤 Files with no reviewable changes (5)
- internal/ai/transcription.go
- web/src/components/Settings/AISection.tsx
- internal/ai/openai.go
- internal/ai/client.go
- internal/ai/gemini.go
✅ Files skipped from review due to trivial changes (9)
- internal/ai/errors.go
- internal/ai/stt/options.go
- internal/ai/audiollm/audiollm.go
- go.mod
- server/router/api/v1/test/ai_service_test.go
- internal/ai/stt/openai/openai_test.go
- internal/ai/audio/webm.go
- web/src/locales/en.json
- proto/store/instance_setting.proto
🚧 Files skipped from review as they are similar to previous changes (6)
- internal/ai/audio/webm_test.go
- internal/ai/audiollm/gemini/gemini.go
- internal/ai/stt/openai/openai.go
- internal/ai/stt/stt.go
- internal/ai/audiollm/options.go
- server/router/api/v1/ai_service.go
Summary
Refactors
internal/ai/from a singleTranscriberinterface (which dressed Gemini's multimodal LLM up as STT) into two honest interfaces, aligned with mainstream OSS conventions (Vercel AI SDK, LiteLLM, Go AI ecosystem):internal/ai/stt/— deterministic speech-to-text (OpenAI Whisper + gpt-4o-transcribe family, plus any OpenAI-compatible STT endpoint via theEndpointfield — Groq Whisper, Azure Whisper, faster-whisper self-host)internal/ai/audiollm/— multimodal audio LLMs (Gemini generateContent; reserved for future gpt-4o-audio support)The
TranscribegRPC handler now dispatches to the right package based onprovider.Type. Transcription instructions are extracted from the Gemini package into the handler, soaudiollm/geminiis a generic audio-LLM adapter — opening the door to future "audio → structured note" pipelines without further refactoring.User-visible improvements
Stopfinish reason from Gemini (safety filter, max-tokens, recitation, prohibited-content, blocklist, SPII, etc.) surfaces astranscription incomplete (finish reason: safety)instead of the previous genericdid not include text.ebml-go+pion/opus) transcodes to WAV server-side; the misleading "Gemini does not accept WebM" UI warning is gone.whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize; Gemini listsgemini-2.5-flash, gemini-2.5-pro. Prompt help text explains the Whisper-soft-hint vs Gemini-literal-instruction difference.Schema and contract
TranscriptionConfig.modeland.prompt.TranscribeRequest/TranscribeResponseshapes unchanged.InstanceAISettingstorage unchanged.Durationdropped from internal STT response — it's a property of the audio file, not the transcription provider; doesn't belong on the response shape. PublicTranscribeResponseonly ever returnedText, so no API impact.Background
Spec and research findings live in:
docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.mddocs/superpowers/plans/2026-05-02-stt-audiollm-split.mdVerified against six OSS reference implementations: Vercel AI SDK, LiteLLM, cloudwego/eino + tmc/langchaingo + sashabaranov/go-openai (SDK layer); Open WebUI, LobeChat, Dify (application layer). All six split STT and multimodal-audio into separate paths; none add STT-specific fields onto a generic provider entity.
Test Plan
go build ./...cleango test ./...clean (excluding Docker-dependentstore/test)TestTranscribesubtests pass (including the Gemini handler test, with mock updated to send realisticfinishReason: STOP)cd web && pnpm buildclean (no TS errors)finish reason: ...instead of generic message)🤖 Generated with Claude Code