Skip to content

refactor(ai): split STT and Audio-LLM into separate interfaces#5928

Merged
boojack merged 2 commits intomainfrom
stt-audiollm-split
May 2, 2026
Merged

refactor(ai): split STT and Audio-LLM into separate interfaces#5928
boojack merged 2 commits intomainfrom
stt-audiollm-split

Conversation

@boojack
Copy link
Copy Markdown
Member

@boojack boojack commented May 2, 2026

Summary

Refactors internal/ai/ from a single Transcriber interface (which dressed Gemini's multimodal LLM up as STT) into two honest interfaces, aligned with mainstream OSS conventions (Vercel AI SDK, LiteLLM, Go AI ecosystem):

  • internal/ai/stt/ — deterministic speech-to-text (OpenAI Whisper + gpt-4o-transcribe family, plus any OpenAI-compatible STT endpoint via the Endpoint field — Groq Whisper, Azure Whisper, faster-whisper self-host)
  • internal/ai/audiollm/ — multimodal audio LLMs (Gemini generateContent; reserved for future gpt-4o-audio support)

The Transcribe gRPC handler now dispatches to the right package based on provider.Type. Transcription instructions are extracted from the Gemini package into the handler, so audiollm/gemini is a generic audio-LLM adapter — opening the door to future "audio → structured note" pipelines without further refactoring.

User-visible improvements

  • Gemini failure modes are now informative. A non-Stop finish reason from Gemini (safety filter, max-tokens, recitation, prohibited-content, blocklist, SPII, etc.) surfaces as transcription incomplete (finish reason: safety) instead of the previous generic did not include text.
  • Browser-recorded WebM works with Gemini transparently. Pure-Go decoder (ebml-go + pion/opus) transcodes to WAV server-side; the misleading "Gemini does not accept WebM" UI warning is gone.
  • Model picker UI shows current options. OpenAI placeholder lists whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize; Gemini lists gemini-2.5-flash, gemini-2.5-pro. Prompt help text explains the Whisper-soft-hint vs Gemini-literal-instruction difference.

Schema and contract

  • No proto field changes — only two comment improvements on TranscriptionConfig.model and .prompt.
  • No public API changesTranscribeRequest/TranscribeResponse shapes unchanged.
  • No DB migrationInstanceAISetting storage unchanged.
  • Duration dropped from internal STT response — it's a property of the audio file, not the transcription provider; doesn't belong on the response shape. Public TranscribeResponse only ever returned Text, so no API impact.

Background

Spec and research findings live in:

  • docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md
  • docs/superpowers/plans/2026-05-02-stt-audiollm-split.md

Verified against six OSS reference implementations: Vercel AI SDK, LiteLLM, cloudwego/eino + tmc/langchaingo + sashabaranov/go-openai (SDK layer); Open WebUI, LobeChat, Dify (application layer). All six split STT and multimodal-audio into separate paths; none add STT-specific fields onto a generic provider entity.

Test Plan

  • go build ./... clean
  • go test ./... clean (excluding Docker-dependent store/test)
  • All 7 TestTranscribe subtests pass (including the Gemini handler test, with mock updated to send realistic finishReason: STOP)
  • cd web && pnpm build clean (no TS errors)
  • Manual smoke test: configure an OpenAI provider, upload short audio via voice-input feature, transcript appears
  • Manual smoke test: configure a Gemini provider, upload browser-recorded WebM via voice-input feature, transcript appears (validates WebM transcoding path)
  • Manual smoke test: deliberately bad/silent audio with Gemini produces a clearer error than before (finish reason: ... instead of generic message)

🤖 Generated with Claude Code

@boojack boojack requested a review from a team as a code owner May 2, 2026 16:05
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fec8757a-330f-4b66-a6e7-56e1ef7d7321

📥 Commits

Reviewing files that changed from the base of the PR and between e25b04a and 6a1c71d.

📒 Files selected for processing (1)
  • go.mod
🚧 Files skipped from review as they are similar to previous changes (1)
  • go.mod

📝 Walkthrough

Walkthrough

Splits internal/ai into stt and audiollm packages; adds WebM/Opus→WAV decoding; implements OpenAI STT and Gemini AudioLLM adapters (with options and tests); updates Transcribe to dispatch by provider.Type; removes legacy combined transcription code; updates proto comments and frontend i18n.

Changes

STT / AudioLLM Refactor

Layer / File(s) Summary
Documentation & Design
docs/superpowers/plans/2026-05-02-stt-audiollm-split.md, docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md
Added staged implementation plan and design spec describing new stt/audiollm package split, factory/dispatch strategy, sentinel errors, proto comment updates, and frontend i18n changes.
Dependencies
go.mod
Added dependencies for EBML and Opus decoding: github.com/at-wat/ebml-go and github.com/pion/opus.
Audio Infrastructure
internal/ai/audio/webm.go, internal/ai/audio/webm_test.go
Added WebMOpusToWAV (WebM/Opus→16-bit PCM WAV) and IsWebMContentType with unit tests and validation/error handling for truncated/invalid inputs.
Core Interfaces / Types
internal/ai/stt/stt.go, internal/ai/audiollm/audiollm.go, internal/ai/errors.go
Introduced stt.Transcriber (Request/Response/Segment), audiollm.Model (Request/Response and FinishReason constants), and sentinel errors ErrSTTNotSupported / ErrAudioLLMNotSupported.
Options / Config Helpers
internal/ai/stt/options.go, internal/ai/audiollm/options.go
Added package-scoped Options, functional option types, WithHTTPClient, and ApplyOptions with default HTTP timeout behavior.
Provider Implementations
internal/ai/stt/openai/openai.go, internal/ai/stt/openai/openai_test.go, internal/ai/audiollm/gemini/gemini.go, internal/ai/audiollm/gemini/gemini_test.go
Implemented OpenAI STT transcriber (endpoint normalization, API key validation, SDK usage, audio metadata handling) and Gemini audio-LLM model (audio read, WebM→WAV fallback, size limits, content-type normalization, generation call, finish-reason mapping); updated/added provider tests.
Handler Integration & Dispatch
server/router/api/v1/ai_service.go, server/router/api/v1/test/ai_service_test.go
Refactored Transcribe to switch on provider.Type and route to transcribeViaSTT (OpenAI) or transcribeViaAudioLLM (Gemini); added buildTranscriptionInstructions; enforce FinishStop for audiollm; updated test mock finishReason.
Removal / Cleanup
internal/ai/client.go, internal/ai/transcription.go, internal/ai/openai.go, internal/ai/gemini.go
Deleted legacy monolithic transcription API, provider-agnostic transcriber factory, and previous OpenAI/Gemini combined transcriber implementations.
Proto Comments & Frontend
proto/store/instance_setting.proto, web/src/components/Settings/AISection.tsx, web/src/locales/en.json
Updated proto field comments for TranscriptionConfig.model and prompt (comment-only); changed settings UI to show generic API-key warning when provider has no key; expanded transcription i18n keys and model placeholder/help strings and removed Gemini-specific webm warning.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main refactoring effort: splitting the monolithic transcription interface into two specialized packages (STT and Audio-LLM), which is the core objective of this PR.
Description check ✅ Passed The description comprehensively relates to the changeset, covering the refactoring goals, implementation details, user-visible improvements, schema/contract implications, testing strategy, and design documentation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 6/8 reviews remaining, refill in 9 minutes and 34 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Refactors internal/ai/ from a single Transcriber interface (which dressed
Gemini's multimodal LLM up as STT) into two honest interfaces, aligned
with mainstream OSS conventions (Vercel AI SDK, LiteLLM, Go AI ecosystem):

- internal/ai/stt/ — deterministic speech-to-text (OpenAI Whisper +
  gpt-4o-transcribe family, plus any OpenAI-compatible STT endpoint via
  the Endpoint field — Groq Whisper, Azure Whisper, faster-whisper)
- internal/ai/audiollm/ — multimodal audio LLMs (Gemini generateContent;
  reserved for future gpt-4o-audio support)

The Transcribe gRPC handler now dispatches to the right package based on
provider.Type. Transcription instructions are extracted from the Gemini
package into the handler, so audiollm/gemini is a generic audio-LLM
adapter that the handler steers — opening the door to future
"audio → structured note" pipelines without further refactoring.

User-visible improvements:

- Gemini failure modes are now informative. A non-Stop finish reason
  from Gemini (safety filter, max-tokens, recitation, prohibited-content,
  blocklist, SPII, image-safety variants) surfaces as
  "transcription incomplete (finish reason: safety)" instead of the
  previous generic "did not include text".
- Browser-recorded WebM works with Gemini transparently. Pure-Go decoder
  (at-wat/ebml-go + pion/opus) transcodes to WAV server-side; the
  misleading "Gemini does not accept WebM" UI warning is gone.
- Model picker UI shows current options. OpenAI placeholder lists
  whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe,
  gpt-4o-transcribe-diarize. Gemini lists gemini-2.5-flash, 2.5-pro.
  Prompt help text explains the Whisper-soft-hint vs Gemini-literal-
  instruction difference.

Schema and contract:

- No proto field changes. Two comment-only updates on
  TranscriptionConfig.model and .prompt to reflect current model
  families and the cross-provider prompt semantics.
- No public API changes. TranscribeRequest/TranscribeResponse shapes
  unchanged.
- No DB migration. InstanceAISetting storage unchanged.
- Duration dropped from internal STT response shape — it's a property
  of the audio file, not the transcription provider. The public
  TranscribeResponse only ever returned Text, so no API impact.

Notable internal changes:

- No factory layer between handler and provider implementations.
  A factory in stt/ would have imported stt/openai which imports stt
  for types — Go import cycle. The handler already switches on
  provider.Type, so the factory was redundant indirection. Matches the
  cloudwego/eino and tmc/langchaingo convention (per-provider
  constructors, caller-side dispatch).
- New sentinel errors: ErrSTTNotSupported, ErrAudioLLMNotSupported.
  Available for callers that want errors.Is checks; the handler uses
  the inline switch instead.
- audiollm.FinishReason maps eight genai safety-related reasons
  (Safety, Recitation, ProhibitedContent, SPII, Blocklist,
  ImageSafety, ImageProhibitedContent, ImageRecitation) to FinishSafety
  so the handler error surfaces a clear bucket regardless of which
  content-policy reason Gemini reports.

Background:

Spec and research findings live in:
- docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md
- docs/superpowers/plans/2026-05-02-stt-audiollm-split.md

Verified against six OSS reference implementations: Vercel AI SDK,
LiteLLM, cloudwego/eino + tmc/langchaingo + sashabaranov/go-openai
(SDK layer); Open WebUI, LobeChat, Dify (application layer). All six
split STT and multimodal-audio into separate paths; none add
STT-specific fields onto a generic provider entity.
@boojack boojack force-pushed the stt-audiollm-split branch from f5ec03f to e25b04a Compare May 2, 2026 16:09
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
internal/ai/audio/webm_test.go (1)

32-48: 🏗️ Heavy lift

Add a success-path fixture for the transcoder.

Right now the new codec path is only covered through failures. A small valid WebM/Opus sample that asserts RIFF/WAVE output and a sane decoded payload would give this feature real regression coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/ai/audio/webm_test.go` around lines 32 - 48, Add a positive test
case to TestWebMOpusToWAV that supplies a small valid WebM/Opus fixture and
asserts the successful WAV output: call WebMOpusToWAV with the valid sample
bytes, require no error, then assert the returned []byte begins with the "RIFF"
header and contains "WAVE" (and optionally a sane decoded payload like a
non-zero length or presence of a "fmt " chunk). Place the fixture and assertions
in webm_test.go alongside the existing negative cases and reference the
WebMOpusToWAV function so the test covers the success path.
server/router/api/v1/test/ai_service_test.go (1)

140-195: ⚡ Quick win

Add a Gemini regression case for the new handler-side guardrails.

This fixture now covers the STOP happy path, but it still doesn't pin the new behavior introduced here: rejecting non-STOP finish reasons, rejecting blank transcript text, and building transcription instructions in the handler. A small negative Gemini subtest would make this refactor much safer.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/superpowers/plans/2026-05-02-stt-audiollm-split.md`:
- Around line 1306-1314: The markdown code fences showing the file list (e.g.,
the block containing "ai.go audio audiollm errors.go models.go resolver.go stt")
are unlabeled and triggering MD040; edit each of the three offending fenced
blocks (including the ones at the other mentioned locations) and add a language
label (use "text") immediately after the opening triple backticks so the fences
read ```text, leaving the block contents unchanged.
- Around line 47-50: The plan's "Untouched" manifest incorrectly lists webm.go
and webm_test.go as untouched; update the staged plan so the audio
implementation files (webm.go and webm_test.go in the internal ai audio area)
are marked as touched/part of this PR (or remove them from the "Untouched" list)
so the manifest matches the actual changes and won't confuse replaying or
reconciling the commit breakdown.

In `@docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md`:
- Around line 71-95: The markdown code fences showing the project file-tree (the
block listing internal/ai/, server/router/api/v1/ai_service.go, proto/store/,
web/src/components/Settings/AISection.tsx, etc.) are unlabeled and triggering
MD040; edit those fenced blocks and add a language tag of "text" (e.g., ```text)
for each occurrence (the two unlabeled blocks around the file-tree sections) so
linters stop flagging them.

In `@internal/ai/audio/webm.go`:
- Around line 53-61: The code validates only length of track.CodecPrivate before
reading header fields, but never checks the OpusHead signature; update the
parsing in the function handling track.CodecPrivate (look for opusHeadMinLength,
track.CodecPrivate, and preSkip) to first verify that the first 8 bytes equal
the ASCII string "OpusHead" and return an error if not, then proceed to validate
channels and read preSkip from bytes 10:12; make the error message descriptive
(e.g., "invalid OpusHead signature") so malformed buffers are rejected
deterministically.

In `@internal/ai/audiollm/gemini/gemini.go`:
- Around line 87-107: The code currently calls io.ReadAll(req.Audio) into
audioBytes before enforcing size limits, which can OOM on large uploads; replace
the unbounded read with a size-capped read (e.g., use io.LimitedReader or read
into a buffer with a cap of maxInlineSize+1) when reading req.Audio so you can
immediately reject inputs that exceed maxInlineSize, then proceed to check
contentType and call audio.WebMOpusToWAV as before but also re-check the
resulting wav size against maxInlineSize (reject if transcoded WAV exceeds the
limit); update the logic around audioBytes, req.Audio, audio.IsWebMContentType,
audio.WebMOpusToWAV and the maxInlineSize checks accordingly.

In `@proto/store/instance_setting.proto`:
- Around line 181-187: The docs block listing transcription model examples
currently promises speaker labels for "gpt-4o-transcribe-diarize"; change that
wording to say it is "diarization-capable" or "supports diarization (speaker
segmentation)" instead of claiming it "includes speaker labels" so you don't
imply the public transcription response includes speaker labels; update the two
relevant comments containing "gpt-4o-transcribe-diarize" and any similar
phrasing in the OPENAI/GEMINI examples to the new wording.

In `@server/router/api/v1/ai_service.go`:
- Around line 117-118: The current error handling at the call site collapses all
errors into codes.Internal; instead preserve helper gRPC statuses by checking
status.FromError(err) and returning the original status if present (or fallback
to codes.Internal only if no gRPC status exists), or alternatively make the
helper transcribeViaAudioLLM return a gRPC status error directly so callers can
return it unchanged; update the call in ai_service.go where
transcribeViaAudioLLM is invoked to use status.FromError(err) and return that
status when available rather than unconditionally wrapping with codes.Internal.

---

Nitpick comments:
In `@internal/ai/audio/webm_test.go`:
- Around line 32-48: Add a positive test case to TestWebMOpusToWAV that supplies
a small valid WebM/Opus fixture and asserts the successful WAV output: call
WebMOpusToWAV with the valid sample bytes, require no error, then assert the
returned []byte begins with the "RIFF" header and contains "WAVE" (and
optionally a sane decoded payload like a non-zero length or presence of a "fmt "
chunk). Place the fixture and assertions in webm_test.go alongside the existing
negative cases and reference the WebMOpusToWAV function so the test covers the
success path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f7857b82-8703-45ad-ae30-484e925687ce

📥 Commits

Reviewing files that changed from the base of the PR and between 238f27d and f5ec03f.

⛔ Files ignored due to path filters (2)
  • go.sum is excluded by !**/*.sum
  • proto/gen/store/instance_setting.pb.go is excluded by !**/*.pb.go, !**/gen/**
📒 Files selected for processing (23)
  • docs/superpowers/plans/2026-05-02-stt-audiollm-split.md
  • docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md
  • go.mod
  • internal/ai/audio/webm.go
  • internal/ai/audio/webm_test.go
  • internal/ai/audiollm/audiollm.go
  • internal/ai/audiollm/gemini/gemini.go
  • internal/ai/audiollm/gemini/gemini_test.go
  • internal/ai/audiollm/options.go
  • internal/ai/client.go
  • internal/ai/errors.go
  • internal/ai/gemini.go
  • internal/ai/openai.go
  • internal/ai/stt/openai/openai.go
  • internal/ai/stt/openai/openai_test.go
  • internal/ai/stt/options.go
  • internal/ai/stt/stt.go
  • internal/ai/transcription.go
  • proto/store/instance_setting.proto
  • server/router/api/v1/ai_service.go
  • server/router/api/v1/test/ai_service_test.go
  • web/src/components/Settings/AISection.tsx
  • web/src/locales/en.json
💤 Files with no reviewable changes (5)
  • web/src/components/Settings/AISection.tsx
  • internal/ai/client.go
  • internal/ai/transcription.go
  • internal/ai/gemini.go
  • internal/ai/openai.go

Comment thread docs/superpowers/plans/2026-05-02-stt-audiollm-split.md
Comment on lines +1306 to +1314
```
ai.go
audio
audiollm
errors.go
models.go
resolver.go
stt
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Label these expected-output fences to satisfy markdownlint.

markdownlint is already flagging these blocks with MD040. Adding text to the fences keeps the plan lint-clean without changing any content.

Also applies to: 1435-1439, 1580-1588

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 1306-1306: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/plans/2026-05-02-stt-audiollm-split.md` around lines 1306 -
1314, The markdown code fences showing the file list (e.g., the block containing
"ai.go audio audiollm errors.go models.go resolver.go stt") are unlabeled and
triggering MD040; edit each of the three offending fenced blocks (including the
ones at the other mentioned locations) and add a language label (use "text")
immediately after the opening triple backticks so the fences read ```text,
leaving the block contents unchanged.

Comment on lines +71 to +95
```
internal/ai/
ai.go # ProviderType, ProviderConfig, errors
client.go # NewTranscriber factory, transcriberOptions, normalizeEndpoint, requireAPIKey
transcription.go # Transcriber interface, TranscribeRequest, TranscribeResponse
openai.go # openAITranscriber → /audio/transcriptions
openai_test.go
gemini.go # geminiTranscriber → generateContent (multimodal)
gemini_test.go
models.go # DefaultOpenAITranscriptionModel, DefaultGeminiTranscriptionModel
resolver.go # FindProvider
errors.go # ErrProviderNotFound, ErrCapabilityUnsupported
audio/
webm.go # IsWebMContentType, WebMOpusToWAV (used by Gemini path)
webm_test.go

server/router/api/v1/
ai_service.go # Transcribe handler (lines 42–123)

proto/store/
instance_setting.proto # InstanceAISetting, AIProviderConfig, AIProviderType, TranscriptionConfig

web/src/components/Settings/
AISection.tsx # Provider list UI, TranscriptionForm
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language tags to the file-tree fences.

These two unlabeled blocks are already tripping MD040; text is enough here.

Also applies to: 103-128

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 71-71: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md` around lines
71 - 95, The markdown code fences showing the project file-tree (the block
listing internal/ai/, server/router/api/v1/ai_service.go, proto/store/,
web/src/components/Settings/AISection.tsx, etc.) are unlabeled and triggering
MD040; edit those fenced blocks and add a language tag of "text" (e.g., ```text)
for each occurrence (the two unlabeled blocks around the file-tree sections) so
linters stop flagging them.

Comment thread internal/ai/audio/webm.go
Comment on lines +53 to +61
if len(track.CodecPrivate) < opusHeadMinLength {
return nil, errors.Errorf("invalid OpusHead: expected at least %d bytes, got %d", opusHeadMinLength, len(track.CodecPrivate))
}

channels := int(track.Audio.Channels)
if channels < 1 || channels > 2 {
return nil, errors.Errorf("unsupported Opus channel count: %d", channels)
}
preSkip := int(binary.LittleEndian.Uint16(track.CodecPrivate[10:12]))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate the OpusHead signature before reading header fields.

This only checks the buffer length, then immediately trusts CodecPrivate[10:12] for preSkip. A malformed CodecPrivate with the right length can slip through and fail later with a much less obvious decode error. Rejecting anything that does not start with OpusHead makes bad uploads fail deterministically here.

Suggested fix
 	if len(track.CodecPrivate) < opusHeadMinLength {
 		return nil, errors.Errorf("invalid OpusHead: expected at least %d bytes, got %d", opusHeadMinLength, len(track.CodecPrivate))
 	}
+	if !bytes.HasPrefix(track.CodecPrivate, []byte("OpusHead")) {
+		return nil, errors.New("invalid OpusHead signature")
+	}
 
 	channels := int(track.Audio.Channels)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/ai/audio/webm.go` around lines 53 - 61, The code validates only
length of track.CodecPrivate before reading header fields, but never checks the
OpusHead signature; update the parsing in the function handling
track.CodecPrivate (look for opusHeadMinLength, track.CodecPrivate, and preSkip)
to first verify that the first 8 bytes equal the ASCII string "OpusHead" and
return an error if not, then proceed to validate channels and read preSkip from
bytes 10:12; make the error message descriptive (e.g., "invalid OpusHead
signature") so malformed buffers are rejected deterministically.

Comment on lines +87 to +107
audioBytes, err := io.ReadAll(req.Audio)
if err != nil {
return nil, errors.Wrap(err, "failed to read audio")
}
if len(audioBytes) == 0 {
return nil, errors.New("audio is required")
}

contentType := req.ContentType
if audio.IsWebMContentType(contentType) {
wav, err := audio.WebMOpusToWAV(audioBytes)
if err != nil {
return nil, errors.Wrap(err, "failed to transcode webm audio for Gemini")
}
audioBytes = wav
contentType = "audio/wav"
}

if len(audioBytes) > maxInlineSize {
return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bound audio reads before decoding/transcoding

At Line 87, io.ReadAll(req.Audio) loads the entire stream before size enforcement (Line 105), so oversized uploads can consume excessive memory before rejection.

Suggested fix
 func (m *Model) GenerateFromAudio(ctx context.Context, req audiollm.Request) (*audiollm.Response, error) {
@@
-	audioBytes, err := io.ReadAll(req.Audio)
+	if req.Size > 0 && req.Size > maxInlineSize {
+		return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize)
+	}
+	audioBytes, err := io.ReadAll(io.LimitReader(req.Audio, maxInlineSize+1))
 	if err != nil {
 		return nil, errors.Wrap(err, "failed to read audio")
 	}
 	if len(audioBytes) == 0 {
 		return nil, errors.New("audio is required")
 	}
+	if len(audioBytes) > maxInlineSize {
+		return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize)
+	}
@@
-	if len(audioBytes) > maxInlineSize {
+	if len(audioBytes) > maxInlineSize {
 		return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/ai/audiollm/gemini/gemini.go` around lines 87 - 107, The code
currently calls io.ReadAll(req.Audio) into audioBytes before enforcing size
limits, which can OOM on large uploads; replace the unbounded read with a
size-capped read (e.g., use io.LimitedReader or read into a buffer with a cap of
maxInlineSize+1) when reading req.Audio so you can immediately reject inputs
that exceed maxInlineSize, then proceed to check contentType and call
audio.WebMOpusToWAV as before but also re-check the resulting wav size against
maxInlineSize (reject if transcoded WAV exceeds the limit); update the logic
around audioBytes, req.Audio, audio.IsWebMContentType, audio.WebMOpusToWAV and
the maxInlineSize checks accordingly.

Comment on lines +181 to +187
// OPENAI examples:
// - whisper-1 (legacy, lower cost)
// - gpt-4o-transcribe, gpt-4o-mini-transcribe (higher quality)
// - gpt-4o-transcribe-diarize (includes speaker labels)
// GEMINI examples:
// - gemini-2.5-flash (default, multimodal call)
// - gemini-2.5-pro
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don't promise speaker labels in the config docs yet.

The public transcription response is still text-only, so calling gpt-4o-transcribe-diarize “includes speaker labels” reads like an end-to-end guarantee the current API/UI do not surface. I’d rephrase this as a diarization-capable model until segment metadata is actually exposed.

Suggested wording
-  //   - gpt-4o-transcribe-diarize (includes speaker labels)
+  //   - gpt-4o-transcribe-diarize (speaker-diarization capable)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// OPENAI examples:
// - whisper-1 (legacy, lower cost)
// - gpt-4o-transcribe, gpt-4o-mini-transcribe (higher quality)
// - gpt-4o-transcribe-diarize (includes speaker labels)
// GEMINI examples:
// - gemini-2.5-flash (default, multimodal call)
// - gemini-2.5-pro
// OPENAI examples:
// - whisper-1 (legacy, lower cost)
// - gpt-4o-transcribe, gpt-4o-mini-transcribe (higher quality)
// - gpt-4o-transcribe-diarize (speaker-diarization capable)
// GEMINI examples:
// - gemini-2.5-flash (default, multimodal call)
// - gemini-2.5-pro
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proto/store/instance_setting.proto` around lines 181 - 187, The docs block
listing transcription model examples currently promises speaker labels for
"gpt-4o-transcribe-diarize"; change that wording to say it is
"diarization-capable" or "supports diarization (speaker segmentation)" instead
of claiming it "includes speaker labels" so you don't imply the public
transcription response includes speaker labels; update the two relevant comments
containing "gpt-4o-transcribe-diarize" and any similar phrasing in the
OPENAI/GEMINI examples to the new wording.

Comment on lines 117 to +118
if err != nil {
return nil, status.Errorf(codes.InvalidArgument, "failed to create AI transcriber: %v", err)
return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l server/router/api/v1/ai_service.go

Repository: usememos/memos

Length of output: 97


🏁 Script executed:

sed -n '100,130p' server/router/api/v1/ai_service.go

Repository: usememos/memos

Length of output: 1017


🏁 Script executed:

sed -n '160,185p' server/router/api/v1/ai_service.go

Repository: usememos/memos

Length of output: 979


🏁 Script executed:

rg -n "status\.Errorf|errors\.(Wrap|Errorf|New)" server/router/api/v1/ai_service.go | head -30

Repository: usememos/memos

Length of output: 1636


🏁 Script executed:

rg -B3 -A3 "errors\.Errorf|errors\.New" server/router/api/v1/ai_service.go | grep -A3 -B3 "return \"\"" | head -40

Repository: usememos/memos

Length of output: 362


🏁 Script executed:

grep -r "status\.FromError" server/router/api/ 2>/dev/null | head -5

Repository: usememos/memos

Length of output: 222


Preserve gRPC error codes from helpers instead of collapsing to codes.Internal.

The helper functions transcribeViaAudioLLM (lines 174, 177) return specific, meaningful errors for incomplete responses and empty text, but line 118 rewrites all helper errors to codes.Internal. This masks user-actionable failures (FailedPrecondition) as server faults and removes the semantic distinction these checks provide.

To fix, either preserve status codes from helpers that already return gRPC errors, or have helpers return status errors directly:

Option 1: Preserve existing status codes at line 118
 if err != nil {
-	return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)
+	if st, ok := status.FromError(err); ok {
+		return nil, st.Err()
+	}
+	return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)
 }
Option 2: Return gRPC status from helper (lines 174–177)
 if resp.FinishReason != audiollm.FinishStop {
-	return "", errors.Errorf("transcription incomplete (finish reason: %s)", resp.FinishReason)
+	return "", status.Errorf(codes.FailedPrecondition, "transcription incomplete (finish reason: %s)", resp.FinishReason)
 }
 if strings.TrimSpace(resp.Text) == "" {
-	return "", errors.New("transcription response did not include text")
+	return "", status.Errorf(codes.FailedPrecondition, "transcription response did not include text")
 }

The pattern status.FromError() is already used in this package (connect_handler.go).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err != nil {
return nil, status.Errorf(codes.InvalidArgument, "failed to create AI transcriber: %v", err)
return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)
if err != nil {
if st, ok := status.FromError(err); ok {
return nil, st.Err()
}
return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/router/api/v1/ai_service.go` around lines 117 - 118, The current error
handling at the call site collapses all errors into codes.Internal; instead
preserve helper gRPC statuses by checking status.FromError(err) and returning
the original status if present (or fallback to codes.Internal only if no gRPC
status exists), or alternatively make the helper transcribeViaAudioLLM return a
gRPC status error directly so callers can return it unchanged; update the call
in ai_service.go where transcribeViaAudioLLM is invoked to use
status.FromError(err) and return that status when available rather than
unconditionally wrapping with codes.Internal.

These two deps were added as // indirect when WebM transcoding first
landed, but internal/ai/audio/webm.go imports them directly, so
go mod tidy moves them to the direct require block. CI Static Checks
runs `go mod tidy && git diff --exit-code`, which failed before this
fix.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
internal/ai/audiollm/gemini/gemini_test.go (2)

20-86: ⚡ Quick win

Non-Stop finish-reason paths have no test coverage.

mapFinishReason's SAFETY → FinishSafety, MAX_TOKENS → FinishLength, and default → FinishOther branches are untested. The PR description specifically calls out improved finish-reason surfacing as a key behavior change, so these paths warrant direct coverage.

✅ Proposed table-driven test for non-Stop finish reasons
func TestGenerateFromAudioFinishReasons(t *testing.T) {
	t.Parallel()

	cases := []struct {
		geminiReason string
		want         audiollm.FinishReason
	}{
		{"STOP", audiollm.FinishStop},
		{"SAFETY", audiollm.FinishSafety},
		{"MAX_TOKENS", audiollm.FinishLength},
		{"RECITATION", audiollm.FinishOther},
	}

	for _, tc := range cases {
		tc := tc
		t.Run(tc.geminiReason, func(t *testing.T) {
			t.Parallel()

			server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
				w.Header().Set("Content-Type", "application/json")
				require.NoError(t, json.NewEncoder(w).Encode(map[string]any{
					"candidates": []map[string]any{{
						"finishReason": tc.geminiReason,
						"content":      map[string]any{"parts": []map[string]string{{"text": ""}}},
					}},
				}))
			}))
			defer server.Close()

			model, err := audiollmgemini.New(ai.ProviderConfig{
				Type:     ai.ProviderGemini,
				Endpoint: server.URL + "/v1beta",
				APIKey:   "test-key",
			}, audiollm.ApplyOptions(nil))
			require.NoError(t, err)

			resp, err := model.GenerateFromAudio(context.Background(), audiollm.Request{
				Model:        "gemini-2.5-flash",
				ContentType:  "audio/mp3",
				Audio:        strings.NewReader("audio bytes"),
				Instructions: "transcribe",
			})
			require.NoError(t, err) // GenerateFromAudio does not error on non-Stop; the handler does
			require.Equal(t, tc.want, resp.FinishReason)
		})
	}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/ai/audiollm/gemini/gemini_test.go` around lines 20 - 86, The test
suite lacks coverage for non-"STOP" finish reasons mapped by mapFinishReason;
add a table-driven test that exercises GenerateFromAudio's finish-reason mapping
for "SAFETY"→FinishSafety, "MAX_TOKENS"→FinishLength and an unknown value (e.g.,
"RECITATION")→FinishOther. Implement a new TestGenerateFromAudioFinishReasons
that spins up httptest servers returning candidates with different
"finishReason" strings, constructs audiollmgemini.New with the test server
endpoint, calls model.GenerateFromAudio (using same request fields as
TestGenerateFromAudio but varying the mock response), and asserts
resp.FinishReason equals the expected audiollm.FinishReason for each case;
reference mapFinishReason, GenerateFromAudio, and TestGenerateFromAudio when
locating where to add this test.

20-86: ⚡ Quick win

WebM → WAV transcoding path is not exercised despite being called out in the plan.

Plan step C.1 states "We also verify that WebM input is transcoded to WAV", but both tests use audio/mpeg and never trigger the IsWebMContentType → WebMOpusToWAV branch. A minimal test using a valid WebM/Opus byte fixture would cover that path and the resulting audio/wav content-type assertion on the wire.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/ai/audiollm/gemini/gemini_test.go` around lines 20 - 86, The test
suite never exercises the WebM→WAV branch; add a new parallel test (or extend
TestGenerateFromAudio) that calls audiollmgemini.GenerateFromAudio with
ContentType "audio/webm" (so IsWebMContentType triggers) using a small valid
WebM/Opus fixture reader and assert the outgoing request's part MIMEType is
"audio/wav" and the inlineData.Data decodes to the expected WAV bytes produced
by WebMOpusToWAV; reference the test function TestGenerateFromAudio, the method
GenerateFromAudio, and the branch guard IsWebMContentType/WebMOpusToWAV so
reviewers can locate the code, and include fixture byte slices (e.g.,
webmFixture and expectedWav) to keep the test deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/ai/audiollm/gemini/gemini_test.go`:
- Around line 20-86: The test suite lacks coverage for non-"STOP" finish reasons
mapped by mapFinishReason; add a table-driven test that exercises
GenerateFromAudio's finish-reason mapping for "SAFETY"→FinishSafety,
"MAX_TOKENS"→FinishLength and an unknown value (e.g., "RECITATION")→FinishOther.
Implement a new TestGenerateFromAudioFinishReasons that spins up httptest
servers returning candidates with different "finishReason" strings, constructs
audiollmgemini.New with the test server endpoint, calls model.GenerateFromAudio
(using same request fields as TestGenerateFromAudio but varying the mock
response), and asserts resp.FinishReason equals the expected
audiollm.FinishReason for each case; reference mapFinishReason,
GenerateFromAudio, and TestGenerateFromAudio when locating where to add this
test.
- Around line 20-86: The test suite never exercises the WebM→WAV branch; add a
new parallel test (or extend TestGenerateFromAudio) that calls
audiollmgemini.GenerateFromAudio with ContentType "audio/webm" (so
IsWebMContentType triggers) using a small valid WebM/Opus fixture reader and
assert the outgoing request's part MIMEType is "audio/wav" and the
inlineData.Data decodes to the expected WAV bytes produced by WebMOpusToWAV;
reference the test function TestGenerateFromAudio, the method GenerateFromAudio,
and the branch guard IsWebMContentType/WebMOpusToWAV so reviewers can locate the
code, and include fixture byte slices (e.g., webmFixture and expectedWav) to
keep the test deterministic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 56ea4848-fb70-462d-bae3-f55ed3b7cecf

📥 Commits

Reviewing files that changed from the base of the PR and between f5ec03f and e25b04a.

⛔ Files ignored due to path filters (2)
  • go.sum is excluded by !**/*.sum
  • proto/gen/store/instance_setting.pb.go is excluded by !**/*.pb.go, !**/gen/**
📒 Files selected for processing (23)
  • docs/superpowers/plans/2026-05-02-stt-audiollm-split.md
  • docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md
  • go.mod
  • internal/ai/audio/webm.go
  • internal/ai/audio/webm_test.go
  • internal/ai/audiollm/audiollm.go
  • internal/ai/audiollm/gemini/gemini.go
  • internal/ai/audiollm/gemini/gemini_test.go
  • internal/ai/audiollm/options.go
  • internal/ai/client.go
  • internal/ai/errors.go
  • internal/ai/gemini.go
  • internal/ai/openai.go
  • internal/ai/stt/openai/openai.go
  • internal/ai/stt/openai/openai_test.go
  • internal/ai/stt/options.go
  • internal/ai/stt/stt.go
  • internal/ai/transcription.go
  • proto/store/instance_setting.proto
  • server/router/api/v1/ai_service.go
  • server/router/api/v1/test/ai_service_test.go
  • web/src/components/Settings/AISection.tsx
  • web/src/locales/en.json
💤 Files with no reviewable changes (5)
  • internal/ai/transcription.go
  • web/src/components/Settings/AISection.tsx
  • internal/ai/openai.go
  • internal/ai/client.go
  • internal/ai/gemini.go
✅ Files skipped from review due to trivial changes (9)
  • internal/ai/errors.go
  • internal/ai/stt/options.go
  • internal/ai/audiollm/audiollm.go
  • go.mod
  • server/router/api/v1/test/ai_service_test.go
  • internal/ai/stt/openai/openai_test.go
  • internal/ai/audio/webm.go
  • web/src/locales/en.json
  • proto/store/instance_setting.proto
🚧 Files skipped from review as they are similar to previous changes (6)
  • internal/ai/audio/webm_test.go
  • internal/ai/audiollm/gemini/gemini.go
  • internal/ai/stt/openai/openai.go
  • internal/ai/stt/stt.go
  • internal/ai/audiollm/options.go
  • server/router/api/v1/ai_service.go

@boojack boojack merged commit 5ccba98 into main May 2, 2026
9 checks passed
@boojack boojack deleted the stt-audiollm-split branch May 2, 2026 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant