refactor(ai): split STT and Audio-LLM into separate interfaces by boojack · Pull Request #5928 · usememos/memos

boojack · 2026-05-02T16:05:49Z

Summary

Refactors internal/ai/ from a single Transcriber interface (which dressed Gemini's multimodal LLM up as STT) into two honest interfaces, aligned with mainstream OSS conventions (Vercel AI SDK, LiteLLM, Go AI ecosystem):

internal/ai/stt/ — deterministic speech-to-text (OpenAI Whisper + gpt-4o-transcribe family, plus any OpenAI-compatible STT endpoint via the Endpoint field — Groq Whisper, Azure Whisper, faster-whisper self-host)
internal/ai/audiollm/ — multimodal audio LLMs (Gemini generateContent; reserved for future gpt-4o-audio support)

The Transcribe gRPC handler now dispatches to the right package based on provider.Type. Transcription instructions are extracted from the Gemini package into the handler, so audiollm/gemini is a generic audio-LLM adapter — opening the door to future "audio → structured note" pipelines without further refactoring.

User-visible improvements

Gemini failure modes are now informative. A non-Stop finish reason from Gemini (safety filter, max-tokens, recitation, prohibited-content, blocklist, SPII, etc.) surfaces as transcription incomplete (finish reason: safety) instead of the previous generic did not include text.
Browser-recorded WebM works with Gemini transparently. Pure-Go decoder (ebml-go + pion/opus) transcodes to WAV server-side; the misleading "Gemini does not accept WebM" UI warning is gone.
Model picker UI shows current options. OpenAI placeholder lists whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize; Gemini lists gemini-2.5-flash, gemini-2.5-pro. Prompt help text explains the Whisper-soft-hint vs Gemini-literal-instruction difference.

Schema and contract

No proto field changes — only two comment improvements on TranscriptionConfig.model and .prompt.
No public API changes — TranscribeRequest/TranscribeResponse shapes unchanged.
No DB migration — InstanceAISetting storage unchanged.
Duration dropped from internal STT response — it's a property of the audio file, not the transcription provider; doesn't belong on the response shape. Public TranscribeResponse only ever returned Text, so no API impact.

Background

Spec and research findings live in:

docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md
docs/superpowers/plans/2026-05-02-stt-audiollm-split.md

Verified against six OSS reference implementations: Vercel AI SDK, LiteLLM, cloudwego/eino + tmc/langchaingo + sashabaranov/go-openai (SDK layer); Open WebUI, LobeChat, Dify (application layer). All six split STT and multimodal-audio into separate paths; none add STT-specific fields onto a generic provider entity.

Test Plan

go build ./... clean
go test ./... clean (excluding Docker-dependent store/test)
All 7 TestTranscribe subtests pass (including the Gemini handler test, with mock updated to send realistic finishReason: STOP)
cd web && pnpm build clean (no TS errors)
Manual smoke test: configure an OpenAI provider, upload short audio via voice-input feature, transcript appears
Manual smoke test: configure a Gemini provider, upload browser-recorded WebM via voice-input feature, transcript appears (validates WebM transcoding path)
Manual smoke test: deliberately bad/silent audio with Gemini produces a clearer error than before (finish reason: ... instead of generic message)

🤖 Generated with Claude Code

coderabbitai · 2026-05-02T16:06:02Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fec8757a-330f-4b66-a6e7-56e1ef7d7321

📥 Commits

Reviewing files that changed from the base of the PR and between e25b04a and 6a1c71d.

📒 Files selected for processing (1)

go.mod

🚧 Files skipped from review as they are similar to previous changes (1)

go.mod

📝 Walkthrough

Walkthrough

Splits internal/ai into stt and audiollm packages; adds WebM/Opus→WAV decoding; implements OpenAI STT and Gemini AudioLLM adapters (with options and tests); updates Transcribe to dispatch by provider.Type; removes legacy combined transcription code; updates proto comments and frontend i18n.

Changes

STT / AudioLLM Refactor

Layer / File(s)	Summary
Documentation & Design `docs/superpowers/plans/2026-05-02-stt-audiollm-split.md`, `docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md`	Added staged implementation plan and design spec describing new `stt`/`audiollm` package split, factory/dispatch strategy, sentinel errors, proto comment updates, and frontend i18n changes.
Dependencies `go.mod`	Added dependencies for EBML and Opus decoding: `github.com/at-wat/ebml-go` and `github.com/pion/opus`.
Audio Infrastructure `internal/ai/audio/webm.go`, `internal/ai/audio/webm_test.go`	Added `WebMOpusToWAV` (WebM/Opus→16-bit PCM WAV) and `IsWebMContentType` with unit tests and validation/error handling for truncated/invalid inputs.
Core Interfaces / Types `internal/ai/stt/stt.go`, `internal/ai/audiollm/audiollm.go`, `internal/ai/errors.go`	Introduced `stt.Transcriber` (`Request/Response/Segment`), `audiollm.Model` (`Request/Response` and `FinishReason` constants), and sentinel errors `ErrSTTNotSupported` / `ErrAudioLLMNotSupported`.
Options / Config Helpers `internal/ai/stt/options.go`, `internal/ai/audiollm/options.go`	Added package-scoped `Options`, functional option types, `WithHTTPClient`, and `ApplyOptions` with default HTTP timeout behavior.
Provider Implementations `internal/ai/stt/openai/openai.go`, `internal/ai/stt/openai/openai_test.go`, `internal/ai/audiollm/gemini/gemini.go`, `internal/ai/audiollm/gemini/gemini_test.go`	Implemented OpenAI STT transcriber (endpoint normalization, API key validation, SDK usage, audio metadata handling) and Gemini audio-LLM model (audio read, WebM→WAV fallback, size limits, content-type normalization, generation call, finish-reason mapping); updated/added provider tests.
Handler Integration & Dispatch `server/router/api/v1/ai_service.go`, `server/router/api/v1/test/ai_service_test.go`	Refactored `Transcribe` to switch on `provider.Type` and route to `transcribeViaSTT` (OpenAI) or `transcribeViaAudioLLM` (Gemini); added `buildTranscriptionInstructions`; enforce `FinishStop` for audiollm; updated test mock finishReason.
Removal / Cleanup `internal/ai/client.go`, `internal/ai/transcription.go`, `internal/ai/openai.go`, `internal/ai/gemini.go`	Deleted legacy monolithic transcription API, provider-agnostic transcriber factory, and previous OpenAI/Gemini combined transcriber implementations.
Proto Comments & Frontend `proto/store/instance_setting.proto`, `web/src/components/Settings/AISection.tsx`, `web/src/locales/en.json`	Updated proto field comments for `TranscriptionConfig.model` and `prompt` (comment-only); changed settings UI to show generic API-key warning when provider has no key; expanded transcription i18n keys and model placeholder/help strings and removed Gemini-specific webm warning.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main refactoring effort: splitting the monolithic transcription interface into two specialized packages (STT and Audio-LLM), which is the core objective of this PR.
Description check	✅ Passed	The description comprehensively relates to the changeset, covering the refactoring goals, implementation details, user-visible improvements, schema/contract implications, testing strategy, and design documentation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 6/8 reviews remaining, refill in 9 minutes and 34 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Refactors internal/ai/ from a single Transcriber interface (which dressed Gemini's multimodal LLM up as STT) into two honest interfaces, aligned with mainstream OSS conventions (Vercel AI SDK, LiteLLM, Go AI ecosystem): - internal/ai/stt/ — deterministic speech-to-text (OpenAI Whisper + gpt-4o-transcribe family, plus any OpenAI-compatible STT endpoint via the Endpoint field — Groq Whisper, Azure Whisper, faster-whisper) - internal/ai/audiollm/ — multimodal audio LLMs (Gemini generateContent; reserved for future gpt-4o-audio support) The Transcribe gRPC handler now dispatches to the right package based on provider.Type. Transcription instructions are extracted from the Gemini package into the handler, so audiollm/gemini is a generic audio-LLM adapter that the handler steers — opening the door to future "audio → structured note" pipelines without further refactoring. User-visible improvements: - Gemini failure modes are now informative. A non-Stop finish reason from Gemini (safety filter, max-tokens, recitation, prohibited-content, blocklist, SPII, image-safety variants) surfaces as "transcription incomplete (finish reason: safety)" instead of the previous generic "did not include text". - Browser-recorded WebM works with Gemini transparently. Pure-Go decoder (at-wat/ebml-go + pion/opus) transcodes to WAV server-side; the misleading "Gemini does not accept WebM" UI warning is gone. - Model picker UI shows current options. OpenAI placeholder lists whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize. Gemini lists gemini-2.5-flash, 2.5-pro. Prompt help text explains the Whisper-soft-hint vs Gemini-literal- instruction difference. Schema and contract: - No proto field changes. Two comment-only updates on TranscriptionConfig.model and .prompt to reflect current model families and the cross-provider prompt semantics. - No public API changes. TranscribeRequest/TranscribeResponse shapes unchanged. - No DB migration. InstanceAISetting storage unchanged. - Duration dropped from internal STT response shape — it's a property of the audio file, not the transcription provider. The public TranscribeResponse only ever returned Text, so no API impact. Notable internal changes: - No factory layer between handler and provider implementations. A factory in stt/ would have imported stt/openai which imports stt for types — Go import cycle. The handler already switches on provider.Type, so the factory was redundant indirection. Matches the cloudwego/eino and tmc/langchaingo convention (per-provider constructors, caller-side dispatch). - New sentinel errors: ErrSTTNotSupported, ErrAudioLLMNotSupported. Available for callers that want errors.Is checks; the handler uses the inline switch instead. - audiollm.FinishReason maps eight genai safety-related reasons (Safety, Recitation, ProhibitedContent, SPII, Blocklist, ImageSafety, ImageProhibitedContent, ImageRecitation) to FinishSafety so the handler error surfaces a clear bucket regardless of which content-policy reason Gemini reports. Background: Spec and research findings live in: - docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md - docs/superpowers/plans/2026-05-02-stt-audiollm-split.md Verified against six OSS reference implementations: Vercel AI SDK, LiteLLM, cloudwego/eino + tmc/langchaingo + sashabaranov/go-openai (SDK layer); Open WebUI, LobeChat, Dify (application layer). All six split STT and multimodal-audio into separate paths; none add STT-specific fields onto a generic provider entity.

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (2)

internal/ai/audio/webm_test.go (1)
32-48: 🏗️ Heavy lift

Add a success-path fixture for the transcoder.

Right now the new codec path is only covered through failures. A small valid WebM/Opus sample that asserts RIFF/WAVE output and a sane decoded payload would give this feature real regression coverage.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/ai/audio/webm_test.go` around lines 32 - 48, Add a positive test
case to TestWebMOpusToWAV that supplies a small valid WebM/Opus fixture and
asserts the successful WAV output: call WebMOpusToWAV with the valid sample
bytes, require no error, then assert the returned []byte begins with the "RIFF"
header and contains "WAVE" (and optionally a sane decoded payload like a
non-zero length or presence of a "fmt " chunk). Place the fixture and assertions
in webm_test.go alongside the existing negative cases and reference the
WebMOpusToWAV function so the test covers the success path.
server/router/api/v1/test/ai_service_test.go (1)

140-195: ⚡ Quick win

Add a Gemini regression case for the new handler-side guardrails.

This fixture now covers the STOP happy path, but it still doesn't pin the new behavior introduced here: rejecting non-STOP finish reasons, rejecting blank transcript text, and building transcription instructions in the handler. A small negative Gemini subtest would make this refactor much safer.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/superpowers/plans/2026-05-02-stt-audiollm-split.md`:
- Around line 1306-1314: The markdown code fences showing the file list (e.g.,
the block containing "ai.go audio audiollm errors.go models.go resolver.go stt")
are unlabeled and triggering MD040; edit each of the three offending fenced
blocks (including the ones at the other mentioned locations) and add a language
label (use "text") immediately after the opening triple backticks so the fences
read ```text, leaving the block contents unchanged.
- Around line 47-50: The plan's "Untouched" manifest incorrectly lists webm.go
and webm_test.go as untouched; update the staged plan so the audio
implementation files (webm.go and webm_test.go in the internal ai audio area)
are marked as touched/part of this PR (or remove them from the "Untouched" list)
so the manifest matches the actual changes and won't confuse replaying or
reconciling the commit breakdown.

In `@docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md`:
- Around line 71-95: The markdown code fences showing the project file-tree (the
block listing internal/ai/, server/router/api/v1/ai_service.go, proto/store/,
web/src/components/Settings/AISection.tsx, etc.) are unlabeled and triggering
MD040; edit those fenced blocks and add a language tag of "text" (e.g., ```text)
for each occurrence (the two unlabeled blocks around the file-tree sections) so
linters stop flagging them.

In `@internal/ai/audio/webm.go`:
- Around line 53-61: The code validates only length of track.CodecPrivate before
reading header fields, but never checks the OpusHead signature; update the
parsing in the function handling track.CodecPrivate (look for opusHeadMinLength,
track.CodecPrivate, and preSkip) to first verify that the first 8 bytes equal
the ASCII string "OpusHead" and return an error if not, then proceed to validate
channels and read preSkip from bytes 10:12; make the error message descriptive
(e.g., "invalid OpusHead signature") so malformed buffers are rejected
deterministically.

In `@internal/ai/audiollm/gemini/gemini.go`:
- Around line 87-107: The code currently calls io.ReadAll(req.Audio) into
audioBytes before enforcing size limits, which can OOM on large uploads; replace
the unbounded read with a size-capped read (e.g., use io.LimitedReader or read
into a buffer with a cap of maxInlineSize+1) when reading req.Audio so you can
immediately reject inputs that exceed maxInlineSize, then proceed to check
contentType and call audio.WebMOpusToWAV as before but also re-check the
resulting wav size against maxInlineSize (reject if transcoded WAV exceeds the
limit); update the logic around audioBytes, req.Audio, audio.IsWebMContentType,
audio.WebMOpusToWAV and the maxInlineSize checks accordingly.

In `@proto/store/instance_setting.proto`:
- Around line 181-187: The docs block listing transcription model examples
currently promises speaker labels for "gpt-4o-transcribe-diarize"; change that
wording to say it is "diarization-capable" or "supports diarization (speaker
segmentation)" instead of claiming it "includes speaker labels" so you don't
imply the public transcription response includes speaker labels; update the two
relevant comments containing "gpt-4o-transcribe-diarize" and any similar
phrasing in the OPENAI/GEMINI examples to the new wording.

In `@server/router/api/v1/ai_service.go`:
- Around line 117-118: The current error handling at the call site collapses all
errors into codes.Internal; instead preserve helper gRPC statuses by checking
status.FromError(err) and returning the original status if present (or fallback
to codes.Internal only if no gRPC status exists), or alternatively make the
helper transcribeViaAudioLLM return a gRPC status error directly so callers can
return it unchanged; update the call in ai_service.go where
transcribeViaAudioLLM is invoked to use status.FromError(err) and return that
status when available rather than unconditionally wrapping with codes.Internal.

---

Nitpick comments:
In `@internal/ai/audio/webm_test.go`:
- Around line 32-48: Add a positive test case to TestWebMOpusToWAV that supplies
a small valid WebM/Opus fixture and asserts the successful WAV output: call
WebMOpusToWAV with the valid sample bytes, require no error, then assert the
returned []byte begins with the "RIFF" header and contains "WAVE" (and
optionally a sane decoded payload like a non-zero length or presence of a "fmt "
chunk). Place the fixture and assertions in webm_test.go alongside the existing
negative cases and reference the WebMOpusToWAV function so the test covers the
success path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f7857b82-8703-45ad-ae30-484e925687ce

📥 Commits

Reviewing files that changed from the base of the PR and between 238f27d and f5ec03f.

⛔ Files ignored due to path filters (2)

go.sum is excluded by !**/*.sum
proto/gen/store/instance_setting.pb.go is excluded by !**/*.pb.go, !**/gen/**

📒 Files selected for processing (23)

docs/superpowers/plans/2026-05-02-stt-audiollm-split.md
docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md
go.mod
internal/ai/audio/webm.go
internal/ai/audio/webm_test.go
internal/ai/audiollm/audiollm.go
internal/ai/audiollm/gemini/gemini.go
internal/ai/audiollm/gemini/gemini_test.go
internal/ai/audiollm/options.go
internal/ai/client.go
internal/ai/errors.go
internal/ai/gemini.go
internal/ai/openai.go
internal/ai/stt/openai/openai.go
internal/ai/stt/openai/openai_test.go
internal/ai/stt/options.go
internal/ai/stt/stt.go
internal/ai/transcription.go
proto/store/instance_setting.proto
server/router/api/v1/ai_service.go
server/router/api/v1/test/ai_service_test.go
web/src/components/Settings/AISection.tsx
web/src/locales/en.json

💤 Files with no reviewable changes (5)

web/src/components/Settings/AISection.tsx
internal/ai/client.go
internal/ai/transcription.go
internal/ai/gemini.go
internal/ai/openai.go

coderabbitai · 2026-05-02T16:11:42Z

+```
+ai.go
+audio
+audiollm
+errors.go
+models.go
+resolver.go
+stt
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Label these expected-output fences to satisfy markdownlint.

markdownlint is already flagging these blocks with MD040. Adding text to the fences keeps the plan lint-clean without changing any content.

Also applies to: 1435-1439, 1580-1588

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 1306-1306: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/superpowers/plans/2026-05-02-stt-audiollm-split.md` around lines 1306 - 1314, The markdown code fences showing the file list (e.g., the block containing "ai.go audio audiollm errors.go models.go resolver.go stt") are unlabeled and triggering MD040; edit each of the three offending fenced blocks (including the ones at the other mentioned locations) and add a language label (use "text") immediately after the opening triple backticks so the fences read ```text, leaving the block contents unchanged.

coderabbitai · 2026-05-02T16:11:42Z

+```
+internal/ai/
+  ai.go                # ProviderType, ProviderConfig, errors
+  client.go            # NewTranscriber factory, transcriberOptions, normalizeEndpoint, requireAPIKey
+  transcription.go     # Transcriber interface, TranscribeRequest, TranscribeResponse
+  openai.go            # openAITranscriber → /audio/transcriptions
+  openai_test.go
+  gemini.go            # geminiTranscriber → generateContent (multimodal)
+  gemini_test.go
+  models.go            # DefaultOpenAITranscriptionModel, DefaultGeminiTranscriptionModel
+  resolver.go          # FindProvider
+  errors.go            # ErrProviderNotFound, ErrCapabilityUnsupported
+  audio/
+    webm.go            # IsWebMContentType, WebMOpusToWAV (used by Gemini path)
+    webm_test.go
+
+server/router/api/v1/
+  ai_service.go        # Transcribe handler (lines 42–123)
+
+proto/store/
+  instance_setting.proto    # InstanceAISetting, AIProviderConfig, AIProviderType, TranscriptionConfig
+
+web/src/components/Settings/
+  AISection.tsx        # Provider list UI, TranscriptionForm
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language tags to the file-tree fences.

These two unlabeled blocks are already tripping MD040; text is enough here.

Also applies to: 103-128

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 71-71: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md` around lines 71 - 95, The markdown code fences showing the project file-tree (the block listing internal/ai/, server/router/api/v1/ai_service.go, proto/store/, web/src/components/Settings/AISection.tsx, etc.) are unlabeled and triggering MD040; edit those fenced blocks and add a language tag of "text" (e.g., ```text) for each occurrence (the two unlabeled blocks around the file-tree sections) so linters stop flagging them.

coderabbitai · 2026-05-02T16:11:42Z

+	if len(track.CodecPrivate) < opusHeadMinLength {
+		return nil, errors.Errorf("invalid OpusHead: expected at least %d bytes, got %d", opusHeadMinLength, len(track.CodecPrivate))
+	}
+
+	channels := int(track.Audio.Channels)
+	if channels < 1 || channels > 2 {
+		return nil, errors.Errorf("unsupported Opus channel count: %d", channels)
+	}
+	preSkip := int(binary.LittleEndian.Uint16(track.CodecPrivate[10:12]))


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate the OpusHead signature before reading header fields.

This only checks the buffer length, then immediately trusts CodecPrivate[10:12] for preSkip. A malformed CodecPrivate with the right length can slip through and fail later with a much less obvious decode error. Rejecting anything that does not start with OpusHead makes bad uploads fail deterministically here.

Suggested fix

if len(track.CodecPrivate) < opusHeadMinLength { return nil, errors.Errorf("invalid OpusHead: expected at least %d bytes, got %d", opusHeadMinLength, len(track.CodecPrivate)) } + if !bytes.HasPrefix(track.CodecPrivate, []byte("OpusHead")) { + return nil, errors.New("invalid OpusHead signature") + } channels := int(track.Audio.Channels)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/ai/audio/webm.go` around lines 53 - 61, The code validates only length of track.CodecPrivate before reading header fields, but never checks the OpusHead signature; update the parsing in the function handling track.CodecPrivate (look for opusHeadMinLength, track.CodecPrivate, and preSkip) to first verify that the first 8 bytes equal the ASCII string "OpusHead" and return an error if not, then proceed to validate channels and read preSkip from bytes 10:12; make the error message descriptive (e.g., "invalid OpusHead signature") so malformed buffers are rejected deterministically.

coderabbitai · 2026-05-02T16:11:42Z

+	audioBytes, err := io.ReadAll(req.Audio)
+	if err != nil {
+		return nil, errors.Wrap(err, "failed to read audio")
+	}
+	if len(audioBytes) == 0 {
+		return nil, errors.New("audio is required")
+	}
+
+	contentType := req.ContentType
+	if audio.IsWebMContentType(contentType) {
+		wav, err := audio.WebMOpusToWAV(audioBytes)
+		if err != nil {
+			return nil, errors.Wrap(err, "failed to transcode webm audio for Gemini")
+		}
+		audioBytes = wav
+		contentType = "audio/wav"
+	}
+
+	if len(audioBytes) > maxInlineSize {
+		return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize)
+	}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bound audio reads before decoding/transcoding

At Line 87, io.ReadAll(req.Audio) loads the entire stream before size enforcement (Line 105), so oversized uploads can consume excessive memory before rejection.

Suggested fix

func (m *Model) GenerateFromAudio(ctx context.Context, req audiollm.Request) (*audiollm.Response, error) { @@ - audioBytes, err := io.ReadAll(req.Audio) + if req.Size > 0 && req.Size > maxInlineSize { + return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize) + } + audioBytes, err := io.ReadAll(io.LimitReader(req.Audio, maxInlineSize+1)) if err != nil { return nil, errors.Wrap(err, "failed to read audio") } if len(audioBytes) == 0 { return nil, errors.New("audio is required") } + if len(audioBytes) > maxInlineSize { + return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize) + } @@ - if len(audioBytes) > maxInlineSize { + if len(audioBytes) > maxInlineSize { return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize) }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/ai/audiollm/gemini/gemini.go` around lines 87 - 107, The code currently calls io.ReadAll(req.Audio) into audioBytes before enforcing size limits, which can OOM on large uploads; replace the unbounded read with a size-capped read (e.g., use io.LimitedReader or read into a buffer with a cap of maxInlineSize+1) when reading req.Audio so you can immediately reject inputs that exceed maxInlineSize, then proceed to check contentType and call audio.WebMOpusToWAV as before but also re-check the resulting wav size against maxInlineSize (reject if transcoded WAV exceeds the limit); update the logic around audioBytes, req.Audio, audio.IsWebMContentType, audio.WebMOpusToWAV and the maxInlineSize checks accordingly.

coderabbitai · 2026-05-02T16:11:42Z

+  // OPENAI examples:
+  //   - whisper-1 (legacy, lower cost)
+  //   - gpt-4o-transcribe, gpt-4o-mini-transcribe (higher quality)
+  //   - gpt-4o-transcribe-diarize (includes speaker labels)
+  // GEMINI examples:
+  //   - gemini-2.5-flash (default, multimodal call)
+  //   - gemini-2.5-pro


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don't promise speaker labels in the config docs yet.

The public transcription response is still text-only, so calling gpt-4o-transcribe-diarize “includes speaker labels” reads like an end-to-end guarantee the current API/UI do not surface. I’d rephrase this as a diarization-capable model until segment metadata is actually exposed.

Suggested wording

- // - gpt-4o-transcribe-diarize (includes speaker labels) + // - gpt-4o-transcribe-diarize (speaker-diarization capable)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// OPENAI examples:

// - whisper-1 (legacy, lower cost)

// - gpt-4o-transcribe, gpt-4o-mini-transcribe (higher quality)

// - gpt-4o-transcribe-diarize (includes speaker labels)

// GEMINI examples:

// - gemini-2.5-flash (default, multimodal call)

// - gemini-2.5-pro

// OPENAI examples:

// - whisper-1 (legacy, lower cost)

// - gpt-4o-transcribe, gpt-4o-mini-transcribe (higher quality)

// - gpt-4o-transcribe-diarize (speaker-diarization capable)

// GEMINI examples:

// - gemini-2.5-flash (default, multimodal call)

// - gemini-2.5-pro

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@proto/store/instance_setting.proto` around lines 181 - 187, The docs block listing transcription model examples currently promises speaker labels for "gpt-4o-transcribe-diarize"; change that wording to say it is "diarization-capable" or "supports diarization (speaker segmentation)" instead of claiming it "includes speaker labels" so you don't imply the public transcription response includes speaker labels; update the two relevant comments containing "gpt-4o-transcribe-diarize" and any similar phrasing in the OPENAI/GEMINI examples to the new wording.

coderabbitai · 2026-05-02T16:11:43Z

 	if err != nil {
-		return nil, status.Errorf(codes.InvalidArgument, "failed to create AI transcriber: %v", err)
+		return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l server/router/api/v1/ai_service.go

Repository: usememos/memos

Length of output: 97

🏁 Script executed:

sed -n '100,130p' server/router/api/v1/ai_service.go

Repository: usememos/memos

Length of output: 1017

🏁 Script executed:

sed -n '160,185p' server/router/api/v1/ai_service.go

Repository: usememos/memos

Length of output: 979

🏁 Script executed:

rg -n "status\.Errorf|errors\.(Wrap|Errorf|New)" server/router/api/v1/ai_service.go | head -30

Repository: usememos/memos

Length of output: 1636

🏁 Script executed:

rg -B3 -A3 "errors\.Errorf|errors\.New" server/router/api/v1/ai_service.go | grep -A3 -B3 "return \"\"" | head -40

Repository: usememos/memos

Length of output: 362

🏁 Script executed:

grep -r "status\.FromError" server/router/api/ 2>/dev/null | head -5

Repository: usememos/memos

Length of output: 222

Preserve gRPC error codes from helpers instead of collapsing to codes.Internal.

The helper functions transcribeViaAudioLLM (lines 174, 177) return specific, meaningful errors for incomplete responses and empty text, but line 118 rewrites all helper errors to codes.Internal. This masks user-actionable failures (FailedPrecondition) as server faults and removes the semantic distinction these checks provide.

To fix, either preserve status codes from helpers that already return gRPC errors, or have helpers return status errors directly:

Option 1: Preserve existing status codes at line 118

if err != nil { - return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err) + if st, ok := status.FromError(err); ok { + return nil, st.Err() + } + return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err) }

Option 2: Return gRPC status from helper (lines 174–177)

if resp.FinishReason != audiollm.FinishStop { - return "", errors.Errorf("transcription incomplete (finish reason: %s)", resp.FinishReason) + return "", status.Errorf(codes.FailedPrecondition, "transcription incomplete (finish reason: %s)", resp.FinishReason) } if strings.TrimSpace(resp.Text) == "" { - return "", errors.New("transcription response did not include text") + return "", status.Errorf(codes.FailedPrecondition, "transcription response did not include text") }

The pattern status.FromError() is already used in this package (connect_handler.go).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if err != nil {

return nil, status.Errorf(codes.InvalidArgument, "failed to create AI transcriber: %v", err)

return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)

if err != nil {

if st, ok := status.FromError(err); ok {

return nil, st.Err()

}

return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@server/router/api/v1/ai_service.go` around lines 117 - 118, The current error handling at the call site collapses all errors into codes.Internal; instead preserve helper gRPC statuses by checking status.FromError(err) and returning the original status if present (or fallback to codes.Internal only if no gRPC status exists), or alternatively make the helper transcribeViaAudioLLM return a gRPC status error directly so callers can return it unchanged; update the call in ai_service.go where transcribeViaAudioLLM is invoked to use status.FromError(err) and return that status when available rather than unconditionally wrapping with codes.Internal.

These two deps were added as // indirect when WebM transcoding first landed, but internal/ai/audio/webm.go imports them directly, so go mod tidy moves them to the direct require block. CI Static Checks runs `go mod tidy && git diff --exit-code`, which failed before this fix.

coderabbitai

🧹 Nitpick comments (2)

internal/ai/audiollm/gemini/gemini_test.go (2)

20-86: ⚡ Quick win

Non-Stop finish-reason paths have no test coverage.

mapFinishReason's SAFETY → FinishSafety, MAX_TOKENS → FinishLength, and default → FinishOther branches are untested. The PR description specifically calls out improved finish-reason surfacing as a key behavior change, so these paths warrant direct coverage.

✅ Proposed table-driven test for non-Stop finish reasons

func TestGenerateFromAudioFinishReasons(t *testing.T) {
	t.Parallel()

	cases := []struct {
		geminiReason string
		want         audiollm.FinishReason
	}{
		{"STOP", audiollm.FinishStop},
		{"SAFETY", audiollm.FinishSafety},
		{"MAX_TOKENS", audiollm.FinishLength},
		{"RECITATION", audiollm.FinishOther},
	}

	for _, tc := range cases {
		tc := tc
		t.Run(tc.geminiReason, func(t *testing.T) {
			t.Parallel()

			server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
				w.Header().Set("Content-Type", "application/json")
				require.NoError(t, json.NewEncoder(w).Encode(map[string]any{
					"candidates": []map[string]any{{
						"finishReason": tc.geminiReason,
						"content":      map[string]any{"parts": []map[string]string{{"text": ""}}},
					}},
				}))
			}))
			defer server.Close()

			model, err := audiollmgemini.New(ai.ProviderConfig{
				Type:     ai.ProviderGemini,
				Endpoint: server.URL + "/v1beta",
				APIKey:   "test-key",
			}, audiollm.ApplyOptions(nil))
			require.NoError(t, err)

			resp, err := model.GenerateFromAudio(context.Background(), audiollm.Request{
				Model:        "gemini-2.5-flash",
				ContentType:  "audio/mp3",
				Audio:        strings.NewReader("audio bytes"),
				Instructions: "transcribe",
			})
			require.NoError(t, err) // GenerateFromAudio does not error on non-Stop; the handler does
			require.Equal(t, tc.want, resp.FinishReason)
		})
	}
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/ai/audiollm/gemini/gemini_test.go` around lines 20 - 86, The test
suite lacks coverage for non-"STOP" finish reasons mapped by mapFinishReason;
add a table-driven test that exercises GenerateFromAudio's finish-reason mapping
for "SAFETY"→FinishSafety, "MAX_TOKENS"→FinishLength and an unknown value (e.g.,
"RECITATION")→FinishOther. Implement a new TestGenerateFromAudioFinishReasons
that spins up httptest servers returning candidates with different
"finishReason" strings, constructs audiollmgemini.New with the test server
endpoint, calls model.GenerateFromAudio (using same request fields as
TestGenerateFromAudio but varying the mock response), and asserts
resp.FinishReason equals the expected audiollm.FinishReason for each case;
reference mapFinishReason, GenerateFromAudio, and TestGenerateFromAudio when
locating where to add this test.

20-86: ⚡ Quick win

WebM → WAV transcoding path is not exercised despite being called out in the plan.

Plan step C.1 states "We also verify that WebM input is transcoded to WAV", but both tests use audio/mpeg and never trigger the IsWebMContentType → WebMOpusToWAV branch. A minimal test using a valid WebM/Opus byte fixture would cover that path and the resulting audio/wav content-type assertion on the wire.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/ai/audiollm/gemini/gemini_test.go` around lines 20 - 86, The test
suite never exercises the WebM→WAV branch; add a new parallel test (or extend
TestGenerateFromAudio) that calls audiollmgemini.GenerateFromAudio with
ContentType "audio/webm" (so IsWebMContentType triggers) using a small valid
WebM/Opus fixture reader and assert the outgoing request's part MIMEType is
"audio/wav" and the inlineData.Data decodes to the expected WAV bytes produced
by WebMOpusToWAV; reference the test function TestGenerateFromAudio, the method
GenerateFromAudio, and the branch guard IsWebMContentType/WebMOpusToWAV so
reviewers can locate the code, and include fixture byte slices (e.g.,
webmFixture and expectedWav) to keep the test deterministic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/ai/audiollm/gemini/gemini_test.go`:
- Around line 20-86: The test suite lacks coverage for non-"STOP" finish reasons
mapped by mapFinishReason; add a table-driven test that exercises
GenerateFromAudio's finish-reason mapping for "SAFETY"→FinishSafety,
"MAX_TOKENS"→FinishLength and an unknown value (e.g., "RECITATION")→FinishOther.
Implement a new TestGenerateFromAudioFinishReasons that spins up httptest
servers returning candidates with different "finishReason" strings, constructs
audiollmgemini.New with the test server endpoint, calls model.GenerateFromAudio
(using same request fields as TestGenerateFromAudio but varying the mock
response), and asserts resp.FinishReason equals the expected
audiollm.FinishReason for each case; reference mapFinishReason,
GenerateFromAudio, and TestGenerateFromAudio when locating where to add this
test.
- Around line 20-86: The test suite never exercises the WebM→WAV branch; add a
new parallel test (or extend TestGenerateFromAudio) that calls
audiollmgemini.GenerateFromAudio with ContentType "audio/webm" (so
IsWebMContentType triggers) using a small valid WebM/Opus fixture reader and
assert the outgoing request's part MIMEType is "audio/wav" and the
inlineData.Data decodes to the expected WAV bytes produced by WebMOpusToWAV;
reference the test function TestGenerateFromAudio, the method GenerateFromAudio,
and the branch guard IsWebMContentType/WebMOpusToWAV so reviewers can locate the
code, and include fixture byte slices (e.g., webmFixture and expectedWav) to
keep the test deterministic.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 56ea4848-fb70-462d-bae3-f55ed3b7cecf

📥 Commits

Reviewing files that changed from the base of the PR and between f5ec03f and e25b04a.

⛔ Files ignored due to path filters (2)

go.sum is excluded by !**/*.sum
proto/gen/store/instance_setting.pb.go is excluded by !**/*.pb.go, !**/gen/**

📒 Files selected for processing (23)

docs/superpowers/plans/2026-05-02-stt-audiollm-split.md
docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md
go.mod
internal/ai/audio/webm.go
internal/ai/audio/webm_test.go
internal/ai/audiollm/audiollm.go
internal/ai/audiollm/gemini/gemini.go
internal/ai/audiollm/gemini/gemini_test.go
internal/ai/audiollm/options.go
internal/ai/client.go
internal/ai/errors.go
internal/ai/gemini.go
internal/ai/openai.go
internal/ai/stt/openai/openai.go
internal/ai/stt/openai/openai_test.go
internal/ai/stt/options.go
internal/ai/stt/stt.go
internal/ai/transcription.go
proto/store/instance_setting.proto
server/router/api/v1/ai_service.go
server/router/api/v1/test/ai_service_test.go
web/src/components/Settings/AISection.tsx
web/src/locales/en.json

💤 Files with no reviewable changes (5)

internal/ai/transcription.go
web/src/components/Settings/AISection.tsx
internal/ai/openai.go
internal/ai/client.go
internal/ai/gemini.go

✅ Files skipped from review due to trivial changes (9)

internal/ai/errors.go
internal/ai/stt/options.go
internal/ai/audiollm/audiollm.go
go.mod
server/router/api/v1/test/ai_service_test.go
internal/ai/stt/openai/openai_test.go
internal/ai/audio/webm.go
web/src/locales/en.json
proto/store/instance_setting.proto

🚧 Files skipped from review as they are similar to previous changes (6)

internal/ai/audio/webm_test.go
internal/ai/audiollm/gemini/gemini.go
internal/ai/stt/openai/openai.go
internal/ai/stt/stt.go
internal/ai/audiollm/options.go
server/router/api/v1/ai_service.go

boojack requested a review from a team as a code owner May 2, 2026 16:05

boojack force-pushed the stt-audiollm-split branch from f5ec03f to e25b04a Compare May 2, 2026 16:09

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

boojack merged commit 5ccba98 into main May 2, 2026
9 checks passed

boojack deleted the stt-audiollm-split branch May 2, 2026 16:22

boojack mentioned this pull request May 2, 2026

Gemini audio transcription not working #5891

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(ai): split STT and Audio-LLM into separate interfaces#5928

refactor(ai): split STT and Audio-LLM into separate interfaces#5928
boojack merged 2 commits intomainfrom
stt-audiollm-split

boojack commented May 2, 2026

Uh oh!

coderabbitai Bot commented May 2, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot May 2, 2026

Uh oh!

coderabbitai Bot May 2, 2026

Uh oh!

coderabbitai Bot May 2, 2026

Uh oh!

coderabbitai Bot May 2, 2026

Uh oh!

coderabbitai Bot May 2, 2026

Uh oh!

coderabbitai Bot May 2, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

boojack commented May 2, 2026

Summary

User-visible improvements

Schema and contract

Background

Test Plan

Uh oh!

coderabbitai Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 2, 2026 •

edited

Loading