Skip to content

feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime#401

Open
AIMO3D-ob wants to merge 1 commit intotiann:mainfrom
AIMO3D-ob:feat/pluggable-voice-backend
Open

feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime#401
AIMO3D-ob wants to merge 1 commit intotiann:mainfrom
AIMO3D-ob:feat/pluggable-voice-backend

Conversation

@AIMO3D-ob
Copy link
Copy Markdown

Summary

Add a pluggable voice backend architecture that extends the existing ElevenLabs ConvAI integration with two new voice providers:

  • Gemini 2.5 Live (gemini-live): Google's real-time audio streaming API via WebSocket, with full function calling support for messageCodingAgent and processPermissionRequest
  • Qwen Realtime (qwen-realtime): Alibaba's DashScope real-time voice API via Hub WebSocket proxy, supporting voice conversation (function calling pending model support)

Users can switch backends via the VOICE_BACKEND environment variable. The existing ElevenLabs integration remains the default and is completely unchanged.

Key Design Decisions

  • Runtime discovery: GET /voice/backend lets the frontend detect the active backend without Vite rebuild
  • Code splitting: React.lazy() ensures alternative backends are only loaded when active
  • Zero upstream breakage: All original ElevenLabs code paths untouched; new code is additive
  • Inline AudioWorklet: Uses Blob URL instead of Vite ?url import to avoid MIME type issues in production builds
  • Qwen WebSocket proxy: Hub proxies Qwen connections at /api/voice/qwen-ws because browser WebSocket API cannot set Authorization headers
  • Barge-in prevention: Auto-mutes microphone during model speech to prevent ambient noise from interrupting responses
  • PWA immediate activation: Added skipWaiting + clientsClaim to service worker for instant deployment updates

Configuration

# Gemini Live (recommended - free tier, full function calling)
VOICE_BACKEND=gemini-live
GEMINI_API_KEY=your-google-api-key

# Qwen Realtime (voice-only, function calling not yet supported by model)
VOICE_BACKEND=qwen-realtime
DASHSCOPE_API_KEY=your-dashscope-key

# ElevenLabs (default, unchanged)
VOICE_BACKEND=elevenlabs
ELEVENLABS_API_KEY=your-elevenlabs-key

Files Changed

Area Files Description
Shared shared/src/voice.ts Voice backend types, Gemini/Qwen model constants, tool-optimized system prompt
Hub Routes hub/src/web/routes/voice.ts Backend discovery + token endpoints for Gemini & Qwen
Hub Server hub/src/web/server.ts Qwen WebSocket proxy handler
Web API web/src/api/client.ts, voice.ts Client functions for new endpoints
Gemini Session web/src/realtime/GeminiLiveVoiceSession.tsx Full Gemini Live implementation (WebSocket + AudioWorklet)
Qwen Session web/src/realtime/QwenVoiceSession.tsx Qwen Realtime implementation (OpenAI-compatible protocol)
Audio Pipeline web/src/realtime/gemini/ PCM utils, AudioWorklet recorder, 24kHz player, tool adapter
Switcher web/src/realtime/VoiceBackendSession.tsx Dynamic backend selector with lazy loading
Integration web/src/components/SessionChat.tsx Uses VoiceBackendSession instead of RealtimeVoiceSession
PWA web/src/sw.ts skipWaiting + clientsClaim
Tests hub/src/web/routes/voice.test.ts, pcmUtils.test.ts, toolAdapter.test.ts 16 test cases

Test Plan

  • ElevenLabs backend still works (no code changes to existing paths)
  • Gemini Live: voice conversation works
  • Gemini Live: function calling (messageCodingAgent) triggers correctly
  • Gemini Live: barge-in prevention (no mid-speech interruption from noise)
  • Qwen Realtime: voice conversation works via Hub WebSocket proxy
  • Hub route tests pass (backend discovery, token endpoints)
  • PCM audio conversion round-trip tests pass
  • Tool adapter tests pass
  • TypeScript type-check passes for both hub and web
  • Test on mobile browsers (iOS Safari, Android Chrome)

Add a strategy-based voice backend architecture that allows switching
between ElevenLabs ConvAI, Gemini Live API, and Qwen Realtime via
the VOICE_BACKEND environment variable.

New backends:
- Gemini 2.5 Live (gemini-live): WebSocket + AudioWorklet audio pipeline,
  full function calling support for messageCodingAgent/processPermissionRequest
- Qwen Realtime (qwen-realtime): DashScope API via Hub WebSocket proxy,
  voice conversation support (function calling not yet supported by model)

Architecture:
- VoiceBackendSession dynamically selects backend via GET /voice/backend
- React.lazy() code splitting — alternative backends not bundled when unused
- Hub routes: GET /voice/backend, POST /voice/gemini-token, POST /voice/qwen-token
- Hub WebSocket proxy at /api/voice/qwen-ws for Qwen (browser can't set Auth header)
- Inline Blob URL AudioWorklet for Vite compatibility
- Auto mic mute during model speech to prevent barge-in from ambient noise
- Tool-call-optimized system prompt (Chinese, no greeting turn)
- PWA skipWaiting + clientsClaim for immediate deployment activation

Switch via environment:
  VOICE_BACKEND=gemini-live    GEMINI_API_KEY=xxx
  VOICE_BACKEND=qwen-realtime  DASHSCOPE_API_KEY=xxx
  VOICE_BACKEND=elevenlabs     ELEVENLABS_API_KEY=xxx (default, unchanged)
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Blocker] Qwen WebSocket proxy bypasses API auth and can be opened without a JWT, which lets any reachable client consume the hub's DashScope credentials through /api/voice/qwen-ws. Evidence hub/src/web/server.ts:328.
  • [Major] The fallback voice backend is now gemini-live, so existing installs that only configured ElevenLabs will be routed away from the existing token flow and fail voice startup. Evidence shared/src/voice.ts:280, hub/src/web/routes/voice.ts:121.
  • [Major] The Qwen frontend still requires the hub to return a raw DashScope key even though the browser never uses it after switching to the hub WebSocket proxy, so every authenticated web client now receives a long-lived provider secret unnecessarily. Evidence web/src/realtime/QwenVoiceSession.tsx:84, hub/src/web/routes/voice.ts:162.

Summary
Review mode: initial
Three findings. Added coverage does not exercise the new /api/voice/qwen-ws auth boundary, and the new route tests would not catch the default-backend regression because they do not assert against DEFAULT_VOICE_BACKEND.

Testing

  • Not run (automation): bun is not installed in this runner.

HAPI Bot

return socketHandler.fetch(req, server as never)
}
// Qwen Realtime WebSocket proxy
if (url.pathname === '/api/voice/qwen-ws') {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BLOCKER] This branch upgrades /api/voice/qwen-ws before the request ever reaches app.fetch(req), so it bypasses createAuthMiddleware() entirely. As written, any client that can reach the hub can open a proxied DashScope session with the server-side API key.

Suggested fix:

const token = url.searchParams.get('token')
if (!token) {
    return new Response('Missing authorization token', { status: 401 })
}

await jwtVerify(token, options.jwtSecret, { algorithms: ['HS256'] })
const upgraded = server.upgrade(req, {
    data: { _qwenProxy: true, apiKey, model }
})

export const QWEN_REALTIME_MODEL = 'qwen3-omni-flash-realtime'
export const QWEN_REALTIME_VOICE = 'Mia'

export const DEFAULT_VOICE_BACKEND: VoiceBackendType = 'gemini-live'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This flips the implicit backend from ElevenLabs to Gemini Live. On a hub that only has ELEVENLABS_API_KEY configured, /api/voice/backend now resolves to gemini-live, and VoiceBackendSession will send users into /api/voice/gemini-token instead of the existing ElevenLabs flow.

Suggested fix:

export const DEFAULT_VOICE_BACKEND: VoiceBackendType = 'elevenlabs'


// Get API key from hub
const tokenResp = await fetchQwenToken(this.api)
if (!tokenResp.allowed || !tokenResp.apiKey) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] The browser does not use this provider key after the Qwen proxy change. startSession() only checks that apiKey exists, then opens /api/voice/qwen-ws, where the hub injects its own Authorization header. Returning a long-lived DashScope key here needlessly leaks it to every authenticated web client.

Suggested fix:

const tokenResp = await fetchQwenToken(this.api)
if (!tokenResp.allowed) {
    const msg = tokenResp.error ?? 'DashScope API key not available'
    state.statusCallback?.('error', msg)
    throw new Error(msg)
}
state.wsBaseUrl = tokenResp.wsUrl || null

Copy link
Copy Markdown
Owner

@tiann tiann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution. I believe this is a good feature. Please fix the comments first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants