feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime by AIMO3D-ob · Pull Request #401 · tiann/hapi

AIMO3D-ob · 2026-04-05T13:30:37Z

Summary

Add a pluggable voice backend architecture that extends the existing ElevenLabs ConvAI integration with two new voice providers:

Gemini 2.5 Live (gemini-live): Google's real-time audio streaming API via WebSocket, with full function calling support for messageCodingAgent and processPermissionRequest
Qwen Realtime (qwen-realtime): Alibaba's DashScope real-time voice API via Hub WebSocket proxy, supporting voice conversation (function calling pending model support)

Users can switch backends via the VOICE_BACKEND environment variable. The existing ElevenLabs integration remains the default and is completely unchanged.

Key Design Decisions

Runtime discovery: GET /voice/backend lets the frontend detect the active backend without Vite rebuild
Code splitting: React.lazy() ensures alternative backends are only loaded when active
Zero upstream breakage: All original ElevenLabs code paths untouched; new code is additive
Inline AudioWorklet: Uses Blob URL instead of Vite ?url import to avoid MIME type issues in production builds
Qwen WebSocket proxy: Hub proxies Qwen connections at /api/voice/qwen-ws because browser WebSocket API cannot set Authorization headers
Barge-in prevention: Auto-mutes microphone during model speech to prevent ambient noise from interrupting responses
PWA immediate activation: Added skipWaiting + clientsClaim to service worker for instant deployment updates

Configuration

# Gemini Live (recommended - free tier, full function calling)
VOICE_BACKEND=gemini-live
GEMINI_API_KEY=your-google-api-key

# Qwen Realtime (voice-only, function calling not yet supported by model)
VOICE_BACKEND=qwen-realtime
DASHSCOPE_API_KEY=your-dashscope-key

# ElevenLabs (default, unchanged)
VOICE_BACKEND=elevenlabs
ELEVENLABS_API_KEY=your-elevenlabs-key

Files Changed

Area	Files	Description
Shared	`shared/src/voice.ts`	Voice backend types, Gemini/Qwen model constants, tool-optimized system prompt
Hub Routes	`hub/src/web/routes/voice.ts`	Backend discovery + token endpoints for Gemini & Qwen
Hub Server	`hub/src/web/server.ts`	Qwen WebSocket proxy handler
Web API	`web/src/api/client.ts`, `voice.ts`	Client functions for new endpoints
Gemini Session	`web/src/realtime/GeminiLiveVoiceSession.tsx`	Full Gemini Live implementation (WebSocket + AudioWorklet)
Qwen Session	`web/src/realtime/QwenVoiceSession.tsx`	Qwen Realtime implementation (OpenAI-compatible protocol)
Audio Pipeline	`web/src/realtime/gemini/`	PCM utils, AudioWorklet recorder, 24kHz player, tool adapter
Switcher	`web/src/realtime/VoiceBackendSession.tsx`	Dynamic backend selector with lazy loading
Integration	`web/src/components/SessionChat.tsx`	Uses `VoiceBackendSession` instead of `RealtimeVoiceSession`
PWA	`web/src/sw.ts`	`skipWaiting` + `clientsClaim`
Tests	`hub/src/web/routes/voice.test.ts`, `pcmUtils.test.ts`, `toolAdapter.test.ts`	16 test cases

Test Plan

Add a strategy-based voice backend architecture that allows switching between ElevenLabs ConvAI, Gemini Live API, and Qwen Realtime via the VOICE_BACKEND environment variable. New backends: - Gemini 2.5 Live (gemini-live): WebSocket + AudioWorklet audio pipeline, full function calling support for messageCodingAgent/processPermissionRequest - Qwen Realtime (qwen-realtime): DashScope API via Hub WebSocket proxy, voice conversation support (function calling not yet supported by model) Architecture: - VoiceBackendSession dynamically selects backend via GET /voice/backend - React.lazy() code splitting — alternative backends not bundled when unused - Hub routes: GET /voice/backend, POST /voice/gemini-token, POST /voice/qwen-token - Hub WebSocket proxy at /api/voice/qwen-ws for Qwen (browser can't set Auth header) - Inline Blob URL AudioWorklet for Vite compatibility - Auto mic mute during model speech to prevent barge-in from ambient noise - Tool-call-optimized system prompt (Chinese, no greeting turn) - PWA skipWaiting + clientsClaim for immediate deployment activation Switch via environment: VOICE_BACKEND=gemini-live GEMINI_API_KEY=xxx VOICE_BACKEND=qwen-realtime DASHSCOPE_API_KEY=xxx VOICE_BACKEND=elevenlabs ELEVENLABS_API_KEY=xxx (default, unchanged)

github-actions

Findings

[Blocker] Qwen WebSocket proxy bypasses API auth and can be opened without a JWT, which lets any reachable client consume the hub's DashScope credentials through /api/voice/qwen-ws. Evidence hub/src/web/server.ts:328.
[Major] The fallback voice backend is now gemini-live, so existing installs that only configured ElevenLabs will be routed away from the existing token flow and fail voice startup. Evidence shared/src/voice.ts:280, hub/src/web/routes/voice.ts:121.
[Major] The Qwen frontend still requires the hub to return a raw DashScope key even though the browser never uses it after switching to the hub WebSocket proxy, so every authenticated web client now receives a long-lived provider secret unnecessarily. Evidence web/src/realtime/QwenVoiceSession.tsx:84, hub/src/web/routes/voice.ts:162.

Summary
Review mode: initial
Three findings. Added coverage does not exercise the new /api/voice/qwen-ws auth boundary, and the new route tests would not catch the default-backend regression because they do not assert against DEFAULT_VOICE_BACKEND.

Testing

Not run (automation): bun is not installed in this runner.

HAPI Bot

github-actions · 2026-04-05T13:35:10Z

hub/src/web/server.ts

+                return socketHandler.fetch(req, server as never)
+            }
+            // Qwen Realtime WebSocket proxy
+            if (url.pathname === '/api/voice/qwen-ws') {


[BLOCKER] This branch upgrades /api/voice/qwen-ws before the request ever reaches app.fetch(req), so it bypasses createAuthMiddleware() entirely. As written, any client that can reach the hub can open a proxied DashScope session with the server-side API key.

Suggested fix:

const token = url.searchParams.get('token') if (!token) { return new Response('Missing authorization token', { status: 401 }) } await jwtVerify(token, options.jwtSecret, { algorithms: ['HS256'] }) const upgraded = server.upgrade(req, { data: { _qwenProxy: true, apiKey, model } })

github-actions · 2026-04-05T13:35:10Z

shared/src/voice.ts

+export const QWEN_REALTIME_MODEL = 'qwen3-omni-flash-realtime'
+export const QWEN_REALTIME_VOICE = 'Mia'
+
+export const DEFAULT_VOICE_BACKEND: VoiceBackendType = 'gemini-live'


[MAJOR] This flips the implicit backend from ElevenLabs to Gemini Live. On a hub that only has ELEVENLABS_API_KEY configured, /api/voice/backend now resolves to gemini-live, and VoiceBackendSession will send users into /api/voice/gemini-token instead of the existing ElevenLabs flow.

Suggested fix:

export const DEFAULT_VOICE_BACKEND: VoiceBackendType = 'elevenlabs'

github-actions · 2026-04-05T13:35:10Z

web/src/realtime/QwenVoiceSession.tsx

+
+        // Get API key from hub
+        const tokenResp = await fetchQwenToken(this.api)
+        if (!tokenResp.allowed || !tokenResp.apiKey) {


[MAJOR] The browser does not use this provider key after the Qwen proxy change. startSession() only checks that apiKey exists, then opens /api/voice/qwen-ws, where the hub injects its own Authorization header. Returning a long-lived DashScope key here needlessly leaks it to every authenticated web client.

Suggested fix:

const tokenResp = await fetchQwenToken(this.api) if (!tokenResp.allowed) { const msg = tokenResp.error ?? 'DashScope API key not available' state.statusCallback?.('error', msg) throw new Error(msg) } state.wsBaseUrl = tokenResp.wsUrl || null

tiann

Thank you for your contribution. I believe this is a good feature. Please fix the comments first.

github-actions bot reviewed Apr 5, 2026

View reviewed changes

tiann requested changes Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime#401

feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime#401
AIMO3D-ob wants to merge 1 commit intotiann:mainfrom
AIMO3D-ob:feat/pluggable-voice-backend

AIMO3D-ob commented Apr 5, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot Apr 5, 2026

Uh oh!

github-actions bot Apr 5, 2026

Uh oh!

github-actions bot Apr 5, 2026

Uh oh!

tiann left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

AIMO3D-ob commented Apr 5, 2026

Summary

Key Design Decisions

Configuration

Files Changed

Test Plan

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

tiann left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants