Desktop: realtime voice hub — Gemini reliability, voice-status UX, per-model prompts#8100
Desktop: realtime voice hub — Gemini reliability, voice-status UX, per-model prompts#8100vendz wants to merge 13 commits into
Conversation
…te can't strand a voice turn Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed) + tap-to-stop Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bar state Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing bar Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e on PTT-up) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…and Gemini Live Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rness Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(avoid HFP) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Pull request overview
This PR hardens the macOS realtime voice hub by improving Gemini Live turn reliability, adding a post-PTT voice-status UX (thinking/speaking/failure + tap-to-stop), and splitting the hub system prompt into per-model variants tuned for OpenAI Realtime vs Gemini Live.
Changes:
- Downscale in-turn Gemini vision frames and add a Gemini turn-completion watchdog timer to prevent stranded turns.
- Add a voice-status pill after PTT-up (thinking → speaking → no response) and user-initiated stop behavior, plus a single
voiceOwnsBarconcept to prevent hover/resize conflicts. - Split hub system prompts into per-provider prompts and plumb provider through the session/controller/test harness.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| desktop/macos/Desktop/Sources/FloatingControlBar/ScreenCaptureManager.swift | Downscales JPEG screen captures to reduce Gemini payload size and avoid server closes. |
| desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubTools.swift | Introduces per-provider system prompts and updates instruction plumbing. |
| desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubTestHarness.swift | Updates harness to pass provider-specific instructions. |
| desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubSession.swift | Adds Gemini watchdog timer and rearms it on relevant reply events. |
| desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubController.swift | Adds voice-status phase management, stop-speaking behavior, and uses provider-specific prompts. |
| desktop/macos/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift | Prevents collapsing the bar on PTT-up while a voice response phase is active. |
| desktop/macos/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift | Disables hover-resize whenever voice “owns” the bar (listening or status pill). |
| desktop/macos/Desktop/Sources/FloatingControlBar/FloatingControlBarView.swift | Adds the post-release voice response UI with tap-to-stop. |
| desktop/macos/Desktop/Sources/FloatingControlBar/FloatingControlBarState.swift | Adds VoiceResponsePhase and voiceOwnsBar to unify voice bar-ownership logic. |
| desktop/macos/Desktop/Sources/AudioCaptureService.swift | Selects built-in mic when output is Bluetooth to avoid A2DP→HFP degradation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| static func systemInstruction(aboutUser: String, provider: RealtimeHubProvider) -> String { | ||
| switch provider { | ||
| case .openai: return openAIInstruction(aboutUser: aboutUser) | ||
| case .gemini: return geminiInstruction(aboutUser: aboutUser) | ||
| } | ||
| } |
| } else if Self.isDefaultOutputBluetooth(), let builtIn = Self.findBuiltInMicDeviceID() { | ||
| // Output is a Bluetooth device (e.g. AirPods). Opening the BT *microphone* forces the | ||
| // headset out of high-quality A2DP into 16 kHz HFP "call" mode — which degrades ALL | ||
| // playback (including the hub's spoken reply) and frequently makes macOS deliver only | ||
| // silence (the A2DP/HFP profile conflict). That silence trips the silent-mic watchdog, |
| geminiWatchdog = timer | ||
| timer.resume() | ||
| return timer |
There was a problem hiding this comment.
5 issues found across 10 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubTools.swift">
<violation number="1" location="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubTools.swift:181">
P2: Gemini prompt contains a hard single-tool rule that contradicts flows requiring sequential tools. This can break task updates that need `get_tasks` followed by `update_action_item`.</violation>
</file>
<file name="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubSession.swift">
<violation number="1" location="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubSession.swift:299">
P1: Gemini watchdog can prematurely end valid tool-call turns after 15s. If a tool response is slow, timer fires, marks turn finished, and subsequent reply audio/events are ignored.</violation>
</file>
<file name="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubController.swift">
<violation number="1" location="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubController.swift:255">
P2: `commitTurn` shows "thinking" even when no live session exists. After a mid-turn disconnect this can leave the voice status stuck indefinitely.</violation>
<violation number="2" location="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubController.swift:555">
P2: Failure state is keyed only on native audio, so valid text-fallback replies are mislabeled as "no response." This produces incorrect UX during fallback turns.</violation>
</file>
<file name="desktop/macos/Desktop/Sources/AudioCaptureService.swift">
<violation number="1" location="desktop/macos/Desktop/Sources/AudioCaptureService.swift:190">
P2: Auto-selecting built-in mic for Bluetooth output is not pinned, so later reconfiguration can revert capture back to default input and reintroduce the Bluetooth HFP/A2DP conflict.</violation>
</file>
Tip: instead of fixing issues one by one fix them all with cubic
Re-trigger cubic
| // Recover if the reply never starts or never completes (Gemini Live drops the | ||
| // trailing turnComplete / truncates replies intermittently). 15s covers the | ||
| // worst-case think latency before the first audio chunk. | ||
| self.armGeminiWatchdog(15) |
There was a problem hiding this comment.
P1: Gemini watchdog can prematurely end valid tool-call turns after 15s. If a tool response is slow, timer fires, marks turn finished, and subsequent reply audio/events are ignored.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubSession.swift, line 299:
<comment>Gemini watchdog can prematurely end valid tool-call turns after 15s. If a tool response is slow, timer fires, marks turn finished, and subsequent reply audio/events are ignored.</comment>
<file context>
@@ -288,6 +293,10 @@ final class RealtimeHubSession: NSObject {
+ // Recover if the reply never starts or never completes (Gemini Live drops the
+ // trailing turnComplete / truncates replies intermittently). 15s covers the
+ // worst-case think latency before the first audio chunk.
+ self.armGeminiWatchdog(15)
// Gemini auto-responds at activityEnd; no explicit response request.
}
</file context>
| <your_tools> | ||
| You CAN read the user's Omi data and act on their Mac, but ONLY through these tools — you cannot see their data, screen, tasks, or memories without calling one. Before any tool, say a SHORT, SPECIFIC, VARIED heads-up out loud first (e.g. "Checking your tasks now" / "Let me pull that conversation up" — never the same robotic phrase twice). Then call the tool. Stay quiet until it returns; NEVER speak the answer before the result comes back; never skip a needed call; never read out JSON, ids, or raw fields. Speak only what the result actually says. | ||
|
|
||
| Pick ONE tool that fits, call it once, then answer. |
There was a problem hiding this comment.
P2: Gemini prompt contains a hard single-tool rule that contradicts flows requiring sequential tools. This can break task updates that need get_tasks followed by update_action_item.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubTools.swift, line 181:
<comment>Gemini prompt contains a hard single-tool rule that contradicts flows requiring sequential tools. This can break task updates that need `get_tasks` followed by `update_action_item`.</comment>
<file context>
@@ -54,104 +54,178 @@ enum HubTool: String {
+<your_tools>
+You CAN read the user's Omi data and act on their Mac, but ONLY through these tools — you cannot see their data, screen, tasks, or memories without calling one. Before any tool, say a SHORT, SPECIFIC, VARIED heads-up out loud first (e.g. "Checking your tasks now" / "Let me pull that conversation up" — never the same robotic phrase twice). Then call the tool. Stay quiet until it returns; NEVER speak the answer before the result comes back; never skip a needed call; never read out JSON, ids, or raw fields. Speak only what the result actually says.
+
+Pick ONE tool that fits, call it once, then answer.
+
+PERSONAL DATA (read):
</file context>
| Pick ONE tool that fits, call it once, then answer. | |
| Use the FEWEST tools needed: call one tool when possible, and chain multiple tool calls only when required (for example, get_tasks first, then update_action_item), then answer. |
| // Resolve the status: a turn that produced audio just ends (collapse); a turn that | ||
| // finished with NO audio — e.g. the Gemini turn-completion watchdog fired on a dropped/ | ||
| // empty reply — briefly tells the user it got nothing, instead of silently collapsing. | ||
| setVoicePhase(audioReceivedThisTurn ? .none : .failed) |
There was a problem hiding this comment.
P2: Failure state is keyed only on native audio, so valid text-fallback replies are mislabeled as "no response." This produces incorrect UX during fallback turns.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubController.swift, line 555:
<comment>Failure state is keyed only on native audio, so valid text-fallback replies are mislabeled as "no response." This produces incorrect UX during fallback turns.</comment>
<file context>
@@ -539,18 +549,26 @@ final class RealtimeHubController: NSObject, RealtimeHubSessionDelegate {
+ // Resolve the status: a turn that produced audio just ends (collapse); a turn that
+ // finished with NO audio — e.g. the Gemini turn-completion watchdog fired on a dropped/
+ // empty reply — briefly tells the user it got nothing, instead of silently collapsing.
+ setVoicePhase(audioReceivedThisTurn ? .none : .failed)
exitVoiceUI()
}
</file context>
| setVoicePhase(audioReceivedThisTurn ? .none : .failed) | |
| setVoicePhase((audioReceivedThisTurn || !reply.isEmpty) ? .none : .failed) |
| responding = true | ||
| // Keep the bar alive after release: show "thinking" until the first audio arrives, so the | ||
| // commit→reply gap (model latency / reconnect / a slow turn) never looks like a dead bar. | ||
| setVoicePhase(.thinking) |
There was a problem hiding this comment.
P2: commitTurn shows "thinking" even when no live session exists. After a mid-turn disconnect this can leave the voice status stuck indefinitely.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubController.swift, line 255:
<comment>`commitTurn` shows "thinking" even when no live session exists. After a mid-turn disconnect this can leave the voice status stuck indefinitely.</comment>
<file context>
@@ -245,6 +250,9 @@ final class RealtimeHubController: NSObject, RealtimeHubSessionDelegate {
responding = true
+ // Keep the bar alive after release: show "thinking" until the first audio arrives, so the
+ // commit→reply gap (model latency / reconnect / a slow turn) never looks like a dead bar.
+ setVoicePhase(.thinking)
// (The screen frame is sent at turn START — see beginTurn — so it has time to
// upload/decode before the model answers. Nothing to attach here.)
</file context>
| setVoicePhase(.thinking) | |
| if session != nil { setVoicePhase(.thinking) } |
| // silence (the A2DP/HFP profile conflict). That silence trips the silent-mic watchdog, | ||
| // which swaps devices and rebuilds the audio engine MID-REPLY, cutting the reply off. | ||
| // So capture from the built-in mic and leave Bluetooth in A2DP. Mirrors the PTT path. | ||
| inputDeviceID = builtIn |
There was a problem hiding this comment.
P2: Auto-selecting built-in mic for Bluetooth output is not pinned, so later reconfiguration can revert capture back to default input and reintroduce the Bluetooth HFP/A2DP conflict.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At desktop/macos/Desktop/Sources/AudioCaptureService.swift, line 190:
<comment>Auto-selecting built-in mic for Bluetooth output is not pinned, so later reconfiguration can revert capture back to default input and reintroduce the Bluetooth HFP/A2DP conflict.</comment>
<file context>
@@ -180,6 +180,15 @@ class AudioCaptureService: @unchecked Sendable {
+ // silence (the A2DP/HFP profile conflict). That silence trips the silent-mic watchdog,
+ // which swaps devices and rebuilds the audio engine MID-REPLY, cutting the reply off.
+ // So capture from the built-in mic and leave Bluetooth in A2DP. Mirrors the PTT path.
+ inputDeviceID = builtIn
+ log("AudioCapture: output is Bluetooth — capturing from built-in mic id=\(builtIn) to keep A2DP")
} else {
</file context>
… no unprompted screen narration or tacked-on offers Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eady-released things 'upcoming' Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…in voice prompts Tells both models today's date (so a released title isn't treated as 'future'), and to recognize an unreferenced 'this/that' as the user's screen — OpenAI calls screenshot first, Gemini uses the frame it already gets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reliability + UX hardening for the realtime voice hub, plus per-model prompts. Builds on #8002.
turnComplete); built-in mic when output is Bluetooth (avoids the A2DP→HFP reply cutoff); downscaled in-turn vision frame.Note: remaining Gemini choppiness is server-side (preview-tier audio truncation); this makes the client resilient to it. OpenAI Realtime is unaffected.