Skip to content

Desktop: realtime voice hub — Gemini reliability, voice-status UX, per-model prompts#8100

Open
vendz wants to merge 13 commits into
BasedHardware:mainfrom
vendz:desktop/realtime-hub-gemini-reliability
Open

Desktop: realtime voice hub — Gemini reliability, voice-status UX, per-model prompts#8100
vendz wants to merge 13 commits into
BasedHardware:mainfrom
vendz:desktop/realtime-hub-gemini-reliability

Conversation

@vendz

@vendz vendz commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Reliability + UX hardening for the realtime voice hub, plus per-model prompts. Builds on #8002.

  • Gemini reliability: turn-completion watchdog (recovers a dropped turnComplete); built-in mic when output is Bluetooth (avoids the A2DP→HFP reply cutoff); downscaled in-turn vision frame.
  • Voice UX: thinking → speaking → "no response" status pill, and tap-to-stop.
  • Prompts: separate, model-tuned system prompts for OpenAI Realtime and Gemini Live — date-aware, infer on-screen references, answer from own knowledge, concise.

Note: remaining Gemini choppiness is server-side (preview-tier audio truncation); this makes the client resilient to it. OpenAI Realtime is unaffected.

vendz and others added 10 commits June 21, 2026 15:57
…te can't strand a voice turn

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed) + tap-to-stop

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bar state

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing bar

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e on PTT-up)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…and Gemini Live

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rness

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(avoid HFP)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 21, 2026 22:59
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the macOS realtime voice hub by improving Gemini Live turn reliability, adding a post-PTT voice-status UX (thinking/speaking/failure + tap-to-stop), and splitting the hub system prompt into per-model variants tuned for OpenAI Realtime vs Gemini Live.

Changes:

  • Downscale in-turn Gemini vision frames and add a Gemini turn-completion watchdog timer to prevent stranded turns.
  • Add a voice-status pill after PTT-up (thinking → speaking → no response) and user-initiated stop behavior, plus a single voiceOwnsBar concept to prevent hover/resize conflicts.
  • Split hub system prompts into per-provider prompts and plumb provider through the session/controller/test harness.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
desktop/macos/Desktop/Sources/FloatingControlBar/ScreenCaptureManager.swift Downscales JPEG screen captures to reduce Gemini payload size and avoid server closes.
desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubTools.swift Introduces per-provider system prompts and updates instruction plumbing.
desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubTestHarness.swift Updates harness to pass provider-specific instructions.
desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubSession.swift Adds Gemini watchdog timer and rearms it on relevant reply events.
desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubController.swift Adds voice-status phase management, stop-speaking behavior, and uses provider-specific prompts.
desktop/macos/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift Prevents collapsing the bar on PTT-up while a voice response phase is active.
desktop/macos/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift Disables hover-resize whenever voice “owns” the bar (listening or status pill).
desktop/macos/Desktop/Sources/FloatingControlBar/FloatingControlBarView.swift Adds the post-release voice response UI with tap-to-stop.
desktop/macos/Desktop/Sources/FloatingControlBar/FloatingControlBarState.swift Adds VoiceResponsePhase and voiceOwnsBar to unify voice bar-ownership logic.
desktop/macos/Desktop/Sources/AudioCaptureService.swift Selects built-in mic when output is Bluetooth to avoid A2DP→HFP degradation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +64 to +69
static func systemInstruction(aboutUser: String, provider: RealtimeHubProvider) -> String {
switch provider {
case .openai: return openAIInstruction(aboutUser: aboutUser)
case .gemini: return geminiInstruction(aboutUser: aboutUser)
}
}
Comment on lines +183 to +187
} else if Self.isDefaultOutputBluetooth(), let builtIn = Self.findBuiltInMicDeviceID() {
// Output is a Bluetooth device (e.g. AirPods). Opening the BT *microphone* forces the
// headset out of high-quality A2DP into 16 kHz HFP "call" mode — which degrades ALL
// playback (including the hub's spoken reply) and frequently makes macOS deliver only
// silence (the A2DP/HFP profile conflict). That silence trips the silent-mic watchdog,
Comment on lines +598 to +600
geminiWatchdog = timer
timer.resume()
return timer

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 10 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubTools.swift">

<violation number="1" location="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubTools.swift:181">
P2: Gemini prompt contains a hard single-tool rule that contradicts flows requiring sequential tools. This can break task updates that need `get_tasks` followed by `update_action_item`.</violation>
</file>

<file name="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubSession.swift">

<violation number="1" location="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubSession.swift:299">
P1: Gemini watchdog can prematurely end valid tool-call turns after 15s. If a tool response is slow, timer fires, marks turn finished, and subsequent reply audio/events are ignored.</violation>
</file>

<file name="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubController.swift">

<violation number="1" location="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubController.swift:255">
P2: `commitTurn` shows "thinking" even when no live session exists. After a mid-turn disconnect this can leave the voice status stuck indefinitely.</violation>

<violation number="2" location="desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubController.swift:555">
P2: Failure state is keyed only on native audio, so valid text-fallback replies are mislabeled as "no response." This produces incorrect UX during fallback turns.</violation>
</file>

<file name="desktop/macos/Desktop/Sources/AudioCaptureService.swift">

<violation number="1" location="desktop/macos/Desktop/Sources/AudioCaptureService.swift:190">
P2: Auto-selecting built-in mic for Bluetooth output is not pinned, so later reconfiguration can revert capture back to default input and reintroduce the Bluetooth HFP/A2DP conflict.</violation>
</file>

Tip: instead of fixing issues one by one fix them all with cubic

Re-trigger cubic

// Recover if the reply never starts or never completes (Gemini Live drops the
// trailing turnComplete / truncates replies intermittently). 15s covers the
// worst-case think latency before the first audio chunk.
self.armGeminiWatchdog(15)

@cubic-dev-ai cubic-dev-ai Bot Jun 21, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Gemini watchdog can prematurely end valid tool-call turns after 15s. If a tool response is slow, timer fires, marks turn finished, and subsequent reply audio/events are ignored.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubSession.swift, line 299:

<comment>Gemini watchdog can prematurely end valid tool-call turns after 15s. If a tool response is slow, timer fires, marks turn finished, and subsequent reply audio/events are ignored.</comment>

<file context>
@@ -288,6 +293,10 @@ final class RealtimeHubSession: NSObject {
+        // Recover if the reply never starts or never completes (Gemini Live drops the
+        // trailing turnComplete / truncates replies intermittently). 15s covers the
+        // worst-case think latency before the first audio chunk.
+        self.armGeminiWatchdog(15)
       // Gemini auto-responds at activityEnd; no explicit response request.
       }
</file context>
Fix with cubic

<your_tools>
You CAN read the user's Omi data and act on their Mac, but ONLY through these tools — you cannot see their data, screen, tasks, or memories without calling one. Before any tool, say a SHORT, SPECIFIC, VARIED heads-up out loud first (e.g. "Checking your tasks now" / "Let me pull that conversation up" — never the same robotic phrase twice). Then call the tool. Stay quiet until it returns; NEVER speak the answer before the result comes back; never skip a needed call; never read out JSON, ids, or raw fields. Speak only what the result actually says.

Pick ONE tool that fits, call it once, then answer.

@cubic-dev-ai cubic-dev-ai Bot Jun 21, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Gemini prompt contains a hard single-tool rule that contradicts flows requiring sequential tools. This can break task updates that need get_tasks followed by update_action_item.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubTools.swift, line 181:

<comment>Gemini prompt contains a hard single-tool rule that contradicts flows requiring sequential tools. This can break task updates that need `get_tasks` followed by `update_action_item`.</comment>

<file context>
@@ -54,104 +54,178 @@ enum HubTool: String {
+<your_tools>
+You CAN read the user's Omi data and act on their Mac, but ONLY through these tools — you cannot see their data, screen, tasks, or memories without calling one. Before any tool, say a SHORT, SPECIFIC, VARIED heads-up out loud first (e.g. "Checking your tasks now" / "Let me pull that conversation up" — never the same robotic phrase twice). Then call the tool. Stay quiet until it returns; NEVER speak the answer before the result comes back; never skip a needed call; never read out JSON, ids, or raw fields. Speak only what the result actually says.
+
+Pick ONE tool that fits, call it once, then answer.
+
+PERSONAL DATA (read):
</file context>
Suggested change
Pick ONE tool that fits, call it once, then answer.
Use the FEWEST tools needed: call one tool when possible, and chain multiple tool calls only when required (for example, get_tasks first, then update_action_item), then answer.
Fix with cubic

// Resolve the status: a turn that produced audio just ends (collapse); a turn that
// finished with NO audio — e.g. the Gemini turn-completion watchdog fired on a dropped/
// empty reply — briefly tells the user it got nothing, instead of silently collapsing.
setVoicePhase(audioReceivedThisTurn ? .none : .failed)

@cubic-dev-ai cubic-dev-ai Bot Jun 21, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Failure state is keyed only on native audio, so valid text-fallback replies are mislabeled as "no response." This produces incorrect UX during fallback turns.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubController.swift, line 555:

<comment>Failure state is keyed only on native audio, so valid text-fallback replies are mislabeled as "no response." This produces incorrect UX during fallback turns.</comment>

<file context>
@@ -539,18 +549,26 @@ final class RealtimeHubController: NSObject, RealtimeHubSessionDelegate {
+    // Resolve the status: a turn that produced audio just ends (collapse); a turn that
+    // finished with NO audio — e.g. the Gemini turn-completion watchdog fired on a dropped/
+    // empty reply — briefly tells the user it got nothing, instead of silently collapsing.
+    setVoicePhase(audioReceivedThisTurn ? .none : .failed)
     exitVoiceUI()
   }
</file context>
Suggested change
setVoicePhase(audioReceivedThisTurn ? .none : .failed)
setVoicePhase((audioReceivedThisTurn || !reply.isEmpty) ? .none : .failed)
Fix with cubic

responding = true
// Keep the bar alive after release: show "thinking" until the first audio arrives, so the
// commit→reply gap (model latency / reconnect / a slow turn) never looks like a dead bar.
setVoicePhase(.thinking)

@cubic-dev-ai cubic-dev-ai Bot Jun 21, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: commitTurn shows "thinking" even when no live session exists. After a mid-turn disconnect this can leave the voice status stuck indefinitely.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At desktop/macos/Desktop/Sources/FloatingControlBar/RealtimeHubController.swift, line 255:

<comment>`commitTurn` shows "thinking" even when no live session exists. After a mid-turn disconnect this can leave the voice status stuck indefinitely.</comment>

<file context>
@@ -245,6 +250,9 @@ final class RealtimeHubController: NSObject, RealtimeHubSessionDelegate {
     responding = true
+    // Keep the bar alive after release: show "thinking" until the first audio arrives, so the
+    // commit→reply gap (model latency / reconnect / a slow turn) never looks like a dead bar.
+    setVoicePhase(.thinking)
     // (The screen frame is sent at turn START — see beginTurn — so it has time to
     // upload/decode before the model answers. Nothing to attach here.)
</file context>
Suggested change
setVoicePhase(.thinking)
if session != nil { setVoicePhase(.thinking) }
Fix with cubic

// silence (the A2DP/HFP profile conflict). That silence trips the silent-mic watchdog,
// which swaps devices and rebuilds the audio engine MID-REPLY, cutting the reply off.
// So capture from the built-in mic and leave Bluetooth in A2DP. Mirrors the PTT path.
inputDeviceID = builtIn

@cubic-dev-ai cubic-dev-ai Bot Jun 21, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Auto-selecting built-in mic for Bluetooth output is not pinned, so later reconfiguration can revert capture back to default input and reintroduce the Bluetooth HFP/A2DP conflict.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At desktop/macos/Desktop/Sources/AudioCaptureService.swift, line 190:

<comment>Auto-selecting built-in mic for Bluetooth output is not pinned, so later reconfiguration can revert capture back to default input and reintroduce the Bluetooth HFP/A2DP conflict.</comment>

<file context>
@@ -180,6 +180,15 @@ class AudioCaptureService: @unchecked Sendable {
+            // silence (the A2DP/HFP profile conflict). That silence trips the silent-mic watchdog,
+            // which swaps devices and rebuilds the audio engine MID-REPLY, cutting the reply off.
+            // So capture from the built-in mic and leave Bluetooth in A2DP. Mirrors the PTT path.
+            inputDeviceID = builtIn
+            log("AudioCapture: output is Bluetooth — capturing from built-in mic id=\(builtIn) to keep A2DP")
         } else {
</file context>
Fix with cubic

vendz and others added 3 commits June 21, 2026 16:22
… no unprompted screen narration or tacked-on offers

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eady-released things 'upcoming'

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…in voice prompts

Tells both models today's date (so a released title isn't treated as 'future'), and to recognize an unreferenced 'this/that' as the user's screen — OpenAI calls screenshot first, Gemini uses the frame it already gets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants