Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions gitbooks/developing/architecture/desktop-companion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
---
description: Desktop companion domain — Clicky-style interaction loop tying hotkey, voice, screen intelligence, LLM, TTS, and visual pointing into a single product experience.
icon: robot
---

# Desktop Companion (`src/openhuman/desktop_companion/`)

The desktop companion orchestrates a Clicky-style interaction loop: hotkey activation, microphone capture, screen context, LLM reasoning, speech synthesis, and visual pointing. It reuses existing building blocks rather than reimplementing them.

## Building blocks

| Module | What it provides | Path |
|--------|-----------------|------|
| **screen_intelligence** | Permission-gated capture sessions, `capture_now()`, `VisionSummary`, `AppContextInfo` | `src/openhuman/screen_intelligence/` |
| **voice** | Hotkey listener (push/tap), audio capture, cloud STT (Whisper), TTS (`reply_speech`) | `src/openhuman/voice/` |
| **meet_agent** | LLM orchestration pattern (STT -> LLM -> TTS), WAV packing | `src/openhuman/meet_agent/` |
| **overlay** | Floating UI surface, attention events, typewriter bubbles | `src/openhuman/overlay/` |
| **provider_surfaces** | Connected-app event queue (`ingest_event`, `list_queue`) | `src/openhuman/provider_surfaces/` |
| **accessibility** | Foreground app context (`foreground_context()`) | `src/openhuman/accessibility/` |

## Module layout

```text
src/openhuman/desktop_companion/
mod.rs — module exports (light)
types.rs — CompanionState enum, CompanionConfig, ConversationTurn, session param/result types
session.rs — singleton session lifecycle, state machine, TTL, conversation history
pipeline.rs — STT -> screen context -> LLM -> TTS -> pointing orchestration
pointing.rs — [POINT:x,y:label:screenN] tag parser, multi-monitor coordinate mapping
handoff.rs — provider-surface queue matching for connected-app actions
bus.rs — broadcast channel for CompanionStateChangedEvent
schemas.rs — RPC controllers (companion_start_session, companion_stop_session, etc.)
```

## State machine

```text
Idle -> Listening -> Thinking -> Speaking -> Pointing -> Idle
| |
v v
Listening Listening (interrupt)

Any state -> Error -> Idle (reset)
```

Valid transitions are enforced by `session::is_valid_transition()`. Key paths:

- **Happy path**: Idle -> Listening -> Thinking -> Speaking -> Pointing -> Idle
- **No pointing**: Thinking -> Speaking -> Idle (no POINT tags in response)
- **Interrupt**: Speaking/Pointing -> Listening (user re-activates hotkey)
- **Cancel**: Thinking -> Idle (user cancels mid-think)
- **Error recovery**: Any -> Error -> Idle

## Interaction pipeline

`pipeline.rs` orchestrates a single turn:

1. **Activation** — state transitions to Listening (will be driven by Tauri shell hotkey bridge in PR 2)
2. **STT** — audio samples transcribed via `voice::cloud_transcribe` (Whisper)
3. **Screen context** — `accessibility::foreground_context()` for app name + window title
4. **LLM** — chat-completions via `BackendOAuthClient` with system prompt, screen context, and rolling conversation history (last 20 turns as context)
5. **Parse response** — extract `[POINT:x,y:label:screenN]` tags via `pointing::parse_and_map()`
6. **Handoff check** — scan response for provider keywords, match against `provider_surfaces` queue
7. **TTS** — synthesize speech via `voice::reply_speech` (ElevenLabs)
8. **Pointing** — emit pointing targets for overlay animation
9. **Return to Idle**

The pipeline supports cancellation via `CancellationToken` — the Tauri shell can cancel at any checkpoint (between STT, LLM, TTS stages).

Text input is also supported via `run_text_turn()` which skips STT.

## Session lifecycle

- **One session at a time** — enforced by a process-global `Mutex<Option<CompanionSessionInner>>`
- **Consent required** — `start_session` rejects `consent=false`
- **TTL enforcement** — sessions auto-expire when `status()` detects elapsed TTL
- **Conversation history** — capped at 50 turns, oldest drained on overflow

## RPC surface

Namespace: `companion`. All methods go through the standard controller registry.

| Method | Description |
|--------|-------------|
| `companion_start_session` | Start a session with explicit consent + optional TTL |
| `companion_stop_session` | End the active session |
| `companion_status` | Current state, session info, remaining TTL |
| `companion_config_get` | Read companion configuration |
| `companion_config_set` | Update companion configuration |

## Event bus

`CompanionStateChangedEvent` is broadcast via a `tokio::sync::broadcast` channel (same pattern as `overlay::bus`). Three `DomainEvent` variants route to the `"companion"` domain:

- `CompanionSessionStarted { session_id }`
- `CompanionStateChanged { session_id, state, previous_state }`
- `CompanionSessionEnded { session_id, reason }`

## Pointing system

LLM responses can embed `[POINT:x,y:label:screenN]` tags. `pointing.rs`:

- Parses tags via regex
- Maps screen-relative coordinates to absolute desktop coordinates using `ScreenGeometry`
- Clamps coordinates to screen bounds
- Falls back to screen 0 when the index is out of range
- Strips tags from display text

## Provider-surface handoff

`handoff.rs` scans the clean LLM response text for provider keywords (slack, discord, telegram, etc.) and matches them against items in the `provider_surfaces` queue. When matches are found, `HandoffEvent`s are included in `TurnResult` for the Tauri shell / overlay to surface.

## Platform scope

- **macOS**: Full support — hotkey, screen capture, pointing, TTS, overlay
- **Windows/Linux**: Partial — hotkey works (rdev), screen context stubbed, no pointing

Platform-specific code is gated with `#[cfg(target_os = "macos")]`.

## Testing

| File | Coverage |
|------|----------|
| `session_tests.rs` | Session CRUD, state machine transitions, TTL, consent, conversation history |
| `pipeline_tests.rs` | Turn orchestration, cancellation, input validation, system prompt |
| `pointing_tests.rs` | Tag parsing, coordinate mapping, multi-monitor, edge cases |
| `handoff.rs` (inline) | Keyword matching, empty queue, provider coverage |
| `schemas.rs` (inline) | Controller count, schema field validation |
| `tests/json_rpc_e2e.rs` | Full RPC round-trip: start -> status -> config -> stop |
9 changes: 9 additions & 0 deletions src/core/all.rs
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,10 @@ fn build_registered_controllers() -> Vec<RegisteredController> {
controllers.extend(crate::openhuman::meet::all_meet_registered_controllers());
// Live meet-agent loop: STT/LLM/TTS over the open call's audio.
controllers.extend(crate::openhuman::meet_agent::all_meet_agent_registered_controllers());
// Desktop companion — Clicky-style interaction loop.
controllers.extend(
crate::openhuman::desktop_companion::all_desktop_companion_registered_controllers(),
);
Comment thread
coderabbitai[bot] marked this conversation as resolved.
// Structured WhatsApp Web data — agent-facing read-only controllers (list/search).
// The write-path ingest controller is registered separately in build_internal_only_controllers.
controllers.extend(crate::openhuman::whatsapp_data::all_whatsapp_data_registered_controllers());
Expand Down Expand Up @@ -330,6 +334,8 @@ fn build_declared_controller_schemas() -> Vec<ControllerSchema> {
schemas.extend(crate::openhuman::meet::all_meet_controller_schemas());
// Live meet-agent listening + speaking loop
schemas.extend(crate::openhuman::meet_agent::all_meet_agent_controller_schemas());
// Desktop companion — Clicky-style interaction loop.
schemas.extend(crate::openhuman::desktop_companion::all_desktop_companion_controller_schemas());
// Structured WhatsApp Web data — local SQLite store, agent-queryable
schemas.extend(crate::openhuman::whatsapp_data::all_whatsapp_data_controller_schemas());
schemas
Expand Down Expand Up @@ -438,6 +444,9 @@ pub fn namespace_description(namespace: &str) -> Option<&'static str> {
"whatsapp_data" => Some(
"Structured WhatsApp conversation and message store — list chats, read messages, and search across WhatsApp Web data.",
),
"companion" => Some(
"Desktop companion — Clicky-style hotkey-driven interaction loop with STT, LLM, TTS, and visual pointing.",
),
_ => None,
}
}
Expand Down
20 changes: 20 additions & 0 deletions src/core/event_bus/events.rs
Original file line number Diff line number Diff line change
Expand Up @@ -415,6 +415,22 @@ pub enum DomainEvent {
rebuilt_at: f64,
},

// ── Desktop Companion ──────────────────────────────────────────────
/// A desktop companion session was started.
CompanionSessionStarted { session_id: String, ttl_secs: u64 },
/// The companion transitioned to a new state.
CompanionStateChanged {
session_id: String,
state: String,
previous_state: String,
},
/// A desktop companion session ended.
CompanionSessionEnded {
session_id: String,
reason: String,
turn_count: usize,
},

// ── System lifecycle ────────────────────────────────────────────────
/// A system component started up.
SystemStartup { component: String },
Expand Down Expand Up @@ -511,6 +527,10 @@ impl DomainEvent {

Self::NotificationIngested { .. } | Self::NotificationTriaged { .. } => "notification",

Self::CompanionSessionStarted { .. }
| Self::CompanionStateChanged { .. }
| Self::CompanionSessionEnded { .. } => "companion",

Self::SystemStartup { .. }
| Self::SystemShutdown { .. }
| Self::SystemRestartRequested { .. }
Expand Down
24 changes: 24 additions & 0 deletions src/openhuman/about_app/catalog.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1103,6 +1103,30 @@ const CAPABILITIES: &[Capability] = &[
status: CapabilityStatus::Beta,
privacy: GITHUB_RELEASES_METADATA,
},
// ── Desktop Companion ────────────────────────────────────────────
Capability {
id: "companion.session",
name: "Desktop Companion Session",
domain: "desktop_companion",
category: CapabilityCategory::ScreenIntelligence,
description: "Start a Clicky-style companion session that ties hotkey activation, \
microphone capture, screen context, LLM reasoning, speech synthesis, \
and visual pointing into a single interaction loop.",
how_to: "Settings > Companion, or activate via the configured hotkey.",
status: CapabilityStatus::Beta,
privacy: DERIVED_TO_BACKEND,
},
Capability {
id: "companion.pointing",
name: "Visual Pointing",
domain: "desktop_companion",
category: CapabilityCategory::ScreenIntelligence,
description: "The companion LLM can embed [POINT:x,y:label:screenN] tags to \
visually point at UI elements on screen via the overlay.",
how_to: "Automatic during companion sessions when the LLM identifies a UI target.",
status: CapabilityStatus::Beta,
privacy: None,
},
];

static VALIDATED: OnceLock<()> = OnceLock::new();
Expand Down
81 changes: 81 additions & 0 deletions src/openhuman/desktop_companion/bus.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
//! Broadcast bus for companion state change events.
//!
//! Follows the same pattern as `overlay::bus`: a process-global
//! `tokio::sync::broadcast` channel so any module can subscribe.
//! The Socket.IO bridge (PR 2) will forward these to the overlay
//! as `companion:state_changed` events.

use once_cell::sync::Lazy;
use tokio::sync::broadcast;

use super::types::CompanionStateChangedEvent;

const LOG_PREFIX: &str = "[desktop_companion]";

static STATE_BUS: Lazy<broadcast::Sender<CompanionStateChangedEvent>> = Lazy::new(|| {
let (tx, _rx) = broadcast::channel(64);
tx
});

/// Subscribe to companion state change events.
pub fn subscribe_state_changed() -> broadcast::Receiver<CompanionStateChangedEvent> {
STATE_BUS.subscribe()
}

/// Publish a state change event.
///
/// Fire-and-forget: if nobody is subscribed the event is dropped.
pub fn publish_state_changed(event: CompanionStateChangedEvent) -> usize {
log::debug!(
"{LOG_PREFIX} state_changed session={} {} -> {}",
event.session_id,
event.previous_state,
event.state,
);
match STATE_BUS.send(event) {
Ok(n) => n,
Err(_) => {
log::debug!("{LOG_PREFIX} no subscribers — state change dropped");
0
}
}
}

#[cfg(test)]
mod tests {
use super::*;
use crate::openhuman::desktop_companion::types::CompanionState;

#[tokio::test]
async fn publish_is_received_by_subscriber() {
// STATE_BUS is process-global — other tests may publish events.
// We filter by session_id to avoid flakiness.
let mut rx = subscribe_state_changed();
let delivered = publish_state_changed(CompanionStateChangedEvent {
session_id: "bus-test-unique".into(),
state: CompanionState::Listening,
previous_state: CompanionState::Idle,
message: None,
});
assert!(delivered >= 1);
// Drain until we find our specific event (others may have been published concurrently).
loop {
let event = rx.recv().await.expect("event delivered");
if event.session_id == "bus-test-unique" {
assert_eq!(event.state, CompanionState::Listening);
assert_eq!(event.previous_state, CompanionState::Idle);
break;
}
}
}
Comment thread
coderabbitai[bot] marked this conversation as resolved.

#[test]
fn publish_with_no_subscribers_is_safe() {
let _ = publish_state_changed(CompanionStateChangedEvent {
session_id: "test".into(),
state: CompanionState::Idle,
previous_state: CompanionState::Error,
message: Some("recovered".into()),
});
}
}
Loading
Loading