diff --git a/README.md b/README.md index 5c4e6337..ed70dd87 100644 --- a/README.md +++ b/README.md @@ -51,7 +51,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins | `assembly agent-cascade` | Same live conversation, but wired client-side from Streaming STT + the LLM Gateway + streaming TTS, like the `agent-cascade` starter (sandbox-only) | | `assembly speak` | Synthesize text to speech over the streaming-TTS WebSocket (sandbox-only) | | `assembly llm` | Prompt the LLM Gateway over a transcript, files, stdin, or a live stream | -| `assembly code` | Terminal coding agent (deepagents SDK) backed only by the LLM Gateway — reads/writes/edits files, runs shell, searches the docs MCP, and can invoke the `assembly` CLI itself; mutating actions ask for approval | +| `assembly code` | Terminal coding agent (deepagents SDK) backed only by the LLM Gateway — reads/writes/edits files, runs shell, searches the docs MCP, and can invoke the `assembly` CLI itself; mutating actions ask for approval. Defaults to voice in a terminal (speak your request, replies read back via streaming TTS in the sandbox); pass `--no-voice` for the keyboard TUI | | `assembly clip` | Cut audio/video with ffmpeg by diarized speaker, text match, LLM pick, or time range (`--video` keeps the picture for URL sources) — clip boundaries snap into nearby silence | | `assembly dub` | Re-voice an audio/video file or URL in another language: transcription, LLM translation, per-speaker TTS, ffmpeg track-swap (sandbox-only) | | `assembly caption` | Burn always-visible captions into a video: transcribe (or reuse a transcript), fetch SRT, ffmpeg burns it in — audio untouched | diff --git a/aai_cli/AGENTS.md b/aai_cli/AGENTS.md index a042ed93..d63d4511 100644 --- a/aai_cli/AGENTS.md +++ b/aai_cli/AGENTS.md @@ -153,7 +153,7 @@ heavily-reworked commands with long bodies; small commands keep the inline - **`agent/`** — full-duplex voice agent (mic in, TTS out via `voices.py`). - **`agent_cascade/`** + `commands/agent_cascade/` — `assembly agent-cascade`: the same live terminal conversation as `assembly agent`, but **client-orchestrated** — `engine.run_cascade` wires Streaming STT → the LLM Gateway → streaming TTS itself instead of talking to the Voice Agent endpoint, mirroring what the `agent-cascade` `assembly init` template does server-side. **Sandbox-only** (streaming TTS has no prod host; guarded via `tts.session.require_available`). Reuses the agent slice's `DuplexAudio`/`AgentRenderer` and `core.client.stream_audio`/`core.llm.complete`/`tts.session.synthesize`; the three network legs are injected through `engine.CascadeDeps` (the `tts/session.py` seam) so the cascade — greeting, per-sentence TTS, barge-in, history window — is unit-tested against fakes with no sockets/mic/speaker. - **`tts/`** + `commands/speak.py` — `assembly speak` synthesizes text to speech over the sandbox streaming-TTS WebSocket (`streaming-tts.sandbox000.…`). **Sandbox-only:** `session.is_available()` is false in production (empty `Environment.streaming_tts_host`), so the command exits 2 with a `--sandbox` hint. `session.synthesize` drives a Begin→Generate→Flush→Audio→Terminate protocol with an injectable `connect` for hermetic tests (mirrors `agent/session.py`); `audio.py` plays the PCM (default) or writes a WAV (`--out`). The single-voice default-playback path **streams**: `synthesize`'s `on_audio(chunk, sample_rate)` callback is wired to `audio.PcmPlayer.feed`, so speech starts on the first Audio frame (it opens the device lazily, since the rate is only known at Begin) instead of after the whole text — the win for a long `--url` page. `--out` (needs the full buffer) and the multi-voice dialogue path (`synthesize_dialogue` → `_output_audio` → buffered `play_pcm`) stay buffered; `synthesize` still returns the complete PCM for the summary regardless. -- **`code_agent/`** + `commands/code/` — `assembly code`: a terminal coding agent (a bespoke port of langchain-ai/deepagents' `code` agent) that talks **only** to the LLM Gateway. `model.py` pins the model to `ChatOpenAI` against `llm_gateway_base`; `agent.py` builds the deepagents graph over a cwd-scoped `LocalShellBackend` (filesystem + shell tools), plus extra tools: the custom `assembly` CLI tool (`cli_tool.py`, runs `python -m aai_cli` with the key via child env, never argv), a URL `fetch_url` tool (`fetch_tool.py`), Tavily web search when `TAVILY_API_KEY` is set (`web_search.py`), an `ask_user` tool routed through an `AskBridge` to the front-end (`ask_tool.py`), and best-effort docs MCP tools (`docs_mcp.py`). Middleware adds installed skills (`skills.py`) and long-term memory (`memory.py`), each over its own dedicated backend. Sessions persist via a SQLite checkpointer (`store.py`) keyed by `--session`, so conversations resume. Approval gates the mutating tools (write/edit/execute/`assembly`/`fetch_url`); the general-purpose `task` subagent comes from deepagents by default. `session.py` drives the graph turn-by-turn (interrupt/resume = human approval), emitting framework-agnostic `events.py` to either the Textual TUI (`tui.py`, modeled on deepagents-code: transcript + input + approval/ask modals + clipboard copy) or the Rich fallback (`render.py`). The whole orchestration is tested by driving the **real** graph with a fake `BaseChatModel` (`tests/test_code_agent.py`), so no network/TTY is needed. +- **`code_agent/`** + `commands/code/` — `assembly code`: a terminal coding agent (a bespoke port of langchain-ai/deepagents' `code` agent) that talks **only** to the LLM Gateway. `model.py` pins the model to `ChatOpenAI` against `llm_gateway_base`; `agent.py` builds the deepagents graph over a cwd-scoped `LocalShellBackend` (filesystem + shell tools), plus extra tools: the custom `assembly` CLI tool (`cli_tool.py`, runs `python -m aai_cli` with the key via child env, never argv), a URL `fetch_url` tool (`fetch_tool.py`), Tavily web search when `TAVILY_API_KEY` is set (`web_search.py`), an `ask_user` tool routed through an `AskBridge` to the front-end (`ask_tool.py`), and best-effort docs MCP tools (`docs_mcp.py`). Middleware adds installed skills (`skills.py`) and long-term memory (`memory.py`), each over its own dedicated backend. Sessions persist via a SQLite checkpointer (`store.py`) keyed by `--session`, so conversations resume. Approval gates the mutating tools (write/edit/execute/`assembly`/`fetch_url`); the general-purpose `task` subagent comes from deepagents by default. `session.py` drives the graph turn-by-turn (interrupt/resume = human approval), emitting framework-agnostic `events.py` to either the Textual TUI (`tui.py`, modeled on deepagents-code: transcript + input + approval/ask modals + clipboard copy) or the Rich fallback (`render.py`). The whole orchestration is tested by driving the **real** graph with a fake `BaseChatModel` (`tests/test_code_agent.py`), so no network/TTY is needed. **Voice is the default front-end in an interactive TTY** (`voice.py` + `_exec._run_voice`): `VoiceSession.listen` captures one spoken turn over Streaming STT (gating the mic shut the instant a turn finalizes) and `VoiceSession.speak` reads each assistant reply back over streaming TTS. It runs the **Rich REPL** loop (not the keyboard TUI) with a voice `read_line` + a reply-speaking sink. Readback needs streaming TTS, so it's **sandbox-only** (`tts.session.is_available`); in production the mic input still works and replies stay on screen. A mic-less box degrades to typed input on the first `AUDIO_ERROR_TYPES` `CLIError`; `--no-voice` selects the TUI, and a non-TTY (pipe/CI) the headless loop. Both legs (STT/TTS) are injected like the cascade's, so `tests/test_code_voice.py` drives it with fakes — no mic/speaker/socket. - **`code_gen/`** — backs `--show-code` on `transcribe`/`stream`/`agent`: builds a ready-to-run Python SDK script from exactly the flags passed (no API key needed; generated code reads `ASSEMBLYAI_API_KEY`). - **`auth/`** — browser-assisted `assembly login` via AMS + **Stytch B2B OAuth discovery** (`discovery.py`, `flow.py`, `loopback.py`, `ams.py`). Not Stytch Connected Apps. - **`init/`** — scaffolds a self-contained FastAPI + HTML starter (`audio-transcription`/`live-captions`/`voice-agent` templates), optionally installs deps and opens the browser; writes the key to a git-ignored `.env`. diff --git a/aai_cli/code_agent/voice.py b/aai_cli/code_agent/voice.py new file mode 100644 index 00000000..3bff39dd --- /dev/null +++ b/aai_cli/code_agent/voice.py @@ -0,0 +1,165 @@ +"""Voice I/O for `assembly code`: speak your request, hear the reply. + +The coding agent's default interactive mode (a TTY) captures one spoken turn via +streaming STT and reads each assistant reply back via streaming TTS. Both legs are +injected so the loop is unit-tested with fakes — no microphone, speaker, or socket. + +Readback needs streaming TTS, which only the sandbox environment exposes +(`tts.session.is_available`); in production, voice *input* still works and replies +stay on screen as text. Microphone (STT) input works in every environment. +""" + +from __future__ import annotations + +import threading +from collections.abc import Callable, Iterable, Iterator +from dataclasses import dataclass +from typing import TYPE_CHECKING, Protocol + +from aai_cli.core import client, config_builder +from aai_cli.core.microphone import MicrophoneSource +from aai_cli.tts import session as tts_session +from aai_cli.tts.audio import PcmPlayer +from aai_cli.tts.session import SpeakConfig + +if TYPE_CHECKING: + from assemblyai.streaming.v3 import StreamingParameters + +# The audio-device CLIError types listen() raises when no usable microphone is present; +# the command degrades to typed input on these (see _exec._voice_read_line). They mirror +# the error_type values core.microphone attaches to its mic-open failures. +AUDIO_ERROR_TYPES = frozenset({"mic_missing", "mic_error", "audio_input_error"}) + +# Streaming TTS synthesizes at 24 kHz, the rate the readback player is opened at. +_TTS_SAMPLE_RATE = 24000 + +# The streaming STT model used to transcribe a spoken turn — the same realtime default +# `assembly stream` and `assembly agent-cascade` use. +_SPEECH_MODEL = "u3-rt-pro" + + +class Microphone(Protocol): + """The microphone slice the listen loop drives: an iterable of PCM at a known rate.""" + + sample_rate: int + + def __iter__(self) -> Iterator[bytes]: + """Yield captured PCM16 chunks until the stream ends.""" + + +class StreamFn(Protocol): + """The streaming-STT call: ``client.stream_audio`` satisfies it structurally.""" + + def __call__( + self, + api_key: str, + source: Iterable[bytes], + *, + params: StreamingParameters, + on_turn: Callable[[object], None], + ) -> None: + """Stream ``source`` and forward each Turn event to ``on_turn``.""" + + +class SynthFn(Protocol): + """The streaming-TTS call: ``tts.session.synthesize`` satisfies it structurally. + + The return is typed ``object`` because the readback path discards it (it plays each + chunk through ``on_audio`` as it arrives), which also lets a test inject a fake that + returns nothing meaningful. + """ + + def __call__( + self, + api_key: str, + config: SpeakConfig, + *, + on_audio: Callable[[bytes, int], None], + ) -> object: + """Synthesize ``config.text``, handing each PCM chunk to ``on_audio``.""" + + +class Player(Protocol): + """The readback player: a context manager that ``feed``s PCM chunks (PcmPlayer).""" + + def __enter__(self) -> Player: + """Enter the playback context (opens the device lazily on first feed).""" + + def __exit__(self, exc_type: object, *exc: object) -> object: + """Drain on a clean exit, abort otherwise; never suppress.""" + + def feed(self, pcm: bytes, sample_rate: int) -> None: + """Play one PCM chunk, opening the output device on the first call.""" + + +def _stt_params(sample_rate: int) -> StreamingParameters: + """StreamingParameters for capturing one spoken turn at ``sample_rate``. + + ``format_turns`` is on so the finalized turn reads like a typed prompt (punctuated + and cased) rather than raw lowercase tokens. + """ + merged = config_builder.merge_streaming_params( + flags={"speech_model": _SPEECH_MODEL, "format_turns": True, "sample_rate": sample_rate} + ) + return config_builder.construct_streaming_params(merged) + + +@dataclass +class VoiceSession: + """Speak-to-it / read-it-back I/O for one coding session, with injectable legs.""" + + api_key: str + readback: bool + mic_factory: Callable[[], Microphone] = MicrophoneSource + stream_fn: StreamFn = client.stream_audio + synth_fn: SynthFn = tts_session.synthesize + player_factory: Callable[[], Player] = PcmPlayer + + def listen(self) -> str | None: + """Capture one spoken turn and return its finalized transcript. + + Returns the text of the first end-of-turn the server finalizes, or ``None`` when + the microphone stream ends without one (EOF — e.g. a finite source in tests). The + microphone is gated shut the moment a turn finalizes, so exactly one utterance is + captured per call; a real mic blocks until you speak (Ctrl-C to quit). + """ + mic = self.mic_factory() + done = threading.Event() + captured: list[str] = [] + + def on_turn(event: object) -> None: + text = (getattr(event, "transcript", "") or "").strip() + if text and getattr(event, "end_of_turn", False): + captured.append(text) + done.set() + + def gated() -> Iterator[bytes]: + for chunk in mic: + if done.is_set(): + return + yield chunk + + self.stream_fn(self.api_key, gated(), params=_stt_params(mic.sample_rate), on_turn=on_turn) + return " ".join(captured).strip() or None + + def speak(self, text: str) -> None: + """Read ``text`` back via streaming TTS, when readback is available. + + A no-op when readback is off (production, where streaming TTS has no host) or the + text is blank — so the caller can route every assistant reply here unconditionally. + """ + text = text.strip() + if not self.readback or not text: + return + config = SpeakConfig(text=text, sample_rate=_TTS_SAMPLE_RATE) + with self.player_factory() as player: + self.synth_fn(self.api_key, config, on_audio=player.feed) + + +def build_voice_session(api_key: str) -> VoiceSession: + """A voice session for the active environment. + + Readback is enabled only where streaming TTS is available (the sandbox); microphone + input is wired regardless. + """ + return VoiceSession(api_key=api_key, readback=tts_session.is_available()) diff --git a/aai_cli/commands/code/__init__.py b/aai_cli/commands/code/__init__.py index 79d49506..b37052d0 100644 --- a/aai_cli/commands/code/__init__.py +++ b/aai_cli/commands/code/__init__.py @@ -71,6 +71,11 @@ def code( tui: bool = typer.Option( True, "--tui/--no-tui", help="Use the full-screen TUI (off: a plain read-eval loop)" ), + voice: bool = typer.Option( + True, + "--voice/--no-voice", + help="Speak to the agent and hear replies read back (readback needs the sandbox)", + ), ) -> None: """Run a terminal coding agent backed by the AssemblyAI LLM Gateway @@ -79,6 +84,10 @@ def code( invoke the 'assembly' CLI itself — all in the working directory. It talks only to the AssemblyAI LLM Gateway. Mutating actions ask for approval unless you pass --auto. + + In an interactive terminal it defaults to voice: speak your request (mic -> + streaming STT) and the agent's replies are read back aloud (sandbox only). + Pass --no-voice for the keyboard TUI, or pipe input for the headless loop. """ opts = code_exec.CodeOptions( prompt=prompt, @@ -92,5 +101,6 @@ def code( session=session, persist=persist, tui=tui, + voice=voice, ) run_with_options(ctx, code_exec.run_code, opts, json=False) diff --git a/aai_cli/commands/code/_exec.py b/aai_cli/commands/code/_exec.py index ffc46cbc..a9120538 100644 --- a/aai_cli/commands/code/_exec.py +++ b/aai_cli/commands/code/_exec.py @@ -3,12 +3,14 @@ The command module parses argv into a frozen ``CodeOptions`` and hands it here. This assembles the gateway model; the agent's tools (the `assembly` CLI tool, the docs MCP, web search, URL fetch, ask-user); the skills + long-term-memory middleware; a persistent -SQLite checkpointer; and the compiled deepagents graph, then drives it through either the -Textual TUI (a TTY) or a plain Rich read-eval loop (headless). +SQLite checkpointer; and the compiled deepagents graph, then drives it through one of +three front-ends: a voice loop (the default in a TTY — speak your request, hear the +reply), the full-screen Textual TUI, or a plain Rich read-eval loop (headless). """ from __future__ import annotations +from collections.abc import Callable from dataclasses import dataclass from pathlib import Path from typing import TYPE_CHECKING @@ -21,14 +23,16 @@ from aai_cli.code_agent.ask_tool import AskBridge, build_ask_tool from aai_cli.code_agent.cli_tool import build_cli_tool, run_assembly from aai_cli.code_agent.docs_mcp import load_docs_tools +from aai_cli.code_agent.events import AssistantText, Event from aai_cli.code_agent.fetch_tool import build_fetch_tool from aai_cli.code_agent.memory import build_memory_middleware from aai_cli.code_agent.model import build_model from aai_cli.code_agent.prompt import DEFAULT_MODEL from aai_cli.code_agent.render import RichRenderer, make_approver -from aai_cli.code_agent.session import CodeSession, run_repl +from aai_cli.code_agent.session import CodeSession, EventSink, run_repl from aai_cli.code_agent.skills import build_skills_middleware from aai_cli.code_agent.store import build_checkpointer +from aai_cli.code_agent.voice import AUDIO_ERROR_TYPES, VoiceSession, build_voice_session from aai_cli.code_agent.web_search import TAVILY_API_KEY_ENV, build_web_search_tool from aai_cli.core import env, errors, stdio from aai_cli.ui import output @@ -53,6 +57,7 @@ class CodeOptions: session: str = "default" persist: bool = True tui: bool = True + voice: bool = True def _assemble_tools(api_key: str, opts: CodeOptions, bridge: AskBridge) -> list[BaseTool]: @@ -174,15 +179,85 @@ def _run_repl(agent: CompiledAgent, opts: CodeOptions, bridge: AskBridge) -> Non run_repl(session, read_line=_read_line, initial=opts.prompt) +def _announce_voice(renderer: RichRenderer, voice: VoiceSession) -> None: + """One-time voice-mode notice, naming whether replies are read back (sandbox) or not.""" + if voice.readback: + renderer.notice( + "Voice mode on: speak your request; replies are read back aloud. Ctrl-C to quit." + ) + else: + renderer.notice( + "Voice mode on: speak your request. Readback needs the sandbox (streaming TTS), " + "so replies show as text. Ctrl-C to quit." + ) + + +def _voice_sink(renderer: RichRenderer, voice: VoiceSession) -> EventSink: + """Render every event, and read the assistant's natural-language text back aloud.""" + + def sink(event: Event) -> None: + renderer(event) + if isinstance(event, AssistantText): + voice.speak(event.text) + + return sink + + +def _voice_read_line(voice: VoiceSession, renderer: RichRenderer) -> Callable[[], str | None]: + """A read-line that captures a spoken turn, degrading to typed input if no mic exists. + + The first time the microphone can't be opened (no device, sounddevice missing) it + prints a one-line notice and switches to ``input()`` for the rest of the session, so a + voice-default run on a mic-less box still works instead of erroring out. + """ + state = {"typed": False} + + def read_line() -> str | None: + if state["typed"]: + return _read_line() + renderer.notice("Listening… (speak now)") + try: + line = voice.listen() + except errors.CLIError as exc: + if exc.error_type not in AUDIO_ERROR_TYPES: + raise + renderer.notice(f"No microphone available ({exc.message}); switching to typed input.") + state["typed"] = True + return _read_line() + if line: + renderer.notice(f"Heard: {line}") + return line + + return read_line + + +def _run_voice(agent: CompiledAgent, opts: CodeOptions, bridge: AskBridge, api_key: str) -> None: + _print_repl_banner(opts) + voice = build_voice_session(api_key) + renderer = RichRenderer() + _announce_voice(renderer, voice) + bridge.handler = _ask_repl # spoken clarifications still fall back to the keyboard + session = CodeSession( + agent=agent, + sink=_voice_sink(renderer, voice), + approver=make_approver(_confirm), + thread_id=opts.session, + auto_approve=opts.auto, + ) + run_repl(session, read_line=_voice_read_line(voice, renderer), initial=opts.prompt) + + def run_code(opts: CodeOptions, state: AppState, *, json_mode: bool) -> None: """Start an `assembly code` coding session from already-parsed flags.""" del json_mode # the coding agent has no JSON output mode; it is a live session api_key = state.resolve_api_key() bridge = AskBridge() agent = _build_agent(api_key, opts, bridge) - use_tui = opts.tui and stdio.stdout_is_tty() and stdio.stdin_is_tty() + interactive = stdio.stdout_is_tty() and stdio.stdin_is_tty() try: - if use_tui: + if opts.voice and interactive: + _run_voice(agent, opts, bridge, api_key) + elif opts.tui and interactive: _run_tui(agent, opts, bridge) else: _run_repl(agent, opts, bridge) diff --git a/tests/__snapshots__/test_snapshots_help_run.ambr b/tests/__snapshots__/test_snapshots_help_run.ambr index 9c731223..2879f6f9 100644 --- a/tests/__snapshots__/test_snapshots_help_run.ambr +++ b/tests/__snapshots__/test_snapshots_help_run.ambr @@ -388,6 +388,10 @@ only to the AssemblyAI LLM Gateway. Mutating actions ask for approval unless you pass --auto. + In an interactive terminal it defaults to voice: speak your request (mic -> + streaming STT) and the agent's replies are read back aloud (sandbox only). + Pass --no-voice for the keyboard TUI, or pipe input for the headless loop. + ╭─ Arguments ──────────────────────────────────────────────────────────────────╮ │ prompt [PROMPT] Initial task for the agent. Omit to just open the │ │ session │ @@ -421,6 +425,10 @@ │ --tui --no-tui Use the full-screen TUI (off: a │ │ plain read-eval loop) │ │ [default: tui] │ + │ --voice --no-voice Speak to the agent and hear replies │ + │ read back (readback needs the │ + │ sandbox) │ + │ [default: voice] │ │ --help Show this message and exit. │ ╰──────────────────────────────────────────────────────────────────────────────╯ diff --git a/tests/test_code_command.py b/tests/test_code_command.py index 89d10486..4f8e79c0 100644 --- a/tests/test_code_command.py +++ b/tests/test_code_command.py @@ -17,6 +17,7 @@ from aai_cli.code_agent.ask_tool import AskBridge from aai_cli.commands.code import _exec +from aai_cli.core.errors import CLIError from aai_cli.main import app runner = CliRunner() @@ -43,19 +44,50 @@ def test_command_parses_flags_into_options(monkeypatch): assert opts.session == "s1" and opts.persist is False -def test_run_code_dispatches_to_tui_when_tty(monkeypatch): +def test_run_code_dispatches_to_voice_by_default_when_tty(monkeypatch): calls = {} monkeypatch.setattr(_exec, "_build_agent", lambda key, opts, bridge: "AGENT") - monkeypatch.setattr(_exec, "_run_tui", lambda agent, opts, bridge: calls.update(tui=agent)) + monkeypatch.setattr( + _exec, "_run_voice", lambda agent, opts, bridge, key: calls.update(voice=(agent, key)) + ) + monkeypatch.setattr(_exec, "_run_tui", lambda *a: calls.update(tui=True)) monkeypatch.setattr(_exec, "_run_repl", lambda *a: calls.update(repl=True)) monkeypatch.setattr("aai_cli.core.stdio.stdout_is_tty", lambda: True) monkeypatch.setattr("aai_cli.core.stdio.stdin_is_tty", lambda: True) state = SimpleNamespace(resolve_api_key=lambda: "k") _exec.run_code(_opts(), state, json_mode=False) + assert calls == {"voice": ("AGENT", "k")} + + +def test_run_code_dispatches_to_tui_when_voice_off(monkeypatch): + calls = {} + monkeypatch.setattr(_exec, "_build_agent", lambda key, opts, bridge: "AGENT") + monkeypatch.setattr(_exec, "_run_voice", lambda *a: calls.update(voice=True)) + monkeypatch.setattr(_exec, "_run_tui", lambda agent, opts, bridge: calls.update(tui=agent)) + monkeypatch.setattr(_exec, "_run_repl", lambda *a: calls.update(repl=True)) + monkeypatch.setattr("aai_cli.core.stdio.stdout_is_tty", lambda: True) + monkeypatch.setattr("aai_cli.core.stdio.stdin_is_tty", lambda: True) + state = SimpleNamespace(resolve_api_key=lambda: "k") + + _exec.run_code(_opts(voice=False), state, json_mode=False) assert calls == {"tui": "AGENT"} +def test_run_code_repl_when_voice_and_tui_off(monkeypatch): + calls = {} + monkeypatch.setattr(_exec, "_build_agent", lambda key, opts, bridge: "AGENT") + monkeypatch.setattr(_exec, "_run_voice", lambda *a: calls.update(voice=True)) + monkeypatch.setattr(_exec, "_run_tui", lambda *a: calls.update(tui=True)) + monkeypatch.setattr(_exec, "_run_repl", lambda agent, opts, bridge: calls.update(repl=agent)) + monkeypatch.setattr("aai_cli.core.stdio.stdout_is_tty", lambda: True) + monkeypatch.setattr("aai_cli.core.stdio.stdin_is_tty", lambda: True) + state = SimpleNamespace(resolve_api_key=lambda: "k") + + _exec.run_code(_opts(voice=False, tui=False), state, json_mode=False) + assert calls == {"repl": "AGENT"} + + def test_run_code_falls_back_to_repl_off_tty(monkeypatch): calls = {} monkeypatch.setattr(_exec, "_build_agent", lambda key, opts, bridge: "AGENT") @@ -81,7 +113,7 @@ def test_run_code_maps_keyboard_interrupt_to_exit_130(monkeypatch): def boom(*a): raise KeyboardInterrupt - monkeypatch.setattr(_exec, "_run_tui", boom) + monkeypatch.setattr(_exec, "_run_voice", boom) state = SimpleNamespace(resolve_api_key=lambda: "k") with pytest.raises(typer.Exit) as exc: @@ -187,3 +219,91 @@ def run(self, **kw): _exec._run_tui("AGENT", _opts(prompt="hi", session="s", root_dir=Path()), AskBridge()) assert seen["agent"] == "AGENT" and seen["thread_id"] == "s" assert seen["run_kw"] == {"mouse": False} + + +def test_voice_sink_renders_all_events_and_speaks_only_assistant_text(): + from aai_cli.code_agent.events import AssistantText, ToolCall + + rendered, spoken = [], [] + voice = SimpleNamespace(speak=spoken.append) + + def renderer(event): + rendered.append(event) + + sink = _exec._voice_sink(renderer, voice) + sink(AssistantText("here you go")) + sink(ToolCall(name="write_file", args={})) + + assert [type(e).__name__ for e in rendered] == ["AssistantText", "ToolCall"] + assert spoken == ["here you go"] # only the assistant's prose is read back + + +def test_announce_voice_message_depends_on_readback(): + notes = [] + renderer = SimpleNamespace(notice=notes.append) + + _exec._announce_voice(renderer, SimpleNamespace(readback=True)) + assert "read back" in notes[-1] + + _exec._announce_voice(renderer, SimpleNamespace(readback=False)) + assert "sandbox" in notes[-1] and "text" in notes[-1] + + +def test_voice_read_line_returns_spoken_line(): + notes = [] + renderer = SimpleNamespace(notice=notes.append) + voice = SimpleNamespace(listen=lambda: "add a flag") + + read_line = _exec._voice_read_line(voice, renderer) + assert read_line() == "add a flag" + assert any("Heard: add a flag" in n for n in notes) + + +def test_voice_read_line_passes_through_none_for_eof(): + renderer = SimpleNamespace(notice=lambda *a: None) + voice = SimpleNamespace(listen=lambda: None) + assert _exec._voice_read_line(voice, renderer)() is None + + +def test_voice_read_line_falls_back_to_typed_input_when_no_mic(monkeypatch): + notes = [] + renderer = SimpleNamespace(notice=notes.append) + calls = {"listen": 0} + + def flaky_mic(): + calls["listen"] += 1 + if calls["listen"] == 1: + raise CLIError("no device", error_type="mic_missing", exit_code=2) + return "SPOKEN AGAIN" # would leak through only if the mic were retried + + voice = SimpleNamespace(listen=flaky_mic) + monkeypatch.setattr(builtins, "input", lambda *a: "typed instead") + + read_line = _exec._voice_read_line(voice, renderer) + assert read_line() == "typed instead" # first call: mic fails -> typed input + assert read_line() == "typed instead" # stays typed; the mic is not retried + assert calls["listen"] == 1 # the latch flipped, so listen() was attempted only once + assert any("switching to typed input" in n.lower() for n in notes) + + +def test_voice_read_line_reraises_non_audio_errors(): + renderer = SimpleNamespace(notice=lambda *a: None) + + def boom(): + raise CLIError("gateway down", error_type="api_error", exit_code=1) + + voice = SimpleNamespace(listen=boom) + with pytest.raises(CLIError): + _exec._voice_read_line(voice, renderer)() + + +def test_run_voice_wires_ask_handler_and_drives_repl(monkeypatch): + class Dummy: + def invoke(self, *a, **k): + return {"messages": []} + + voice = SimpleNamespace(readback=False, listen=lambda: None, speak=lambda *a: None) + monkeypatch.setattr(_exec, "build_voice_session", lambda key: voice) + bridge = AskBridge() + _exec._run_voice(Dummy(), _opts(session="s3"), bridge, "k") + assert bridge.handler is _exec._ask_repl diff --git a/tests/test_code_voice.py b/tests/test_code_voice.py new file mode 100644 index 00000000..7fbb8017 --- /dev/null +++ b/tests/test_code_voice.py @@ -0,0 +1,117 @@ +"""Tests for the `assembly code` voice I/O (code_agent/voice.py + _exec voice helpers). + +The bodies are intentionally unannotated: they drive the voice session through +lightweight fakes (a fake mic, stream_fn, synth_fn, and player) so no microphone, +speaker, or socket is ever touched — the strict type-checker skips untyped test bodies. +""" + +from __future__ import annotations + +from types import SimpleNamespace + +from aai_cli.code_agent import voice as voicemod +from aai_cli.code_agent.voice import VoiceSession, build_voice_session + + +class FakeMic: + def __init__(self, chunks, sample_rate=16000): + self._chunks = list(chunks) + self.sample_rate = sample_rate + + def __iter__(self): + return iter(self._chunks) + + +def _turn(text, *, end_of_turn): + return SimpleNamespace(transcript=text, end_of_turn=end_of_turn) + + +def test_listen_returns_final_turn_and_gates_mic_after_it(): + seen = {} + + def fake_stream(api_key, source, *, params, on_turn): + seen["key"] = api_key + seen["params"] = params + it = iter(source) + seen["before"] = next(it) # the first chunk flows before the turn finalizes + on_turn(_turn("add a verbose flag", end_of_turn=True)) + seen["after"] = list(it) # gated() must stop now, yielding nothing more + + session = VoiceSession( + api_key="k", + readback=False, + mic_factory=lambda: FakeMic([b"a", b"b", b"c"]), + stream_fn=fake_stream, + ) + assert session.listen() == "add a verbose flag" + assert seen["key"] == "k" + assert seen["before"] == b"a" + assert seen["after"] == [] # the mic was gated shut the instant the turn finalized + assert seen["params"].format_turns is True + assert seen["params"].sample_rate == 16000 + + +def test_listen_ignores_partials_and_returns_none_without_a_final_turn(): + def fake_stream(api_key, source, *, params, on_turn): + on_turn(_turn("typing in progr", end_of_turn=False)) # interim only + on_turn(_turn("", end_of_turn=True)) # finalized but empty -> not captured + on_turn(SimpleNamespace(transcript="no end_of_turn field")) # missing attr -> not final + list(source) + + session = VoiceSession( + api_key="k", readback=False, mic_factory=lambda: FakeMic([b"a"]), stream_fn=fake_stream + ) + # A turn is captured only when end_of_turn is truthy; a partial, an empty final, and an + # event lacking the field entirely (the getattr default is False) all leave it None. + assert session.listen() is None + + +class FakePlayer: + def __init__(self): + self.fed = [] + + def __enter__(self): + return self + + def __exit__(self, *exc): + return False + + def feed(self, pcm, sample_rate): + self.fed.append((pcm, sample_rate)) + + +def test_speak_synthesizes_and_plays_when_readback_on(): + player = FakePlayer() + captured = {} + + def fake_synth(api_key, config, *, on_audio): + captured["text"] = config.text + captured["rate"] = config.sample_rate + on_audio(b"pcm", 24000) + return SimpleNamespace(pcm=b"pcm", sample_rate=24000, audio_duration_seconds=0.0) + + session = VoiceSession( + api_key="k", readback=True, synth_fn=fake_synth, player_factory=lambda: player + ) + session.speak(" hello there ") + assert captured["text"] == "hello there" # stripped + assert captured["rate"] == 24000 + assert player.fed == [(b"pcm", 24000)] + + +def test_speak_is_a_noop_when_readback_off_or_text_blank(): + def boom(*a, **k): + raise AssertionError("synthesize must not be called") + + off = VoiceSession(api_key="k", readback=False, synth_fn=boom, player_factory=FakePlayer) + off.speak("hi") # readback off -> no synthesis + + blank = VoiceSession(api_key="k", readback=True, synth_fn=boom, player_factory=FakePlayer) + blank.speak(" ") # blank text -> no synthesis + + +def test_build_voice_session_readback_tracks_tts_availability(monkeypatch): + monkeypatch.setattr(voicemod.tts_session, "is_available", lambda: True) + assert build_voice_session("k").readback is True + monkeypatch.setattr(voicemod.tts_session, "is_available", lambda: False) + assert build_voice_session("k").readback is False