AssemblyAI · alexkroman · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026
diff --git a/README.md b/README.md
@@ -51,7 +51,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
 | `assembly agent-cascade` | Same live conversation, but wired client-side from Streaming STT + the LLM Gateway + streaming TTS, like the `agent-cascade` starter (sandbox-only) |
 | `assembly speak` | Synthesize text to speech over the streaming-TTS WebSocket (sandbox-only) |
 | `assembly llm` | Prompt the LLM Gateway over a transcript, files, stdin, or a live stream |
-| `assembly code` | Terminal coding agent (deepagents SDK) backed only by the LLM Gateway — reads/writes/edits files, runs shell, searches the docs MCP, and can invoke the `assembly` CLI itself; mutating actions ask for approval |
+| `assembly code` | Terminal coding agent (deepagents SDK) backed only by the LLM Gateway — reads/writes/edits files, runs shell, searches the docs MCP, and can invoke the `assembly` CLI itself; mutating actions ask for approval. Defaults to voice in a terminal (speak your request, replies read back via streaming TTS in the sandbox); pass `--no-voice` for the keyboard TUI |
 | `assembly clip` | Cut audio/video with ffmpeg by diarized speaker, text match, LLM pick, or time range (`--video` keeps the picture for URL sources) — clip boundaries snap into nearby silence |
 | `assembly dub` | Re-voice an audio/video file or URL in another language: transcription, LLM translation, per-speaker TTS, ffmpeg track-swap (sandbox-only) |
 | `assembly caption` | Burn always-visible captions into a video: transcribe (or reuse a transcript), fetch SRT, ffmpeg burns it in — audio untouched |

diff --git a/aai_cli/AGENTS.md b/aai_cli/AGENTS.md
@@ -153,7 +153,7 @@ heavily-reworked commands with long bodies; small commands keep the inline
 - **`agent/`** — full-duplex voice agent (mic in, TTS out via `voices.py`).
 - **`agent_cascade/`** + `commands/agent_cascade/` — `assembly agent-cascade`: the same live terminal conversation as `assembly agent`, but **client-orchestrated** — `engine.run_cascade` wires Streaming STT → the LLM Gateway → streaming TTS itself instead of talking to the Voice Agent endpoint, mirroring what the `agent-cascade` `assembly init` template does server-side. **Sandbox-only** (streaming TTS has no prod host; guarded via `tts.session.require_available`). Reuses the agent slice's `DuplexAudio`/`AgentRenderer` and `core.client.stream_audio`/`core.llm.complete`/`tts.session.synthesize`; the three network legs are injected through `engine.CascadeDeps` (the `tts/session.py` seam) so the cascade — greeting, per-sentence TTS, barge-in, history window — is unit-tested against fakes with no sockets/mic/speaker.
 - **`tts/`** + `commands/speak.py` — `assembly speak` synthesizes text to speech over the sandbox streaming-TTS WebSocket (`streaming-tts.sandbox000.…`). **Sandbox-only:** `session.is_available()` is false in production (empty `Environment.streaming_tts_host`), so the command exits 2 with a `--sandbox` hint. `session.synthesize` drives a Begin→Generate→Flush→Audio→Terminate protocol with an injectable `connect` for hermetic tests (mirrors `agent/session.py`); `audio.py` plays the PCM (default) or writes a WAV (`--out`). The single-voice default-playback path **streams**: `synthesize`'s `on_audio(chunk, sample_rate)` callback is wired to `audio.PcmPlayer.feed`, so speech starts on the first Audio frame (it opens the device lazily, since the rate is only known at Begin) instead of after the whole text — the win for a long `--url` page. `--out` (needs the full buffer) and the multi-voice dialogue path (`synthesize_dialogue` → `_output_audio` → buffered `play_pcm`) stay buffered; `synthesize` still returns the complete PCM for the summary regardless.
-- **`code_agent/`** + `commands/code/` — `assembly code`: a terminal coding agent (a bespoke port of langchain-ai/deepagents' `code` agent) that talks **only** to the LLM Gateway. `model.py` pins the model to `ChatOpenAI` against `llm_gateway_base`; `agent.py` builds the deepagents graph over a cwd-scoped `LocalShellBackend` (filesystem + shell tools), plus extra tools: the custom `assembly` CLI tool (`cli_tool.py`, runs `python -m aai_cli` with the key via child env, never argv), a URL `fetch_url` tool (`fetch_tool.py`), Tavily web search when `TAVILY_API_KEY` is set (`web_search.py`), an `ask_user` tool routed through an `AskBridge` to the front-end (`ask_tool.py`), and best-effort docs MCP tools (`docs_mcp.py`). Middleware adds installed skills (`skills.py`) and long-term memory (`memory.py`), each over its own dedicated backend. Sessions persist via a SQLite checkpointer (`store.py`) keyed by `--session`, so conversations resume. Approval gates the mutating tools (write/edit/execute/`assembly`/`fetch_url`); the general-purpose `task` subagent comes from deepagents by default. `session.py` drives the graph turn-by-turn (interrupt/resume = human approval), emitting framework-agnostic `events.py` to either the Textual TUI (`tui.py`, modeled on deepagents-code: transcript + input + approval/ask modals + clipboard copy) or the Rich fallback (`render.py`). The whole orchestration is tested by driving the **real** graph with a fake `BaseChatModel` (`tests/test_code_agent.py`), so no network/TTY is needed.
+- **`code_agent/`** + `commands/code/` — `assembly code`: a terminal coding agent (a bespoke port of langchain-ai/deepagents' `code` agent) that talks **only** to the LLM Gateway. `model.py` pins the model to `ChatOpenAI` against `llm_gateway_base`; `agent.py` builds the deepagents graph over a cwd-scoped `LocalShellBackend` (filesystem + shell tools), plus extra tools: the custom `assembly` CLI tool (`cli_tool.py`, runs `python -m aai_cli` with the key via child env, never argv), a URL `fetch_url` tool (`fetch_tool.py`), Tavily web search when `TAVILY_API_KEY` is set (`web_search.py`), an `ask_user` tool routed through an `AskBridge` to the front-end (`ask_tool.py`), and best-effort docs MCP tools (`docs_mcp.py`). Middleware adds installed skills (`skills.py`) and long-term memory (`memory.py`), each over its own dedicated backend. Sessions persist via a SQLite checkpointer (`store.py`) keyed by `--session`, so conversations resume. Approval gates the mutating tools (write/edit/execute/`assembly`/`fetch_url`); the general-purpose `task` subagent comes from deepagents by default. `session.py` drives the graph turn-by-turn (interrupt/resume = human approval), emitting framework-agnostic `events.py` to either the Textual TUI (`tui.py`, modeled on deepagents-code: transcript + input + approval/ask modals + clipboard copy) or the Rich fallback (`render.py`). The whole orchestration is tested by driving the **real** graph with a fake `BaseChatModel` (`tests/test_code_agent.py`), so no network/TTY is needed. **Voice is the default front-end in an interactive TTY** (`voice.py` + `_exec._run_voice`): `VoiceSession.listen` captures one spoken turn over Streaming STT (gating the mic shut the instant a turn finalizes) and `VoiceSession.speak` reads each assistant reply back over streaming TTS. It runs the **Rich REPL** loop (not the keyboard TUI) with a voice `read_line` + a reply-speaking sink. Readback needs streaming TTS, so it's **sandbox-only** (`tts.session.is_available`); in production the mic input still works and replies stay on screen. A mic-less box degrades to typed input on the first `AUDIO_ERROR_TYPES` `CLIError`; `--no-voice` selects the TUI, and a non-TTY (pipe/CI) the headless loop. Both legs (STT/TTS) are injected like the cascade's, so `tests/test_code_voice.py` drives it with fakes — no mic/speaker/socket.
 - **`code_gen/`** — backs `--show-code` on `transcribe`/`stream`/`agent`: builds a ready-to-run Python SDK script from exactly the flags passed (no API key needed; generated code reads `ASSEMBLYAI_API_KEY`).
 - **`auth/`** — browser-assisted `assembly login` via AMS + **Stytch B2B OAuth discovery** (`discovery.py`, `flow.py`, `loopback.py`, `ams.py`). Not Stytch Connected Apps.
 - **`init/`** — scaffolds a self-contained FastAPI + HTML starter (`audio-transcription`/`live-captions`/`voice-agent` templates), optionally installs deps and opens the browser; writes the key to a git-ignored `.env`.

diff --git a/aai_cli/code_agent/voice.py b/aai_cli/code_agent/voice.py
@@ -0,0 +1,165 @@
+"""Voice I/O for `assembly code`: speak your request, hear the reply.
+
+The coding agent's default interactive mode (a TTY) captures one spoken turn via
+streaming STT and reads each assistant reply back via streaming TTS. Both legs are
+injected so the loop is unit-tested with fakes — no microphone, speaker, or socket.
+
+Readback needs streaming TTS, which only the sandbox environment exposes
+(`tts.session.is_available`); in production, voice *input* still works and replies
+stay on screen as text. Microphone (STT) input works in every environment.
+"""
+
+from __future__ import annotations
+
+import threading
+from collections.abc import Callable, Iterable, Iterator
+from dataclasses import dataclass
+from typing import TYPE_CHECKING, Protocol
+
+from aai_cli.core import client, config_builder
+from aai_cli.core.microphone import MicrophoneSource
+from aai_cli.tts import session as tts_session
+from aai_cli.tts.audio import PcmPlayer
+from aai_cli.tts.session import SpeakConfig
+
+if TYPE_CHECKING:
+    from assemblyai.streaming.v3 import StreamingParameters
+
+# The audio-device CLIError types listen() raises when no usable microphone is present;
+# the command degrades to typed input on these (see _exec._voice_read_line). They mirror
+# the error_type values core.microphone attaches to its mic-open failures.
+AUDIO_ERROR_TYPES = frozenset({"mic_missing", "mic_error", "audio_input_error"})
+
+# Streaming TTS synthesizes at 24 kHz, the rate the readback player is opened at.
+_TTS_SAMPLE_RATE = 24000
+
+# The streaming STT model used to transcribe a spoken turn — the same realtime default
+# `assembly stream` and `assembly agent-cascade` use.
+_SPEECH_MODEL = "u3-rt-pro"
+
+
+class Microphone(Protocol):
+    """The microphone slice the listen loop drives: an iterable of PCM at a known rate."""
+
+    sample_rate: int
+
+    def __iter__(self) -> Iterator[bytes]:
+        """Yield captured PCM16 chunks until the stream ends."""
+
+
+class StreamFn(Protocol):
+    """The streaming-STT call: ``client.stream_audio`` satisfies it structurally."""
+
+    def __call__(
+        self,
+        api_key: str,
+        source: Iterable[bytes],
+        *,
+        params: StreamingParameters,
+        on_turn: Callable[[object], None],
+    ) -> None:
+        """Stream ``source`` and forward each Turn event to ``on_turn``."""
+
+
+class SynthFn(Protocol):
+    """The streaming-TTS call: ``tts.session.synthesize`` satisfies it structurally.
+
+    The return is typed ``object`` because the readback path discards it (it plays each
+    chunk through ``on_audio`` as it arrives), which also lets a test inject a fake that
+    returns nothing meaningful.
+    """
+
+    def __call__(
+        self,
+        api_key: str,
+        config: SpeakConfig,
+        *,
+        on_audio: Callable[[bytes, int], None],
+    ) -> object:
+        """Synthesize ``config.text``, handing each PCM chunk to ``on_audio``."""
+
+
+class Player(Protocol):
+    """The readback player: a context manager that ``feed``s PCM chunks (PcmPlayer)."""
+
+    def __enter__(self) -> Player:
+        """Enter the playback context (opens the device lazily on first feed)."""
+
+    def __exit__(self, exc_type: object, *exc: object) -> object:
+        """Drain on a clean exit, abort otherwise; never suppress."""
+
+    def feed(self, pcm: bytes, sample_rate: int) -> None:
+        """Play one PCM chunk, opening the output device on the first call."""
+
+
+def _stt_params(sample_rate: int) -> StreamingParameters:
+    """StreamingParameters for capturing one spoken turn at ``sample_rate``.
+
+    ``format_turns`` is on so the finalized turn reads like a typed prompt (punctuated
+    and cased) rather than raw lowercase tokens.
+    """
+    merged = config_builder.merge_streaming_params(
+        flags={"speech_model": _SPEECH_MODEL, "format_turns": True, "sample_rate": sample_rate}
+    )
+    return config_builder.construct_streaming_params(merged)
+
+
+@dataclass
+class VoiceSession:
+    """Speak-to-it / read-it-back I/O for one coding session, with injectable legs."""
+
+    api_key: str
+    readback: bool
+    mic_factory: Callable[[], Microphone] = MicrophoneSource
+    stream_fn: StreamFn = client.stream_audio
+    synth_fn: SynthFn = tts_session.synthesize
+    player_factory: Callable[[], Player] = PcmPlayer
+
+    def listen(self) -> str | None:
+        """Capture one spoken turn and return its finalized transcript.
+
+        Returns the text of the first end-of-turn the server finalizes, or ``None`` when
+        the microphone stream ends without one (EOF — e.g. a finite source in tests). The
+        microphone is gated shut the moment a turn finalizes, so exactly one utterance is
+        captured per call; a real mic blocks until you speak (Ctrl-C to quit).
+        """
+        mic = self.mic_factory()
+        done = threading.Event()
+        captured: list[str] = []
+
+        def on_turn(event: object) -> None:
+            text = (getattr(event, "transcript", "") or "").strip()
+            if text and getattr(event, "end_of_turn", False):
+                captured.append(text)
+                done.set()
+
+        def gated() -> Iterator[bytes]:
+            for chunk in mic:
+                if done.is_set():
+                    return
+                yield chunk
+
+        self.stream_fn(self.api_key, gated(), params=_stt_params(mic.sample_rate), on_turn=on_turn)
+        return " ".join(captured).strip() or None
+
+    def speak(self, text: str) -> None:
+        """Read ``text`` back via streaming TTS, when readback is available.
+
+        A no-op when readback is off (production, where streaming TTS has no host) or the
+        text is blank — so the caller can route every assistant reply here unconditionally.
+        """
+        text = text.strip()
+        if not self.readback or not text:
+            return
+        config = SpeakConfig(text=text, sample_rate=_TTS_SAMPLE_RATE)
+        with self.player_factory() as player:
+            self.synth_fn(self.api_key, config, on_audio=player.feed)
+
+
+def build_voice_session(api_key: str) -> VoiceSession:
+    """A voice session for the active environment.
+
+    Readback is enabled only where streaming TTS is available (the sandbox); microphone
+    input is wired regardless.
+    """
+    return VoiceSession(api_key=api_key, readback=tts_session.is_available())
diff --git a/aai_cli/commands/code/__init__.py b/aai_cli/commands/code/__init__.py
@@ -71,6 +71,11 @@ def code(
     tui: bool = typer.Option(
         True, "--tui/--no-tui", help="Use the full-screen TUI (off: a plain read-eval loop)"
     ),
+    voice: bool = typer.Option(
+        True,
+        "--voice/--no-voice",
+        help="Speak to the agent and hear replies read back (readback needs the sandbox)",
+    ),
 ) -> None:
     """Run a terminal coding agent backed by the AssemblyAI LLM Gateway
 
@@ -79,6 +84,10 @@ def code(
     invoke the 'assembly' CLI itself — all in the working directory. It talks
     only to the AssemblyAI LLM Gateway. Mutating actions ask for approval unless
     you pass --auto.
+
+    In an interactive terminal it defaults to voice: speak your request (mic ->
+    streaming STT) and the agent's replies are read back aloud (sandbox only).
+    Pass --no-voice for the keyboard TUI, or pipe input for the headless loop.
     """
     opts = code_exec.CodeOptions(
         prompt=prompt,
@@ -92,5 +101,6 @@ def code(
         session=session,
         persist=persist,
         tui=tui,
+        voice=voice,
     )
     run_with_options(ctx, code_exec.run_code, opts, json=False)