Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
| `assembly agent-cascade` | Same live conversation, but wired client-side from Streaming STT + the LLM Gateway + streaming TTS, like the `agent-cascade` starter (sandbox-only) |
| `assembly speak` | Synthesize text to speech over the streaming-TTS WebSocket (sandbox-only) |
| `assembly llm` | Prompt the LLM Gateway over a transcript, files, stdin, or a live stream |
| `assembly code` | Terminal coding agent (deepagents SDK) backed only by the LLM Gateway — reads/writes/edits files, runs shell, searches the docs MCP, and can invoke the `assembly` CLI itself; mutating actions ask for approval |
| `assembly code` | Terminal coding agent (deepagents SDK) backed only by the LLM Gateway — reads/writes/edits files, runs shell, searches the docs MCP, and can invoke the `assembly` CLI itself; mutating actions ask for approval. Defaults to voice in a terminal (speak your request, replies read back via streaming TTS in the sandbox); pass `--no-voice` for the keyboard TUI |
| `assembly clip` | Cut audio/video with ffmpeg by diarized speaker, text match, LLM pick, or time range (`--video` keeps the picture for URL sources) — clip boundaries snap into nearby silence |
| `assembly dub` | Re-voice an audio/video file or URL in another language: transcription, LLM translation, per-speaker TTS, ffmpeg track-swap (sandbox-only) |
| `assembly caption` | Burn always-visible captions into a video: transcribe (or reuse a transcript), fetch SRT, ffmpeg burns it in — audio untouched |
Expand Down
2 changes: 1 addition & 1 deletion aai_cli/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@ heavily-reworked commands with long bodies; small commands keep the inline
- **`agent/`** — full-duplex voice agent (mic in, TTS out via `voices.py`).
- **`agent_cascade/`** + `commands/agent_cascade/` — `assembly agent-cascade`: the same live terminal conversation as `assembly agent`, but **client-orchestrated** — `engine.run_cascade` wires Streaming STT → the LLM Gateway → streaming TTS itself instead of talking to the Voice Agent endpoint, mirroring what the `agent-cascade` `assembly init` template does server-side. **Sandbox-only** (streaming TTS has no prod host; guarded via `tts.session.require_available`). Reuses the agent slice's `DuplexAudio`/`AgentRenderer` and `core.client.stream_audio`/`core.llm.complete`/`tts.session.synthesize`; the three network legs are injected through `engine.CascadeDeps` (the `tts/session.py` seam) so the cascade — greeting, per-sentence TTS, barge-in, history window — is unit-tested against fakes with no sockets/mic/speaker.
- **`tts/`** + `commands/speak.py` — `assembly speak` synthesizes text to speech over the sandbox streaming-TTS WebSocket (`streaming-tts.sandbox000.…`). **Sandbox-only:** `session.is_available()` is false in production (empty `Environment.streaming_tts_host`), so the command exits 2 with a `--sandbox` hint. `session.synthesize` drives a Begin→Generate→Flush→Audio→Terminate protocol with an injectable `connect` for hermetic tests (mirrors `agent/session.py`); `audio.py` plays the PCM (default) or writes a WAV (`--out`). The single-voice default-playback path **streams**: `synthesize`'s `on_audio(chunk, sample_rate)` callback is wired to `audio.PcmPlayer.feed`, so speech starts on the first Audio frame (it opens the device lazily, since the rate is only known at Begin) instead of after the whole text — the win for a long `--url` page. `--out` (needs the full buffer) and the multi-voice dialogue path (`synthesize_dialogue` → `_output_audio` → buffered `play_pcm`) stay buffered; `synthesize` still returns the complete PCM for the summary regardless.
- **`code_agent/`** + `commands/code/` — `assembly code`: a terminal coding agent (a bespoke port of langchain-ai/deepagents' `code` agent) that talks **only** to the LLM Gateway. `model.py` pins the model to `ChatOpenAI` against `llm_gateway_base`; `agent.py` builds the deepagents graph over a cwd-scoped `LocalShellBackend` (filesystem + shell tools), plus extra tools: the custom `assembly` CLI tool (`cli_tool.py`, runs `python -m aai_cli` with the key via child env, never argv), a URL `fetch_url` tool (`fetch_tool.py`), Tavily web search when `TAVILY_API_KEY` is set (`web_search.py`), an `ask_user` tool routed through an `AskBridge` to the front-end (`ask_tool.py`), and best-effort docs MCP tools (`docs_mcp.py`). Middleware adds installed skills (`skills.py`) and long-term memory (`memory.py`), each over its own dedicated backend. Sessions persist via a SQLite checkpointer (`store.py`) keyed by `--session`, so conversations resume. Approval gates the mutating tools (write/edit/execute/`assembly`/`fetch_url`); the general-purpose `task` subagent comes from deepagents by default. `session.py` drives the graph turn-by-turn (interrupt/resume = human approval), emitting framework-agnostic `events.py` to either the Textual TUI (`tui.py`, modeled on deepagents-code: transcript + input + approval/ask modals + clipboard copy) or the Rich fallback (`render.py`). The whole orchestration is tested by driving the **real** graph with a fake `BaseChatModel` (`tests/test_code_agent.py`), so no network/TTY is needed.
- **`code_agent/`** + `commands/code/` — `assembly code`: a terminal coding agent (a bespoke port of langchain-ai/deepagents' `code` agent) that talks **only** to the LLM Gateway. `model.py` pins the model to `ChatOpenAI` against `llm_gateway_base`; `agent.py` builds the deepagents graph over a cwd-scoped `LocalShellBackend` (filesystem + shell tools), plus extra tools: the custom `assembly` CLI tool (`cli_tool.py`, runs `python -m aai_cli` with the key via child env, never argv), a URL `fetch_url` tool (`fetch_tool.py`), Tavily web search when `TAVILY_API_KEY` is set (`web_search.py`), an `ask_user` tool routed through an `AskBridge` to the front-end (`ask_tool.py`), and best-effort docs MCP tools (`docs_mcp.py`). Middleware adds installed skills (`skills.py`) and long-term memory (`memory.py`), each over its own dedicated backend. Sessions persist via a SQLite checkpointer (`store.py`) keyed by `--session`, so conversations resume. Approval gates the mutating tools (write/edit/execute/`assembly`/`fetch_url`); the general-purpose `task` subagent comes from deepagents by default. `session.py` drives the graph turn-by-turn (interrupt/resume = human approval), emitting framework-agnostic `events.py` to either the Textual TUI (`tui.py`, modeled on deepagents-code: transcript + input + approval/ask modals + clipboard copy) or the Rich fallback (`render.py`). The whole orchestration is tested by driving the **real** graph with a fake `BaseChatModel` (`tests/test_code_agent.py`), so no network/TTY is needed. **Voice is the default front-end in an interactive TTY** (`voice.py` + `_exec._run_voice`): `VoiceSession.listen` captures one spoken turn over Streaming STT (gating the mic shut the instant a turn finalizes) and `VoiceSession.speak` reads each assistant reply back over streaming TTS. It runs the **Rich REPL** loop (not the keyboard TUI) with a voice `read_line` + a reply-speaking sink. Readback needs streaming TTS, so it's **sandbox-only** (`tts.session.is_available`); in production the mic input still works and replies stay on screen. A mic-less box degrades to typed input on the first `AUDIO_ERROR_TYPES` `CLIError`; `--no-voice` selects the TUI, and a non-TTY (pipe/CI) the headless loop. Both legs (STT/TTS) are injected like the cascade's, so `tests/test_code_voice.py` drives it with fakes — no mic/speaker/socket.
- **`code_gen/`** — backs `--show-code` on `transcribe`/`stream`/`agent`: builds a ready-to-run Python SDK script from exactly the flags passed (no API key needed; generated code reads `ASSEMBLYAI_API_KEY`).
- **`auth/`** — browser-assisted `assembly login` via AMS + **Stytch B2B OAuth discovery** (`discovery.py`, `flow.py`, `loopback.py`, `ams.py`). Not Stytch Connected Apps.
- **`init/`** — scaffolds a self-contained FastAPI + HTML starter (`audio-transcription`/`live-captions`/`voice-agent` templates), optionally installs deps and opens the browser; writes the key to a git-ignored `.env`.
Expand Down
165 changes: 165 additions & 0 deletions aai_cli/code_agent/voice.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
"""Voice I/O for `assembly code`: speak your request, hear the reply.

The coding agent's default interactive mode (a TTY) captures one spoken turn via
streaming STT and reads each assistant reply back via streaming TTS. Both legs are
injected so the loop is unit-tested with fakes — no microphone, speaker, or socket.

Readback needs streaming TTS, which only the sandbox environment exposes
(`tts.session.is_available`); in production, voice *input* still works and replies
stay on screen as text. Microphone (STT) input works in every environment.
"""

from __future__ import annotations

import threading
from collections.abc import Callable, Iterable, Iterator
from dataclasses import dataclass
from typing import TYPE_CHECKING, Protocol

from aai_cli.core import client, config_builder
from aai_cli.core.microphone import MicrophoneSource
from aai_cli.tts import session as tts_session
from aai_cli.tts.audio import PcmPlayer
from aai_cli.tts.session import SpeakConfig

if TYPE_CHECKING:
from assemblyai.streaming.v3 import StreamingParameters

# The audio-device CLIError types listen() raises when no usable microphone is present;
# the command degrades to typed input on these (see _exec._voice_read_line). They mirror
# the error_type values core.microphone attaches to its mic-open failures.
AUDIO_ERROR_TYPES = frozenset({"mic_missing", "mic_error", "audio_input_error"})

# Streaming TTS synthesizes at 24 kHz, the rate the readback player is opened at.
_TTS_SAMPLE_RATE = 24000

# The streaming STT model used to transcribe a spoken turn — the same realtime default
# `assembly stream` and `assembly agent-cascade` use.
_SPEECH_MODEL = "u3-rt-pro"


class Microphone(Protocol):
"""The microphone slice the listen loop drives: an iterable of PCM at a known rate."""

sample_rate: int

def __iter__(self) -> Iterator[bytes]:
"""Yield captured PCM16 chunks until the stream ends."""


class StreamFn(Protocol):
"""The streaming-STT call: ``client.stream_audio`` satisfies it structurally."""

def __call__(
self,
api_key: str,
source: Iterable[bytes],
*,
params: StreamingParameters,
on_turn: Callable[[object], None],
) -> None:
"""Stream ``source`` and forward each Turn event to ``on_turn``."""


class SynthFn(Protocol):
"""The streaming-TTS call: ``tts.session.synthesize`` satisfies it structurally.

The return is typed ``object`` because the readback path discards it (it plays each
chunk through ``on_audio`` as it arrives), which also lets a test inject a fake that
returns nothing meaningful.
"""

def __call__(
self,
api_key: str,
config: SpeakConfig,
*,
on_audio: Callable[[bytes, int], None],
) -> object:
"""Synthesize ``config.text``, handing each PCM chunk to ``on_audio``."""


class Player(Protocol):
"""The readback player: a context manager that ``feed``s PCM chunks (PcmPlayer)."""

def __enter__(self) -> Player:
"""Enter the playback context (opens the device lazily on first feed)."""

def __exit__(self, exc_type: object, *exc: object) -> object:
"""Drain on a clean exit, abort otherwise; never suppress."""

def feed(self, pcm: bytes, sample_rate: int) -> None:
"""Play one PCM chunk, opening the output device on the first call."""


def _stt_params(sample_rate: int) -> StreamingParameters:
"""StreamingParameters for capturing one spoken turn at ``sample_rate``.

``format_turns`` is on so the finalized turn reads like a typed prompt (punctuated
and cased) rather than raw lowercase tokens.
"""
merged = config_builder.merge_streaming_params(
flags={"speech_model": _SPEECH_MODEL, "format_turns": True, "sample_rate": sample_rate}
)
return config_builder.construct_streaming_params(merged)


@dataclass
class VoiceSession:
"""Speak-to-it / read-it-back I/O for one coding session, with injectable legs."""

api_key: str
readback: bool
mic_factory: Callable[[], Microphone] = MicrophoneSource
stream_fn: StreamFn = client.stream_audio
synth_fn: SynthFn = tts_session.synthesize
player_factory: Callable[[], Player] = PcmPlayer

def listen(self) -> str | None:
"""Capture one spoken turn and return its finalized transcript.

Returns the text of the first end-of-turn the server finalizes, or ``None`` when
the microphone stream ends without one (EOF — e.g. a finite source in tests). The
microphone is gated shut the moment a turn finalizes, so exactly one utterance is
captured per call; a real mic blocks until you speak (Ctrl-C to quit).
"""
mic = self.mic_factory()
done = threading.Event()
captured: list[str] = []

def on_turn(event: object) -> None:
text = (getattr(event, "transcript", "") or "").strip()
if text and getattr(event, "end_of_turn", False):
captured.append(text)
done.set()

def gated() -> Iterator[bytes]:
for chunk in mic:
if done.is_set():
return
yield chunk

self.stream_fn(self.api_key, gated(), params=_stt_params(mic.sample_rate), on_turn=on_turn)
return " ".join(captured).strip() or None

def speak(self, text: str) -> None:
"""Read ``text`` back via streaming TTS, when readback is available.

A no-op when readback is off (production, where streaming TTS has no host) or the
text is blank — so the caller can route every assistant reply here unconditionally.
"""
text = text.strip()
if not self.readback or not text:
return
config = SpeakConfig(text=text, sample_rate=_TTS_SAMPLE_RATE)
with self.player_factory() as player:
self.synth_fn(self.api_key, config, on_audio=player.feed)


def build_voice_session(api_key: str) -> VoiceSession:
"""A voice session for the active environment.

Readback is enabled only where streaming TTS is available (the sandbox); microphone
input is wired regardless.
"""
return VoiceSession(api_key=api_key, readback=tts_session.is_available())
10 changes: 10 additions & 0 deletions aai_cli/commands/code/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,11 @@ def code(
tui: bool = typer.Option(
True, "--tui/--no-tui", help="Use the full-screen TUI (off: a plain read-eval loop)"
),
voice: bool = typer.Option(
True,
"--voice/--no-voice",
help="Speak to the agent and hear replies read back (readback needs the sandbox)",
),
) -> None:
"""Run a terminal coding agent backed by the AssemblyAI LLM Gateway

Expand All @@ -79,6 +84,10 @@ def code(
invoke the 'assembly' CLI itself — all in the working directory. It talks
only to the AssemblyAI LLM Gateway. Mutating actions ask for approval unless
you pass --auto.

In an interactive terminal it defaults to voice: speak your request (mic ->
streaming STT) and the agent's replies are read back aloud (sandbox only).
Pass --no-voice for the keyboard TUI, or pipe input for the headless loop.
"""
opts = code_exec.CodeOptions(
prompt=prompt,
Expand All @@ -92,5 +101,6 @@ def code(
session=session,
persist=persist,
tui=tui,
voice=voice,
)
run_with_options(ctx, code_exec.run_code, opts, json=False)
Loading
Loading