Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .importlinter
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ type = layers
; assembles the command layer — main, command_registry, help_panels, options —
; stays at the package root, above `commands`, and is intentionally unlisted
; (it legitimately imports the command modules to discover/register them).
; Feature slices (agent, tts, streaming, code_gen, init, auth, onboard) are
; Feature slices (agent, tts, streaming, code_agent, code_gen, init, auth, onboard) are
; likewise unlisted vertical slices governed by contract 2.
layers =
commands
Expand All @@ -34,6 +34,7 @@ source_modules =
aai_cli.agent
aai_cli.agent_cascade
aai_cli.auth
aai_cli.code_agent
aai_cli.code_gen
aai_cli.init
aai_cli.onboard
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
| `assembly agent-cascade` | Same live conversation, but wired client-side from Streaming STT + the LLM Gateway + streaming TTS, like the `agent-cascade` starter (sandbox-only) |
| `assembly speak` | Synthesize text to speech over the streaming-TTS WebSocket (sandbox-only) |
| `assembly llm` | Prompt the LLM Gateway over a transcript, files, stdin, or a live stream |
| `assembly code` | Terminal coding agent (deepagents SDK) backed only by the LLM Gateway — reads/writes/edits files, runs shell, searches the docs MCP, and can invoke the `assembly` CLI itself; mutating actions ask for approval |
| `assembly clip` | Cut audio/video with ffmpeg by diarized speaker, text match, LLM pick, or time range (`--video` keeps the picture for URL sources) — clip boundaries snap into nearby silence |
| `assembly dub` | Re-voice an audio/video file or URL in another language: transcription, LLM translation, per-speaker TTS, ffmpeg track-swap (sandbox-only) |
| `assembly caption` | Burn always-visible captions into a video: transcribe (or reuse a transcript), fetch SRT, ffmpeg burns it in — audio untouched |
Expand Down
5 changes: 3 additions & 2 deletions aai_cli/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ contract:
`help_panels`, `options`. They assemble/define the command layer (and
`command_registry` imports the command modules to discover them), so they live
*above* `commands` and stay at the root.
- **Feature slices** — `agent/`, `tts/`, `streaming/`, `code_gen/`, `init/`,
`auth/`, `onboard/`. These are cohesive vertical slices that internally mix
- **Feature slices** — `agent/`, `tts/`, `streaming/`, `code_agent/`, `code_gen/`,
`init/`, `auth/`, `onboard/`. These are cohesive vertical slices that internally mix
protocol + rendering, so they aren't a single horizontal layer; contract 2
forbids them from importing `commands`.

Expand Down Expand Up @@ -153,6 +153,7 @@ heavily-reworked commands with long bodies; small commands keep the inline
- **`agent/`** — full-duplex voice agent (mic in, TTS out via `voices.py`).
- **`agent_cascade/`** + `commands/agent_cascade/` — `assembly agent-cascade`: the same live terminal conversation as `assembly agent`, but **client-orchestrated** — `engine.run_cascade` wires Streaming STT → the LLM Gateway → streaming TTS itself instead of talking to the Voice Agent endpoint, mirroring what the `agent-cascade` `assembly init` template does server-side. **Sandbox-only** (streaming TTS has no prod host; guarded via `tts.session.require_available`). Reuses the agent slice's `DuplexAudio`/`AgentRenderer` and `core.client.stream_audio`/`core.llm.complete`/`tts.session.synthesize`; the three network legs are injected through `engine.CascadeDeps` (the `tts/session.py` seam) so the cascade — greeting, per-sentence TTS, barge-in, history window — is unit-tested against fakes with no sockets/mic/speaker.
- **`tts/`** + `commands/speak.py` — `assembly speak` synthesizes text to speech over the sandbox streaming-TTS WebSocket (`streaming-tts.sandbox000.…`). **Sandbox-only:** `session.is_available()` is false in production (empty `Environment.streaming_tts_host`), so the command exits 2 with a `--sandbox` hint. `session.synthesize` drives a Begin→Generate→Flush→Audio→Terminate protocol with an injectable `connect` for hermetic tests (mirrors `agent/session.py`); `audio.py` plays the PCM (default) or writes a WAV (`--out`). The single-voice default-playback path **streams**: `synthesize`'s `on_audio(chunk, sample_rate)` callback is wired to `audio.PcmPlayer.feed`, so speech starts on the first Audio frame (it opens the device lazily, since the rate is only known at Begin) instead of after the whole text — the win for a long `--url` page. `--out` (needs the full buffer) and the multi-voice dialogue path (`synthesize_dialogue` → `_output_audio` → buffered `play_pcm`) stay buffered; `synthesize` still returns the complete PCM for the summary regardless.
- **`code_agent/`** + `commands/code/` — `assembly code`: a terminal coding agent (a bespoke port of langchain-ai/deepagents' `code` agent) that talks **only** to the LLM Gateway. `model.py` pins the model to `ChatOpenAI` against `llm_gateway_base`; `agent.py` builds the deepagents graph over a cwd-scoped `LocalShellBackend` (filesystem + shell tools), plus extra tools: the custom `assembly` CLI tool (`cli_tool.py`, runs `python -m aai_cli` with the key via child env, never argv), a URL `fetch_url` tool (`fetch_tool.py`), Tavily web search when `TAVILY_API_KEY` is set (`web_search.py`), an `ask_user` tool routed through an `AskBridge` to the front-end (`ask_tool.py`), and best-effort docs MCP tools (`docs_mcp.py`). Middleware adds installed skills (`skills.py`) and long-term memory (`memory.py`), each over its own dedicated backend. Sessions persist via a SQLite checkpointer (`store.py`) keyed by `--session`, so conversations resume. Approval gates the mutating tools (write/edit/execute/`assembly`/`fetch_url`); the general-purpose `task` subagent comes from deepagents by default. `session.py` drives the graph turn-by-turn (interrupt/resume = human approval), emitting framework-agnostic `events.py` to either the Textual TUI (`tui.py`, modeled on deepagents-code: transcript + input + approval/ask modals + clipboard copy) or the Rich fallback (`render.py`). The whole orchestration is tested by driving the **real** graph with a fake `BaseChatModel` (`tests/test_code_agent.py`), so no network/TTY is needed.
- **`code_gen/`** — backs `--show-code` on `transcribe`/`stream`/`agent`: builds a ready-to-run Python SDK script from exactly the flags passed (no API key needed; generated code reads `ASSEMBLYAI_API_KEY`).
- **`auth/`** — browser-assisted `assembly login` via AMS + **Stytch B2B OAuth discovery** (`discovery.py`, `flow.py`, `loopback.py`, `ams.py`). Not Stytch Connected Apps.
- **`init/`** — scaffolds a self-contained FastAPI + HTML starter (`audio-transcription`/`live-captions`/`voice-agent` templates), optionally installs deps and opens the browser; writes the key to a git-ignored `.env`.
Expand Down
16 changes: 16 additions & 0 deletions aai_cli/code_agent/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
"""`assembly code` — a terminal coding agent built on the deepagents SDK.

A bespoke port of langchain-ai/deepagents' `code` agent, wired so it **only**
talks to the AssemblyAI LLM Gateway (an OpenAI-compatible endpoint reached via
`langchain_openai.ChatOpenAI`; see `model.py`). The agent gets deepagents'
built-in filesystem + shell tools — rooted at the working directory through a
`LocalShellBackend` — plus a custom `assembly` tool that invokes this very CLI,
so it can transcribe/stream/run-LLM as part of a coding task (`cli_tool.py`).

The pieces are split so the orchestration (`session.py`) is unit-tested against
a fake chat model driving the *real* deepagents graph, with no network: `agent.py`
builds the graph, `render.py` draws the conversation, and the Typer command in
`aai_cli/commands/code/` wires the gateway model + real CLI runner in.
"""

from __future__ import annotations
86 changes: 86 additions & 0 deletions aai_cli/code_agent/agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
"""Assemble the deepagents graph for `assembly code`.

Wires the gateway model to deepagents' built-in coding toolset (filesystem + shell,
rooted at the working directory via a `LocalShellBackend`), plus the custom `assembly`
CLI tool and any MCP/docs tools, the installed-skills middleware, and human-in-the-loop
approval on the mutating tools. The compiled graph is driven turn-by-turn from
`session.py`; an `InMemorySaver` checkpointer gives both conversation memory and the
interrupt/resume the approval flow needs.
"""

from __future__ import annotations

from collections.abc import Mapping, Sequence
from pathlib import Path
from typing import TYPE_CHECKING, Protocol

from aai_cli.code_agent.cli_tool import CLI_TOOL_NAME
from aai_cli.code_agent.fetch_tool import FETCH_TOOL_NAME
from aai_cli.code_agent.prompt import build_system_prompt

if TYPE_CHECKING:
from langchain.agents.middleware import AgentMiddleware
from langchain_core.language_models.chat_models import BaseChatModel
from langchain_core.tools import BaseTool
from langgraph.checkpoint.base import BaseCheckpointSaver

# The tools whose effects reach outside the model — file writes, edits, arbitrary
# shell, the AssemblyAI CLI (which can spend account credits), and URL fetches (which
# can reach internal/SSRF targets). Each is gated behind human approval unless the
# session opts into --auto.
MUTATING_TOOLS = ("write_file", "edit_file", "execute", CLI_TOOL_NAME, FETCH_TOOL_NAME)


class CompiledAgent(Protocol):
"""The slice of the compiled langgraph graph the session drives.

A structural type so we needn't name langgraph's deeply-generic
``CompiledStateGraph`` (and don't drag its type params through our code).
"""

def invoke(
self, input: object, config: Mapping[str, object] | None = None
) -> dict[str, object]:
"""Run one step of the graph, returning the updated state (incl. messages)."""


def _interrupt_config(*, auto_approve: bool) -> dict[str, bool] | None:
"""The ``interrupt_on`` map: approve every mutating tool, or ``None`` under --auto."""
if auto_approve:
return None
return dict.fromkeys(MUTATING_TOOLS, True)


def build_agent(
*,
model: BaseChatModel,
root_dir: Path,
tools: Sequence[BaseTool] = (),
middlewares: Sequence[AgentMiddleware] = (),
checkpointer: BaseCheckpointSaver | None = None,
auto_approve: bool = False,
) -> CompiledAgent:
"""Compile the coding agent over ``root_dir`` with ``tools`` and ``middlewares``.

``model`` is the only network seam — tests pass a fake chat model so the real
deepagents graph (filesystem + shell tools, approval, checkpointing) runs offline.
``checkpointer`` defaults to an in-memory saver (one ephemeral session); the command
passes a SQLite saver for persistent, resumable sessions.
"""
from deepagents import create_deep_agent
from deepagents.backends import LocalShellBackend
from langgraph.checkpoint.memory import InMemorySaver

# virtual_mode=True maps the model's "/"-rooted paths under root_dir and blocks
# traversal escapes, so file ops and shell stay inside the working directory.
backend = LocalShellBackend(root_dir=str(root_dir), virtual_mode=True)

return create_deep_agent(
model=model,
backend=backend,
system_prompt=build_system_prompt(str(root_dir)),
tools=list(tools),
middleware=list(middlewares),
interrupt_on=_interrupt_config(auto_approve=auto_approve),
checkpointer=checkpointer if checkpointer is not None else InMemorySaver(),
)
51 changes: 51 additions & 0 deletions aai_cli/code_agent/ask_tool.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
"""An `ask_user` tool so the agent can ask the user a question mid-task.

deepagents-code ships an AskUser middleware; base deepagents does not, so we add a
small tool. The actual prompting is injected through an :class:`AskBridge`: the Rich
REPL reads a line, the Textual TUI pops an input modal, and tests script the answer —
the tool itself just calls the bridge, so it stays framework-agnostic. It is *not*
approval-gated (it is itself the user interaction).
"""

from __future__ import annotations

from collections.abc import Callable
from dataclasses import dataclass, field
from typing import TYPE_CHECKING

if TYPE_CHECKING:
from langchain_core.tools import BaseTool

ASK_TOOL_NAME = "ask_user"


def _unanswered(_question: str) -> str:
"""Default handler before a front-end registers one: no human is attached."""
return "No user is available to answer; proceed with your best judgment."


@dataclass
class AskBridge:
"""A late-bound seam for asking the user a question.

The agent (and its tools) are built before the front-end exists, so the tool
captures this bridge and the REPL/TUI sets :attr:`handler` once it's running.
"""

handler: Callable[[str], str] = field(default=_unanswered)

def ask(self, question: str) -> str:
return self.handler(question)


def build_ask_tool(bridge: AskBridge) -> BaseTool:
"""Wrap an :class:`AskBridge` as the ``ask_user`` tool."""
from langchain_core.tools import tool

@tool(ASK_TOOL_NAME)
def ask_user(question: str) -> str:
"""Ask the user a clarifying question and return their answer. Use when you
genuinely need information only the user has before continuing."""
return bridge.ask(question)

return ask_user
42 changes: 42 additions & 0 deletions aai_cli/code_agent/banner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
"""The `assembly code` startup splash — the ASSEMBLY wordmark + a short intro.

Rendered once at session start (in the TUI transcript and the headless REPL). The
wordmark is the ANSI-Shadow block font; built from a per-letter map so the rows stay
aligned without hand-editing one giant string. The accent is the AssemblyAI brand blue.
"""

from __future__ import annotations

from aai_cli.ui import theme

# The wordmark accent — the AssemblyAI brand blue (Cobolt 400), as a hex literal so it
# renders identically in Rich and Textual without our theme being loaded.
BRAND_HEX = theme.BRAND

# Intro copy, shared by both front-ends so the wording stays in one place.
READY_LINE = "Ready to code! What would you like to build?"
TIP_LINE = "Tip: approve tools as they run, or pass --auto to skip the prompts."

# Each glyph is six rows tall (ANSI-Shadow). Only the letters in "ASSEMBLY" are needed.
_LETTERS: dict[str, list[str]] = {
"A": [" █████╗ ", "██╔══██╗", "███████║", "██╔══██║", "██║ ██║", "╚═╝ ╚═╝"],
"S": ["███████╗", "██╔════╝", "███████╗", "╚════██║", "███████║", "╚══════╝"],
"E": ["███████╗", "██╔════╝", "█████╗ ", "██╔══╝ ", "███████╗", "╚══════╝"],
"M": ["███╗ ███╗", "████╗ ████║", "██╔████╔██║", "██║╚██╔╝██║", "██║ ╚═╝ ██║", "╚═╝ ╚═╝"],
"B": ["██████╗ ", "██╔══██╗", "██████╔╝", "██╔══██╗", "██████╔╝", "╚═════╝ "],
"L": ["██╗ ", "██║ ", "██║ ", "██║ ", "███████╗", "╚══════╝"],
"Y": ["██╗ ██╗", "╚██╗ ██╔╝", " ╚████╔╝ ", " ╚██╔╝ ", " ██║ ", " ╚═╝ "],
}
_ROWS = 6


def wordmark() -> list[str]:
"""The six plain rows of the ASSEMBLY block wordmark."""
return [" ".join(_LETTERS[ch][row] for ch in "ASSEMBLY") for row in range(_ROWS)]


def version() -> str:
"""The CLI version string (e.g. ``v0.1.19``)."""
from aai_cli import __version__

return f"v{__version__}"
87 changes: 87 additions & 0 deletions aai_cli/code_agent/cli_tool.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
"""Expose the AssemblyAI CLI to the agent as a tool.

The agent gets an ``assembly`` tool that runs *this* CLI as a subprocess
(``python -m aai_cli …``), so a coding task can transcribe a file, run an LLM
transform, list transcripts, etc. without the model hand-rolling shell quoting.

Secrets never ride argv (the project-wide rule): the resolved API key is injected
into the child's environment, never appended to the argument list, so it can't leak
into ``ps`` or the model's own transcript of the command it ran.
"""

from __future__ import annotations

import subprocess
import sys
from collections.abc import Callable
from typing import TYPE_CHECKING

from aai_cli.core import config, env

if TYPE_CHECKING:
from langchain_core.tools import BaseTool

# The tool name the model calls and the approval flow gates on.
CLI_TOOL_NAME = "assembly"

# Cap captured output so a chatty command can't blow the model's context window.
_MAX_OUTPUT_CHARS = 20000
# Backstop so a hung command (e.g. a stuck network call) can't wedge the session.
_DEFAULT_TIMEOUT = 600

# A runner takes the CLI argument list and returns the combined, formatted output.
CliRunner = Callable[[list[str]], str]


def _truncate(text: str) -> str:
"""Clip captured output to the context-window budget, marking that we did."""
if len(text) <= _MAX_OUTPUT_CHARS:
return text
return text[:_MAX_OUTPUT_CHARS] + "\n…[output truncated]"


def _format_result(proc: subprocess.CompletedProcess[str]) -> str:
"""Render a finished CLI run as text the model can read: exit code + both streams."""
parts = [f"exit code: {proc.returncode}"]
if proc.stdout:
parts.append(f"stdout:\n{proc.stdout.rstrip()}")
if proc.stderr:
parts.append(f"stderr:\n{proc.stderr.rstrip()}")
return _truncate("\n".join(parts))


def run_assembly(args: list[str], *, api_key: str, timeout: float = _DEFAULT_TIMEOUT) -> str:
"""Run ``assembly <args>`` as a subprocess and return its formatted output.

Invoked as ``python -m aai_cli`` so it's the very CLI in use, independent of
whatever ``assembly`` may (or may not) be on PATH. The key is passed through the
environment, never argv.
"""
proc = subprocess.run(
[sys.executable, "-m", "aai_cli", *args],
capture_output=True,
text=True,
stdin=subprocess.DEVNULL,
env=env.child_env(**{config.ENV_API_KEY: api_key}),
timeout=timeout,
check=False,
)
return _format_result(proc)


def build_cli_tool(runner: CliRunner) -> BaseTool:
"""Wrap a :data:`CliRunner` as the ``assembly`` LangChain tool the agent can call.

The runner is injected so the orchestration is tested without spawning a real
subprocess; the command layer passes :func:`run_assembly` bound to the session's key.
"""
from langchain_core.tools import tool

@tool(CLI_TOOL_NAME)
def assembly(arguments: list[str]) -> str:
"""Run the AssemblyAI CLI. Pass CLI arguments as a list of strings, e.g.
["transcribe", "audio.mp3", "--json"]. Returns the command's exit code and
output. Do not include an API key — it is provided via the environment."""
return runner(arguments)

return assembly
Loading
Loading