Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .importlinter
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ type = forbidden
; deliberate, so this short list does not drift the way the old core list did.
source_modules =
aai_cli.agent
aai_cli.agent_framework
aai_cli.agent_cascade
aai_cli.auth
aai_cli.code_gen
aai_cli.init
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
| `assembly stream` | Real-time transcription from your microphone, a file, or a URL — on macOS it can capture system audio too |
| `assembly dictate` | Push-to-talk dictation: press Enter to record, Enter again for instant text (Sync STT API, up to 120 s per utterance) |
| `assembly agent` | Full-duplex spoken conversation with a voice agent, right in your terminal |
| `assembly agent-framework` | Same live conversation, but wired client-side from Streaming STT + the LLM Gateway + streaming TTS, like the `agent-framework` starter (sandbox-only) |
| `assembly agent-cascade` | Same live conversation, but wired client-side from Streaming STT + the LLM Gateway + streaming TTS, like the `agent-cascade` starter (sandbox-only) |
| `assembly speak` | Synthesize text to speech over the streaming-TTS WebSocket (sandbox-only) |
| `assembly llm` | Prompt the LLM Gateway over a transcript, stdin, or a live stream |
| `assembly clip` | Cut audio/video with ffmpeg by diarized speaker, text match, LLM pick, or time range (`--video` keeps the picture for URL sources) — clip boundaries snap into nearby silence |
Expand Down
2 changes: 1 addition & 1 deletion REFERENCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ each carrying a `"type"` field to dispatch on:
| ------- | ----------- |
| `assembly stream --json` | `begin`, `turn`, `termination` |
| `assembly agent --json` | `session.ready`, `transcript.user.delta`, `transcript.user`, `reply.started`, `transcript.agent`, `reply.done` |
| `assembly agent-framework --json` | `session.ready`, `transcript.user.delta`, `transcript.user`, `reply.started`, `transcript.agent`, `reply.done` |
| `assembly agent-cascade --json` | `session.ready`, `transcript.user.delta`, `transcript.user`, `reply.started`, `transcript.agent`, `reply.done` |
| `assembly dictate --json` | `utterance` |
| `assembly llm --follow --json` | `answer` |
| `assembly transcribe <batch> --json` | `result` (one per source) |
Expand Down
2 changes: 1 addition & 1 deletion aai_cli/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ heavily-reworked commands with long bodies; small commands keep the inline
- **`streaming/`** + `client.stream_audio` — v3 realtime API. Event callbacks run on the SDK reader thread and guard against `BrokenPipeError` (`stdio.silence_stdout()`) so a closed pipe never dumps a thread traceback.
- **`core/sync_stt.py`** + **`core/hotkey.py`** + `commands/dictate/` — `assembly dictate`: push-to-talk dictation over the **Sync STT API** (`Environment.sync_base`, one POST `/transcribe` per utterance with the required `X-AAI-Model: u3-sync-pro` header; 80 ms–120 s of PCM/WAV). `hotkey.TerminalKeys` scopes stdin into cbreak (Ctrl-C still signals) and reads single keypresses; `dictate_exec._record` polls it with a zero timeout between ~100 ms mic chunks. All three boundaries (keys, mic, HTTP) are injectable, so the suite never needs a real terminal — `tests/test_hotkey.py` drives a pty pair for the termios behavior.
- **`agent/`** — full-duplex voice agent (mic in, TTS out via `voices.py`).
- **`agent_framework/`** + `commands/agent_framework/` — `assembly agent-framework`: the same live terminal conversation as `assembly agent`, but **client-orchestrated** — `engine.run_cascade` wires Streaming STT → the LLM Gateway → streaming TTS itself instead of talking to the Voice Agent endpoint, mirroring what the `agent-framework` `assembly init` template does server-side. **Sandbox-only** (streaming TTS has no prod host; guarded via `tts.session.require_available`). Reuses the agent slice's `DuplexAudio`/`AgentRenderer` and `core.client.stream_audio`/`core.llm.complete`/`tts.session.synthesize`; the three network legs are injected through `engine.CascadeDeps` (the `tts/session.py` seam) so the cascade — greeting, per-sentence TTS, barge-in, history window — is unit-tested against fakes with no sockets/mic/speaker.
- **`agent_cascade/`** + `commands/agent_cascade/` — `assembly agent-cascade`: the same live terminal conversation as `assembly agent`, but **client-orchestrated** — `engine.run_cascade` wires Streaming STT → the LLM Gateway → streaming TTS itself instead of talking to the Voice Agent endpoint, mirroring what the `agent-cascade` `assembly init` template does server-side. **Sandbox-only** (streaming TTS has no prod host; guarded via `tts.session.require_available`). Reuses the agent slice's `DuplexAudio`/`AgentRenderer` and `core.client.stream_audio`/`core.llm.complete`/`tts.session.synthesize`; the three network legs are injected through `engine.CascadeDeps` (the `tts/session.py` seam) so the cascade — greeting, per-sentence TTS, barge-in, history window — is unit-tested against fakes with no sockets/mic/speaker.
- **`tts/`** + `commands/speak.py` — `assembly speak` synthesizes text to speech over the sandbox streaming-TTS WebSocket (`streaming-tts.sandbox000.…`). **Sandbox-only:** `session.is_available()` is false in production (empty `Environment.streaming_tts_host`), so the command exits 2 with a `--sandbox` hint. `session.synthesize` drives a Begin→Generate→Flush→Audio→Terminate protocol with an injectable `connect` for hermetic tests (mirrors `agent/session.py`); `audio.py` plays the PCM (default) or writes a WAV (`--out`).
- **`code_gen/`** — backs `--show-code` on `transcribe`/`stream`/`agent`: builds a ready-to-run Python SDK script from exactly the flags passed (no API key needed; generated code reads `ASSEMBLYAI_API_KEY`).
- **`auth/`** — browser-assisted `assembly login` via AMS + **Stytch B2B OAuth discovery** (`discovery.py`, `flow.py`, `loopback.py`, `ams.py`). Not Stytch Connected Apps.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
"""The terminal *agent framework* slice: a client-orchestrated voice cascade.
"""The terminal *agent cascade* slice: a client-orchestrated voice cascade.

`assembly agent-framework` holds the same kind of live voice conversation as
`assembly agent-cascade` holds the same kind of live voice conversation as
`assembly agent`, but where `agent` talks to AssemblyAI's single Voice Agent
endpoint, this slice wires the three primitives together itself — Streaming STT
-> the LLM Gateway -> streaming TTS — exactly like the ``agent-framework``
-> the LLM Gateway -> streaming TTS — exactly like the ``agent-cascade``
``assembly init`` template does server-side. Because it uses streaming TTS it is
sandbox-only.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Per-run configuration for the terminal voice cascade.

Defaults mirror the ``agent-framework`` ``assembly init`` template's
Defaults mirror the ``agent-cascade`` ``assembly init`` template's
``api/settings.py`` so the CLI conversation and the scaffolded app behave the
same out of the box.
"""
Expand All @@ -9,7 +9,7 @@

from dataclasses import dataclass

from aai_cli.agent_framework.voices import DEFAULT_VOICE
from aai_cli.agent_cascade.voices import DEFAULT_VOICE
from aai_cli.core import llm

DEFAULT_MODEL = llm.DEFAULT_MODEL
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@
from dataclasses import dataclass, field
from typing import TYPE_CHECKING, Protocol

from aai_cli.agent_framework.config import CascadeConfig
from aai_cli.agent_framework.text import split_sentences, trim_history
from aai_cli.agent_cascade.config import CascadeConfig
from aai_cli.agent_cascade.text import split_sentences, trim_history
from aai_cli.core import client, config_builder, llm
from aai_cli.core.errors import CLIError
from aai_cli.tts import session as tts_session
Expand Down Expand Up @@ -96,7 +96,7 @@ def _spawn_thread(target: Callable[[], None]) -> _Worker:
return thread


# The realtime model the cascade transcribes with (same as the agent-framework template).
# The realtime model the cascade transcribes with (same as the agent-cascade template).
STT_SPEECH_MODEL = "u3-rt-pro"


Expand Down Expand Up @@ -253,7 +253,7 @@ def _record_error(self, exc: CLIError) -> None:
since the worker thread can't surface an exit code itself."""
if self.error is None:
self.error = exc
output.error_console.print(f"[aai.warn]agent-framework:[/aai.warn] {exc.message}")
output.error_console.print(f"[aai.warn]agent-cascade:[/aai.warn] {exc.message}")

def shutdown(self) -> None:
"""Stop and join any in-flight reply worker (run on every exit path)."""
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""The voices `assembly agent-framework` speaks with.
"""The voices `assembly agent-cascade` speaks with.

The cascade's audio comes from streaming TTS, so its voices are the TTS catalog
(`aai_cli.tts.voices`) — not the Voice Agent voices `assembly agent` uses. This
Expand Down
2 changes: 1 addition & 1 deletion aai_cli/app/agent_shared.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Run-logic shared by the two voice commands (`agent` and `agent-framework`).
"""Run-logic shared by the two voice commands (`agent` and `agent-cascade`).

Both build a live terminal conversation and resolve the persona the same way, so
the shared piece lives in the `app/` layer (the `doctor_checks`/`setup_exec`
Expand Down
2 changes: 1 addition & 1 deletion aai_cli/app/init_exec.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def _active_env_vars() -> dict[str, str]:
"ASSEMBLYAI_STREAMING_HOST": env.streaming_host,
# Voice Agent host mirrors the streaming host's naming across environments.
"ASSEMBLYAI_AGENTS_HOST": env.streaming_host.replace("streaming", "agents", 1),
# Streaming-TTS host for the cascade (agent-framework) template. Empty in
# Streaming-TTS host for the cascade (agent-cascade) template. Empty in
# production, where streaming TTS has no host; that template then refuses to
# run and points at --sandbox.
"ASSEMBLYAI_TTS_HOST": env.streaming_tts_host,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,15 @@
import typer

from aai_cli import command_registry, help_panels, options
from aai_cli.agent_framework import voices
from aai_cli.agent_framework.config import (
from aai_cli.agent_cascade import voices
from aai_cli.agent_cascade.config import (
DEFAULT_GREETING,
DEFAULT_MODEL,
DEFAULT_SYSTEM_PROMPT,
)
from aai_cli.agent_framework.voices import DEFAULT_VOICE
from aai_cli.agent_cascade.voices import DEFAULT_VOICE
from aai_cli.app.context import AppState, run_command, run_with_options
from aai_cli.commands.agent_framework import _exec as agent_framework_exec
from aai_cli.commands.agent_cascade import _exec as agent_cascade_exec
from aai_cli.core import choices, llm
from aai_cli.ui import output
from aai_cli.ui.help_text import examples_epilog
Expand All @@ -23,7 +23,7 @@
SPEC = command_registry.CommandModuleSpec(
panel=help_panels.TRANSCRIPTION,
order=45, # pragma: no mutate -- sparse rank; a +-1 shift is order-equivalent
commands=("agent-framework",),
commands=("agent-cascade",),
)


Expand All @@ -35,24 +35,24 @@ def _emit_voice_list(_state: AppState, json_mode: bool) -> None:


@app.command(
name="agent-framework",
name="agent-cascade",
rich_help_panel=help_panels.TRANSCRIPTION,
epilog=examples_epilog(
[
("Start a live cascade conversation", "assembly --sandbox agent-framework"),
("Start a live cascade conversation", "assembly --sandbox agent-cascade"),
(
"Pick a voice and opening line",
'assembly --sandbox agent-framework --voice michael --greeting "Hi there"',
'assembly --sandbox agent-cascade --voice michael --greeting "Hi there"',
),
(
"Give the agent a persona",
'assembly --sandbox agent-framework --system-prompt "You are a terse pirate."',
'assembly --sandbox agent-cascade --system-prompt "You are a terse pirate."',
),
("See available voices", "assembly --sandbox agent-framework --list-voices"),
("See available voices", "assembly --sandbox agent-cascade --list-voices"),
]
),
)
def agent_framework(
def agent_cascade(
ctx: typer.Context,
source: str | None = typer.Argument(
None, help="Audio file path or URL to speak to the agent. Omit to use the microphone."
Expand Down Expand Up @@ -97,9 +97,9 @@ def agent_framework(

Like 'assembly agent', but instead of AssemblyAI's Voice Agent endpoint this
wires the three primitives together itself — Streaming STT, the LLM Gateway,
and streaming TTS — exactly like the 'agent-framework' init template does
and streaming TTS — exactly like the 'agent-cascade' init template does
server-side. Because it uses streaming TTS it only runs in the sandbox: run
it as 'assembly --sandbox agent-framework' (--sandbox goes before the
it as 'assembly --sandbox agent-cascade' (--sandbox goes before the
subcommand).

Use headphones: the mic stays open while the agent speaks, so on speakers it
Expand All @@ -108,14 +108,14 @@ def agent_framework(
agent's reply.

This only runs a conversation in the terminal — it writes no code. To build
an agent-framework app, run 'assembly init agent-framework' instead.
an agent-cascade app, run 'assembly init agent-cascade' instead.
"""

if list_voices:
run_command(ctx, _emit_voice_list, json=json_out)
return

opts = agent_framework_exec.AgentFrameworkOptions(
opts = agent_cascade_exec.AgentCascadeOptions(
source=source,
sample=sample,
voice=voice,
Expand All @@ -126,4 +126,4 @@ def agent_framework(
device=device,
output_field=output_field,
)
run_with_options(ctx, agent_framework_exec.run_agent_framework, opts, json=json_out)
run_with_options(ctx, agent_cascade_exec.run_agent_cascade, opts, json=json_out)
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""Run logic for `assembly agent-framework`: the options/run split (see AGENTS.md).
"""Run logic for `assembly agent-cascade`: the options/run split (see AGENTS.md).

The command module parses argv into an ``AgentFrameworkOptions`` and hands it to
``run_agent_framework``, so tests drive validation and the cascade wiring by
The command module parses argv into an ``AgentCascadeOptions`` and hands it to
``run_agent_cascade``, so tests drive validation and the cascade wiring by
constructing options directly rather than round-tripping through ``CliRunner``.
"""

Expand All @@ -16,8 +16,8 @@

from aai_cli.agent.audio import SAMPLE_RATE, DuplexAudio, NullPlayer
from aai_cli.agent.render import AgentRenderer
from aai_cli.agent_framework import engine, voices
from aai_cli.agent_framework.config import CascadeConfig
from aai_cli.agent_cascade import engine, voices
from aai_cli.agent_cascade.config import CascadeConfig
from aai_cli.app.agent_shared import resolve_system_prompt as _resolve_system_prompt
from aai_cli.app.context import AppState
from aai_cli.core import choices, client
Expand All @@ -28,8 +28,8 @@


@dataclass(frozen=True)
class AgentFrameworkOptions:
"""Every `assembly agent-framework` conversation flag as plain data.
class AgentCascadeOptions:
"""Every `assembly agent-cascade` conversation flag as plain data.

``--list-voices`` is excluded: it dispatches to its own auth-free body in the
command module. ``--json`` is excluded: run_command resolves it into the
Expand Down Expand Up @@ -70,16 +70,16 @@ def _open_audio(
return duplex.mic, duplex.player, SAMPLE_RATE


def run_agent_framework(opts: AgentFrameworkOptions, state: AppState, *, json_mode: bool) -> None:
"""Execute one `assembly agent-framework` cascade from already-parsed flags."""
def run_agent_cascade(opts: AgentCascadeOptions, state: AppState, *, json_mode: bool) -> None:
"""Execute one `assembly agent-cascade` cascade from already-parsed flags."""
text_mode, json_mode = resolve_output_modes(opts.output_field, json_mode=json_mode)
if opts.voice not in voices.VOICE_NAMES:
raise UsageError(
f"Unknown voice {opts.voice!r}.",
suggestion="Run 'assembly agent-framework --list-voices' to see the options.",
suggestion="Run 'assembly agent-cascade --list-voices' to see the options.",
)
# Streaming TTS has no production host, so the whole cascade is sandbox-only.
tts_session.require_available("agent-framework")
tts_session.require_available("agent-cascade")
system_prompt_text = _resolve_system_prompt(opts.system_prompt, opts.system_prompt_file)

from_file = bool(opts.source) or opts.sample
Expand Down
6 changes: 3 additions & 3 deletions aai_cli/init/templates/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,15 @@
"audio-transcription": "Audio Transcription",
"live-captions": "Live Captions",
"voice-agent": "Voice Agent",
"agent-framework": "Agent Framework",
"agent-cascade": "Agent Cascade",
}

# Display order for the picker and `--help`.
TEMPLATE_ORDER: tuple[str, ...] = (
"audio-transcription",
"live-captions",
"voice-agent",
"agent-framework",
"agent-cascade",
)


Expand All @@ -27,7 +27,7 @@
"audio-transcription": "Transcribe audio & video files, URLs, and YouTube — speaker labels and audio intelligence",
"live-captions": "Live real-time captions from your microphone over the Streaming API",
"voice-agent": "Full-duplex voice agent (speech in, LLM reply, speech out) via the Voice Agent API",
"agent-framework": "Cascaded voice agent you orchestrate: Streaming STT, the LLM Gateway, and sandbox TTS",
"agent-cascade": "Cascaded voice agent you orchestrate: Streaming STT, the LLM Gateway, and sandbox TTS",
}


Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Talk to a cascaded voice agent — AssemblyAI agent-framework starter
# Talk to a cascaded voice agent — AssemblyAI agent-cascade starter

Click connect and talk. Unlike the `voice-agent` template (which uses AssemblyAI's
all-in-one Voice Agent API), this app is a **cascade your own backend orchestrates**:
Expand All @@ -12,7 +12,7 @@ Streaming TTS has no production host, so the whole cascade runs against the Asse
sandbox with a sandbox key. Scaffold it that way:

```sh
assembly --sandbox init agent-framework
assembly --sandbox init agent-cascade
```

That pins the sandbox hosts in `.env`. Running against production exits with a hint.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Server-side cascade orchestrator for the agent-framework template.
"""Server-side cascade orchestrator for the agent-cascade template.

The browser opens one WebSocket to FastAPI and the backend wires three AssemblyAI
primitives together — Streaming STT, the LLM Gateway, and streaming TTS — so every
Expand Down Expand Up @@ -63,7 +63,7 @@ def unavailable_reason(settings: _Settings) -> str | None:
if not settings.TTS_HOST:
return (
"Streaming TTS has no production host, so this cascade is sandbox-only. "
"Re-scaffold against the sandbox: assembly --sandbox init agent-framework."
"Re-scaffold against the sandbox: assembly --sandbox init agent-cascade."
)
return None

Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
"""Talk to a cascaded voice agent — AssemblyAI agent-framework starter (FastAPI).
"""Talk to a cascaded voice agent — AssemblyAI agent-cascade starter (FastAPI).

The browser opens one WebSocket to this backend, which runs the cascade itself —
Streaming STT -> LLM Gateway -> streaming TTS — so your API key never reaches the
client. Streaming TTS is sandbox-only, so scaffold with `assembly --sandbox init
agent-framework` and use a sandbox key.
agent-cascade` and use a sandbox key.

WS /ws <- {"type":"input.audio","audio":<b64 pcm>} ; -> transcripts + reply.audio
"""
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
ASSEMBLYAI_API_KEY=your_assemblyai_api_key_here
# This cascade uses streaming TTS, which is sandbox-only — use a sandbox key and the
# sandbox hosts (assembly --sandbox init agent-framework fills these in for you):
# sandbox hosts (assembly --sandbox init agent-cascade fills these in for you):
# ASSEMBLYAI_STREAMING_HOST=streaming.sandbox000.assemblyai-labs.com
# ASSEMBLYAI_TTS_HOST=streaming-tts.sandbox000.assemblyai-labs.com
# ASSEMBLYAI_LLM_GATEWAY_URL=https://llm-gateway.sandbox000.assemblyai-labs.com/v1
Loading
Loading