Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -132,19 +132,25 @@ jobs:
# (require_ffmpeg) before doing their work, so without it those tests fail at the
# probe rather than exercising the mocked run. PortAudio needs no install — the
# sounddevice wheel bundles it on Windows. choco ships on the runner but its download
# occasionally flakes (one matrix cell got ffmpeg, the other didn't), so retry and
# verify ffmpeg is callable here — a real miss fails this step instead of surfacing as
# confusing "ffmpeg not on PATH" test failures. The shim lands in choco's bin dir,
# already on the runner PATH, so later steps pick it up.
# from community.chocolatey.org doesn't just flake — it sometimes *hangs* for the whole
# job timeout, and a plain retry loop never gets to retry because the stuck attempt
# never returns. So bound each attempt with a hard timeout (Start-Job + Wait-Job): a
# hung download is killed and the next attempt retries, instead of wedging the cell
# until it's cancelled. The shim lands in choco's bin dir (machine-wide, already on the
# runner PATH), so the parent shell and later steps pick it up.
- name: System deps (ffmpeg)
shell: pwsh
run: |
$ErrorActionPreference = "Stop"
$env:PATH = "C:\ProgramData\chocolatey\bin;$env:PATH"
for ($i = 1; $i -le 3; $i++) {
choco install ffmpeg --no-progress -y
$job = Start-Job { choco install ffmpeg --no-progress -y }
if (Wait-Job $job -Timeout 240) { Receive-Job $job } else {
Stop-Job $job
Write-Host "choco install ffmpeg hung (attempt $i); killing and retrying…"
}
Remove-Job $job -Force
if (Get-Command ffmpeg -ErrorAction SilentlyContinue) { break }
Write-Host "ffmpeg not yet on PATH (attempt $i); retrying…"
Start-Sleep -Seconds 5
}
ffmpeg -version
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
| :--- | :--- |
| `assembly transcribe` | Transcribe files, URLs, YouTube/podcast pages, podcast RSS feeds, directories, globs, or bucket storage (`s3://`, `gs://`, `az://`) — with speaker labels, PII redaction, summarization, SRT/VTT captions, and resumable batch runs |
| `assembly stream` | Real-time transcription from your microphone, a file, or a URL — on macOS it can capture system audio too |
| `assembly dictate` | Push-to-talk dictation: recording starts immediately, press Enter for instant text (Sync STT API, up to 120 s per utterance) |
| `assembly dictate` | Signal-driven dictation: records immediately, send SIGTERM for instant text — scriptable from hotkey tools like Hammerspoon (Sync STT API, up to 120 s per utterance) |
| `assembly agent` | Full-duplex spoken conversation with a voice agent, right in your terminal |
| `assembly agent-cascade` | Same live conversation, but wired client-side from Streaming STT + the LLM Gateway + streaming TTS, like the `agent-cascade` starter (sandbox-only) |
| `assembly speak` | Synthesize text to speech over the streaming-TTS WebSocket (sandbox-only) |
Expand Down
8 changes: 4 additions & 4 deletions aai_cli/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@ between layers is enforced — higher may import lower, never the reverse:
`help_text`, `typer_patches`, `update_check`.
- **`core/`** — the Rich-free library layer: `client`, `config`,
`config_builder`, `keyring_store`, `environments`, `env`, `errors`, `llm`,
`telemetry`, `debuglog`, `remotefs`, `sync_stt`, `hotkey`, `ws`, `youtube`,
`telemetry`, `debuglog`, `remotefs`, `sync_stt`, `signals`, `ws`, `youtube`,
`wer`, `argscan`, `jsonshape`, `timeparse`, `microphone`, `procs`, `stdio`,
`choices`, `locking`, `config_lock`. Contract 4 also forbids `rich` here, so
`choices`. Contract 4 also forbids `rich` here, so
"no Rich below the UI layer" is structural.

Three things sit *beside* the stack, intentionally unlisted in the layers
Expand Down Expand Up @@ -139,7 +139,7 @@ heavily-reworked commands with long bodies; small commands keep the inline
### Cross-cutting state (resolution order matters)

- **`app/context.py`** — `AppState` (profile, env) is attached to the Typer context in the root `@app.callback()`. `run_command` is the standard command wrapper.
- **`core/config.py`** — profiles persisted in `config.toml` (via `platformdirs`); the **API key lives only in the OS keyring**, never in a dotfile. The keyring access itself is factored into **`core/keyring_store.py`** (the single importer of `keyring`, holding `KEYRING_SERVICE = "assemblyai-cli"` + `set_secret`/`get_secret`/`restore_secret`/`delete_secret`/`usable`), so the "secrets never touch the dotfile" split is structural; `config` reads/writes secrets through it and only `config.keyring_usable` re-surfaces the probe on the auth facade. Key resolution order: `--api-key` flag (validation paths only) → `ASSEMBLYAI_API_KEY` env → keyring. **Run commands deliberately expose no `--api-key` flag** so keys can't leak into `ps`/shell history. Every `config.toml` write is a read-modify-write (`_load` → mutate → `_dump`): `_dump` is a temp-file + atomic `os.replace` (a reader never sees a torn file), and the whole RMW runs under a cross-process `filelock` (`config_lock.update`/`.locked`, built on `core/locking.py`) so two concurrent `assembly` processes can't lose each other's updates. Readers stay lock-free. The lock helpers live in `config_lock.py` (not `config.py`) only to keep the latter under the file-length gate; reuse one cached `FileLock` per path so nested writers (`persist_login`) stay reentrant.
- **`core/config.py`** — profiles persisted in `config.toml` (via `platformdirs`); the **API key lives only in the OS keyring**, never in a dotfile. The keyring access itself is factored into **`core/keyring_store.py`** (the single importer of `keyring`, holding `KEYRING_SERVICE = "assemblyai-cli"` + `set_secret`/`get_secret`/`restore_secret`/`delete_secret`/`usable`), so the "secrets never touch the dotfile" split is structural; `config` reads/writes secrets through it and only `config.keyring_usable` re-surfaces the probe on the auth facade. Key resolution order: `--api-key` flag (validation paths only) → `ASSEMBLYAI_API_KEY` env → keyring. **Run commands deliberately expose no `--api-key` flag** so keys can't leak into `ps`/shell history. Every `config.toml` write is a read-modify-write (`_load` → mutate → `_dump`) via the `config._update` context manager: `_dump` is a temp-file + atomic `os.replace`, so a reader never sees a torn file. Writers and readers are otherwise unsynchronized — last write wins (there is **no** cross-process lock; an earlier `filelock`-based serialization was removed because it was a recurring Windows CI flake and the lost-update race it closed isn't worth the cost for a single-user CLI). On Windows the atomic replace has no replace-over-open guarantee, so both the lock-free read and the `os.replace` ride out the transient `PermissionError` through `config._retry_on_sharing_violation` (a no-op on POSIX).
- **`core/environments.py`** — a frozen `Environment` (api_base, streaming_host, llm_gateway_base, ams_base, stytch_*). `DEFAULT_ENV` is **`production`**; use `--sandbox` (or `--env sandbox000` / `AAI_ENV`) to target the sandbox. The active environment is a process-global set once at startup; precedence: `--env` → `AAI_ENV` → profile's stored env → default. A credential is only valid against the environment that minted it.
- **`core/client.py`** — thin wrappers over the `assemblyai` SDK (`transcribe`, `list_transcripts`, `stream_audio`, etc.). It normalizes SDK exceptions: auth failures become a single clean `auth_failure()` `CLIError`; everything else becomes `APIError`. New SDK calls should follow this try/except shape.
- **`core/errors.py`** — the `CLIError` hierarchy (each with `error_type` + `exit_code`). `ui/output.py` emits errors to **stderr**; stdout stays clean for pipelines. `--json` switches to machine-readable output; it is never auto-enabled — `output.resolve_json()` deliberately keeps human text the default even when piped or agent-run.
Expand All @@ -149,7 +149,7 @@ heavily-reworked commands with long bodies; small commands keep the inline
### Feature subsystems

- **`streaming/`** + `client.stream_audio` — v3 realtime API. Event callbacks run on the SDK reader thread and guard against `BrokenPipeError` (`stdio.silence_stdout()`) so a closed pipe never dumps a thread traceback.
- **`core/sync_stt.py`** + **`core/hotkey.py`** + `commands/dictate/` — `assembly dictate`: push-to-talk dictation over the **Sync STT API** (`Environment.sync_base`, one POST `/transcribe` per utterance with the required `X-AAI-Model: u3-sync-pro` header; 80 ms–120 s of PCM/WAV). `hotkey.TerminalKeys` scopes stdin into cbreak (Ctrl-C still signals) and reads single keypresses; `dictate_exec._record` polls it with a zero timeout between ~100 ms mic chunks. All three boundaries (keys, mic, HTTP) are injectable, so the suite never needs a real terminal — `tests/test_hotkey.py` drives a pty pair for the termios behavior.
- **`core/sync_stt.py`** + **`core/signals.py`** + `commands/dictate/` — `assembly dictate`: headless dictation over the **Sync STT API** (`Environment.sync_base`, one POST `/transcribe` per utterance with the required `X-AAI-Model: u3-sync-pro` header; 80 ms–120 s of PCM/WAV). It needs no terminal: recording starts immediately and `dictate_exec._record` polls `signals.stop_on_terminate` between ~100 ms mic chunks for a SIGTERM, which finishes the utterance (clean exit 0) — so a hotkey tool like Hammerspoon can launch it as a background task and `kill -TERM`/`task:terminate()` to transcribe. SIGINT (Ctrl-C) still cancels (exit 130). Both boundaries (the stop latch, mic, HTTP) are injectable, so the suite never needs a real signal or microphone (`tests/test_dictate_exec.py` scripts the SIGTERM latch). Contrast `signals.terminate_as_interrupt` (used by `stream`/`agent`/`speak`), which routes SIGTERM into the *cancel* path instead.
- **`agent/`** — full-duplex voice agent (mic in, TTS out via `voices.py`).
- **`agent_cascade/`** + `commands/agent_cascade/` — `assembly agent-cascade`: the same live terminal conversation as `assembly agent`, but **client-orchestrated** — `engine.run_cascade` wires Streaming STT → the LLM Gateway → streaming TTS itself instead of talking to the Voice Agent endpoint, mirroring what the `agent-cascade` `assembly init` template does server-side. **Sandbox-only** (streaming TTS has no prod host; guarded via `tts.session.require_available`). Reuses the agent slice's `DuplexAudio`/`AgentRenderer` and `core.client.stream_audio`/`core.llm.complete`/`tts.session.synthesize`; the three network legs are injected through `engine.CascadeDeps` (the `tts/session.py` seam) so the cascade — greeting, per-sentence TTS, barge-in, history window — is unit-tested against fakes with no sockets/mic/speaker.
- **`tts/`** + `commands/speak.py` — `assembly speak` synthesizes text to speech over the sandbox streaming-TTS WebSocket (`streaming-tts.sandbox000.…`). **Sandbox-only:** `session.is_available()` is false in production (empty `Environment.streaming_tts_host`), so the command exits 2 with a `--sandbox` hint. `session.synthesize` drives a Begin→Generate→Flush→Audio→Terminate protocol with an injectable `connect` for hermetic tests (mirrors `agent/session.py`); `audio.py` plays the PCM (default) or writes a WAV (`--out`).
Expand Down
20 changes: 13 additions & 7 deletions aai_cli/commands/dictate/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,11 @@
rich_help_panel=help_panels.TRANSCRIPTION,
epilog=examples_epilog(
[
("Dictate one utterance: recording starts, Enter transcribes it", "assembly dictate"),
("Record until SIGTERM, then print the transcript", "assembly dictate"),
(
"Stop the recording and transcribe (e.g. from a hotkey tool)",
"kill -TERM $(pgrep -f 'assembly dictate')",
),
(
"Pipe the utterance into another command",
'assembly dictate | assembly llm "write a conventional commit"',
Expand Down Expand Up @@ -75,13 +79,15 @@ def dictate(
help="Output mode: text (the bare transcript per utterance, pipe-friendly) or json",
),
) -> None:
"""Push-to-talk dictation: record the mic, get the transcript back
"""Signal-driven dictation: record the mic, get the transcript back

Recording starts immediately; press Enter (or Space) to stop and the
utterance is sent to the AssemblyAI Sync API — the transcript prints right
away (no polling) and dictate exits, so it flows straight to the next
command in a pipe. The recording can be up to 120 seconds long. Press
Ctrl-C to cancel without transcribing.
Recording starts immediately and runs headless — no terminal needed — so a
hotkey tool like Hammerspoon can launch it as a background task and send
SIGTERM (kill -TERM, task:terminate()) to stop. On SIGTERM the utterance is
sent to the AssemblyAI Sync API, the transcript prints right away (no
polling), and dictate exits, so it flows straight to the next command in a
pipe. The recording can be up to 120 seconds long. Ctrl-C (SIGINT) cancels
without transcribing.
"""
opts = dictate_exec.DictateOptions(
language=language,
Expand Down
Loading
Loading