Add AssemblyAI CLI (aai)#1
Merged
Merged
Conversation
Initial implementation of the `aai` onboarding CLI for AssemblyAI. Commands: - login: store the API key via keyring - transcribe: file/sample transcription with a runnable code template - transcripts: list/get past transcripts - samples: scaffold key-injected sample scripts - stream: real-time transcription from a file or microphone - agent: live two-way voice conversation with an AssemblyAI voice agent - claude: wire Claude Code to AssemblyAI's docs MCP server + skill Built on typer + rich; packaged with hatchling. Includes a full pytest suite (157 tests) plus ruff and pre-commit configuration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI: - scripts/check.sh runs ruff (lint + format), mypy, and pytest - .github/workflows/ci.yml runs it on every PR and push to main Type safety: - generic output.emit(); typed audio stream, ffmpeg stdout, stream source - [tool.mypy] config (ignore_missing_imports); mypy is now green Robustness: - clean "not authenticated" errors across all commands: detect rejected keys (incl. the Voice Agent's 1008 policy-violation close) and raise NotAuthenticated instead of a raw protocol/APIError - `aai claude install` no longer hangs: detach child stdin, add a timeout, and pass `npx -y` so an invisible prompt can't block forever Install + docs: - install.sh for `curl -fsSL .../install.sh | sh` (pipx/pip, no clone) - README rewritten with the curl install as the top path - remove DEMO.md; gitignore .claude/, .mypy_cache/, docs/ Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- python -m assemblyai_cli entrypoint - client.py auth/error branches (get/transcribe/stream -> NotAuthenticated vs APIError) - session.py should_send_audio gate and _send_audio_loop (forward/drop/stop-on-error) - renderer human-mode lines + broken-pipe write swallowing - human-mode command paths (transcripts table, agent half-duplex notice, stream Ctrl-C, interactive login with/without a working browser) - MicCapture closes a closeable stream Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Quality gates (CI runs all of these on every PR): - coverage gate: pytest-cov with --cov-fail-under=90 (currently 96%) - broader ruff rules: BLE, C4, SIM, RET, PTH, ARG, S, RUF (+ fixes) - stricter mypy: disallow_untyped_defs, warn_unused_ignores, warn_return_any, no_implicit_optional (fully annotated) - new CI jobs: `pre-commit run --all-files` (hooks can't drift) and `python -m build` + `twine check` (package always builds) Features / UX: - `aai samples create agent` scaffolds a runnable voice-agent script - bundle PyAudio as a core dependency (mic/speaker work out of the box; drop the [mic] extra); CI installs portaudio19-dev for the build - groups (`aai claude`, `aai samples`) print their subcommands instead of "Missing command" (no_args_is_help) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- extract BaseRenderer (assemblyai_cli/render.py) for the shared NDJSON + in-place-line plumbing; AgentRenderer/StreamRenderer now only map their own events. stopped() is shared, so `aai stream` Ctrl-C reuses it. - unify the two near-identical mic classes (MicSource + MicCapture) into a single MicrophoneSource (assemblyai_cli/microphone.py) used by both `stream` and `agent`; consolidate their tests into test_microphone.py. Behavior is unchanged (outputs preserved); ~30 fewer lines, 96% coverage. Note on Rich: transcripts already uses rich.table; evaluated rich.Live for the live transcript line but it complicates the JSON/threaded paths and testability without a real simplification, so kept the lightweight helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Correctness: - renderer _write re-raises BrokenPipeError so `aai stream/agent --json | head` stops cleanly again (the dedup had swallowed it); agent handles it like stream - is_auth_failure no longer matches bare numbers (401/403/1008) anywhere in a message — those caused valid keys/real errors to be reported as "key rejected". Voice Agent 1008 and pre-upgrade HTTP 401/403 are now detected structurally (close code / status code) in session.py - mic-open failures in the agent's daemon capture thread are now surfaced to the user (clean CLIError/exit) instead of vanishing with a hung session - ffmpeg no longer SIGTERM'd after natural EOF, removing a spurious "exit -15" decode error on fully-streamed files - validate_key reuses the shared is_auth_failure (catches forbidden/403) Cleanup: - shared status_str() for transcript status (was copy-pasted 3x) - shared pyaudio_missing_error() (was duplicated with divergent wording) - claude.py honors CLAUDE_CONFIG_DIR; Step TypedDict removes the type: ignore Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified against the real `skills` CLI (vercel-labs/skills): - `skills add` auto-selects PROJECT scope when run inside a project, so a bare `skills add` from a repo never reached ~/.claude/skills — hence install said "installed" while status said "not_installed". Pass --global (+ --yes) to pin user scope, matching where status looks. Skill name confirmed: "assemblyai". - the skill is symlinked into ~/.claude/skills from the skills store, so shutil.rmtree couldn't remove it; `_remove_skill` now shells out to `npx skills remove assemblyai --global` (and verifies it's gone). End-to-end: `aai claude install` → `status` → `remove` now agree on a real machine. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`aai stream` now mirrors `aai transcribe`: - `--sample` streams the same hosted wildfires.mp3 clip - a positional source can be a local file OR an http(s) URL (decoded via ffmpeg, which reads URLs natively — verified it yields 16k mono PCM from the sample URL) Shared the source-resolution logic in client.resolve_audio_source() so transcribe and stream don't duplicate the --sample / "provide a path or URL" handling. FileSource grew a URL branch (skips the local is_file/WAV fast-path, always ffmpeg). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PyAudio ships no macOS/Linux wheels, so a fresh install had to compile it from source against PortAudio headers (brew install portaudio + a compiler). sounddevice bundles PortAudio in its macOS/Windows wheels, so `pip install` now works with zero system dependencies on those platforms; Linux needs only the libportaudio2 runtime (no headers/compiler). - microphone.py: replace SDK PyAudio MicrophoneStream with a sounddevice RawInputStream iterator (_SoundDeviceMic); rename pyaudio_missing_error -> audio_missing_error - agent/audio.py: Player uses sounddevice RawOutputStream; simpler teardown - pyproject: pyaudio -> sounddevice; agent.py.tmpl + README updated - tests: cover _SoundDeviceMic, both factories, and missing-dep/device-failure branches (coverage 97.4%) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Supply-chain: - pin GitHub Actions to commit SHAs (a moved tag can't change what runs) - add least-privilege `permissions: contents: read` to the workflow - new `pip-audit` CI job that fails on known dependency CVEs (deps are clean today; documented the --ignore-vuln escape hatch for unfixable transitives) - .github/dependabot.yml to keep pip deps and the pinned Actions current - fix stale "PyAudio" CI label left by the sounddevice migration Tests: - branch coverage gate (pytest --cov-branch, still >=90; currently 96%) - add hypothesis property tests: NDJSON renderers preserve arbitrary text (quotes/newlines/unicode), and WAV chunking is byte-exact and bounded Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drive the real `aai` CLI as a subprocess against the live AssemblyAI API, synthesizing speech locally with kokoro TTS. Marked `e2e`, they skip when the API key / kokoro / numpy is unavailable, so CI and keyless contributors are never blocked. A new precommit `pytest-e2e` hook runs them; the default unit run and coverage gate exclude them via `-m "not e2e"`. To make the agent drivable, `aai agent` now accepts a positional source / --sample (mirroring stream/transcribe): it streams a clip as the user's speech via a NullPlayer (headless), suppresses the greeting, runs full-duplex so nothing is muted, waits for session.ready before streaming, and exits after the agent's first reply. Live-mic behavior is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a --prompt switch to transcribe and stream that transforms the transcript through AssemblyAI's LLM Gateway, plus a standalone `aai llm` command. The gateway is OpenAI-compatible and has no assemblyai-SDK client for the synchronous endpoint, so we talk to it via the openai SDK pointed at https://llm-gateway.assemblyai.com/v1 (Bearer auth, transcript_id injection). - transcribe --prompt: transforms the finished transcript server-side via the transcript id ({{ transcript }} injection). Human prints the transform only; --json keeps raw text + transform{model,prompt,output}. Conflicts with --srt/--vtt. - stream --prompt: accumulates finalized turns, then runs one transform on the full transcript when the stream ends (native per-turn streaming gateway is not provisioned for general accounts, so we don't rely on it). - aai llm: prompt the gateway directly, with --transcript-id injection, --model/--system/--max-tokens, and --list-models. Adds openai>=1.40. Unit suite covers the gateway client, both switches, and the new command (coverage gate green); e2e tests exercise aai llm, transcribe --prompt, and stream --prompt against the live API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…plex agent audio
CLI surface & consistency
- transcribe/stream: split prompts — `--prompt` biases the speech model (u3-pro),
`--llm-gateway-prompt` runs the LLM Gateway transform. Drop transcribe `--srt`/`--vtt`.
- agent: rename `--prompt`/`--prompt-file` -> `--system-prompt`/`--system-prompt-file`
(aligns with the speech-prompt meaning); drop `--full-duplex` (always full-duplex now).
- Group `transcripts` under `aai transcripts get/list` (was top-level `get`/`list`).
- transcribe `--speaker-labels` now actually renders diarized utterances (Speaker A: …)
in human mode and adds `utterances` to --json (was silently dropped).
YouTube
- `aai transcribe/stream <youtube-url>` downloads best audio via yt-dlp to a temp file,
then transcribes/streams it (new `youtube.py`). Adds yt-dlp dependency.
LLM Gateway
- Talk to the gateway through the OpenAI SDK (OpenAI-compatible) with transcript_id
injection; switch the dependency from httpx to openai.
Rendering
- Replace the hand-rolled ANSI renderer with rich.live.Live (fixes long-line wrap
cascade); JSON/NDJSON path unchanged.
Audio robustness (macOS)
- Capture at the device's native rate and resample (mic + speaker), instead of forcing
an unsupported rate.
- agent: drive mic + speaker through ONE full-duplex sd.RawStream (DuplexAudio). macOS
AUHAL refuses two concurrent streams ("cannot do in current context", err -50), which
silently killed capture — recording now works. Live agent prompts to use headphones.
- stream: tidy ffmpeg generator teardown so a stray Ctrl-C in cleanup doesn't surface as
"Exception ignored in generator".
Tooling
- check.sh + mypy now cover tests/ (tests is a package; pytest fns exempt from
annotation rule); add markdownlint (docs/ excluded) and a .markdownlint.json.
- Repo squash-only is configured on GitHub.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`uv lock` resolves the PEP 621 deps in pyproject.toml into uv.lock (hatchling stays the build backend; uv only manages resolution/locking). Use `uv sync` to install from the lock and `uv lock` to refresh it. Also gitignore poetry.lock — this project doesn't use poetry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ts list Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "Listening…" notice fired on the streaming session's Begin event, before the mic device was lazily opened — so early speech fell into the gap. Gate it on the mic actually opening via a MicrophoneSource on_open callback, and emit it only for live mic input (file/URL/YouTube sources show no notice). For the voice agent, suppress the "start talking" prompt on file-driven runs, which have no mic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Group top-level commands as core transcription → voice/LLM → account → tooling, with version last, via an OrderedGroup that controls list_commands (Typer otherwise renders direct commands before sub-typer groups). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Typer >=0.13 vendors its own click; type list_commands with typer's vendored Context and drop the now-unneeded real click dependency (clean installs were crashing: main.py imported click but typer no longer pulls it in). - assemblyai transcription models are pydantic v1 (__fields__/.dict); make the config-builder tests version-agnostic across v1/v2. - Update smoke test to assert TyperGroup; drop a stale type: ignore in conftest. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
assemblyai's transcription models are pydantic v1 (no model_dump); 'aai transcripts list' would have crashed at runtime. Serialize via .json() when model_dump is absent. Pin the v1 path with a test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bare tools picked up whatever was on PATH (system Python), which diverged from the locked venv and masked breakages. Run the Python gate through uv run so CI and local match pyproject + uv.lock. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… stream
- transcribe --show-code now reflects --llm-gateway-prompt: the generated script
runs the prompt over the transcript via the OpenAI-compatible LLM Gateway
(transcript_id extra_body + {{ transcript }} tag), replacing analysis output as
the CLI does.
- Generated agent script now uses ONE sd.RawStream (mic+speaker) with device-rate
resampling, matching the DuplexAudio fix — two separate streams crash on macOS
CoreAudio (PaMacCore -50).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l colors - --llm-gateway-prompt is now repeatable: each prompt runs on the prior response (the first on the transcript), with per-step human/JSON output and matching --show-code generation. - Give "you:" and "agent:" distinct theme colors so the two speakers are easy to tell apart at a glance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-show-code Each --llm-gateway-prompt now runs on the previous prompt's response (the first over the transcript), printing every step labeled. --show-code emits the equivalent chained loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… uv run Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The generated agent script resampled via audioop, which warns on 3.12 and is removed in 3.13. The agent's audio is already 24 kHz and the device accepts it, so open one full-duplex sd.RawStream at 24 kHz with no resampling — simpler, no deprecation warning, and portable to 3.13. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
audioop (used by the PCM resampler in microphone.py / agent audio) left the stdlib in 3.13; add the audioop-lts backport under a python_version >= 3.13 marker so live stream/agent audio keeps working there. Generated --show-code samples don't use audioop at all (the agent script runs at the native 24 kHz). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cat notes.txt | aai llm 'summarize' (and aai transcribe -o text | aai llm '...') now work — stdin text is injected into the prompt. An explicit --transcript-id still takes priority. Adds a small stdio.piped_stdin_text() helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…es|json) Print a single field raw to stdout for pipelines — no jq needed for the common case. e.g. aai transcribe x.mp3 -o text | aai llm 'summarize'. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- aai transcribe - buffers piped bytes to a temp file the SDK uploads (e.g. ffmpeg -i v.mp4 -f wav - | aai transcribe -). - aai stream - streams raw PCM16 mono from stdin via a new StdinSource (e.g. ffmpeg … -f s16le - | aai stream -); --sample-rate sets the rate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Piped stream defaulted to NDJSON, so 'aai stream | aai llm' fed JSON to the
model. 'aai stream -o text' now prints only finalized turn transcripts as plain
stdout lines, with status ('Listening…') on stderr — clean for pipelines.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nto aai llm) Mirrors 'stream -o text': finalized transcript lines to stdout, status to stderr, so 'aai agent -o text | aai llm "summarize the call"' works. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pipe When the downstream consumer (a Ctrl-C'd '| aai llm', or '| head') closes the pipe, the Turn callback raised BrokenPipeError on the SDK's reader thread with no handler — dumping a thread traceback. Guard the event callbacks to swallow it and point stdout at /dev/null so the exit-time flush can't re-raise either; the main thread still stops on Ctrl-C / source EOF. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
emit_error printed to stdout, so a failure polluted piped output (e.g. 'aai transcribe x -o text > out.txt' wrote the error into out.txt). Route both human and --json error output to stderr. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tic templates samples create now renders through the same generator as --show-code (env-var auth, no secret written to disk), so scaffolds can't drift from a separate set of .tmpl files. Removes the static templates and key injection; scaffolding no longer requires auth. Generated samples are asserted to parse. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…am->llm) Document the composability features: -o/--output field selectors, '-' stdin audio, aai llm reading piped stdin, errors-to-stderr, and the stream/agent -> llm patterns (timeout -s INT, inactivity-timeout, capture-then-process). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
aai stream -o text | aai llm -f "summarize as I talk" — each finalized turn triggers a fresh transform over the full transcript, refreshing in place (Rich Live) for humans or one NDJSON object per refresh when piped. Adds stdio stdin_is_piped/iter_piped_stdin_lines helpers + tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cts the key) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
--show-code now renders Pygments-highlighted Python (via rich.Syntax) on an interactive terminal, but emits plain text with no ANSI whenever output isn't a terminal (pipe/redirect/agent) — so '--show-code > script.py' and '| aai llm' stay byte-clean and runnable. rich already bundles Pygments; no new dependency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit-driven consistency pass over how commands handle pipes and streams: - add a top-level BrokenPipe guard (main.run) so `aai … | head` never prints an "Exception ignored" traceback from the one-shot output paths; wire __main__ and the console-script entry point to it - extract output.validate_output_field / output.stream_output_modes and use them in stream + agent (drop the copy-pasted validation and json-mode logic) - move transcript field selection to client.select_transcript_field and add `-o text|id|status|utterances|json` to `transcripts get` plus `-o text|json` to `aai llm` (one-shot), removing the `… | jq -r .output` workaround - route transcribe's --json branch through output.emit and follow-mode NDJSON through output.emit_ndjson; clarify the `aai llm --json` help text - render the `aai llm -f` live panel in the alternate screen buffer so a Ctrl-C in `aai stream -o text | aai llm -f` no longer duplicates the panel border Docs: README Recipes use `-o text`; add a `--show-code | aai llm` example. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Expose SubRip captions through `-o srt` on `transcribe` and `transcripts get`, backed by the SDK's export_subtitles_srt(). Network/auth errors from the /srt export endpoint surface as the usual clean CLI errors. Includes a "burn captions onto a video" README recipe. Also bundles in-flight working-tree changes across agent/render/streaming/main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
One-line yt-dlp + aai transcribe -o srt + ffmpeg chain for captioning a YouTube video. Notes --no-playlist for radio links and that burned SRT is static line captions, not word-by-word karaoke. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The check job installed deps with pip but check.sh drives every tool through `uv run`/`uv build`, so CI died on `uv: command not found` (long-standing since 44b050d). Install uv from PyPI in the check job. Also clear the gates that surfaced once check.sh actually ran: - client.py: wrap export_subtitles_srt() in str() (mypy no-any-return) - ruff format reflow in claude.py and transcribe.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two more missing CI tools surfaced once uv was fixed: - markdownlint CLI: check.sh lints Markdown with it (pinned 0.45.0, via npm on the runner's Node) — was `command not found` in the check job. - ffmpeg: the `--sample` stream tests build a FileSource for the hosted sample URL, which requires ffmpeg; without it FileSource raises before stream_audio, so those 4 tests failed with KeyError: 'params'. Added to both jobs that run pytest (check + pre-commit). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Initial implementation of the
aaionboarding CLI for AssemblyAI. This populates the repo (previously just a README) with the full CLI package, tests, and tooling.Commands
aai login— store the API key via the system keyringaai transcribe— file/sample transcription, emits a runnable code templateaai transcripts— list / get past transcriptsaai samples— scaffold key-injected sample scriptsaai stream— real-time transcription from a file or microphoneaai agent— live two-way voice conversation with an AssemblyAI voice agentaai claude— wire Claude Code to AssemblyAI's docs MCP server + skillStack
typer+rich, packaged withhatchling[mic](PyAudio) for streaming/agent mic input,[dev]for toolingruff+pre-commitconfiguredTests
Full
pytestsuite — 157 tests passing locally.🤖 Generated with Claude Code