Add sandbox-only `aai speak` streaming TTS command by alexkroman · Pull Request #61 · AssemblyAI/cli

alexkroman · 2026-06-10T17:58:10Z

What

Adds aai speak — synthesize speech from text via the sandbox streaming-TTS WebSocket, playing it through the speakers by default or writing a WAV with --out.

aai speak "Hello there, friend." --sandbox       # play through speakers
aai speak "Hello" --out /tmp/hello.wav --sandbox # write a WAV
echo "Hello" | aai speak --sandbox               # text from stdin
aai speak "Hi" --voice jane --language English --sandbox --json

Sandbox-only: TTS only exists at streaming-tts.sandbox000.assemblyai-labs.com. Running against production (the default) exits 2 with a --sandbox hint.

How

aai_cli/tts/ — new subsystem mirroring agent/:
- session.py — the WebSocket protocol (Begin → Generate → Flush → Audio → Terminate) with an injectable connect factory for hermetic tests; auth/connect failures map to clean CLIErrors. Boundaries are typed via Protocol, not Any.
- audio.py — write_wav (stdlib wave) and play_pcm (sounddevice).
commands/speak.py — the Typer sub-app (sandbox guard → key → text from arg/stdin → synthesize → play or write; --json metadata).
environments.py — new streaming_tts_host field; empty on production (the sandbox-only signal), set on sandbox000.
main.py — registers speak under "Run AssemblyAI".

Design + implementation plan: docs/superpowers/specs/2026-06-10-aai-speak-design.md, docs/superpowers/plans/2026-06-10-aai-speak.md.

Tests / gate

Full ./scripts/check.sh is green: ruff, mypy, pyright (src strict + tests), vulture, deptry, import-linter, xenon, 100% patch coverage, mutation gate (no surviving mutants), no-new-escape-hatches (zero net-new Any/cast/ignores), build + twine. New unit tests cover the protocol (incl. error/warning frames, auth mapping, default factories), WAV/playback, and the command surface; help snapshots regenerated.

⚠️ Known limitation — upstream not synthesizing (server-side)

The deployed sandbox upstream currently rejects all synthesis. Verified directly against the live server (raw client, no CLI involved):

The connection succeeds — Begin echoes configuration: {voice, language, sample_rate}.
On Flush the server returns {"type":"Error","error_code":3005,"error":"Upstream error: InputParseError"} and closes with code 3005.
It fails identically for every voice/language/text — even with no Generate at all (just Flush), and for the exact documented sample_session.py request. So the upstream model throws InputParseError on the gRPC Initialization, before any text. This is a server-side gateway↔model contract mismatch, not a CLI bug — no client change can affect it.

The CLI faithfully reproduces the reference protocol; it should work once the sandbox upstream is fixed. End-to-end audio could not be verified while the upstream is down.

Open follow-up

Revisit the client-side --voice/--language defaults (currently Vivian/English). These were added on a since-disproven hypothesis that the bare command failed for lack of voice/language; the server actually defaults them. May revert to "no client default" (server is source of truth).

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The sandbox streaming-TTS server requires an explicit voice and language on the connection (it no longer backfills them), so an omitted voice produced an upstream InputParseError. Always send the documented defaults (Vivian/English), overridable with --voice/--language. Also types the websocket/output-stream boundaries as Protocols instead of Any to satisfy the no-new-escape-hatches gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Type the bare dict annotations, assert immutability via an opaque-name setattr helper (frozen check without a read-only static error), and verify the default output-stream factory with an identity sentinel. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The TTS session client had drifted from the api_tts reference protocol, so every `aai speak` request failed with `InputParseError`: - Send `ForceFlushTextBuffer`, not `Flush` — a plain `Flush` message is unparseable server-side and is what produced the InputParseError. - Stop on `is_final`, not `is_final_for_flush` (the real Audio-frame flag). - Read `sample_rate` from the `Begin` frame's `configuration` (24 kHz fallback); Audio frames don't carry it, so the old `msg["sample_rate"]` would KeyError. - Default `--voice` to `jane` (the reference client's default) instead of `Vivian`, so a bare `aai speak` works. Verified against the live sandbox: bare and `--sample-rate 16000` runs both produce valid WAVs. Spec doc updated to the corrected protocol. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

play_pcm wrote the whole clip in one blocking write and caught only Exception, so a Ctrl-C mid-playback was neither delivered promptly nor stopped the device — PortAudio kept draining its buffer after the interrupt. Write in ~4 KiB chunks so the SIGINT lands between writes, and on KeyboardInterrupt abort() the stream (discarding buffered frames) for an immediate stop before re-raising. The stream is now closed via finally on every path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Implement parse_voice_overrides() to split --voice repeatable args into bare default voice and speaker ID→voice mappings (case-insensitive). Implement assign_voices() to resolve each segment to a voice via first-appearance rotation order, with mappings consuming no rotation slot. Add DEFAULT_VOICE_ROTATION constant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ze malformed-pair

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nt, multi duration Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…h); fix dialogue test typing Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d arg only) The blank-argument case deliberately does NOT fall back to piped stdin (guards against 'aai speak "$MSG"' silently consuming the pipe when MSG is empty); it is pinned by test_blank_arg_does_not_fall_back_to_stdin. Reword the docstring + add an inline note so the behavior isn't misread as a bug. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexkroman-assembly and others added 13 commits June 10, 2026 09:40

docs(speak): design for sandbox-only streaming TTS command

4f64b19

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(speak): implementation plan for the speak command

47eabe5

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(speak): add streaming_tts_host to Environment

4ac6e2a

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(speak): tts session availability + url building

847ef10

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(speak): tts websocket synthesize protocol

ea9a73e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(speak): tts wav writing + pcm playback

31c872a

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(speak): add sandbox-only speak command

133c97f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(speak): refresh CLI help snapshots

ba37f08

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(speak): document the tts subsystem and speak command

cb68bd0

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(speak): cover default audio/ws factories + help order

58981ea

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(speak): kill surviving mutants on changed lines

b419755

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

aikido-pr-checks Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread aai_cli/commands/speak.py

alexkroman-assembly and others added 15 commits June 10, 2026 15:11

docs(speak): design for speaker-aware multi-voice playback

4c37cbe

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(speak): implementation plan for speaker-aware multi-voice playback

b60526f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(speak): parse speaker-labeled transcript text into segments

81988a8

test(speak): cover label-only continuation; drop no-op strip

3dc1848

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(speak): make voice-rotation wrap test self-sufficient; parametri…

dcdf9f6

…ze malformed-pair

feat(speak): add silence() PCM helper for inter-turn gaps

2907b60

feat(speak): synthesize_dialogue concatenates per-voice segments

4a746ce

test(speak): pin dialogue duration, server rate, and empty-segments edge

ee023a2

feat(speak): auto-detect diarized input and voice each speaker

c4730c3

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(speak): cover empty-labels path; pin voice fallback, speaker cou…

31b448b

…nt, multi duration Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(speak): kill mutation survivors (Segment frozen, multi --out pat…

b34be10

…h); fix dialogue test typing Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexkroman merged commit db7644d into main Jun 11, 2026
11 checks passed

alexkroman deleted the add-speak-command branch June 11, 2026 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sandbox-only `aai speak` streaming TTS command#61

Add sandbox-only `aai speak` streaming TTS command#61
alexkroman merged 28 commits into
mainfrom
add-speak-command

alexkroman commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

alexkroman commented Jun 10, 2026

What

How

Tests / gate

⚠️ Known limitation — upstream not synthesizing (server-side)

Open follow-up

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants