Skip to content

Add sandbox-only aai speak streaming TTS command#61

Merged
alexkroman merged 28 commits into
mainfrom
add-speak-command
Jun 11, 2026
Merged

Add sandbox-only aai speak streaming TTS command#61
alexkroman merged 28 commits into
mainfrom
add-speak-command

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

What

Adds aai speak — synthesize speech from text via the sandbox streaming-TTS WebSocket, playing it through the speakers by default or writing a WAV with --out.

aai speak "Hello there, friend." --sandbox       # play through speakers
aai speak "Hello" --out /tmp/hello.wav --sandbox # write a WAV
echo "Hello" | aai speak --sandbox               # text from stdin
aai speak "Hi" --voice jane --language English --sandbox --json

Sandbox-only: TTS only exists at streaming-tts.sandbox000.assemblyai-labs.com. Running against production (the default) exits 2 with a --sandbox hint.

How

  • aai_cli/tts/ — new subsystem mirroring agent/:
    • session.py — the WebSocket protocol (Begin → Generate → Flush → Audio → Terminate) with an injectable connect factory for hermetic tests; auth/connect failures map to clean CLIErrors. Boundaries are typed via Protocol, not Any.
    • audio.pywrite_wav (stdlib wave) and play_pcm (sounddevice).
  • commands/speak.py — the Typer sub-app (sandbox guard → key → text from arg/stdin → synthesize → play or write; --json metadata).
  • environments.py — new streaming_tts_host field; empty on production (the sandbox-only signal), set on sandbox000.
  • main.py — registers speak under "Run AssemblyAI".

Design + implementation plan: docs/superpowers/specs/2026-06-10-aai-speak-design.md, docs/superpowers/plans/2026-06-10-aai-speak.md.

Tests / gate

Full ./scripts/check.sh is green: ruff, mypy, pyright (src strict + tests), vulture, deptry, import-linter, xenon, 100% patch coverage, mutation gate (no surviving mutants), no-new-escape-hatches (zero net-new Any/cast/ignores), build + twine. New unit tests cover the protocol (incl. error/warning frames, auth mapping, default factories), WAV/playback, and the command surface; help snapshots regenerated.

⚠️ Known limitation — upstream not synthesizing (server-side)

The deployed sandbox upstream currently rejects all synthesis. Verified directly against the live server (raw client, no CLI involved):

  • The connection succeeds — Begin echoes configuration: {voice, language, sample_rate}.
  • On Flush the server returns {"type":"Error","error_code":3005,"error":"Upstream error: InputParseError"} and closes with code 3005.
  • It fails identically for every voice/language/text — even with no Generate at all (just Flush), and for the exact documented sample_session.py request. So the upstream model throws InputParseError on the gRPC Initialization, before any text. This is a server-side gateway↔model contract mismatch, not a CLI bug — no client change can affect it.

The CLI faithfully reproduces the reference protocol; it should work once the sandbox upstream is fixed. End-to-end audio could not be verified while the upstream is down.

Open follow-up

  • Revisit the client-side --voice/--language defaults (currently Vivian/English). These were added on a since-disproven hypothesis that the bare command failed for lack of voice/language; the server actually defaults them. May revert to "no client default" (server is source of truth).

🤖 Generated with Claude Code

alexkroman-assembly and others added 13 commits June 10, 2026 09:40
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The sandbox streaming-TTS server requires an explicit voice and language on
the connection (it no longer backfills them), so an omitted voice produced an
upstream InputParseError. Always send the documented defaults (Vivian/English),
overridable with --voice/--language. Also types the websocket/output-stream
boundaries as Protocols instead of Any to satisfy the no-new-escape-hatches gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Type the bare dict annotations, assert immutability via an opaque-name setattr
helper (frozen check without a read-only static error), and verify the default
output-stream factory with an identity sentinel.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread aai_cli/commands/speak.py
alexkroman-assembly and others added 15 commits June 10, 2026 15:11
The TTS session client had drifted from the api_tts reference protocol, so
every `aai speak` request failed with `InputParseError`:

- Send `ForceFlushTextBuffer`, not `Flush` — a plain `Flush` message is
  unparseable server-side and is what produced the InputParseError.
- Stop on `is_final`, not `is_final_for_flush` (the real Audio-frame flag).
- Read `sample_rate` from the `Begin` frame's `configuration` (24 kHz
  fallback); Audio frames don't carry it, so the old `msg["sample_rate"]`
  would KeyError.
- Default `--voice` to `jane` (the reference client's default) instead of
  `Vivian`, so a bare `aai speak` works.

Verified against the live sandbox: bare and `--sample-rate 16000` runs both
produce valid WAVs. Spec doc updated to the corrected protocol.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
play_pcm wrote the whole clip in one blocking write and caught only Exception,
so a Ctrl-C mid-playback was neither delivered promptly nor stopped the device —
PortAudio kept draining its buffer after the interrupt.

Write in ~4 KiB chunks so the SIGINT lands between writes, and on
KeyboardInterrupt abort() the stream (discarding buffered frames) for an
immediate stop before re-raising. The stream is now closed via finally on every
path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implement parse_voice_overrides() to split --voice repeatable args into
bare default voice and speaker ID→voice mappings (case-insensitive).
Implement assign_voices() to resolve each segment to a voice via
first-appearance rotation order, with mappings consuming no rotation slot.
Add DEFAULT_VOICE_ROTATION constant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt, multi duration

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…h); fix dialogue test typing

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d arg only)

The blank-argument case deliberately does NOT fall back to piped stdin (guards
against 'aai speak "$MSG"' silently consuming the pipe when MSG is empty); it
is pinned by test_blank_arg_does_not_fall_back_to_stdin. Reword the docstring +
add an inline note so the behavior isn't misread as a bug.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexkroman alexkroman merged commit db7644d into main Jun 11, 2026
11 checks passed
@alexkroman alexkroman deleted the add-speak-command branch June 11, 2026 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants