Add sandbox-only aai speak streaming TTS command#61
Merged
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The sandbox streaming-TTS server requires an explicit voice and language on the connection (it no longer backfills them), so an omitted voice produced an upstream InputParseError. Always send the documented defaults (Vivian/English), overridable with --voice/--language. Also types the websocket/output-stream boundaries as Protocols instead of Any to satisfy the no-new-escape-hatches gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Type the bare dict annotations, assert immutability via an opaque-name setattr helper (frozen check without a read-only static error), and verify the default output-stream factory with an identity sentinel. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The TTS session client had drifted from the api_tts reference protocol, so every `aai speak` request failed with `InputParseError`: - Send `ForceFlushTextBuffer`, not `Flush` — a plain `Flush` message is unparseable server-side and is what produced the InputParseError. - Stop on `is_final`, not `is_final_for_flush` (the real Audio-frame flag). - Read `sample_rate` from the `Begin` frame's `configuration` (24 kHz fallback); Audio frames don't carry it, so the old `msg["sample_rate"]` would KeyError. - Default `--voice` to `jane` (the reference client's default) instead of `Vivian`, so a bare `aai speak` works. Verified against the live sandbox: bare and `--sample-rate 16000` runs both produce valid WAVs. Spec doc updated to the corrected protocol. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
play_pcm wrote the whole clip in one blocking write and caught only Exception, so a Ctrl-C mid-playback was neither delivered promptly nor stopped the device — PortAudio kept draining its buffer after the interrupt. Write in ~4 KiB chunks so the SIGINT lands between writes, and on KeyboardInterrupt abort() the stream (discarding buffered frames) for an immediate stop before re-raising. The stream is now closed via finally on every path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implement parse_voice_overrides() to split --voice repeatable args into bare default voice and speaker ID→voice mappings (case-insensitive). Implement assign_voices() to resolve each segment to a voice via first-appearance rotation order, with mappings consuming no rotation slot. Add DEFAULT_VOICE_ROTATION constant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ze malformed-pair
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt, multi duration Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…h); fix dialogue test typing Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d arg only) The blank-argument case deliberately does NOT fall back to piped stdin (guards against 'aai speak "$MSG"' silently consuming the pipe when MSG is empty); it is pinned by test_blank_arg_does_not_fall_back_to_stdin. Reword the docstring + add an inline note so the behavior isn't misread as a bug. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
aai speak— synthesize speech from text via the sandbox streaming-TTS WebSocket, playing it through the speakers by default or writing a WAV with--out.Sandbox-only: TTS only exists at
streaming-tts.sandbox000.assemblyai-labs.com. Running against production (the default) exits 2 with a--sandboxhint.How
aai_cli/tts/— new subsystem mirroringagent/:session.py— the WebSocket protocol (Begin → Generate → Flush → Audio → Terminate) with an injectableconnectfactory for hermetic tests; auth/connect failures map to cleanCLIErrors. Boundaries are typed viaProtocol, notAny.audio.py—write_wav(stdlibwave) andplay_pcm(sounddevice).commands/speak.py— the Typer sub-app (sandbox guard → key → text from arg/stdin → synthesize → play or write;--jsonmetadata).environments.py— newstreaming_tts_hostfield; empty on production (the sandbox-only signal), set onsandbox000.main.py— registersspeakunder "Run AssemblyAI".Design + implementation plan:
docs/superpowers/specs/2026-06-10-aai-speak-design.md,docs/superpowers/plans/2026-06-10-aai-speak.md.Tests / gate
Full
./scripts/check.shis green: ruff, mypy, pyright (src strict + tests), vulture, deptry, import-linter, xenon, 100% patch coverage, mutation gate (no surviving mutants), no-new-escape-hatches (zero net-newAny/cast/ignores), build + twine. New unit tests cover the protocol (incl. error/warning frames, auth mapping, default factories), WAV/playback, and the command surface; help snapshots regenerated.The deployed sandbox upstream currently rejects all synthesis. Verified directly against the live server (raw client, no CLI involved):
Beginechoesconfiguration: {voice, language, sample_rate}.Flushthe server returns{"type":"Error","error_code":3005,"error":"Upstream error: InputParseError"}and closes with code3005.Generateat all (justFlush), and for the exact documentedsample_session.pyrequest. So the upstream model throwsInputParseErroron the gRPCInitialization, before any text. This is a server-side gateway↔model contract mismatch, not a CLI bug — no client change can affect it.The CLI faithfully reproduces the reference protocol; it should work once the sandbox upstream is fixed. End-to-end audio could not be verified while the upstream is down.
Open follow-up
--voice/--languagedefaults (currentlyVivian/English). These were added on a since-disproven hypothesis that the bare command failed for lack of voice/language; the server actually defaults them. May revert to "no client default" (server is source of truth).🤖 Generated with Claude Code