Skip to content

Commit 0f0950e

Browse files
committed
dictate: make the default headless and signal-driven
Recording now starts immediately and runs without a terminal, finishing on SIGTERM (clean exit 0) so a hotkey tool like Hammerspoon can launch `assembly dictate` as a background task and `kill -TERM` / `task:terminate()` to transcribe. SIGINT (Ctrl-C) still cancels (exit 130). The press-Enter/keypress terminal mode and its `core/hotkey.py` (TerminalKeys) backend are removed; dictate now polls a new `signals.stop_on_terminate` latch between mic chunks. Contrast `signals.terminate_as_interrupt` (stream/agent/ speak), which routes SIGTERM into the cancel path instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JrDRFsdAYyXwWSudM2da8g
1 parent 37e3a01 commit 0f0950e

13 files changed

Lines changed: 247 additions & 499 deletions

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
4646
| :--- | :--- |
4747
| `assembly transcribe` | Transcribe files, URLs, YouTube/podcast pages, podcast RSS feeds, directories, globs, or bucket storage (`s3://`, `gs://`, `az://`) — with speaker labels, PII redaction, summarization, SRT/VTT captions, and resumable batch runs |
4848
| `assembly stream` | Real-time transcription from your microphone, a file, or a URL — on macOS it can capture system audio too |
49-
| `assembly dictate` | Push-to-talk dictation: recording starts immediately, press Enter for instant text (Sync STT API, up to 120 s per utterance) |
49+
| `assembly dictate` | Signal-driven dictation: records immediately, send SIGTERM for instant text — scriptable from hotkey tools like Hammerspoon (Sync STT API, up to 120 s per utterance) |
5050
| `assembly agent` | Full-duplex spoken conversation with a voice agent, right in your terminal |
5151
| `assembly agent-cascade` | Same live conversation, but wired client-side from Streaming STT + the LLM Gateway + streaming TTS, like the `agent-cascade` starter (sandbox-only) |
5252
| `assembly speak` | Synthesize text to speech over the streaming-TTS WebSocket (sandbox-only) |

aai_cli/AGENTS.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ between layers is enforced — higher may import lower, never the reverse:
3232
`help_text`, `typer_patches`, `update_check`.
3333
- **`core/`** — the Rich-free library layer: `client`, `config`,
3434
`config_builder`, `keyring_store`, `environments`, `env`, `errors`, `llm`,
35-
`telemetry`, `debuglog`, `remotefs`, `sync_stt`, `hotkey`, `ws`, `youtube`,
35+
`telemetry`, `debuglog`, `remotefs`, `sync_stt`, `signals`, `ws`, `youtube`,
3636
`wer`, `argscan`, `jsonshape`, `timeparse`, `microphone`, `procs`, `stdio`,
3737
`choices`, `locking`, `config_lock`. Contract 4 also forbids `rich` here, so
3838
"no Rich below the UI layer" is structural.
@@ -149,7 +149,7 @@ heavily-reworked commands with long bodies; small commands keep the inline
149149
### Feature subsystems
150150

151151
- **`streaming/`** + `client.stream_audio` — v3 realtime API. Event callbacks run on the SDK reader thread and guard against `BrokenPipeError` (`stdio.silence_stdout()`) so a closed pipe never dumps a thread traceback.
152-
- **`core/sync_stt.py`** + **`core/hotkey.py`** + `commands/dictate/``assembly dictate`: push-to-talk dictation over the **Sync STT API** (`Environment.sync_base`, one POST `/transcribe` per utterance with the required `X-AAI-Model: u3-sync-pro` header; 80 ms–120 s of PCM/WAV). `hotkey.TerminalKeys` scopes stdin into cbreak (Ctrl-C still signals) and reads single keypresses; `dictate_exec._record` polls it with a zero timeout between ~100 ms mic chunks. All three boundaries (keys, mic, HTTP) are injectable, so the suite never needs a real terminal — `tests/test_hotkey.py` drives a pty pair for the termios behavior.
152+
- **`core/sync_stt.py`** + **`core/signals.py`** + `commands/dictate/``assembly dictate`: headless dictation over the **Sync STT API** (`Environment.sync_base`, one POST `/transcribe` per utterance with the required `X-AAI-Model: u3-sync-pro` header; 80 ms–120 s of PCM/WAV). It needs no terminal: recording starts immediately and `dictate_exec._record` polls `signals.stop_on_terminate` between ~100 ms mic chunks for a SIGTERM, which finishes the utterance (clean exit 0) — so a hotkey tool like Hammerspoon can launch it as a background task and `kill -TERM`/`task:terminate()` to transcribe. SIGINT (Ctrl-C) still cancels (exit 130). Both boundaries (the stop latch, mic, HTTP) are injectable, so the suite never needs a real signal or microphone (`tests/test_dictate_exec.py` scripts the SIGTERM latch). Contrast `signals.terminate_as_interrupt` (used by `stream`/`agent`/`speak`), which routes SIGTERM into the *cancel* path instead.
153153
- **`agent/`** — full-duplex voice agent (mic in, TTS out via `voices.py`).
154154
- **`agent_cascade/`** + `commands/agent_cascade/``assembly agent-cascade`: the same live terminal conversation as `assembly agent`, but **client-orchestrated**`engine.run_cascade` wires Streaming STT → the LLM Gateway → streaming TTS itself instead of talking to the Voice Agent endpoint, mirroring what the `agent-cascade` `assembly init` template does server-side. **Sandbox-only** (streaming TTS has no prod host; guarded via `tts.session.require_available`). Reuses the agent slice's `DuplexAudio`/`AgentRenderer` and `core.client.stream_audio`/`core.llm.complete`/`tts.session.synthesize`; the three network legs are injected through `engine.CascadeDeps` (the `tts/session.py` seam) so the cascade — greeting, per-sentence TTS, barge-in, history window — is unit-tested against fakes with no sockets/mic/speaker.
155155
- **`tts/`** + `commands/speak.py``assembly speak` synthesizes text to speech over the sandbox streaming-TTS WebSocket (`streaming-tts.sandbox000.…`). **Sandbox-only:** `session.is_available()` is false in production (empty `Environment.streaming_tts_host`), so the command exits 2 with a `--sandbox` hint. `session.synthesize` drives a Begin→Generate→Flush→Audio→Terminate protocol with an injectable `connect` for hermetic tests (mirrors `agent/session.py`); `audio.py` plays the PCM (default) or writes a WAV (`--out`).

aai_cli/commands/dictate/__init__.py

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,11 @@
2222
rich_help_panel=help_panels.TRANSCRIPTION,
2323
epilog=examples_epilog(
2424
[
25-
("Dictate one utterance: recording starts, Enter transcribes it", "assembly dictate"),
25+
("Record until SIGTERM, then print the transcript", "assembly dictate"),
26+
(
27+
"Stop the recording and transcribe (e.g. from a hotkey tool)",
28+
"kill -TERM $(pgrep -f 'assembly dictate')",
29+
),
2630
(
2731
"Pipe the utterance into another command",
2832
'assembly dictate | assembly llm "write a conventional commit"',
@@ -75,13 +79,15 @@ def dictate(
7579
help="Output mode: text (the bare transcript per utterance, pipe-friendly) or json",
7680
),
7781
) -> None:
78-
"""Push-to-talk dictation: record the mic, get the transcript back
82+
"""Signal-driven dictation: record the mic, get the transcript back
7983
80-
Recording starts immediately; press Enter (or Space) to stop and the
81-
utterance is sent to the AssemblyAI Sync API — the transcript prints right
82-
away (no polling) and dictate exits, so it flows straight to the next
83-
command in a pipe. The recording can be up to 120 seconds long. Press
84-
Ctrl-C to cancel without transcribing.
84+
Recording starts immediately and runs headless — no terminal needed — so a
85+
hotkey tool like Hammerspoon can launch it as a background task and send
86+
SIGTERM (kill -TERM, task:terminate()) to stop. On SIGTERM the utterance is
87+
sent to the AssemblyAI Sync API, the transcript prints right away (no
88+
polling), and dictate exits, so it flows straight to the next command in a
89+
pipe. The recording can be up to 120 seconds long. Ctrl-C (SIGINT) cancels
90+
without transcribing.
8591
"""
8692
opts = dictate_exec.DictateOptions(
8793
language=language,

aai_cli/commands/dictate/_exec.py

Lines changed: 52 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,29 @@
11
"""Run logic for `assembly dictate`: the options/run split (see AGENTS.md).
22
3-
Push-to-talk dictation over the Sync STT API: recording starts immediately,
4-
runs until a hotkey is pressed (or the duration cap), then the utterance is
5-
POSTed to the Sync API, the transcript is printed, and dictate exits. The
6-
command module (aai_cli/commands/dictate/__init__.py) only parses argv into a
7-
``DictateOptions``; tests drive the session by constructing options directly and
8-
injecting the key/mic/HTTP boundaries, with no CliRunner argv round-trip and no
9-
real terminal.
3+
Headless dictation over the Sync STT API: recording starts immediately and runs
4+
until SIGTERM is delivered (or the duration cap), then the utterance is POSTed to
5+
the Sync API, the transcript is printed, and dictate exits 0. There is no terminal
6+
interaction — a controller like Hammerspoon launches `assembly dictate` as a
7+
background task and sends SIGTERM (``task:terminate()`` / ``kill -TERM``) to mean
8+
"I'm done dictating", so the transcript flows straight to the next command in a
9+
pipe. SIGINT (Ctrl-C) cancels without transcribing (exit 130). The command module
10+
(aai_cli/commands/dictate/__init__.py) only parses argv into a ``DictateOptions``;
11+
tests drive the session by constructing options directly and injecting the
12+
stop-signal/mic/HTTP boundaries, with no real signals, microphone, or network.
1013
"""
1114

1215
from __future__ import annotations
1316

17+
from collections.abc import Callable
1418
from dataclasses import dataclass
1519

1620
import typer
1721

1822
from aai_cli.app.context import AppState
1923
from aai_cli.core import choices, errors, sync_stt
2024
from aai_cli.core.config_builder import split_csv
21-
from aai_cli.core.hotkey import CTRL_C, CTRL_D, ESC, TerminalKeys
2225
from aai_cli.core.microphone import MicrophoneSource
26+
from aai_cli.core.signals import stop_on_terminate
2327
from aai_cli.streaming.validate import resolve_output_modes
2428
from aai_cli.ui import output
2529

@@ -28,10 +32,6 @@
2832
TARGET_RATE = 16000
2933
_BYTES_PER_SECOND = TARGET_RATE * 2 # PCM16 mono
3034

31-
# Enter or Space stops the (auto-started) recording; q / Esc / Ctrl-D also stop
32-
# it (Ctrl-C cancels — cbreak mode keeps SIGINT delivery).
33-
STOP_KEYS = frozenset({"\r", "\n", " ", "q", "Q", ESC, CTRL_C, CTRL_D})
34-
3535

3636
@dataclass(frozen=True)
3737
class DictateOptions:
@@ -52,7 +52,7 @@ class DictateOptions:
5252

5353

5454
def _note(message: str, *, json_mode: bool, quiet: bool) -> None:
55-
"""A muted stderr hint guiding the interactive session; silent under --json
55+
"""A muted stderr hint naming how to finish the recording; silent under --json
5656
(stderr must stay machine-readable) and --quiet."""
5757
if json_mode or quiet:
5858
return
@@ -68,11 +68,14 @@ def _languages(language: str | None) -> str | list[str] | None:
6868
return codes[0] if len(codes) == 1 else codes
6969

7070

71-
def _record(keys: TerminalKeys, mic: MicrophoneSource, *, max_seconds: float) -> bytes:
72-
"""Capture PCM until a hotkey is pressed again or the duration cap is hit.
71+
def _record(
72+
stop_requested: Callable[[], bool], mic: MicrophoneSource, *, max_seconds: float
73+
) -> bytes:
74+
"""Capture PCM until SIGTERM is delivered (``stop_requested`` flips True) or the
75+
duration cap is hit.
7376
74-
The key poll runs between ~100 ms mic chunks with a zero timeout, so the mic
75-
read loop is never blocked waiting on the keyboard.
77+
The stop poll runs between ~100 ms mic chunks, so a SIGTERM is honored within one
78+
chunk without blocking the mic read loop.
7679
"""
7780
pcm = bytearray()
7881
frames = iter(mic)
@@ -81,8 +84,7 @@ def _record(keys: TerminalKeys, mic: MicrophoneSource, *, max_seconds: float) ->
8184
pcm += chunk
8285
if len(pcm) >= int(max_seconds * _BYTES_PER_SECOND):
8386
break
84-
# None (no key pending) is simply not in the set.
85-
if keys.read(0) in STOP_KEYS:
87+
if stop_requested():
8688
break
8789
finally:
8890
# MicrophoneSource yields from a generator whose cleanup releases the
@@ -122,8 +124,8 @@ def _transcribe_utterance(
122124
) -> None:
123125
"""Send one recorded utterance to the Sync API and print the transcript.
124126
125-
A recording below the API's 80 ms floor (a double-tapped hotkey) is skipped
126-
with a warning rather than bounced off the server as a 400.
127+
A recording below the API's 80 ms floor (an instant SIGTERM) is skipped with a
128+
warning rather than bounced off the server as a 400.
127129
"""
128130
if len(pcm) < sync_stt.MIN_AUDIO_MS * _BYTES_PER_SECOND // 1000:
129131
output.emit_warning(
@@ -144,7 +146,7 @@ def _transcribe_utterance(
144146

145147

146148
def _capture_and_transcribe(
147-
keys: TerminalKeys,
149+
stop_requested: Callable[[], bool],
148150
api_key: str,
149151
opts: DictateOptions,
150152
state: AppState,
@@ -156,10 +158,12 @@ def _capture_and_transcribe(
156158
target_rate=TARGET_RATE,
157159
device=opts.device,
158160
on_open=lambda: _note(
159-
"● Recording — press Enter to stop.", json_mode=json_mode, quiet=state.quiet
161+
"● Recording — send SIGTERM to transcribe (Ctrl-C cancels).",
162+
json_mode=json_mode,
163+
quiet=state.quiet,
160164
),
161165
)
162-
pcm = _record(keys, mic, max_seconds=opts.max_seconds)
166+
pcm = _record(stop_requested, mic, max_seconds=opts.max_seconds)
163167
_transcribe_utterance(api_key, pcm, opts, state, json_mode=json_mode)
164168

165169

@@ -170,34 +174,30 @@ def run_dictate(opts: DictateOptions, state: AppState, *, json_mode: bool) -> No
170174
# dictate has no live panel, so the text_mode half is unused — plain
171175
# transcript text is already the non-JSON default in `_emit`.
172176
_, json_mode = resolve_output_modes(opts.output_field, json_mode=json_mode)
177+
# Resolve credentials before recording: don't capture audio we can't transcribe.
178+
api_key = state.resolve_api_key()
179+
if opts.prompt and opts.language:
180+
# The server ignores language_code whenever a custom prompt is set;
181+
# never drop a requested flag silently (mirrors the speak warnings).
182+
output.emit_warning(
183+
"--language is ignored when --prompt is set; state the language inside the prompt.",
184+
json_mode=json_mode,
185+
)
186+
if opts.once and not state.quiet:
187+
# Deprecation trap, not removal: --once still parses so old scripts don't
188+
# break, but recording one utterance and exiting is now the default, so the
189+
# flag does nothing — say so once (mirrors `login`).
190+
output.emit_warning(
191+
"--once is now the default and can be omitted.",
192+
json_mode=json_mode,
193+
)
173194
try:
174-
# Entering TerminalKeys validates the terminal (a usage precondition)
175-
# before credentials, so a piped stdin reads as "needs a terminal" — not
176-
# as a login prompt.
177-
with TerminalKeys() as keys:
178-
api_key = state.resolve_api_key()
179-
if opts.prompt and opts.language:
180-
# The server ignores language_code whenever a custom prompt is set;
181-
# never drop a requested flag silently (mirrors the speak warnings).
182-
output.emit_warning(
183-
"--language is ignored when --prompt is set; "
184-
"state the language inside the prompt.",
185-
json_mode=json_mode,
186-
)
187-
if opts.once and not state.quiet:
188-
# Deprecation trap, not removal: --once still parses so old scripts
189-
# don't break, but recording one utterance and exiting is now the
190-
# default, so the flag does nothing — say so once (mirrors `login`).
191-
output.emit_warning(
192-
"--once is now the default and can be omitted.",
193-
json_mode=json_mode,
194-
)
195-
# Recording auto-starts and exits after one utterance: a single
196-
# keystroke stops the capture, which also closes a piped stdout so
197-
# `assembly dictate | assembly llm …` unblocks the downstream command.
198-
_capture_and_transcribe(keys, api_key, opts, state, json_mode=json_mode)
195+
# Recording auto-starts and exits after one utterance: SIGTERM stops the
196+
# capture, which also closes a piped stdout so `assembly dictate | assembly
197+
# llm …` unblocks the downstream command.
198+
with stop_on_terminate() as stop_requested:
199+
_capture_and_transcribe(stop_requested, api_key, opts, state, json_mode=json_mode)
199200
except KeyboardInterrupt:
200-
# Ctrl-C cancels dictation, so it exits 130 (cancel) — distinct from `q`, which
201-
# ends the session normally (exit 0). The with-block above already restored the
202-
# terminal on the way out.
201+
# Ctrl-C / SIGINT cancels dictation, so it exits 130 (cancel) — distinct from
202+
# SIGTERM, which finishes the utterance normally (exit 0).
203203
raise typer.Exit(code=errors.CANCELLED_EXIT_CODE) from None

0 commit comments

Comments
 (0)