Skip to content

assembly code/live: voice-interrupt UX, modal dismissal, concise speech, gemini live default#252

Merged
alexkroman merged 10 commits into
mainfrom
assembly-voice-ux
Jun 19, 2026
Merged

assembly code/live: voice-interrupt UX, modal dismissal, concise speech, gemini live default#252
alexkroman merged 10 commits into
mainfrom
assembly-voice-ux

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Follow-up UX polish on top of #251 (now merged).

  • Interrupt the readback → resume listening. Escape/Ctrl-C while the voice is speaking now stops the talking and goes back to listening (you can talk over the reply) instead of pausing to text mode. Interrupting while listening still pauses to the text prompt. Ctrl-C only arms the double-press quit when it paused to text.
  • Escape/Ctrl-C dismiss the modals. The approval modal declines the tool; the ask modal returns an empty answer.
  • Concise, speech-ready replies. The assembly code system prompt now tells the model its prose is read aloud — keep it to a sentence or two of plain spoken language, code in fenced blocks (the readback skips them).
  • assembly live defaults to gemini-2.5-flash-lite (low latency for spoken turns); assembly code stays gpt-5.1. Verified the gateway accepts it; --help snapshot updated.

./scripts/check.sh → All checks passed (100% patch coverage, mutation gate, build+twine).

🤖 Generated with Claude Code

alexkroman-assembly and others added 2 commits June 18, 2026 16:24
…ch, gemini live default

- Interrupting the readback (Escape/Ctrl-C while the voice is speaking) now stops the
  talking and resumes listening instead of pausing to text mode; interrupting while
  listening still pauses to the text prompt. Ctrl-C only arms the double-press quit when
  it paused to text, not when it resumed listening.
- Escape/Ctrl-C dismiss the approval modal (declining the tool) and the ask modal
  (empty answer).
- The assembly code system prompt now steers the model to concise, speech-ready prose
  (read aloud), with code kept in fenced blocks the readback skips.
- assembly live defaults to gemini-2.5-flash-lite (low latency for spoken turns);
  assembly code stays gpt-5.1. Verified the gateway accepts it; --help snapshot updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…itions

assembly live (and assembly code) bind tools whose JSON-Schema `parameters` carry
`$schema`/`additionalProperties`/`title`. OpenAI ignores them, but Gemini's
function_declarations 400 on them ("Unknown name \"$schema\""), so every tool-bound
turn failed — the brain graph raised a non-CLIError, the reply worker died silently,
and the live agent never responded.

_GatewayChatOpenAI now strips those keys (recursively) from each tool's parameter
schema in the outgoing request, so a tool-bound request works on every gateway-routed
model. Verified end-to-end: the brain now replies on gemini-2.5-flash-lite. This is
what makes the gemini live default usable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_strip_schema_keys(function.get("parameters"))


def _strip_schema_keys(node: object) -> None:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function _strip_schema_keys recursively traverses schema nodes without depth limits or visited tracking; add a max-depth parameter or convert to an iterative traversal to avoid unbounded recursion.

Details

✨ AI Reasoning
​A new recursive routine was introduced to walk JSON-Schema-shaped structures and remove keys. It unconditionally recurses into dict/list children (_strip_schema_keys calls itself for each child) with no depth counter, visited set, or maximum depth. Malicious or very deeply nested input could trigger deep recursion and stack overflow. This risk is directly introduced by the added sanitization helpers.

🔧 How do I fix it?
Add depth limiting via counter parameters that are checked and enforced, or replace with iterative approaches using explicit loops or stack data structures. For graphs, combine depth limiting with visited set tracking.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

alexkroman-assembly and others added 3 commits June 18, 2026 17:04
…definitions

Expanding the earlier $schema/additionalProperties/title fix: the default MCP tools
carry more validation keywords Gemini's function_declarations reject (exclusiveMinimum/
Maximum, multipleOf, patternProperties, …), each 400-ing a tool-bound turn. Strip the
full validation/metadata keyword set (structural keys kept). Verified end-to-end: the
live brain replies on gemini-2.5-flash-lite with all 28 default MCP tools loaded.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ort warnings

- Give AssistantMessage a top margin in the live TUI so the greeting is separated from
  the splash and each reply is separated from the preceding user turn (scoped to the live
  app's CSS, so `assembly code` is unaffected).
- Suppress firecrawl-py's pydantic "Field name 'json'/'schema' shadows an attribute"
  UserWarnings at the runtime import site (pytest already filters them via pyproject);
  they otherwise leak into the user's terminal whenever a FIRECRAWL_API_KEY is set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When the brain graph failed mid-turn (a gateway 4xx/5xx, a tool raising, a recursion
limit), it raised a non-CLIError, _generate_reply only caught CLIError, and the reply
worker died on a daemon thread — so the agent announced an action ("I'll search…") and
then never came back, with no clue why.

brain._run_graph now converts any graph exception into a CLIError (re-raising CLIErrors
unchanged), and the cascade shows it in the transcript ("(error: …)") and records it,
instead of swallowing it. The user sees *why* a turn produced no answer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# a langgraph recursion limit. Convert it to a CLIError so the cascade records and
# *surfaces* it (the engine shows it in the transcript) instead of the reply worker
# dying silently and the user getting no answer with no clue why.
raise CLIError(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Embedding the raw exception (f"the agent couldn't complete the turn: {exc}") may expose user/tool data. Redact or sanitize exception text before including it in CLIError messages.

Details

✨ AI Reasoning
​The code now catches all Exceptions from the agent graph and raises a CLIError whose message embeds the original exception's string representation. That original exception may include user-controlled data, tool outputs, or other sensitive content. The CLIError is then recorded and shown to the user/UI by other parts of the cascade, so this change increases the risk of leaking unsanitized user input or external payloads to logs/terminal.

🔧 How do I fix it?
Keep sensitive data such as emails, passwords, and tokens out of logs. When logging values tied to a user, prefer a safe identifier like a user ID over the raw input, and strip line breaks from any user-provided text you do log.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

# brain._run_graph). Show it in the transcript so the turn doesn't just vanish —
# the user sees *why* there was no answer instead of silence.
self._record_error(exc)
self.renderer.reply_started()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rendering CLIError.message directly to the transcript may leak user/tool data. Sanitize or replace the message with a generic error before showing.

Details

✨ AI Reasoning
​In the reply worker's except CLIError handler the code now writes the CLIError.message into the agent transcript via renderer.agent_transcript(f"(error: {exc.message})"). That message originates from exceptions converted earlier (which can contain untrusted user or external content). Displaying it verbatim to the UI increases risk of leaking sensitive or malicious content.

🔧 How do I fix it?
Keep sensitive data such as emails, passwords, and tokens out of logs. When logging values tied to a user, prefer a safe identifier like a user ID over the raw input, and strip line breaks from any user-provided text you do log.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

alexkroman-assembly and others added 5 commits June 18, 2026 17:46
…laude-haiku-4-5

- A second Ctrl-C now always quits, even mid-readback: the quit-pending check moved
  ahead of stopping voice, so a spoken turn can't trap you. The first Ctrl-C (and
  Escape) still stops the readback and resumes listening; the second Ctrl-C exits.
  _stop_voice_activity returns None now (its result is no longer branched on).
- assembly live defaults to claude-haiku-4-5-20251001 (low latency for spoken turns);
  assembly code stays gpt-5.1. Config test + --help snapshot updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… can't hang startup

Each MCP server was loaded with an unbounded asyncio.run(get_tools()); a slow/hung server
(npx/uvx cold-start, an unreachable host) blocked `assembly live` startup indefinitely,
and a Ctrl-C in that window triggered langchain-mcp-adapters' cancel-time crash. Wrap the
fetch in asyncio.wait_for(timeout=15s) — a server that won't list its tools in time is
cancelled and skipped (_safe_load turns the TimeoutError into []).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pause speech

Two focused changes to the `assembly live` voice agent (still deepagents-based):

Slim the toolset to just Firecrawl web search. A low-latency spoken turn does
best with one obvious tool rather than a large menu it has to choose among — the
big toolset (URL fetch, docs MCP, and a curated 5-server default MCP set) made
the model narrate "I'll search…" without ever calling anything, and bloated
every request with tool schemas. build_live_tools now returns only the web-search
tool (when FIRECRAWL_API_KEY is set), and no MCP servers load by default
(--mcp-config stays as a strictly opt-in power-user knob; default_servers is
removed). The prompt's capability builder is trimmed to match.

Wire Escape/Ctrl-C to pause speech and return to listening. A new
CascadeSession.interrupt_reply signals the in-flight reply to stop (sets the stop
flag + flushes audio) WITHOUT joining the worker — a UI-thread join would
deadlock against the worker's call_from_thread render hops. run_cascade gains an
on_session hook so the live TUI captures the session and binds Escape (interrupt)
and Ctrl-C (interrupt while speaking, else quit); Ctrl-Q always quits as the
guaranteed escape hatch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A spoken turn that paused to use a tool (web search) sat silent on "thinking…",
reading as a hang. The brain now feeds an on_tool sink a short, speakable label
("Searching the web") as each tool call lands: build_completer's complete_reply
takes an optional on_tool, and the graph is streamed — rather than invoke-d —
whenever a sink is wired (not just under -v), so calls surface live.

The cascade engine passes the renderer's tool_call as that sink, so every
front-end shows it: the live TUI drops a dim inline "Searching the web…" note,
the line renderer prints it (stderr in piped text mode), and --json emits a new
additive tool.use event. The Renderer protocol gains tool_call.

Also extracts the shared cascade test fakes into tests/_cascade_fakes.py so the
engine/command/TUI suites share one set of doubles and stay under the 500-line gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A Ctrl-C during the voice TUI's setup — opening the mic, building the deepagents
graph, loading --mcp-config servers — lands before Textual captures the keyboard,
so it surfaced as a raw KeyboardInterrupt (and, mid asyncio.run/threading
teardown, a noisy traceback). The line-renderer path already mapped this to a
clean exit 130; the TUI dispatch did not. Extract a _launch_tui helper that wraps
_run_live_tui and maps a setup-time KeyboardInterrupt to typer.Exit(130), matching
the assembly code TUI. (In-session Ctrl-C is already a Textual binding, so it never
reaches the graph as an exception.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexkroman alexkroman added this pull request to the merge queue Jun 19, 2026
Merged via the queue into main with commit 8bdf6b7 Jun 19, 2026
20 checks passed
@alexkroman alexkroman deleted the assembly-voice-ux branch June 19, 2026 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants