AssemblyAI · alexkroman · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -25,7 +25,7 @@ uv run assembly --help            # run the CLI from the locked environment
 
 Dev tooling is a PEP 735 `[dependency-groups]` group with `default-groups = ["dev"]`, not a `[project]` extra — `uv sync --extra dev` errors.
 
-`scripts/check.sh` is the authoritative gate; keep this list in sync with it. It runs, in order: `uv lock --check` → `ruff check` → `ruff format --check` → `mypy` → `pyright` (src strict) → `pyright` (tests) → `vulture` (dead code) → `deptry` (dependency hygiene) → `lint-imports` (import-linter architecture contracts) → max-file-length (500 lines) → `xenon` (cyclomatic complexity: function max B, module avg A, project avg A) → `swiftlint` + swift compile (macOS only, skipped elsewhere) → `markdownlint` → `codespell` (spell-check code/comments/docs via `uvx`; config in `[tool.codespell]`) → `prettier` (init template JS/CSS) → `shellcheck` → `actionlint` + `zizmor` (workflow lint/audit) → `gitleaks` (secret scan) → generated `--show-code` compile gate → init template contract gate → unused snapshot/fixture gate (`scripts/unused_fixtures_gate.py`: orphaned `.ambr`/API fixtures, since xdist disables syrupy's own unused detection) → docs consistency gate (`scripts/docs_consistency_gate.py`: REFERENCE.md/README.md env vars, exit codes, and `assembly …` command refs stay in sync with the code) → docstring coverage gate (`scripts/docstring_coverage_gate.py`: public-API docstring ratchet, an `interrogate` stand-in that handles PEP 695 generics) → `brew audit --strict` (the shipped `Formula/assembly.rb`; self-skips without Homebrew) → `pytest` (90% branch coverage) → `diff-cover` (100% patch coverage vs `origin/main`) → **mutation gate** (diff-scoped: mutates each changed line and reruns the tests that cover it — a surviving mutant fails the gate, so changed lines need assertions that would *fail* if the line broke, not just coverage; suppress a genuinely unassertable line with `# pragma: no mutate`) → a "no new escape hatches" gate (`# type: ignore` / `# noqa` / `pragma: no cover` / `Any` / `cast(` / test skip/xfail/sleep, all **count-gated against the merge-base** so moving an existing hatch in a refactor doesn't false-positive but a net-new one fails) → `uv build` + `twine check --strict`. The `vulture`/`deptry`/`lint-imports`/`xenon`, patch-coverage, and mutation stages catch the failures that `ruff`+`mypy` alone won't — don't claim the gate is green until the script prints `All checks passed.` **CodeQL is intentionally NOT in this gate** — it's the slowest check (~minutes) and is enforced separately by the `codeql.yml` workflow (which also covers CI; `check.sh` self-skipped it on the hosted runner anyway), so dropping it keeps the local gate fast with no loss of CI coverage. `scripts/codeql_gate.py` still exists to reproduce a code-scanning alert locally (`uv run python scripts/codeql_gate.py`).
+`scripts/check.sh` is the authoritative gate; keep this list in sync with it. It runs, in order: `uv lock --check` → `ruff check` → `ruff format --check` → `mypy` → `pyright` (src strict) → `pyright` (tests) → `vulture` (dead code) → `deptry` (dependency hygiene) → `lint-imports` (import-linter architecture contracts) → max-file-length (500 lines) → `xenon` (cyclomatic complexity: function max B, module avg A, project avg A) → `swiftlint` + swift compile (macOS only, skipped elsewhere) → `markdownlint` → `codespell` (spell-check code/comments/docs via `uvx`; config in `[tool.codespell]`) → `prettier` (init template JS/CSS) → `shellcheck` → `actionlint` + `zizmor` (workflow lint/audit) → `gitleaks` (secret scan) → generated `--show-code` compile gate → init template contract gate → unused snapshot/fixture gate (`scripts/unused_fixtures_gate.py`: orphaned `.ambr`/API fixtures, since xdist disables syrupy's own unused detection) → docs consistency gate (`scripts/docs_consistency_gate.py`: REFERENCE.md/README.md env vars, exit codes, and `assembly …` command refs stay in sync with the code) → docstring coverage gate (`scripts/docstring_coverage_gate.py`: public-API docstring ratchet, an `interrogate` stand-in that handles PEP 695 generics) → `brew audit --strict` (the shipped `Formula/assembly.rb`; self-skips without Homebrew) → `pytest` (90% branch coverage) → Textual TUI coverage (≥90% on the `textual`-importing modules — a per-surface floor so a fragile TUI module can't rot under the project-wide average; the module set is derived from the `textual` import and reuses the pytest `.coverage`, no re-run) → `diff-cover` (100% patch coverage vs `origin/main`) → **mutation gate** (diff-scoped: mutates each changed line and reruns the tests that cover it — a surviving mutant fails the gate, so changed lines need assertions that would *fail* if the line broke, not just coverage; suppress a genuinely unassertable line with `# pragma: no mutate`) → a "no new escape hatches" gate (`# type: ignore` / `# noqa` / `pragma: no cover` / `Any` / `cast(` / test skip/xfail/sleep, all **count-gated against the merge-base** so moving an existing hatch in a refactor doesn't false-positive but a net-new one fails) → `uv build` + `twine check --strict`. The `vulture`/`deptry`/`lint-imports`/`xenon`, patch-coverage, and mutation stages catch the failures that `ruff`+`mypy` alone won't — don't claim the gate is green until the script prints `All checks passed.` **CodeQL is intentionally NOT in this gate** — it's the slowest check (~minutes) and is enforced separately by the `codeql.yml` workflow (which also covers CI; `check.sh` self-skipped it on the hosted runner anyway), so dropping it keeps the local gate fast with no loss of CI coverage. `scripts/codeql_gate.py` still exists to reproduce a code-scanning alert locally (`uv run python scripts/codeql_gate.py`).
 
 **Commits are gated.** On success `check.sh` records a working-tree signature (`scripts/gate_marker.py record` → `.git/aai-gate-pass`), and a PreToolUse hook (`.claude/hooks/require-gate-before-commit.sh`) blocks `git commit` unless that signature still matches — so run the full gate to completion *before* committing (a single-file `pytest` does not satisfy it), and re-run it after any further edit. Iterate with the fast targeted commands above, gate once at the end. For a deliberate work-in-progress commit, prefix `AAI_ALLOW_COMMIT=1 git commit …`.
 

diff --git a/aai_cli/code_agent/modals.py b/aai_cli/code_agent/modals.py
@@ -65,8 +65,10 @@ class ApprovalScreen(ModalScreen[str]):
 
     DEFAULT_CSS = """
     ApprovalScreen { align: center bottom; background: transparent; }
+    /* width: 100% (not 1fr) so the box honors its 1-col side margins — a docked 1fr container
+       ignores horizontal margin and overflows the screen, clipping the right border off-edge. */
     ApprovalScreen #approvalbox {
-        dock: bottom; width: 1fr; height: auto;
+        dock: bottom; width: 100%; height: auto;
         border: round #f59e0b; background: #000000; padding: 0 1; margin: 0 1 1 1;
     }
     ApprovalScreen #approvalbox Label { height: auto; }
@@ -163,7 +165,7 @@ class AskScreen(ModalScreen[str]):
     DEFAULT_CSS = """
     AskScreen { align: center bottom; background: transparent; }
     AskScreen #askbox {
-        dock: bottom; width: 1fr; height: auto;
+        dock: bottom; width: 100%; height: auto;
         border: round #3a3f55; background: #000000; padding: 0 1; margin: 0 1 1 1;
     }
     """

diff --git a/aai_cli/code_agent/tui.py b/aai_cli/code_agent/tui.py
@@ -75,7 +75,9 @@ class CodeAgentApp(_VoiceLegs):
     /* The transcript is a scroll container of mounted message widgets (not a RichLog), so the
        reply streams in place and tool output can expand/collapse. */
     #log {{ height: 1fr; border: none; background: #000000; padding: 1 2; }}
-    #promptbar {{ dock: bottom; height: 3; background: #000000; border: round #3a3f55; margin: 1 1; }}
+    /* width: 100% (not the 1fr default) so the bordered box fits inside its 1-col side margins;
+       a docked 1fr container ignores horizontal margin and overflows, clipping the right border. */
+    #promptbar {{ dock: bottom; height: 3; width: 100%; background: #000000; border: round #3a3f55; margin: 1 1; }}
     #promptmark {{ width: 3; color: {banner.BRAND_HEX}; content-align: center middle; }}
     #prompt {{ border: none; background: #000000; padding: 0; }}
     /* Shown in place of the prompt while voice capture is on (Ctrl-V brings the prompt back). */

diff --git a/pyproject.toml b/pyproject.toml
@@ -117,6 +117,12 @@ dev = [
     # failure instead of a wedged session (not in addopts — opt-in per run).
     "pytest-timeout>=2.3.1",
     "time-machine>=3.1.0",
+    # Visual-regression snapshots for the Textual TUIs (`assembly code` / `live`): the
+    # `snap_compare` fixture renders an app to SVG and diffs it against a committed golden,
+    # catching CSS/layout/docking regressions the behavioral pilot tests can't see. Stores
+    # SVGs under tests/__snapshots__/<module>/ (regenerate with --snapshot-update like the
+    # .ambr goldens). See tests/AGENTS.md "Textual visual snapshots".
+    "pytest-textual-snapshot>=1.0.0",
     "hypothesis>=6.155.1",
     "ruff>=0.15.15",
     "mypy>=2.1.0",

diff --git a/scripts/check.sh b/scripts/check.sh
@@ -238,6 +238,21 @@ echo "==> pytest (with branch-coverage gate)"
 # splitting it across workers is safe.
 uv run pytest -q --strict-config --strict-markers -n auto -m "not e2e and not install" --cov=aai_cli --cov-branch --cov-context=test --cov-report=term-missing --cov-report=xml --cov-fail-under=90
 
+echo "==> Textual TUI coverage (>=90% on the textual-importing modules)"
+# The project-wide 90% gate above is an average, so a TUI module can rot while the rest
+# of the suite carries it. The Textual TUIs (`assembly code` / `live`) are the most
+# layout-fragile, regression-prone surface in the repo (see tests/AGENTS.md), so hold
+# them to their own >=90% floor. The module set is *derived* — every aai_cli file that
+# imports `textual` — so a new TUI module is picked up automatically with no list to
+# hand-maintain. Reuses the .coverage data the pytest step just wrote (no re-run), and
+# counts branches because that data was collected with --cov-branch.
+tui_modules="$(git grep -lP '^\s*(from|import) textual' -- 'aai_cli/**/*.py' | paste -sd, -)"
-tui_modules="$(git grep -lP '^\s*(from|import) textual' -- 'aai_cli/**/*.py' | paste -sd, -)"
+tui_modules="$(git grep -lP '^\s*(from|import) textual' -- 'aai_cli/**/*.py' | paste -sd, - || true)"
-tui_modules="$(git grep -lP '^\s*(from|import) textual' -- 'aai_cli/**/*.py' | paste -sd, -)"
+tui_modules="$(git grep -lP '^\s*(from|import) textual' -- 'aai_cli/**/*.py' | paste -sd, - || true)"
+if [[ -z "$tui_modules" ]]; then
+  echo "   no textual-importing modules found (the derive pattern is stale?)"
+  exit 1
+fi
+uv run coverage report --include="$tui_modules" --fail-under=90
+
 echo "==> diff-cover (patch coverage: every changed line must be tested)"
 # The 90% gate above is project-wide, so new code can ride on the existing suite and
 # stay untested. diff-cover requires 100% coverage of the lines changed versus the

diff --git a/tests/AGENTS.md b/tests/AGENTS.md
@@ -20,6 +20,25 @@ CLI output is pinned by **syrupy snapshot tests** (`tests/__snapshots__/*.ambr`)
 
 The `--help` goldens are split per command group (`tests/test_snapshots_help_<group>.py`) so concurrent branches touching different commands regenerate *different* `.ambr` files. The partition (`HELP_GROUPS` in `tests/_snapshot_surface.py`) is **derived from each command module's `SPEC.panel`** (see `aai_cli/command_registry.py`), so a new command lands in the right group automatically; `tests/test_snapshots_help_groups.py` guards that the derived partition matches the live Typer tree. The root `assembly --help` screen — which every new command changes — has its own golden (`tests/test_snapshots_help_root.py`), so that churn stays confined to one trivially-regenerable `.ambr` file.
 
+## Textual visual snapshots (the `code` / `live` TUIs)
+
+The two Textual apps — `CodeAgentApp` (`assembly code`) and `LiveAgentApp` (`assembly live`) — are **the most layout-fragile surface in the repo**: a one-line CSS edit (a dock, a width, a margin, a transparent background) silently shifts the whole painted frame, and the pilot tests (`test_code_tui.py` / `test_live_tui.py`) only ever assert one widget, region, or flag at a time — they can't see "the modal's right border is now clipped off-screen". So they're backed by **visual-regression snapshots** (`tests/test_tui_snapshots.py`, on top of the `pytest-textual-snapshot` `snap_compare` fixture): each test renders an app (or a pushed modal) to an SVG and diffs it against a committed golden under `tests/__snapshots__/test_tui_snapshots/*.raw`. (This is how the `width: 1fr` → `width: 100%` overflow bug in `#promptbar`/`#approvalbox`/`#askbox` was found — a docked `1fr` container ignores horizontal margin and overflows, and the pilot region asserts never checked the right edge.)
+
+The two layers are complementary, so add to whichever fits: a **behavioral** assertion (a key press changes state, a modal returns a value, a region stays docked) goes in the pilot tests; a **visual** change (chrome, colors, spacing, a new transcript widget) earns a `snap_compare` golden. When a visual fix lands, pin the precise invariant in a pilot test too (e.g. `box.region.right <= 100`) so a mutant is killed deterministically, not only by the SVG diff.
+
+Regenerate after an intentional UI change with `uv run pytest tests/test_tui_snapshots.py --snapshot-update` and **eyeball every changed SVG before committing** — a blessed-but-wrong baseline is worse than no snapshot. (No SVG viewer in a headless session? Reconstruct the text by grouping each `<text>` element's content by its `y` coordinate; that's enough to read the frame and spot a clipped border.)
+
+A Textual app renders non-deterministically unless four things are frozen — all handled by `tests/_tui_snapshot.py` (read its module docstring before adding a test):
+
+- **`banner.version()`** in the splash is the hatch-vcs git-tag string (`v0.1.devN+g<sha>`), different on every commit — `pin_banner_version` freezes it.
+- **The voice bar's meter** advances on a 0.3s `set_interval`; the frame at screenshot time depends on wall-clock scheduling — `freeze_animation` pins it to one frame and stops the timer (and the spinner's).
+- **`LiveAgentApp` starts the blocking cascade on a worker thread on mount**, which `exit()`s the app before the screenshot — `build_live_app` returns a subclass whose `_start` is a no-op, and the test drives the transcript methods directly.
+- **The code status line** renders the cwd, git branch, and `~`-abbreviated home (all machine/platform-specific) — `stable_workdir` pins `Path.home` and builds a fixed `~/demo` cwd with a fake `.git/HEAD`.
+
+The `.raw` SVGs live in a `tests/__snapshots__/test_tui_snapshots/` **subdirectory**, so `scripts/unused_fixtures_gate.py` (which globs only top-level `*.ambr`) doesn't police them — delete a renamed test's stale `.raw` by hand.
+
+On top of the project-wide 90% gate, `check.sh` enforces a **per-surface ≥90% coverage floor on the Textual modules** (every `aai_cli` file that imports `textual` — derived, not hand-listed — reusing the pytest `.coverage`), so a fragile TUI module can't rot while the rest of the suite carries the average. Keep these modules well-covered by the pilot tests; a new TUI module is held to the floor automatically.
+
 ## Hermeticity (enforced three ways)
 
 The suite is hermetic by construction (`tests/conftest.py` + `pyproject.toml` `[tool.pytest.ini_options]`): **pytest-randomly** shuffles order, an autouse `pin_timezone` fixture pins `TZ` to a fixed non-UTC zone (UTC-normalized rendering must be unaffected; use **time-machine** to freeze `now`), and **pytest-socket** (`--disable-socket`) blocks real network so an unmocked SDK/HTTP call fails loudly instead of hitting the API. A test that only binds a loopback server opts back in with the tight `@pytest.mark.allow_hosts(["127.0.0.1"])` (still blocks external hosts). The `e2e`/`install` marker suites legitimately reach the real network in-process (PyPI reachability probes, real-API runs), so a `pytest_collection_modifyitems` hook in `conftest.py` auto-grants them full sockets — adding a network marker is all that's needed, no per-test `enable_socket`.