cjus · cjus · May 18, 2026 · May 18, 2026 · May 18, 2026 · May 18, 2026
diff --git a/.env.example b/.env.example
@@ -143,3 +143,70 @@ SOLRAC_WEB_TOKEN=
 # Synthetic chat id all web traffic shares. Negative integer; default −1000
 # avoids collision with real Telegram chat ids.
 # SOLRAC_WEB_CHAT_ID=-1000
+
+# ── Voice (ElevenLabs STT + TTS) ────────────────────────────────────────────
+# Off by default. When VOICE_ENABLED=true:
+#   - Web UI gets a mic button (record → STT pre-fills the composer) and a
+#     per-message speak button (TTS plays in the browser).
+#   - Telegram voice notes (msg.voice) are transcribed and fed through the
+#     normal text path.
+#   - `/voice on` (per chat, sticky) injects a word-limit prompt to the LLM
+#     AND attaches a Telegram voice note to each reply.
+#
+# REQUIRED when VOICE_ENABLED=true (boot fails loud otherwise):
+#   ELEVENLABS_API_KEY — from https://elevenlabs.io → Profile + API Keys
+#                        (starts `sk_…`). Restrict the key to Text-to-Speech
+#                        + Speech-to-Text endpoints only.
+#   ELEVENLABS_VOICE_ID — 20-char id from VoiceLab → voice detail page.
+#
+# Both ELEVENLABS_* and VOICE_* keys are scrubbed from the Claude SDK
+# subprocess env (agent.ts::sanitizedSubprocessEnv) so a compromised model
+# can't exfiltrate them via an auto-allowed Bash command.
+#
+# Privacy: audio + transcripts + TTS-bound replies hit ElevenLabs SaaS.
+# SOUL.md / SOLRAC.md never leave the host, but the speech does.
+#
+# VOICE_ENABLED=true
+# ELEVENLABS_API_KEY=sk_REPLACE_ME
+# ELEVENLABS_VOICE_ID=REPLACE_ME
+
+# TTS model. eleven_flash_v2_5 is $0.05/1k chars, low latency. Other options:
+# eleven_turbo_v2_5 (similar price/latency), eleven_multilingual_v2 ($0.10/1k,
+# better quality). Operator pin to match plan / preference.
+# ELEVENLABS_TTS_MODEL=eleven_flash_v2_5
+
+# STT model. scribe_v2 went GA March 2026 and replaces v1.
+# ELEVENLABS_STT_MODEL=scribe_v2
+
+# Hard wall — TTS requests over this length are refused with HTTP 413 + a
+# user-visible chat hint. The voice-mode prompt nudge (see below) defends
+# against this softly; the wall is the last line of defense.
+# VOICE_TTS_MAX_CHARS=3000
+
+# Soft target — when `/voice on` is set for a chat, this many words is the
+# budget injected into the LLM prompt ("respond in under N words"). The
+# model may use up to 3× when the user explicitly asks for more.
+# Clamps to [30, 200] at boot (out-of-range warns + clamps).
+# VOICE_REPLY_WORDS_HINT=60
+
+# STT upload limits. Web `/api/stt` rejects oversized bodies before paying
+# Scribe; Telegram voice-note download is bounded by the same byte cap.
+# VOICE_STT_MAX_BYTES=2097152          # 2 MiB
+# VOICE_STT_MAX_SECONDS=60             # client MediaRecorder stops at this
+
+# Independent voice cost cap (Anthropic burn is separate). Sliding 60-min
+# windows over voice_events.cost_usd_estimate.
+# VOICE_HOURLY_COST_CAP_USD=0.25       # per-chat
+# VOICE_GLOBAL_HOURLY_COST_CAP_USD=1.00  # host-wide
+
+# Output formats. Web uses MP3 (plays everywhere). Telegram uses Ogg/Opus
+# (sendVoice). §17 probe (May 2026) confirmed ElevenLabs returns
+# Ogg-containerized Opus for opus_48000_64; if a future change flips that
+# to raw Opus, set this to mp3_44100_64 and Telegram path uses sendAudio.
+# ELEVENLABS_TTS_OUTPUT_FORMAT_WEB=mp3_44100_64
+# ELEVENLABS_TTS_OUTPUT_FORMAT_TG=opus_48000_64
+
+# Pricing constants used for cap math + voice_events.cost_usd_estimate.
+# Pin to your ElevenLabs plan if it differs from the published defaults.
+# ELEVENLABS_TTS_PRICE_USD_PER_1K_CHARS=0.05
+# ELEVENLABS_STT_PRICE_USD_PER_HOUR=0.22
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Solrac
 
-> A self-hosted, hackable personal Agent: free local LLM (Ollama or LMStudio) or remote LLM (OpenRouter) by default, with explicit escalation to Anthropic's Claude Sonnet/Opus via the Claude Agent SDK. Reach it from Telegram or a browser; own every audit row, permission rule, and budget cap.
+> A self-hosted personal Agent you can configure, hack, and **converse with**. Reach it by text from Telegram or a browser, or by voice (ElevenLabs STT + TTS) on either transport. Free local LLM (Ollama / LMStudio) or remote (OpenRouter) by default; escalate to Anthropic's Claude Sonnet (`@`) or Opus (`!`) only when you mean it. Own every audit row, permission rule, and budget cap.
 
 <image src="./docs/solrac.png" width="300px" />
 
@@ -16,11 +16,12 @@ It's deliberately smaller and narrower than other personal-assistant projects:
 Both are broader and better-resourced. **Solrac's distinct value:**
 
 - **BYO-model engine slot.** No-prefix messages route to whichever model source you wire — free on-host (Ollama / LMStudio) or pay-per-token remote (OpenRouter). `@` (Sonnet) and `!` (Opus) are paid Claude escalations only on operator intent.
-- **Cost enforcement, not just visibility.** Sliding per-chat and global hourly USD caps that *deny* turns when hit — they sum every `cost_usd` row (Claude or OpenRouter), so remote-mode burn is gated by the same ceilings without extra configuration. Plus a daily cost-report DM.
-- **Audit-before-acting.** Every update (allowed, denied, queue-full) writes a row to one append-only SQLite table, tagged with the engine that served it (`local:ollama:...`, `remote:openrouter:...`, `claude:primary:...`).
+- **Cost enforcement, not just visibility.** Sliding per-chat and global hourly USD caps that *deny* turns when hit — they sum every `cost_usd` row (Claude or OpenRouter), so remote-mode burn is gated by the same ceilings without extra configuration. Plus a daily cost-report DM. Voice spend (ElevenLabs STT + TTS, when enabled) rides a **second** independent cost-cap axis with its own per-chat + global ceilings.
+- **Voice on every transport.** Telegram voice notes get transcribed; the web UI has a mic button and per-message speak buttons. `/voice on` turns on terse audio replies. ~120 lines of `fetch` against ElevenLabs — no SDK, no realtime WebSocket. Off by default.
+- **Audit-before-acting.** Every update (allowed, denied, queue-full) writes a row to one append-only SQLite table, tagged with the engine that served it (`local:ollama:...`, `remote:openrouter:...`, `claude:primary:...`). Voice gets a parallel `voice_events` log — every STT/TTS attempt (allowed, capped, denied, errored) is recorded.
 - **Single-process minimalism.** No HTTP framework, no Telegram framework runtime, no queue server, no Docker, no sub-agents. A few thousand lines of TypeScript you can read in an afternoon and fork.
 
-If you need multi-tenancy, voice wake, mobile companions, or 25 chat platforms, use OpenClaw or Hermes. If you want a small, cost-capped, fully audited foundation you can bend to your shape, Solrac fits.
+If you need multi-tenancy, always-listening voice wake, mobile companions, or 25 chat platforms, use OpenClaw or Hermes. If you want a small, cost-capped, fully audited foundation — with optional speech-to-text and text-to-speech on Telegram and the browser — that you can bend to your shape, Solrac fits.
 
 ## Quick start
 

diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
@@ -1216,6 +1216,151 @@ The existing `policy.ts::createConfirmationBroker` is transport-agnostic — `re
 
 The transport adds `web.ts`, `web-client.ts`, `web-sanitize.ts`, and `markdown.ts`. No HTTP framework, no WebSocket framework, no extra runtime dependencies beyond `marked` (used on both transports). The "no HTTP framework" anti-goal is honored — `Bun.serve` `routes` and `fetch` only, same shape as `server.ts`.
 
+## Voice transport (optional)
+
+Off by default. Enabled via `VOICE_ENABLED=true` + ElevenLabs credentials. Adds two flows — **speech-in** (operator → text prompt) and **speech-out** (assistant reply → audio) — to both Telegram and the web UI. Implementation lives in two modules:
+
+| Module | Role |
+|---|---|
+| `src/elevenlabs.ts` | Typed `fetch` wrapper for ElevenLabs HTTP. STT (`POST /v1/speech-to-text`, multipart) and TTS-stream (`POST /v1/text-to-speech/{voice_id}/stream`, chunked body). ~165 lines, no SDK. |
+| `src/voice.ts` | Orchestration — gate, cost-cap, audit-write, transport delivery. Exports `handleWebStt`, `handleWebTts`, `handleTelegramVoiceStt`, `maybeReplyWithVoice`, `stripMarkdownForSpeech`, `buildVoiceModePrompt`. |
+
+### The two flows
+
+```
+SPEECH-IN (STT)
+   Telegram voice note               Web mic button
+   ───────────────────                ────────────────
+   poll loop → msg.voice              MediaRecorder → /api/stt (multipart)
+       │                                  │
+       │ gateUpdate (allowlist)            │ session-cookie auth
+       │ voice cost cap check              │ voice cost cap check
+       ▼                                  ▼
+   getFile + download bytes           parse multipart, validate size
+       │                                  │
+       └──────► voice.handleXxxStt ◄──────┘
+                    │
+                    ▼
+            ElevenLabs Scribe
+                    │
+                    ▼
+           voice_events row written
+                    │
+       Telegram: synthesize text Update    Web: return { ok, text } →
+       → queue.enqueue (normal turn)       browser pre-fills composer
+
+SPEECH-OUT (TTS)
+   Telegram: post-turn hook           Web: speak button on assistant msg
+   ──────────────────────              ──────────────────────────────────
+   agent/engine done + audit closed   user click → POST /api/tts
+       │                                  │ session-cookie auth
+       │ /voice on? (sessions table)       │ voice cost cap check
+       │ voice cost cap check              │ length wall
+       ▼                                  ▼
+   stripMarkdownForSpeech(final)      stripMarkdownForSpeech(markdown)
+       │                                  │
+       └──────► ElevenLabs TTS-stream ◄────┘
+                    │
+                    ▼
+           voice_events row written
+                    │
+       Telegram: buffer → sendVoice       Web: proxy-stream → <audio>
+       (Ogg/Opus) or sendAudio (MP3)      blob URL → autoplay
+```
+
+### `voice_events` table — separate from `audit`
+
+One turn can produce **multiple** voice events (one STT input + one TTS output, sometimes). The `audit` table's two-writes-per-turn shape doesn't fit, so voice gets its own append-only log:
+
+```sql
+CREATE TABLE voice_events (
+  id INTEGER PRIMARY KEY AUTOINCREMENT,
+  chat_id INTEGER NOT NULL,
+  ts_ms INTEGER NOT NULL,
+  kind TEXT NOT NULL CHECK (kind IN ('stt','tts')),
+  source TEXT NOT NULL CHECK (source IN ('web','telegram')),
+  model TEXT NOT NULL,
+  voice_id TEXT,
+  audit_id INTEGER,            -- informational link to audit.id (not FK)
+  duration_ms INTEGER,         -- STT only
+  chars INTEGER,               -- TTS only
+  cost_usd_estimate REAL NOT NULL DEFAULT 0,
+  status TEXT NOT NULL CHECK (status IN ('ok','denied_cap','denied_gate','error')),
+  error_message TEXT
+);
+CREATE INDEX idx_voice_events_chat_ts ON voice_events (chat_id, ts_ms);
+CREATE INDEX idx_voice_events_ts ON voice_events (ts_ms);
+```
+
+`audit_id` is informational only (no FK) so a `denied_gate` STT — which never reaches `audit` because no allowlisted sender existed — still gets a row. Cost-cap queries sum `cost_usd_estimate` over a sliding 60-min window, filtered to `status='ok'` so denials don't double-count.
+
+### Independent voice cost cap
+
+Anthropic burn (`audit.cost_usd`) and ElevenLabs burn (`voice_events.cost_usd_estimate`) are **separate axes**. Each has its own per-chat + global sliding-60-min ceiling:
+
+| Axis | Per-chat env var | Global env var | Default |
+|---|---|---|---|
+| Anthropic | `HOURLY_COST_CAP_USD` | `GLOBAL_HOURLY_COST_CAP_USD` | $1.00 / $4.00 (4×) |
+| Voice | `VOICE_HOURLY_COST_CAP_USD` | `VOICE_GLOBAL_HOURLY_COST_CAP_USD` | $0.25 / $1.00 |
+
+Order of checks inside `voice.ts` mirrors the Anthropic cap shape: **global first, then per-chat** (a host-wide hit shouldn't be masked by a per-chat pass). For STT the gate fires before either cap; for TTS the length wall fires after both caps. Cost is **estimated** at write time (ElevenLabs doesn't return per-call billing on the wire) using the configured price constants:
+
+- **STT:** `audio_duration_secs / 3600 × ELEVENLABS_STT_PRICE_USD_PER_HOUR`
+- **TTS:** `chars / 1000 × ELEVENLABS_TTS_PRICE_USD_PER_1K_CHARS`
+
+Pin the prices to your ElevenLabs plan if the published defaults don't match.
+
+### Voice mode (`sessions.voice_replies` + `/voice on|off`)
+
+Per-chat sticky toggle backing both Telegram TTS attach AND the word-limit prompt nudge. Added to `sessions` as a `0/1` column (idempotent ALTER). The `/voice` command's parser accepts `on`, `off`, `1`, `0`, `true`, `false`, or no-arg (renders current state).
+
+When `voice_replies=1` for a chat, two things happen on every turn:
+
+1. **`voice.ts::buildVoiceModePrompt`** is called by both SOLRAC.md injection sites and a `<voice-mode>` block is prepended that tells the model to keep the reply under `VOICE_REPLY_WORDS_HINT` words (default 60). The block sits **after** SOLRAC.md and **before** the cross-engine OOB block, so operator overlays can override the word limit if needed. The model may use up to 3× the limit when the user explicitly asks for more.
+2. **`maybeReplyWithVoice`** runs as the post-turn hook (after the audit row closes, only on `!isError`). It strips the markdown, checks the cost cap + length wall, calls ElevenLabs TTS, buffers the audio, and sends via `sendVoice` (Ogg/Opus) or `sendAudio` (MP3 fallback). Web turns don't invoke this — the per-message speak button does it on user demand instead.
+
+The post-turn hook is wired as an optional `attachVoiceReply` callback on `AgentRunDeps` and `EngineRunDeps`. Telegram-bound deps carry the callback; web-bound deps don't. Same VoiceDeps instance backs both Telegram STT/TTS and web STT/TTS — the sliding 60-min cap is shared across transports, so an operator can't double up by talking on web + Telegram simultaneously.
+
+### Footer strip
+
+The `*✅ ...*` line agent/engine append to every successful reply (turn count, cost, model) is UI chrome, not content. `voice.ts::stripMarkdownForSpeech` regex-strips that pattern before tokenizing, so TTS never reads "✅ remote:openrouter:z-ai/glm-5.1 · 1 tools · 6.6s · $0.0048" aloud.
+
+The strip also handles standard markdown → speech transforms via `marked.lexer`:
+- Code fences → `[code block omitted]`
+- Tables → `[table omitted]`
+- Lists → comma-joined items
+- Links → text (URL dropped)
+- Headers / bold / italic → unwrap, keep text
+
+### Env scrub additions
+
+`ELEVENLABS_*` and `VOICE_*` are added to `agent.ts::sanitizedSubprocessEnv`'s scrub list. `ELEVENLABS_API_KEY` is a billed credential that the spawned `claude` SDK subprocess has no business reading; `VOICE_*` (cost caps, model ids) shouldn't leak via an auto-allowed `Bash(echo $VOICE_...)`.
+
+### What about the web sanitizer?
+
+`web-sanitize.ts` deliberately excludes `<audio>` from its allowlist. We do NOT widen it. The `<audio>` element on the web UI is injected via `document.createElement` by `app.js`, AFTER sanitization runs on the reply body — the trust boundary doesn't move. The sanitizer is for marked-rendered LLM content; audio playback is UI chrome.
+
+### Dependency direction
+
+```
+elevenlabs.ts  →  log + config
+voice.ts       →  elevenlabs + db + log + config + policy + telegram + marked
+agent.ts       →  + voice (post-turn hook + buildVoiceModePrompt)
+engine.ts      →  + voice (post-turn hook + buildVoiceModePrompt)
+web.ts         →  + voice (handleWebStt, handleWebTts)
+main.ts        →  + voice (handleTelegramVoiceStt dispatcher, maybeReplyWithVoice)
+commands.ts    →  unchanged structurally — /voice command dispatches via db.setVoiceRepliesFlag
+```
+
+No new runtime dependency outside `marked` (already shipped). `fetch` and `FormData` are global in Bun.
+
+### Anti-goal preservation
+
+- `marked` is still the only non-SDK runtime dep — ElevenLabs is raw `fetch`, no SDK.
+- No HTTP framework added (the two new routes ride the existing `Bun.serve` instance).
+- No Telegram framework runtime added (`sendVoice`/`sendAudio` use multipart `fetch`).
+- One-PR-per-feature was reversed deliberately for the voice change (phases 1–5 landed together) — per PLAN.md §16, called out as an explicit re-evaluation.
+
 ## Anti-goals
 
 Decisions deliberately not made. Don't relitigate without strong justification.