diff --git a/.env.example b/.env.example index 6780d0f..50e226d 100644 --- a/.env.example +++ b/.env.example @@ -143,3 +143,70 @@ SOLRAC_WEB_TOKEN= # Synthetic chat id all web traffic shares. Negative integer; default −1000 # avoids collision with real Telegram chat ids. # SOLRAC_WEB_CHAT_ID=-1000 + +# ── Voice (ElevenLabs STT + TTS) ──────────────────────────────────────────── +# Off by default. When VOICE_ENABLED=true: +# - Web UI gets a mic button (record → STT pre-fills the composer) and a +# per-message speak button (TTS plays in the browser). +# - Telegram voice notes (msg.voice) are transcribed and fed through the +# normal text path. +# - `/voice on` (per chat, sticky) injects a word-limit prompt to the LLM +# AND attaches a Telegram voice note to each reply. +# +# REQUIRED when VOICE_ENABLED=true (boot fails loud otherwise): +# ELEVENLABS_API_KEY — from https://elevenlabs.io → Profile + API Keys +# (starts `sk_…`). Restrict the key to Text-to-Speech +# + Speech-to-Text endpoints only. +# ELEVENLABS_VOICE_ID — 20-char id from VoiceLab → voice detail page. +# +# Both ELEVENLABS_* and VOICE_* keys are scrubbed from the Claude SDK +# subprocess env (agent.ts::sanitizedSubprocessEnv) so a compromised model +# can't exfiltrate them via an auto-allowed Bash command. +# +# Privacy: audio + transcripts + TTS-bound replies hit ElevenLabs SaaS. +# SOUL.md / SOLRAC.md never leave the host, but the speech does. +# +# VOICE_ENABLED=true +# ELEVENLABS_API_KEY=sk_REPLACE_ME +# ELEVENLABS_VOICE_ID=REPLACE_ME + +# TTS model. eleven_flash_v2_5 is $0.05/1k chars, low latency. Other options: +# eleven_turbo_v2_5 (similar price/latency), eleven_multilingual_v2 ($0.10/1k, +# better quality). Operator pin to match plan / preference. +# ELEVENLABS_TTS_MODEL=eleven_flash_v2_5 + +# STT model. scribe_v2 went GA March 2026 and replaces v1. +# ELEVENLABS_STT_MODEL=scribe_v2 + +# Hard wall — TTS requests over this length are refused with HTTP 413 + a +# user-visible chat hint. The voice-mode prompt nudge (see below) defends +# against this softly; the wall is the last line of defense. +# VOICE_TTS_MAX_CHARS=3000 + +# Soft target — when `/voice on` is set for a chat, this many words is the +# budget injected into the LLM prompt ("respond in under N words"). The +# model may use up to 3× when the user explicitly asks for more. +# Clamps to [30, 200] at boot (out-of-range warns + clamps). +# VOICE_REPLY_WORDS_HINT=60 + +# STT upload limits. Web `/api/stt` rejects oversized bodies before paying +# Scribe; Telegram voice-note download is bounded by the same byte cap. +# VOICE_STT_MAX_BYTES=2097152 # 2 MiB +# VOICE_STT_MAX_SECONDS=60 # client MediaRecorder stops at this + +# Independent voice cost cap (Anthropic burn is separate). Sliding 60-min +# windows over voice_events.cost_usd_estimate. +# VOICE_HOURLY_COST_CAP_USD=0.25 # per-chat +# VOICE_GLOBAL_HOURLY_COST_CAP_USD=1.00 # host-wide + +# Output formats. Web uses MP3 (plays everywhere). Telegram uses Ogg/Opus +# (sendVoice). §17 probe (May 2026) confirmed ElevenLabs returns +# Ogg-containerized Opus for opus_48000_64; if a future change flips that +# to raw Opus, set this to mp3_44100_64 and Telegram path uses sendAudio. +# ELEVENLABS_TTS_OUTPUT_FORMAT_WEB=mp3_44100_64 +# ELEVENLABS_TTS_OUTPUT_FORMAT_TG=opus_48000_64 + +# Pricing constants used for cap math + voice_events.cost_usd_estimate. +# Pin to your ElevenLabs plan if it differs from the published defaults. +# ELEVENLABS_TTS_PRICE_USD_PER_1K_CHARS=0.05 +# ELEVENLABS_STT_PRICE_USD_PER_HOUR=0.22 diff --git a/README.md b/README.md index 24d8db6..ce2bf8b 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # Solrac -> A self-hosted, hackable personal Agent: free local LLM (Ollama or LMStudio) or remote LLM (OpenRouter) by default, with explicit escalation to Anthropic's Claude Sonnet/Opus via the Claude Agent SDK. Reach it from Telegram or a browser; own every audit row, permission rule, and budget cap. +> A self-hosted personal Agent you can configure, hack, and **converse with**. Reach it by text from Telegram or a browser, or by voice (ElevenLabs STT + TTS) on either transport. Free local LLM (Ollama / LMStudio) or remote (OpenRouter) by default; escalate to Anthropic's Claude Sonnet (`@`) or Opus (`!`) only when you mean it. Own every audit row, permission rule, and budget cap. @@ -16,11 +16,12 @@ It's deliberately smaller and narrower than other personal-assistant projects: Both are broader and better-resourced. **Solrac's distinct value:** - **BYO-model engine slot.** No-prefix messages route to whichever model source you wire — free on-host (Ollama / LMStudio) or pay-per-token remote (OpenRouter). `@` (Sonnet) and `!` (Opus) are paid Claude escalations only on operator intent. -- **Cost enforcement, not just visibility.** Sliding per-chat and global hourly USD caps that *deny* turns when hit — they sum every `cost_usd` row (Claude or OpenRouter), so remote-mode burn is gated by the same ceilings without extra configuration. Plus a daily cost-report DM. -- **Audit-before-acting.** Every update (allowed, denied, queue-full) writes a row to one append-only SQLite table, tagged with the engine that served it (`local:ollama:...`, `remote:openrouter:...`, `claude:primary:...`). +- **Cost enforcement, not just visibility.** Sliding per-chat and global hourly USD caps that *deny* turns when hit — they sum every `cost_usd` row (Claude or OpenRouter), so remote-mode burn is gated by the same ceilings without extra configuration. Plus a daily cost-report DM. Voice spend (ElevenLabs STT + TTS, when enabled) rides a **second** independent cost-cap axis with its own per-chat + global ceilings. +- **Voice on every transport.** Telegram voice notes get transcribed; the web UI has a mic button and per-message speak buttons. `/voice on` turns on terse audio replies. ~120 lines of `fetch` against ElevenLabs — no SDK, no realtime WebSocket. Off by default. +- **Audit-before-acting.** Every update (allowed, denied, queue-full) writes a row to one append-only SQLite table, tagged with the engine that served it (`local:ollama:...`, `remote:openrouter:...`, `claude:primary:...`). Voice gets a parallel `voice_events` log — every STT/TTS attempt (allowed, capped, denied, errored) is recorded. - **Single-process minimalism.** No HTTP framework, no Telegram framework runtime, no queue server, no Docker, no sub-agents. A few thousand lines of TypeScript you can read in an afternoon and fork. -If you need multi-tenancy, voice wake, mobile companions, or 25 chat platforms, use OpenClaw or Hermes. If you want a small, cost-capped, fully audited foundation you can bend to your shape, Solrac fits. +If you need multi-tenancy, always-listening voice wake, mobile companions, or 25 chat platforms, use OpenClaw or Hermes. If you want a small, cost-capped, fully audited foundation — with optional speech-to-text and text-to-speech on Telegram and the browser — that you can bend to your shape, Solrac fits. ## Quick start diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 774b493..acaf97a 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -1216,6 +1216,151 @@ The existing `policy.ts::createConfirmationBroker` is transport-agnostic — `re The transport adds `web.ts`, `web-client.ts`, `web-sanitize.ts`, and `markdown.ts`. No HTTP framework, no WebSocket framework, no extra runtime dependencies beyond `marked` (used on both transports). The "no HTTP framework" anti-goal is honored — `Bun.serve` `routes` and `fetch` only, same shape as `server.ts`. +## Voice transport (optional) + +Off by default. Enabled via `VOICE_ENABLED=true` + ElevenLabs credentials. Adds two flows — **speech-in** (operator → text prompt) and **speech-out** (assistant reply → audio) — to both Telegram and the web UI. Implementation lives in two modules: + +| Module | Role | +|---|---| +| `src/elevenlabs.ts` | Typed `fetch` wrapper for ElevenLabs HTTP. STT (`POST /v1/speech-to-text`, multipart) and TTS-stream (`POST /v1/text-to-speech/{voice_id}/stream`, chunked body). ~165 lines, no SDK. | +| `src/voice.ts` | Orchestration — gate, cost-cap, audit-write, transport delivery. Exports `handleWebStt`, `handleWebTts`, `handleTelegramVoiceStt`, `maybeReplyWithVoice`, `stripMarkdownForSpeech`, `buildVoiceModePrompt`. | + +### The two flows + +``` +SPEECH-IN (STT) + Telegram voice note Web mic button + ─────────────────── ──────────────── + poll loop → msg.voice MediaRecorder → /api/stt (multipart) + │ │ + │ gateUpdate (allowlist) │ session-cookie auth + │ voice cost cap check │ voice cost cap check + ▼ ▼ + getFile + download bytes parse multipart, validate size + │ │ + └──────► voice.handleXxxStt ◄──────┘ + │ + ▼ + ElevenLabs Scribe + │ + ▼ + voice_events row written + │ + Telegram: synthesize text Update Web: return { ok, text } → + → queue.enqueue (normal turn) browser pre-fills composer + +SPEECH-OUT (TTS) + Telegram: post-turn hook Web: speak button on assistant msg + ────────────────────── ────────────────────────────────── + agent/engine done + audit closed user click → POST /api/tts + │ │ session-cookie auth + │ /voice on? (sessions table) │ voice cost cap check + │ voice cost cap check │ length wall + ▼ ▼ + stripMarkdownForSpeech(final) stripMarkdownForSpeech(markdown) + │ │ + └──────► ElevenLabs TTS-stream ◄────┘ + │ + ▼ + voice_events row written + │ + Telegram: buffer → sendVoice Web: proxy-stream →