voice support: ElevenLabs STT + TTS on Telegram and web#30
Merged
Conversation
…, features, readme
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Solrac's pitch was "configure, hack, converse with" — but the conversation was text-only. This PR adds optional voice on both transports without changing the agent loop, queue, audit shape, or cost-cap posture. Voice is off by default; enabling it requires
VOICE_ENABLED=trueplus anELEVENLABS_API_KEY+ELEVENLABS_VOICE_ID.Two design constraints shaped the work:
fetchwrapper (~175 lines,src/elevenlabs.ts) — same minimalism assrc/telegram.ts. One additional library on the public surface would have been disproportionate.voice_events.cost_usd_estimate, sliding 60-min per-chat + global ceilings). A voice runaway never starves a Claude turn and vice versa.What's in the scope
/voice on(per-chat sticky, persisted insessions.voice_replies) makes Telegram replies attach an audio voice note and tells the model to keep replies terse via a<voice-mode>system block. Web UI gets a per-message 🔊 button with a cached blob — replays don't re-bill.voice_eventstable (parallel toaudit, not nested) records every attempt —ok/denied_cap/denied_gate/error— with kind, source, model, cost estimate, duration / chars. Denied-gate STTs still get a row even though they never reachaudit.ELEVENLABS_*andVOICE_*env vars are scrubbed from the Claude SDK subprocess (agent.ts::sanitizedSubprocessEnv) so a compromised model can't exfiltrate the billed credential.voice_eventstable reference + 6 debugging queries), GLOSSARY (ElevenLabs, STT, TTS, voice cost cap, voice mode, voice_events), and RUNBOOK (voice cost runaway + ElevenLabs error recovery).Anti-goal status
Does not reverse any anti-goals.
fetch-only posture.sendVoice/sendAudioadded to the existing typed client.Impact
src/elevenlabs.ts,src/voice.ts,src/voice.test.ts,src/db.ts,src/config.ts,src/agent.ts,src/engine.ts,src/telegram.ts,src/main.ts,src/commands.ts,src/web.ts,public/index.html,public/app.js,public/style.css, plus docs.mainsee no behavior change after upgrade.Test plan
npm run typecheck— cleanbun test— 828/828 pass/voice on→ Telegram reply attaches voice notedenied_caprows when hitdenied_gaterowserrorrow with verbatim upstream message