Skip to content

adrianwedd/afterwords

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

143 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Afterwords — Local Voice-Cloning TTS Server

Listen to the voice demos →   Open In Colab

Clone any voice from a 15-second YouTube clip and run it locally on your Mac. Use it as a standalone TTS API, or pair it with Claude Code to hear every response spoken aloud. 110+ voices included across 6 verified backends (Qwen3 0.6B/1.7B, Chatterbox, VoxCPM 1.5, Voxtral, SoproTTS) plus 13 scaffolded backends (OpenVoice v2, F5-TTS, CosyVoice2, GPT-SoVITS, XTTS v2, IndexTTS-2, NeuTTS Air, Spark-TTS, Dia2, YourTTS, SV2TTS, MockingBird, FireRedTTS-2) that load correctly but have known installation issues on Apple Silicon — see the Backend Status table for details.

No cloud API. No subscription. No data leaves your machine. The voice comes from a 15-second audio sample — yours, a friend's, or anyone on YouTube.

Quick Start

git clone https://github.com/adrianwedd/afterwords.git
cd afterwords
bash setup.sh

The setup script checks prerequisites, creates a venv, walks you through cloning a voice from YouTube, and starts the server. If Claude Code is detected (or you choose to install it), the script also wires up a Stop hook so Claude speaks every response.

For a server-only install with no Claude Code integration:

bash setup.sh --server-only

With Claude Code

Claude Code has /voice — hold Space to dictate prompts. But it's input only. Claude can hear you; you can't hear Claude. This project adds the missing half: text-to-speech output. Together, /voice input + TTS output = full voice conversations with Claude Code.

If Claude Code isn't installed, setup will offer to install it (requires Node.js; setup installs that too if needed via Homebrew).

With Codex CLI

Codex CLI (@openai/codex) doesn't expose a Stop-hook event the way Claude Code does, so the integration uses a watcher instead: tail -F on the active session JSONL under ~/.codex/sessions, extract final assistant answers, queue them for synthesis. Inside an interactive Codex session (where $CODEX_THREAD_ID is set):

afterwords codex-hook start    # daemon follows this session, speaks final answers
afterwords codex-hook status   # check
afterwords codex-hook stop

The watcher needs ripgrep (brew install ripgrep) to locate the session file. Setup auto-detects Codex and prints these commands when it finishes; you don't have to memorize them.

For watcher debugging, run afterwords codex-hook status; it reports stale pid files and shows the tail of /tmp/codex-tts-watch.log when the watcher is not running. afterwords codex-hook start --diagnose prints the thread id, session file it would watch, hook path, and sample event detection without starting the daemon. The most common startup failures are $CODEX_THREAD_ID not being exported, or Codex not having created the first session event yet, so no ~/.codex/sessions/.../rollout-*.jsonl file matches the thread id.

Trade-offs vs Claude Code: this depends on Codex's local session file format and on $CODEX_THREAD_ID being exported, both undocumented contracts that may shift between Codex versions. For non-interactive Codex (codex exec), prefer wrapping with --output-last-message <FILE> and feeding the file to /synthesize directly — cleaner and version-stable.

With Gemini CLI

Gemini CLI ships hook support, including a gemini hooks migrate --from-claude subcommand. Tempting — but in our testing it has a silent-write bug: when run from $HOME it reports success but leaves ~/.gemini/settings.json unchanged (it writes via setValue("Workspace", ...) which is read-only when cwd == home). Even when the migrate succeeds elsewhere, the resulting config wouldn't work for TTS because the payload schema differs: Claude sends last_assistant_message, Gemini sends prompt_response.

So we ship a small adapter instead. setup.sh installs ~/.claude/hooks/gemini-tts-hook.sh (it normalises prompt_response → the existing Claude tts-hook + worker chain) and prints the JSON snippet to add to ~/.gemini/settings.json:

{
  "hooks": {
    "AfterAgent": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "bash ~/.claude/hooks/gemini-tts-hook.sh",
            "timeout": 120000
          }
        ]
      }
    ]
  }
}

Two things to know:

  • Gemini fires AfterAgent (analogue of Claude's Stop); there's no clean SubagentStop analogue, so per-agent voice mapping via .afterwords is Claude-only for now.
  • The adapter reuses Claude's TTS queue (/tmp/claude-tts-queue.txt) and worker, so a single Afterwords worker drains both Claude and Gemini sessions. No coordination needed.

Test: gemini -p "say hi" should speak the response via Afterwords using your default voice.

Without Claude Code

The TTS server is a plain HTTP API. Use it from any tool, script, or application:

# Synthesize speech
curl "http://localhost:7860/synthesize?text=Hello+world&voice=galadriel" -o hello.wav
afplay hello.wav

# List available voices
curl http://localhost:7860/health | jq .voices

Integrate with Cursor, Windsurf, shell scripts, web apps — anything that can make an HTTP request.

For CLIs without native completion hooks (e.g. Hermes), use the bundled wrapper to speak any text via the server in one call:

bash scripts/hermes-tts.sh "Hello from Hermes" picard   # voice arg is optional

The wrapper checks the server, encodes the text, fetches the WAV, trims the leading 100 ms of model artifact, and plays via afplay. Same flow Claude Code's hook uses, just invoked manually.

Adding More Voices

bash clone-voice.sh
# or non-interactive:
bash clone-voice.sh "https://youtube.com/watch?v=..." galadriel 30

The script downloads the audio, extracts a 15-second segment, denoises it, transcribes with Whisper, and saves a voice profile. Each voice is just a 700 KB WAV file — adding voices costs zero extra memory.

Auditing voice profiles

If you hand-edit a voices/*.json reference_text after cloning (e.g. to correct Whisper mishearings), you can drift the transcript away from what the trimmed audio actually says — which degrades cloning fidelity. The audit tool re-transcribes every reference WAV and flags drift:

afterwords audit               # report only
afterwords audit --fix         # overwrite reference_text with fresh Whisper output for flagged voices
afterwords audit --voice picard

Flags raised: phantom canonical text (transcript materially longer than what's heard), mid-word truncation, mid-clip silence gaps ≥1.5s, and impossible char/sec ratios. Exits non-zero when any voice is flagged, so it's safe to wire into CI.

Switching Voices

Per-project — drop a .afterwords file in any repo:

echo "snape" > .afterwords     # this project uses Snape
echo "galadriel" > .afterwords # this one uses Galadriel

Per-agent — map agent names to voices (one per line):

# .afterwords
default: data
clara-oswald: clara-oswald
donna-noble: donna-noble
k9: k9
Explore: spock

When Claude Code spawns a subagent, the hook reads its agent_type and looks up the voice from the mapping. If no match is found, it falls back to default:, then to the server's default voice. Built-in subagent types (Explore, Plan, general-purpose) are silently skipped.

The hook reads this before each synthesis. No server restart needed.

Global default — edit DEFAULT_VOICE in server.py and restart:

afterwords restart

Per-request:

curl "http://localhost:7860/synthesize?text=Hello&voice=samantha" -o hello.wav

Newly cloned voices are auto-discovered on server restart, OR pick them up without a restart:

afterwords reload   # rescans voices/, adds new profiles, no synthesis interruption

reload is add-only and atomic — if any new profile fails validation, the whole reload aborts and the previous voice set stays intact.

Languages

The backends advertise different language support. Ask /health to see what each one offers:

curl -s localhost:7860/health | jq '.loaded_backends | to_entries[] | {backend: .key, langs: .value.supported_langs}'

Pass lang= on a synthesis request when you want a non-English language:

curl "http://localhost:7860/synthesize?text=Ni+hao&voice=galadriel&lang=zh" -o hello-zh.wav

If the voice's backend doesn't support the requested language, and the voice belongs to a family (e.g. picard, picard-voxcpm-15 etc. all have family: picard in their JSON), the server auto-routes to a same-family voice on a backend that does support it. If no family member supports the language, you get a clean 400 with the list of supported languages.

Claude Code Skill

A Claude Code skill is included in skill/ that enables natural-language TTS commands:

"say 'the spice must flow' in a dramatic voice"
"what voices are available?"
"set this project to use river-song"

The skill handles voice selection, server health checks, synthesis, and playback. Install it by pointing Claude Code at the skill/SKILL.md file or adding it as a plugin.

Architecture

┌─────────────────────────────────────────────────────────────┐
│  Your Mac (Apple Silicon, 16 GB+)                          │
│                                                             │
│  ┌─────────────────────────┐                                │
│  │  Multi-Backend TTS      │  ← MLX + optional OpenVoice     │
│  │  localhost:7860          │  ← 110+ voice profiles         │
│  │  /synthesize?text=...    │  ← ~20s per sentence (Qwen3)   │
│  └─────────┬───────────────┘                                │
│            │                                                │
│  ┌─────────┴───────────────┐  ┌──────────────────────────┐  │
│  │  Claude Code Stop Hook  │  │  Claude Code /voice      │  │
│  │  ~/.claude/hooks/       │  │  (hold Space to dictate) │  │
│  │  tts-hook.sh            │  │  Speech → Text input     │  │
│  │  Text → Speech output   │  │  (built-in)              │  │
│  └─────────────────────────┘  └──────────────────────────┘  │
│                                                             │
│  Together: full voice conversation with Claude Code          │
└─────────────────────────────────────────────────────────────┘

/voice handles input: you speak, Claude hears text. This project handles output: Claude responds, you hear speech.

How It Works

Voice Cloning (Zero-Shot)

No training or fine-tuning. The default MLX-based backends extract speaker embeddings from 15-second reference clips and generate new speech in that voice: Qwen3-TTS 0.6B & 1.7B (Alibaba, multilingual), Chatterbox fp16 (Resemble AI, multilingual), VoxCPM 1.5 (ModelBest, en/zh), and Voxtral 4B (preset voices). SoproTTS is a verified optional CPU-friendly backend for lightweight English zero-shot cloning.

The following optional backends have working code but currently have installation issues on Apple Silicon (dep-resolution errors, missing build deps, or source repos not bundled). Each is tracked in the issue tracker. OpenVoice v2 is a PyTorch/MeloTTS backend for zero-shot multilingual cloning in en/es/fr/zh/ja/ko. F5-TTS is a PyTorch backend using the flow-matching DiT F5TTS_v1_Base model for en/zh; its default pretrained weights are CC-BY-NC 4.0 and are not for commercial use. CosyVoice2-0.5B is an Apache-2.0 PyTorch backend for multilingual zero-shot cloning. GPT-SoVITS is an MIT-licensed PyTorch backend for few-shot cloning in en/zh/ja/ko/yue. XTTS v2 is a Coqui TTS backend for 17-language zero-shot cloning; its CPML weights are non-commercial only. IndexTTS-2 is a PyTorch backend for expressive en/zh zero-shot cloning with emotion controls. NeuTTS Air is an Apache-2.0 CPU-first backend for English zero-shot cloning via Neuphonic's neutts package. Spark-TTS is an LLM+BiCodec PyTorch backend for en/zh zero-shot cloning; its published 0.5B weights are CC-BY-NC-SA 4.0 and non-commercial. Dia2 is an Apache-2.0 PyTorch backend for English dialogue-oriented voice conditioning with [S1]/[S2] speaker tags. YourTTS is an open-source Coqui VITS backend for lightweight en/fr/pt-BR zero-shot cloning at 16 kHz. FireRedTTS-2 is an Apache-2.0 PyTorch backend for long conversational and podcast-style multilingual zero-shot cloning. SV2TTS is an open-source PyTorch backend using the classic Real-Time Voice Cloning encoder + Tacotron + WaveRNN pipeline for English. MockingBird is an open-source Chinese-focused SV2TTS-derived backend.

Backend Status

Verified backends clone voices end-to-end on Apple Silicon (tested). Scaffolded backends have working code but known installation issues — see linked issues for status.

Backend Status License Languages Sample rate Reference text
qwen3-0.6b, qwen3-1.7b ✅ verified model-dependent en/zh/ja/ko/es/fr/de/it/pt/ru 24 kHz required
chatterbox ✅ verified model-dependent en/es/fr/de/it/pt/zh/ja/ko 24 kHz optional
voxcpm-1.5 ✅ verified model-dependent en/zh 44.1 kHz optional
voxtral ✅ verified model-dependent preset voices 24 kHz ignored
soprotts ✅ verified Apache-2.0 en 24 kHz optional
openvoice-v2 🔧 scaffolded MIT en/es/fr/zh/ja/ko 22.05 kHz optional
f5-tts 🔧 scaffolded CC-BY-NC default weights en/zh 24 kHz required
cosyvoice2 🔧 scaffolded Apache-2.0 en/zh/ja/ko/de/es/fr/it/ru 24 kHz required
gpt-sovits 🔧 scaffolded MIT en/zh/ja/ko/yue 32 kHz required
xtts-v2 🔧 scaffolded CPML, non-commercial only en/es/fr/de/it/pt/pl/tr/ru/nl/cs/ar/zh/hu/ko/ja/hi 24 kHz optional
indextts-2 🔧 scaffolded LicenseRef-Bilibili-IndexTTS en/zh 22.05 kHz optional
neutts-air 🔧 scaffolded Apache-2.0 en 24 kHz optional
spark-tts 🔧 scaffolded Apache-2.0 code; CC-BY-NC-SA 4.0 weights en/zh 24 kHz optional
dia2 🔧 scaffolded Apache-2.0 en 44 kHz optional
yourtts 🔧 scaffolded Open source en/fr/pt-BR 16 kHz optional
firered-tts-2 🔧 scaffolded Apache-2.0 en/zh/ja/ko/fr/de/ru 24 kHz optional
sv2tts 🔧 scaffolded Open source en 22.05 kHz optional
mockingbird 🔧 scaffolded Open source zh/en 22.05 kHz optional

Voice profiles pin to a specific backend via the backend JSON field. The shipped flagship voices (picard, galadriel, attenborough) have per-backend variants so you can compare clone fidelity across models — Qwen3 sizes are the most reliable cloners in the current stack; see the demo site for audible comparison.

The Server

FastAPI + Uvicorn serving WAV audio over HTTP. Backends load once at startup; each voice is a reference WAV + transcript string. All synthesis serialised via _synth_lock (MLX Metal is single-GPU regardless of backend). VOICES dict mutation is guarded by a separate _model_lock. Lock-acquisition order is always _synth_lock_model_lock to avoid deadlock.

GET  /health
       → {"status":"ok", "ready":true, "voices":[...],
          "loaded_backends": {"qwen3-0.6b": {"loaded":true, "voice_count":..., "supported_langs":[...]}, ...}}

       Current backend ids:
       qwen3-0.6b, qwen3-1.7b, chatterbox, voxcpm-1.5, voxtral, openvoice-v2, f5-tts, cosyvoice2,
       gpt-sovits, xtts-v2, indextts-2, neutts-air, spark-tts, dia2, yourtts, firered-tts-2,
       sv2tts, mockingbird, soprotts

GET  /synthesize?text=Hello&voice=galadriel&lang=en
       → audio/wav (16-bit PCM)
       → X-Backend, X-Synthesis-Time, X-Duration headers
       → 400 if voice unknown OR lang unsupported (returns supported_langs)
       → 503 if warming up

POST /synthesize          (--allow-clone only)
       Body: {"text":..., "voice":..., "emotion":..., "lang":"en"}
       → audio/wav, same status codes as GET

POST /clone               (--allow-clone only)
       multipart: audio file, session_id, emotion, transcript?, backend?
       → JSON {voice, backend, emotion, quality, sequence, ...}

POST /reload              (--allow-clone only)
       → JSON {status, reloaded:[names], errors:[]} on success (200)
       → JSON {status:"failed", errors:[...]}        on abort   (500)
       Add-only, atomic — if any voice fails to prepare, no changes committed.

DELETE /session/{id}      (--allow-clone only)
       → removes all voices for that session, cleans up temp files

The Hook

Claude Code's Stop hook fires after every response. The hook extracts the response text, strips markdown, queues it for synthesis, and plays the result through your speakers. A background worker with mkdir-based locking (macOS has no flock) prevents overlapping audio.

The Queue

Fast Claude conversations generate responses faster than TTS can synthesise. The worker processes a queue (max 10 entries), trimming oldest entries when it overflows. Each response is archived as an MP3 plus a sidecar TXT file in ~/.claude/tts-archive/ (requires lame), so archived speech can be audited later with python scripts/audit-archive.py.

Privacy note: the sidecar .txt files contain the exact text spoken by Claude — including code snippets, file paths, and anything else that appeared in a response. They persist indefinitely on local disk and are never uploaded anywhere. Clean periodically with rm ~/.claude/tts-archive/*.txt (or both *.txt and *.mp3 to wipe the whole archive). If you'd rather not archive at all, remove the lame ... && printf ... block from ~/.claude/hooks/tts-worker.sh.

Requirements

  • Apple Silicon Mac (M1/M2/M3/M4), 16 GB+ RAM (32 GB recommended)
  • Python 3.11+
  • ~2 GB disk (model weights + venv)
  • Claude Code (optional — for automatic TTS on responses; setup offers to install it)

File Map

afterwords/
├── setup.sh              ← one-command setup (detects/installs Claude Code)
├── afterwords.sh         ← CLI for server management (symlinked to PATH)
├── clone-voice.sh        ← add more voices from YouTube
├── server.py             ← multi-voice TTS server
├── strip_markdown.py     ← text cleaner for TTS (also used by hooks)
├── tests/                ← pytest suite (82+ tests, no GPU needed)
├── backends/             ← Backend Protocol + concrete backends + registry CLI
├── scripts/              ← reclone-flagship.py, gen-comparison-audio.sh, audit-voice-transcripts.py, audit-archive.py
├── docs/                 ← demo site (deployed to GitHub Pages)
├── requirements.txt      ← runtime deps
├── requirements-dev.txt  ← test deps (pytest>=9.0.3, httpx)
├── skill/                ← Claude Code skill for natural-language TTS
│   ├── SKILL.md          ← skill instructions
│   └── scripts/speak.sh  ← synthesize + play helper
├── voices/
│   ├── galadriel-ref.wav ← 15s reference (Cate Blanchett, LOTR)
│   ├── samantha-ref.wav  ← (Scarlett Johansson, Her)
│   ├── amy-pond-ref.wav  ← (Karen Gillan, Doctor Who)
│   └── ...               ← 110+ voices total; 3 flagships have per-backend variants
└── README.md

~/.claude/                    ← only with Claude Code integration
├── settings.json         ← Stop hook registered here
└── hooks/
    ├── tts-hook.sh       ← queue response for TTS
    ├── tts-worker.sh     ← process queue, play audio
    └── strip-markdown.py ← clean text for TTS

~/Library/LaunchAgents/
└── com.afterwords.tts-server.plist  ← auto-start on login

Included Voices

Voice Source Character
attenborough David Attenborough, BBC Earth Warm, measured, wry narration
galadriel Cate Blanchett, LOTR Ethereal, ancient, otherworldly
han-solo Harrison Ford, Star Wars Sardonic, roguish confidence
samantha Scarlett Johansson, Her Warm, introspective AI
aurora AURORA, Shower Thoughts Dreamy, Norwegian, whimsical
audrey Audrey Hepburn, 1961 Elegant, transatlantic
marla Helena Bonham Carter, Fight Club Sardonic, darkly poetic
avasarala Shohreh Aghdashloo, The Expanse Gravelly, commanding
vesper Eva Green, Casino Royale French-accented, seductive
claudia Claudia Black, Dragon Age Australian, husky
eartha Eartha Kitt, interview Passionate purr
tilda Tilda Swinton, interview Crisp, dry wit
snape Alan Rickman, Harry Potter Velvet menace, slow burn
loki Tom Hiddleston, Avengers Theatrical, commanding
spock Leonard Nimoy, Star Trek Measured, logical deadpan
bardem Javier Bardem, Vicky Cristina Barcelona Warm, seductive Spanish
depp Johnny Depp, interview Languid, charming
data Brent Spiner, Star Trek TNG Precise, android curiosity
picard Patrick Stewart, Star Trek Authoritative, measured
ronan Ronan Keating, interview Soft Irish, reflective

Doctor Who Companion Voices

Voice Actor Character
the-doctor Tom Baker, Day of the Doctor Warm, enigmatic Curator
amy-pond Karen Gillan, Angels Take Manhattan Fierce, emotional farewell
bill-potts Pearl Mackie, Twice Upon a Time Warm, defiant
clara-oswald Jenna Coleman, The Name of the Doctor Quick, clever
donna-noble Catherine Tate, Turn Left Bold, heartfelt
k9 John Leeson, Doctor Who Robotic, clipped
leela Louise Jameson, Big Finish Direct, warrior's clarity
martha-jones Freema Agyeman, Last of the Time Lords Confident, commanding
nyssa-of-traken Sarah Sutton, Terminus Gentle, precise
river-song Alex Kingston, Husbands of River Song Theatrical, knowing
romana Lalla Ward, Big Finish Regal, intellectual
rose-tyler Billie Piper, Parting of the Ways Ethereal, powerful
sarah-jane-smith Elisabeth Sladen, School Reunion Warm, investigative
tegan-jovanka Janet Fielding, Resurrection of the Daleks Blunt, emotional
yasmin-khan Mandip Gill, Power of the Doctor Quiet, heartfelt

The full gallery includes 110+ voices spanning British comedy (Blackadder, Alan Partridge, Basil Fawlty, Malcolm Tucker, Father Ted, Geraldine, Patsy & Edina, Bernard Black…), American drama (Frasier, Columbo, Saul Goodman, Harvey Specter…), science communicators (Carl Sagan, Feynman, Brian Cox, Neil deGrasse Tyson…), and more. Run afterwords voices --demo to browse and hear samples.

Troubleshooting

Symptom Fix
No voice after Claude responds afterwords status — if dead: afterwords start
"warming up" 503 Wait ~30s after restart for model load + warmup
Voice sounds wrong/garbled Re-clone with a better reference clip; verify transcript accuracy
40+ seconds per request Restart the server (model may be reloading per-request)
/voice not working Enable with /voice command in Claude Code; requires Claude.ai account
Hook not firing Open /hooks in Claude Code to verify; or restart session
New voice not available Restart the server — voices are discovered on startup
Port 7860 already in use Another instance is running, or another app uses the port
Model download fails Check network; retry python server.py manually
MP3 archives missing Install lame via brew install lame

Testing

pip install -r requirements-dev.txt
pytest

Tests cover the server API (endpoint validation, error handling, voice resolution, hot-reload atomicity, lang routing across backend families), backend protocol conformance, the strip-markdown text transform, and lifecycle helpers (_cleanup_current_voices, _sweep_orphaned_temp_files). 82+ tests pass without loading any real model — a FakeBackend fixture stands in. Real-model integration tests are opt-in via pytest -m integration.

Run a single test:

pytest tests/test_strip_markdown.py::test_inline_code_keeps_content
pytest tests/test_server.py -k reload         # all reload tests
pytest tests/test_server.py -k routing        # all family-routing tests

Managing the Server

afterwords start       # start the TTS server
afterwords stop        # stop the TTS server
afterwords restart     # restart after config changes
afterwords status      # show health, PID, loaded voices
afterwords logs        # tail the server log
afterwords voices      # list available voices
afterwords reload      # pick up new voices without restarting (no synth interruption)
afterwords clone       # clone a new voice from YouTube
afterwords uninstall   # remove service and optionally hooks

The afterwords command is added to your PATH during setup. It wraps launchd service management, health checks, and voice operations into a single interface.

Uninstalling

afterwords uninstall

This removes the launchd service and offers to remove Claude Code hooks. Voice profiles and server code remain in the repo directory. Setup is safe to re-run if anything breaks.

Performance

On 32 GB M3 Max (four backends preloaded):

  • Startup: ~2 min (backend load + warmup)
  • Model load: ~5s (cached) / ~5 min (first run, downloading ~3 GB)
  • Per request: ~15s fixed overhead + ~0.5x real-time (~20s typical)
  • Peak memory: ~10 GB (all four backends)
  • Adding voices: zero extra memory (each is just a 700 KB WAV)

Credits

Related

About

Give Claude Code a voice. Zero-shot voice cloning on Apple Silicon. ~15 voices included.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors