Production audio pipeline: Voice, SFX, Music, Mix, Export.
Companion to VideoFormation (same architecture, different domain).
| Principle | Implementation |
|---|---|
| Single Source of Truth | project.json governs everything |
| Validation Gates | Hard gates before generation, mixing, export |
| Automation First | CLI drives pipeline; dashboard is optional |
| Engine Agnostic | Swap TTS/music engines without touching project files |
| Hardware Aware | Auto-detects GPU, suggests optimal engine |
| Bilingual First | Arabic + English as primary languages |
- Python 3.11+
- ffmpeg on PATH
- Optional: NVIDIA GPU with 4GB+ VRAM for XTTS voice cloning
# Install with core dependencies (includes edge-tts, audio processing, etc.)
pip install -e ".[cloud]"
# Or for development
pip install -e ".[dev,server]"AudioFormation supports multiple TTS engines. Install the optional dependencies based on your needs:
# Cloud TTS engines (recommended for most users)
pip install -e ".[cloud]"
# Includes: gTTS, ElevenLabs, httpx, python-dotenv
# Note: edge-tts is always installed (main dependency)
# Local voice cloning (requires GPU)
pip install -e ".[xtts]"
# Includes: coqui-tts (XTTS v2)
# Voice activity detection for ducking
pip install -e ".[vad]"
# Includes: silero-vad
# Web dashboard
pip install -e ".[server]"
# Includes: fastapi, uvicorn, aiofiles, python-multipart
# M4B audiobook export
pip install -e ".[m4b]"
# Includes: mutagen
# MIDI composition
pip install -e ".[midi]"
# Includes: midiutil
# Full installation (all features)
pip install -e ".[cloud,xtts,vad,server,m4b,midi,dev]"Some engines require API keys:
ElevenLabs (optional):
export ELEVENLABS_API_KEY="your_api_key_here"# Create a project
audioformation new "MY_NOVEL"
# Start the Dashboard
audioformation serve
# Open http://localhost:4001 in your browserOr use the CLI:
# Ingest text
audioformation ingest MY_NOVEL --source ./chapters/
# Generate (edge-tts with gTTS fallback)
audioformation generate MY_NOVEL --engine edge
# Mix (Voice + Music + Ducking)
audioformation mix MY_NOVEL
# Export
audioformation export MY_NOVEL --format m4bAudioFormation supports multiple TTS engines with different capabilities:
| Engine | Type | Languages | Cost | Dependencies |
|---|---|---|---|---|
| Edge-TTS | Cloud | Arabic, English, many others | Free | edge-tts |
| gTTS | Cloud | 100+ languages | Free | gTTS |
| ElevenLabs | Cloud | English (premium voices) | Paid | elevenlabs + API key |
| XTTS v2 | Local | Multilingual | Free | coqui-tts + GPU |
# Use Edge-TTS (default, best for Arabic)
audioformation generate PROJECT --engine edge
# Use gTTS (fallback, more languages)
audioformation generate PROJECT --engine gtts
# Use ElevenLabs (premium quality)
audioformation generate PROJECT --engine elevenlabs
# Use XTTS v2 (local voice cloning)
audioformation generate PROJECT --engine xtts| Feature | Status |
|---|---|
| Edge TTS (free, Arabic + English) | β BUILT |
| SSML direction mapping | β BUILT |
| Text chunking (breath-group) | β BUILT |
| Per-chapter QC scanning | β BUILT |
| QC Scan API endpoint | β BUILT |
| LUFS normalization | β BUILT |
| MP3 export with manifest | β BUILT |
| Arabic diacritics detection | β BUILT |
| Engine fallback chain (edge-tts β gTTS) | β BUILT |
| XTTS v2 engine adapter | β BUILT |
| ElevenLabs engine adapter | β BUILT |
| Multi-speaker dialogue | β BUILT |
| Ambient pad generation | β BUILT |
| VAD-based ducking | β BUILT |
| M4B audiobook export | β BUILT |
| Web dashboard | β BUILT |
| Run All Pipeline | β BUILT |
AudioFormation follows a modular pipeline architecture with five core domains:
audioformation CLI β Pipeline State Machine
βββ TTS Engines (edge, gtts, xtts, elevenlabs) β
IMPLEMENTED
βββ Audio Processor (normalize, trim, stitch) β
IMPLEMENTED
βββ Ambient Composer (pad generation) β
IMPLEMENTED
βββ Mixer (multi-track, VAD ducking) β
IMPLEMENTED
βββ QC Scanner (per-chunk quality) β
IMPLEMENTED
βββ Exporter (MP3/M4B + manifest) β
IMPLEMENTED
- β Phase 1 Complete: Core TTS pipeline, QC, audio processing
- β Phase 2 Complete: Cloud TTS adapters, voice cloning, multi-speaker, CLI tools
- β Phase 3 Complete: Mixer with ducking, M4B export, web interface (Editor/Mix)
- β Phase 4 Complete: Dashboard v2.0
- β³ Phase 5 Future: Real Music & Algorithmic composition, advanced features
The dashboard (audioformation serve) provides a visual interface for:
- Project Management: Create and list projects.
- Editor: Configure generation settings, edit chapter metadata, trigger generation per-chapter.
- Mix & Review: Visualize audio waveforms (
wavesurfer.js), play back generated/mixed audio, trigger the mixing pipeline. - Run All Pipeline: Single-click execution of entire audiobook workflow (validate β generate β QC scan β process β compose β mix β export).
pip install -e ".[dev]"
pytest -v
# or
pytest --cov=src --cov-report=term-missing- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT

