Audio recording to speech-to-text script using speaches.ai API. Record audio from your microphone, compress it with ffmpeg, and get instant transcriptions using a local speaches.ai instance.
- π€ Audio Recording: Record from any audio input device using sounddevice
- π₯οΈ GUI Interface: Easy-to-use graphical interface with real-time feedback
- ποΈ Voice Activity Detection (VAD): Optional silence filtering using Silero VAD with ONNX Runtime
- π Single Instance: Prevents multiple copies of the app from running simultaneously
- ποΈ Smart Compression: Automatic silence removal and Opus encoding with ffmpeg
- π Fast Transcription: Uses speaches.ai (OpenAI-compatible API) with faster-whisper
- π» Windows Compatible: Works on Windows 11 without admin rights
- π¦ Easy Management: Uses modern
uvpackage manager
- Python 3.9 or higher
- uv package manager
- ffmpeg (installation instructions below)
- speaches.ai running locally (Docker)
- PyTorch and ONNX Runtime (installed automatically with VAD dependencies)
# Windows (PowerShell)
irm https://astral.sh/uv/install.ps1 | iexcd c:\dev\speakpyuv venv
.venv\Scripts\activate
uv pip install -e .Option A: System Installation
- Download ffmpeg from https://www.gyan.dev/ffmpeg/builds/
- Choose "ffmpeg-release-essentials.zip"
- Extract the archive
- Add the
binfolder to your system PATH
Option B: Portable (No Admin Required)
- Download ffmpeg from the link above
- Extract the archive
- Create a
ffmpegfolder in the project directory - Copy the
binfolder into it - Your structure should be:
c:\dev\speakpy\ffmpeg\bin\ffmpeg.exe
Make sure your speaches.ai Docker container is running:
docker run -d -p 8000:8000 ghcr.io/speaches-ai/speaches:latest# Start with visible window
python speakpy_gui.py
# Start minimized to system tray
python speakpy_gui.py --trayThe GUI provides:
- Simple Interface: Click "Start Recording" button to begin, "Stop Recording" to finish
- Live Activity Log: See real-time feedback about recording and processing status
- Transcription Display: View transcription results in a dedicated text area
- Copy to Clipboard: One-click button to copy transcription text
- Auto-Paste: Automatically paste transcribed text into other applications (no admin rights required)
- System Tray Integration: Minimize to tray, control from system tray icon
- Global Hotkey: Press Ctrl+Shift+; to toggle recording from anywhere
- Status Indicators: Visual feedback showing current application state (Ready/Recording/Processing)
GUI Controls:
- Click Start Recording to begin capturing audio
- Speak clearly into your microphone
- Click Stop Recording when finished
- Wait for processing and transcription to complete
- Use Copy to Clipboard to copy the transcription text
- Enable Auto copy to clipboard checkbox to automatically paste text into focused applications
- Use Clear to reset the transcription area
- Model Selection: Edit the model field to change the transcription model (takes effect on next recording)
- Enable VAD Filtering: Checkbox to toggle Voice Activity Detection (silence filtering) for the next recording
- VAD Threshold: Slider to adjust detection sensitivity (0.0-1.0) on-the-fly
Window Management:
- Close (X) Button: Exits the application completely
- Minimize (-) Button: Hides window to system tray (keeps running in background)
- System Tray Icon: Right-click for menu options:
- Show Window: Restore the main window
- Start Recording: Toggle recording from tray
- Exit: Close the application
- Global Hotkey: Press Ctrl+Shift+; anywhere to toggle recording (even when minimized)
Auto-Paste Feature: When the "Auto copy to clipboard" checkbox is enabled, transcribed text will automatically:
- Copy to clipboard
- Simulate Ctrl+V keypress after 150ms
- Paste into whichever application has keyboard focus (e.g., Notepad, browser, Word)
This works without admin rights using standard keyboard input simulation.
You can customize the GUI startup behavior with these flags:
usage: speakpy_gui.py [-h] [--tray] [--api-url API_URL] [--model MODEL]
[--vad] [--vad-threshold VAD_THRESHOLD] [--keep-files]
Arguments:
--tray Start minimized to system tray
--api-url API_URL Speaches.ai API URL (default: http://localhost:8000)
--model MODEL Transcription model
--vad Enable Voice Activity Detection by default
--vad-threshold THRESH VAD sensitivity threshold (default: 0.5)
--keep-files Keep temporary audio files
- Recording: Captures audio from your microphone using the sounddevice library
- VAD (Optional): Detects and filters voice activity in real-time using Silero VAD ONNX model (secure, no arbitrary code execution)
- Compression: Processes audio with ffmpeg:
- Removes silence at the beginning
- Converts to 16kHz mono
- Encodes with Opus codec at 32kbps for minimal file size
- Transcription: Sends compressed audio to speaches.ai API
- Results: Displays the transcription in your console
- Make sure ffmpeg is installed and in your PATH
- Or place ffmpeg in the
ffmpeg/bin/directory within the project - Run
ffmpeg -versionto verify installation
- Check if Docker container is running:
docker ps - Verify port 8000 is accessible:
curl http://localhost:8000/docs - Make sure you're using the correct API URL
- Check if your microphone is connected and enabled
- Check Windows sound settings
- Ensure good microphone quality and minimal background noise
- Try specifying the language:
--language en - Record for longer (speak more before pressing CTRL+C) for better context
- Check if the correct audio device is selected
speakpy/
βββ speakpy_gui.py # Main GUI application
βββ pyproject.toml # Project configuration
βββ README.md # This file
βββ src/
β βββ __init__.py
β βββ audio_recorder.py # Audio recording with sounddevice
β βββ audio_compressor.py # FFmpeg compression
β βββ api_client.py # Speaches.ai API client
β βββ vad_processor.py # Voice Activity Detection (Silero VAD)
β βββ gui.py # GUI components (tkinter)
β βββ utils.py # Helper functions
βββ ffmpeg/ # Optional: portable ffmpeg
βββ bin/
βββ ffmpeg.exe
- Switch from PyTorch Hub to ONNX Runtime: β Migrated VAD to use ONNX model via official silero-vad package for improved security (no arbitrary code execution from torch.hub)
- Dynamic API Handler: Add a field to the transcription API endpoint to allow dynamic switching of API handlers at runtime.
- Streaming Transcription: Implement real-time streaming transcription to provide live text feedback while recording.
- speaches.ai - OpenAI-compatible STT/TTS server
- faster-whisper - Fast transcription engine
- sounddevice - Audio I/O library
- Compression technique inspired by Epicenter
This project is free to use and modify.
