-
Notifications
You must be signed in to change notification settings - Fork 34
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Add voice mode support to CodeMie CLI, allowing users to interact with AI coding agents using voice input instead of (or in addition to) text input.
Motivation
Voice input provides a hands-free, natural interaction mode that can improve developer productivity — especially during code reviews, brainstorming, and when navigating complex codebases. The project already has a sound effects system (--sounds flag) for audio output on hook events, but lacks any voice/microphone input capability.
Requirements
Core Functionality
- Record audio from the user's microphone
- Transcribe speech to text using a speech-to-text provider
- Send transcribed text as a prompt to the active AI coding agent
- Support toggling voice mode on/off during a session (e.g.,
/voicecommand)
Audio Recording
- Detect available audio recording tools (SoX
rec,arecord, etc.) - Provide clear installation instructions if no recorder is found (e.g.,
brew install sox) - Support configurable recording parameters (sample rate, format, silence detection)
- Implement voice activity detection (VAD) to auto-stop recording on silence
Speech-to-Text Integration
- Support multiple STT providers:
- OpenAI Whisper API (already available as dependency)
- Local Whisper model (for offline/privacy-sensitive use)
- Configurable via profile settings
- Handle transcription errors gracefully
- Show transcribed text to the user for confirmation before sending
CLI Integration
- Add
--voiceflag to agent session commands - Add
/voiceslash command to toggle voice mode within a session - Integrate with existing hook system (
UserPromptSubmithook) - Support keyboard shortcut to start/stop recording (e.g., hold-to-talk or push-to-talk)
Configuration
- Add voice settings to profile configuration:
voice.enabled— enable/disable voice modevoice.provider— STT provider (openai, local-whisper)voice.language— preferred language for transcriptionvoice.silenceTimeout— seconds of silence before auto-stopvoice.confirmBeforeSend— show transcription before sending
Technical Considerations
- Existing audio infra: The project has audio player detection in
src/agents/plugins/claude/sounds-installer.ts— recording detection should follow the same pattern - OpenAI SDK: Already included as a dependency with
openai/resources/audio(transcription, translation, speech APIs) - Architecture: Should follow the plugin-based 5-layer architecture (CLI → Registry → Plugin → Core → Utils)
- Cross-platform: Must work on macOS (SoX/
rec), Linux (arecord, SoX), and Windows WSL
Out of Scope (for initial version)
- Text-to-speech responses (agent speaking back)
- Real-time streaming transcription
- Multi-language auto-detection
- Wake word activation
Acceptance Criteria
- User can start a voice session with
codemie chat --voiceor toggle with/voice - Audio is recorded, transcribed, and sent as a prompt to the agent
- Clear error messages when audio tools are missing
- Voice settings are configurable per profile
- Works on macOS and Linux
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request