Skip to content

feat: Enable voice mode support for CodeMie CLI #214

@amdmax

Description

@amdmax

Summary

Add voice mode support to CodeMie CLI, allowing users to interact with AI coding agents using voice input instead of (or in addition to) text input.

Motivation

Voice input provides a hands-free, natural interaction mode that can improve developer productivity — especially during code reviews, brainstorming, and when navigating complex codebases. The project already has a sound effects system (--sounds flag) for audio output on hook events, but lacks any voice/microphone input capability.

Requirements

Core Functionality

  • Record audio from the user's microphone
  • Transcribe speech to text using a speech-to-text provider
  • Send transcribed text as a prompt to the active AI coding agent
  • Support toggling voice mode on/off during a session (e.g., /voice command)

Audio Recording

  • Detect available audio recording tools (SoX rec, arecord, etc.)
  • Provide clear installation instructions if no recorder is found (e.g., brew install sox)
  • Support configurable recording parameters (sample rate, format, silence detection)
  • Implement voice activity detection (VAD) to auto-stop recording on silence

Speech-to-Text Integration

  • Support multiple STT providers:
    • OpenAI Whisper API (already available as dependency)
    • Local Whisper model (for offline/privacy-sensitive use)
    • Configurable via profile settings
  • Handle transcription errors gracefully
  • Show transcribed text to the user for confirmation before sending

CLI Integration

  • Add --voice flag to agent session commands
  • Add /voice slash command to toggle voice mode within a session
  • Integrate with existing hook system (UserPromptSubmit hook)
  • Support keyboard shortcut to start/stop recording (e.g., hold-to-talk or push-to-talk)

Configuration

  • Add voice settings to profile configuration:
    • voice.enabled — enable/disable voice mode
    • voice.provider — STT provider (openai, local-whisper)
    • voice.language — preferred language for transcription
    • voice.silenceTimeout — seconds of silence before auto-stop
    • voice.confirmBeforeSend — show transcription before sending

Technical Considerations

  • Existing audio infra: The project has audio player detection in src/agents/plugins/claude/sounds-installer.ts — recording detection should follow the same pattern
  • OpenAI SDK: Already included as a dependency with openai/resources/audio (transcription, translation, speech APIs)
  • Architecture: Should follow the plugin-based 5-layer architecture (CLI → Registry → Plugin → Core → Utils)
  • Cross-platform: Must work on macOS (SoX/rec), Linux (arecord, SoX), and Windows WSL

Out of Scope (for initial version)

  • Text-to-speech responses (agent speaking back)
  • Real-time streaming transcription
  • Multi-language auto-detection
  • Wake word activation

Acceptance Criteria

  1. User can start a voice session with codemie chat --voice or toggle with /voice
  2. Audio is recorded, transcribed, and sent as a prompt to the agent
  3. Clear error messages when audio tools are missing
  4. Voice settings are configurable per profile
  5. Works on macOS and Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions