Welcome to Aivy, an open-source voice assistant that channels the charm and intelligence of Tony Stark’s iconic AI companion! Built with Python, Aivy combines real-time speech-to-text (STT), text-to-speech (TTS), and a local large language model (LLM) to deliver seamless, witty conversations. Whether you’re asking about the weather or craving a clever quip, Aivy is here to assist with style! 😎
- 🎙️ Real-Time Speech Recognition: Transcribes speech after 2 seconds of talking followed by 1.5 seconds of silence using
faster_whisper. - 🗣️ Human-Like Speech Synthesis: Generates natural-sounding audio with the
mimiTTS model (CSM-1B). - 🧠 Witty Conversations: Powered by LLaMA-3.2-1B via LM Studio, delivering Iron Man-style responses.
- 🔄 Overlap Prevention: Blocks new speech during responses for clear, uninterrupted interactions.
- 🌐 Cross-Platform: Runs on Linux, macOS, and Windows with GPU or CPU support.
Follow these steps to bring Aivy to life on your system. Whether you’re on Linux, macOS, or Windows, Aivy is designed to be easy to set up and fun to use!
Before you begin, ensure you have the following:
- Python 3.8+: Check with
python --version. - CUDA (Optional): For GPU acceleration (requires an NVIDIA GPU and CUDA toolkit).
- FFmpeg: For audio processing.
- Ubuntu:
sudo apt install ffmpeg - macOS:
brew install ffmpeg - Windows: Download from FFmpeg and add to your PATH.
- Ubuntu:
- LM Studio: To serve LLaMA-3.2-1B locally (download).
- Models:
- LLaMA-3.2-1B: Available via Hugging Face or LM Studio (requires a Hugging Face account).
- CSM-1B:
mimiweights from Hugging Face.
- Microphone and Speakers: Any standard USB, Bluetooth, or built-in mic/speaker.
- Git: To clone the repository (
git --version).
-
Clone the Repository:
git clone https://github.com/yourusername/aivy.git cd aivy -
Set Up a Virtual Environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
requirements.txt (included in the repo):
faster-whisper openai sounddevice numpy torch torchaudio transformers huggingface_hub safetensors psutil -
Download Models:
- LLaMA-3.2-1B:
-
Download via LM Studio or Hugging Face (e.g.,
meta-llama/Llama-3.2-1B). -
Set the model path environment variable:
export LLAMA_MODEL_PATH=/path/to/Llama-3.2-1B
-
- CSM-1B:
-
Download
mimiweights from Hugging Face (e.g.,moshi/mimi). -
Set the model path:
export CSM_MODEL_PATH=/path/to/csm-1b
-
- LLaMA-3.2-1B:
-
Configure Audio Devices (Optional):
-
Set environment variables for your microphone and speaker:
export INPUT_DEVICE=0 # Microphone index export OUTPUT_DEVICE=0 # Speaker index
-
Run
python -c "import sounddevice as sd; print(sd.query_devices())"to list available devices. -
If not set, Aivy will prompt you to select devices interactively.
-
-
Set Environment (Linux Only): To ensure compatibility with audio libraries:
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
-
Start LM Studio:
- Open LM Studio and load the LLaMA-3.2-1B model.
- Start the server at
http://localhost:1234/v1(default settings).
-
Launch Aivy:
python aivy.py
- Select your microphone and speaker when prompted (or use
INPUT_DEVICE/OUTPUT_DEVICE). - Speak: "Hey Aivy, what’s the weather?" and enjoy a witty response!
- Select your microphone and speaker when prompted (or use
-
Test the Interaction:
- Say: "Hey Aivy, tell me a joke."
- Expected response: "Why did the AI go to therapy? It had an identity crisis, sir!"
- Aivy will wait for 2 seconds of speech followed by 1.5 seconds of silence before transcribing.
JARVIS initializing...
Initialization complete in 2.35 seconds
Entering main loop...
Audio stream started...
Buffer size: 32000 samples, amplitude: 0.0450
Detected speech, updating last_speech_time
Waiting for silence (1500/1500ms)
Transcribing audio buffer...
User: Hey Aivy, what's the weather like today?
Sending to LLM...
JARVIS: Sunny with a chance of brilliance, sir! (LLM took 0.20 seconds)
Generating speech for: Sunny with a chance...
Audio generation completed in 8.00 seconds
- Audio Device Detection: If your mic or speaker isn’t detected, check devices with
python -c "import sounddevice as sd; print(sd.query_devices())"and setINPUT_DEVICE/OUTPUT_DEVICEaccordingly. - Model Paths: Ensure
LLAMA_MODEL_PATHandCSM_MODEL_PATHare set correctly to avoid loading errors. - Latency on CPU: GPU is recommended for real-time performance; CPU may introduce delays.
- LM Studio Connection: If Aivy fails to connect to
http://localhost:1234/v1, ensure LM Studio is running and the server is active.
File an issue on GitHub if you encounter problems, and include logs for faster resolution!
Aivy is just the beginning! Here’s a roadmap to make it even more powerful:
- 🗣️ Interruption Handling: Reintroduce real-time interruption, allowing users to stop playback by speaking (currently disabled for stability).
- 🎛️ GUI for Device Selection: Add a simple interface using
tkinterorPyQtto choose audio devices. - ⚡ STT Optimization: Use smaller
faster_whispermodels (e.g.,base) for faster transcription.
- 🎤 Wake-Word Detection: Implement "Hey Aivy" to activate listening using
webrtcvadorsnowboy. - 🌐 Multilingual Support: Extend STT and TTS to non-English languages with
whisper-largeand multilingual TTS models.Integrate realtime web search and function calling. - 🔇 Noise Cancellation: Integrate
rnnoisefor robust performance in noisy environments.
- ☁️ Cloud LLM Integration: Support cloud-based LLMs (e.g., via xAI’s API) for users without powerful hardware.
- 📱 Mobile App: Develop a mobile interface for Aivy using Kivy or Flutter.
- 🔌 Plugin System: Enable custom commands (e.g., control smart home devices) via a plugin architecture.
We’re thrilled to welcome contributions to Aivy! 🙌 Whether you’re fixing bugs, adding features, or improving docs, your help will make Aivy the ultimate voice assistant.
- Fork the Repo: Click the "Fork" button on GitHub.
- Create a Branch:
git checkout -b my-feature. - Make Changes: Implement your feature or bug fix.
- Commit:
git commit -m "Add cool feature". - Push:
git push origin my-feature. - Open a Pull Request: Describe your changes and submit!
- Debug and stabilize interruption handling (see Issues).
- Add support for alternative TTS models (e.g., VITS, Tacotron).
- Improve documentation with video tutorials or setup guides.
- Optimize GPU memory usage for larger models.
Check the Issues tab for open tasks, or share your ideas in Discussions.
Aivy is licensed under the MIT License. See LICENSE for details.
Aivy wouldn’t be possible without these amazing projects:
- faster_whisper: Lightning-fast speech recognition.
- mimi (CSM-1B): High-quality text-to-speech synthesis.
- LLaMA-3.2-1B: Powering Aivy’s clever responses.
- LM Studio: Seamless local LLM serving.
- sounddevice: Reliable audio input and output.
Special thanks to the open-source community for inspiring and supporting Aivy’s development! 🌟
Have questions or ideas? Reach out via GitHub Discussions or open an issue. Let’s make Aivy legendary together!
⭐ Star this repo if you love Aivy! Join us in building the ultimate voice assistant! 🚀🤖