Local voice-driven agent for Reachy Mini. Runs entirely on-device — no cloud APIs. Uses faster-whisper for STT, llama.cpp for the LLM, and Piper for TTS.
Microphone
|
v
[silero-vad] (CPU, always listening)
| speech detected
v
[faster-whisper small.en] (GPU, ~1.5 GB VRAM)
| transcribed text
v
[Agent Client] --> [llama.cpp / Qwen 4B] (GPU, ~3 GB VRAM)
| |
| tool calls
| v
| [Robot Tools] --> [Reachy Mini SDK]
| |
| tool results
|<---------------------'
| assistant text
v
[Piper TTS] (CPU)
| audio
v
Speaker
All inference stays local on a single machine. Target hardware: older i7 + NVIDIA GTX 1070 (8 GB VRAM).
- NVIDIA GPU with 6+ GB VRAM (GTX 1070 or better)
- Microphone
- Speaker (or the Reachy Mini's built-in speaker)
- Reachy Mini (Lite or Wireless) — or run in simulator mode
Install these on the host machine (Arch Linux):
# NVIDIA drivers + CUDA
sudo pacman -S nvidia nvidia-utils cudnn
# CUDA toolkit — see "GPU Compatibility" section below for version choice
sudo pacman -S cuda
# Audio server + libraries
sudo pacman -S pipewire pipewire-pulse pipewire-alsa wireplumber portaudio
systemctl --user enable --now pipewire pipewire-pulse wireplumber
# GStreamer (for Reachy camera)
sudo pacman -S gstreamer gst-plugins-base gst-plugins-good
# llama.cpp (build from source)
# See: https://github.com/ggml-org/llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
CUDACXX=/opt/cuda/bin/nvcc cmake -B build -DGGML_CUDA=ON -DCUDAToolkit_ROOT=/opt/cuda
cmake --build build --config Release -j$(nproc)
sudo cp build/bin/llama-server /usr/local/bin/
# uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | shecho 'SUBSYSTEM=="usb", ATTRS{idVendor}=="1a86", ATTRS{idProduct}=="55d3", MODE="0666", GROUP="uucp"
SUBSYSTEM=="usb", ATTRS{idVendor}=="38fb", ATTRS{idProduct}=="1001", MODE="0666", GROUP="uucp"' \
| sudo tee /etc/udev/rules.d/99-reachy-mini.rules
sudo udevadm control --reload-rules && sudo udevadm trigger
sudo usermod -aG uucp $USER
# Log out and back in for the group change to take effectmkdir -p models/gguf models/piper
# LLM — Qwen 3.5 4B Q4_K_M (recommended for GTX 1070)
curl -L -o models/gguf/qwen3.5-4b-q4_k_m.gguf \
"https://huggingface.co/unsloth/Qwen3.5-4B-GGUF/resolve/main/Qwen3.5-4B-Q4_K_M.gguf"
# Vision projector (enables image understanding)
curl -L -o models/gguf/mmproj-BF16.gguf \
"https://huggingface.co/unsloth/Qwen3.5-4B-GGUF/resolve/main/mmproj-BF16.gguf"
# TTS voice
wget -O models/piper/en_GB-northern_english_male-medium.onnx \
"https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_GB/northern_english_male/medium/en_GB-northern_english_male-medium.onnx?download=true"
wget -O models/piper/en_GB-northern_english_male-medium.onnx.json \
"https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_GB/northern_english_male/medium/en_GB-northern_english_male-medium.onnx.json?download=true"
# STT model downloads automatically on first run (faster-whisper handles this)cp .env.example .env
# Edit .env — update LLAMA_MODEL_PATH in scripts/run_llama_server.sh
# if your GGUF filename differs from the defaultuv syncYou need three terminals (or use a process manager):
# Terminal 1: Reachy Mini daemon (or simulator)
reachy-mini-daemon # real robot
reachy-mini-daemon --sim # simulator
# Terminal 2: LLM server
# Edit LLAMA_MODEL_PATH if needed
LLAMA_MODEL_PATH=models/gguf/your-model.gguf ./scripts/run_llama_server.sh
# Terminal 3: Voice agent
./scripts/run_app.shThe agent will start listening. Speak and it will transcribe, reason, execute tools, and respond via TTS.
app/
__init__.py
main.py # Entry point — initializes services and starts loop
config.py # Settings from .env
orchestrator.py # Main conversation loop
audio_io.py # VAD-gated microphone capture
stt_service.py # faster-whisper transcription
tts_service.py # Piper speech synthesis
agent_client.py # LLM client with tool calling
robot_tools.py # Tool schemas and handlers for Reachy
robot_state.py # Reachy Mini connection management
playback.py # Audio output
scripts/
run_llama_server.sh
run_app.sh
models/ # gitignored — download models here
gguf/
piper/
logs/ # gitignored — runtime logs and snapshots
The agent can call these tools during conversation:
| Tool | Description |
|---|---|
look_left |
Turn head left |
look_right |
Turn head right |
look_center |
Return head to center |
nod |
Nod yes |
shake_head |
Shake no |
take_snapshot |
Capture camera image |
get_robot_status |
Check connection status |
get_time |
Get current time |
Add the OpenAI function schema to TOOLS in robot_tools.py and a handler function in make_handlers(). Handlers must return a JSON-serializable dict and never raise exceptions.
CUDA 13 dropped support for Pascal GPUs (GTX 1070 and older, compute capability < 7.5). If you have a Pascal card, you need CUDA 12.x:
# For Pascal GPUs (GTX 1070, etc.) — install CUDA 12.9 from AUR
yay -S cuda-12.9 # replaces the cuda packageThe ctranslate2 library (used by faster-whisper) also needs a matching CUDA version. If you see libcublas.so.12 not found errors, your CUDA toolkit version doesn't match. Use WHISPER_DEVICE=cpu with WHISPER_COMPUTE_TYPE=int8 as a reliable fallback — it's fast enough for small.en.
| Component | VRAM |
|---|---|
| faster-whisper small.en (int8, CUDA) | ~1 GB |
| Qwen 3.5 4B Q4_K_M via llama.cpp | ~3 GB |
| CUDA overhead | ~0.5 GB |
| Total | ~4.5 GB |
To free more VRAM, set WHISPER_DEVICE=cpu and WHISPER_COMPUTE_TYPE=int8 in .env to run STT on CPU instead (~0 VRAM, still fast).
The --jinja flag in run_llama_server.sh is required for tool calling to work. Without it, the server silently ignores the tools parameter.
Context is set to 4096 by default (LLAMA_CTX). This is conservative but keeps memory low. Increase if the model needs more context for complex conversations.
Qwen 3.5 enables "thinking" by default — the model spends tokens on internal reasoning before responding. This is wasteful for a voice agent (adds latency, consumes the token budget). The agent client disables it via chat_template_kwargs: {"enable_thinking": false}, passed through the OpenAI client's extra_body parameter. If you swap to a non-thinking model, this parameter is harmlessly ignored.
The current agent layer (agent_client.py) is a thin OpenAI-compatible client that handles tool-calling loops directly against the llama.cpp server. It uses the same tool schema format as Hermes, so swapping in Hermes later requires minimal changes — point Hermes at the same llama.cpp endpoint and register the same tool handlers.
- GTX 1070 + Arch Linux — CUDA 12 requirement, PipeWire, permissions, VRAM budget
- Core loop — STT + LLM + TTS working end-to-end locally
- Robot tools — agent calls Reachy movement tools from conversation
- Session memory — conversation context persists across turns (Hermes integration)
- Camera —
take_snapshottool, optional vision pipeline - Attention model — wake word or LLM-based filtering for "am I being spoken to?"
- Self-improvement — guarded tool/skill editing with human approval