Speaking is just easier.
Speak freely, type instantly on Wayland (X11 compatible) — 100% local voice dictation for Linux with 25+ languages, 5 translation backends, speaker diarization, and real-time visual feedback. Text appears right where your cursor is.
What is dictee? • System requirements • Quick start • Features • Installation • Configuration • Usage • Post-processing • Limitations • Roadmap • Wiki
dictee is a complete voice dictation system for Linux. Press a shortcut, speak, and the text is typed directly into the active application — any application, any window, any text field.
Transcription is performed 100% locally by default: no audio ever leaves your machine unless you explicitly choose a cloud translation backend.
- 100% local processing by default — no audio leaves the machine unless you explicitly enable a cloud translation backend. Frozen ONNX models, no training on your data.
- 4 ASR backends to choose from — Parakeet-TDT and Canary run as native Rust binaries (ONNX Runtime, low GPU latency), faster-whisper (99 languages) and Vosk (lightweight CPU) in Python. Transparent switching via Unix socket depending on language, latency or hardware. → 4 ASR backends
- 5 translation backends to choose from — from fully local (Canary, LibreTranslate, Ollama) to cloud (Google, Bing), with an explicit privacy table for each option. → Translation backends
- No duration limit on audio files — the chunked pipeline shipped in v1.3 (
dictee-transcribe) diarizes a 54-min keynote in 122 s on an 8 GB GPU, where direct mel loading caps at 10-15 min. Ideal for meeting minutes and long interviews. - Native Linux integration — KDE Plasma 6 plasmoid + PyQt6 system tray (compatible with GNOME, XFCE, Sway via AppIndicator fallback). No other desktop dictation app offers this on Linux.
| Backend | Min RAM | CPU mode | GPU | Disk |
|---|---|---|---|---|
| Parakeet-TDT (default) | 4 GB | Yes — ~0.8 s per utterance (recent CPU) | NVIDIA 4 GB+ VRAM (~5× faster) | 3 GB |
| Canary-1B v2 | 6 GB | No — encoder too heavy | NVIDIA 6 GB+ VRAM required | 6 GB |
| faster-whisper | 4 GB | Yes — turbo or small |
NVIDIA 4 GB+ VRAM (large-v3) |
3 GB |
| Vosk | 2 GB | Yes — by design | — | 50 MB |
Distributions tested: Ubuntu 22.04 / 24.04 · Debian 12 · Fedora 40 / 44 · openSUSE Tumbleweed · Arch Linux · KDE Neon.
Desktop environments: KDE Plasma 6 (full integration via native plasmoid widget) · GNOME, Xfce, Cinnamon (system tray only — GNOME requires the AppIndicator extension).
Three steps to go from zero to dictation in under two minutes:
1. Install
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash2. Configure
The first-run wizard walks you through backend selection, model download and keyboard shortcut binding. Re-run anytime with dictee --setup.
3. Speak
Press your shortcut (default F9), speak, release. The transcription appears at your cursor.
For detailed install paths (manual .deb/.rpm, GPU prerequisites, AUR, from source), see Installation below or the wiki's Installation and GPU-Setup pages.
| Backend | Languages | Model size | Warm latency | Notes |
|---|---|---|---|---|
| Parakeet-TDT 0.6B v3 | 25 | ~2.5 GB | ~0.8s CPU · ~0.16s GPU | Default, native punctuation |
| Canary-1B v2 | 25 | ~5 GB | ~0.7s GPU | Built-in translation (25 ↔ EN, 48 pairs) |
| faster-whisper | 99 | ~500 MB–3 GB | ~0.3s | Wide language coverage |
| Vosk | 20+ | ~50 MB | ~1.5s | Lightweight, strict offline |
Each backend runs as a systemd user service with the same Unix socket protocol — switching is transparent. → ASR-Backends wiki
dictee uses Parakeet-TDT 0.6B v3 by default. On the Open ASR Leaderboard, it outperforms Whisper-large-v3 on multilingual transcription while being significantly smaller and faster:
| Model | Size | English WER | FLEURS multilingual (avg) | Relative speed |
|---|---|---|---|---|
| Parakeet-TDT 0.6B v3 (dictee default) | 600M | ~6.5 % | 12.0 % | ~10× Whisper-large-v3 |
| Whisper-large-v3 | 1.55B | 7.4 % | 12.6 % | baseline |
| Canary-1B v2 (also bundled) | 1B | 7.2 % | – | ~5× Whisper-large-v3 |
| Whisper-large-v3-turbo | 809M | ~7.8 % | – | ~3-4× |
| Vosk (CPU fallback) | 50 MB | ~12-18 % | – | – |
Parakeet-TDT v3 wins particularly on French, Greek, Estonian and Maltese. For maximum language coverage (99 languages), switch to faster-whisper; for built-in translation, switch to Canary-1B.
Sources: NVIDIA Parakeet-TDT v3 · Open ASR Leaderboard 2025.
| Backend | Privacy | Speed | Quality | Languages |
|---|---|---|---|---|
| Canary-1B | 🔒 Local | Built-in | Excellent | 4 |
| LibreTranslate | 🔒 Local | 0.1–0.3s | Good | 30+ |
| Ollama | 🔒 Local | 2–3s | Excellent | Any (LLM) |
| Google Translate | 🌐 Cloud | 0.2–0.7s | Excellent | 130+ |
| Bing Translator | 🌐 Cloud | 1.7–2.2s | Very good | 100+ |
→ Translation wiki · Ollama-Setup
A 12-step configurable pipeline transforms raw ASR output before it hits your cursor:
- Regex rules + dictionary — 7 languages, ASR variants, voice commands → Rules-and-Dictionary
- LLM correction — optional fluency polish via local Ollama (first / last / hybrid position) → LLM-Correction
- Numbers & dates — cardinal, ordinal, versions, decimals, French times → Numbers-Dates-Continuation
- Continuation buffer — continue a sentence across dictations with last-word memory
- Short-text keepcaps — per-language exceptions for acronyms and names (new in v1.3)
Answer "who spoke when?" in multi-speaker recordings via NVIDIA's Sortformer model. Up to 4 speakers, ideal for meeting notes and interviews. Triggered via Meeting mode or dictee --meeting. → Diarization wiki
Push-to-talk is dictee's main flow, but the bundled dictee-transcribe window also handles offline transcription of any audio or video file you already have. Multi-tab interface, audio player synchronised with the timeline, per-tab translation and LLM analysis, export to PDF / SRT / JSON / Markdown.
- Any input format (mp3, mp4, wav, opus, flac, mkv…) — auto-converted via ffmpeg
- Multi-tab — keep the original transcription side-by-side with translations and LLM analyses (summary, chapters, ASR cleanup…)
- Speaker diarization built-in — toggle on, get up to 4 speakers labelled and renamable
- LLM analysis — 14 providers configurable side by side (Ollama, OpenAI, Claude, Gemini, Mistral, DeepSeek, Groq, Cerebras, OpenRouter…)
- Per-tab translation — Canary / LibreTranslate / Ollama / Google / Bing
- KDE Plasma 6 widget — native QML plasmoid, 5 animation styles, live state → Plasmoid-Widget
- System tray icon — PyQt6, works on GNOME/XFCE/Sway (AppIndicator fallback) → Tray-Icon
- animation-speech (external) — fullscreen overlay on
wlr-layer-shellcompositors
All three share state via a filesystem watcher — any change is reflected instantly across interfaces (multi-user safe with UID suffix).
animation-speech is a standalone project that provides a fullscreen visual animation during recording, with cancellation via the Escape key. It works on any Wayland compositor supporting wlr-layer-shell (KDE Plasma, Sway, Hyprland…).
sudo dpkg -i animation-speech_1.2.0_all.debDownload: animation-speech releases
Note: animation-speech is not compatible with GNOME (no
wlr-layer-shellsupport). GNOME users can rely ondictee-trayfor visual feedback. Contributions for a GNOME Shell extension are welcome — see the plasmoid source for reference architecture.
Auto-detects distro and GPU, adds the NVIDIA CUDA repo if needed, installs the right package:
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bashSupported: Ubuntu, Debian, Fedora, openSUSE, Arch Linux. Other distros fall back to the tarball path.
Options (after --):
# Force CPU (skip GPU detection)
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --cpu
# Force GPU (CUDA)
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --gpu
# Pin a specific version
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --version 1.3.3
# Non-interactive
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --non-interactiveDownload from Releases.
Ubuntu / Debian (CPU):
sudo apt install ./dictee-cpu_1.3.3_amd64.debUbuntu / Debian (GPU): requires the NVIDIA CUDA APT repo — see GPU-Setup for the one-time setup, then:
sudo apt install ./dictee-cuda_1.3.3_amd64.debFedora / openSUSE (CPU):
sudo dnf install ./dictee-cpu-1.3.3-1.x86_64.rpmFedora / openSUSE (GPU): add the CUDA repo first (see GPU-Setup), then dictee-cuda-1.3.3-1.x86_64.rpm.
Arch Linux (AUR): PKGBUILD in the repo root (x86_64 + aarch64). Clone + makepkg -si.
aarch64 / Jetson: no pre-built package — build from source. CUDA limited to NVIDIA Jetson boards.
Other distros (tarball):
tar xzf dictee-1.3.3_amd64.tar.gz
cd dictee-1.3.3
sudo ./install.shThe tarball ships binaries only — system dependencies must be installed beforehand via your distro's package manager. Names vary by distro; pick the equivalents in yours:
python3(≥3.10),python3-pip,python3-venvpython3-evdev,python3-pyqt6(+qtmultimedia+qtsvg),python3-numpypulseaudio-utils,pipewire(oralsa-utils),libnotify(-bin),soxwl-clipboard(Wayland),xclip(X11)translate-shell,curl
Example for Debian-derived distros (Mint, MX, Pop!_OS…):
sudo apt install python3 python3-pip python3-venv python3-evdev \
python3-pyqt6 python3-pyqt6.qtmultimedia python3-pyqt6.qtsvg python3-numpy \
pulseaudio-utils pipewire libnotify-bin sox wl-clipboard xclip \
translate-shell curlFrom source: cargo build --release --features sortformer then sudo ./install.sh. See Developer-Guide for full Cargo features and package build scripts.
First launch triggers a setup wizard (backend, model, shortcuts).
Reconfigure anytime from the application menu, tray icon, Plasma widget, or by running:
dictee --setup# Show current backends
dictee-switch-backend status
# Switch ASR (parakeet · canary · whisper · vosk)
dictee-switch-backend asr canary
# Switch translation (canary · libretranslate · ollama · google · bing)
dictee-switch-backend translate ollamaThe tray and plasmoid include backend sub-menus — no terminal required.
For detailed configuration (all ASR backends, translation matrix, plasmoid settings, keyboard shortcuts on tiling WMs), see the wiki:
- ASR-Backends · Translation
- Plasmoid-Widget · Tray-Icon
- Keyboard-Shortcuts (KDE/GNOME/Sway/i3/Hyprland)
# Simple dictation — transcribe and type
dictee
# Dictate + translate (default: system language → English)
dictee --translate
dictee --translate --ollama # 100% local via Ollama
# Change target language
DICTEE_LANG_TARGET=es dictee --translate # → Spanish
# Meeting mode (diarization, up to 4 speakers)
dictee --meeting
# Cancel ongoing dictation
dictee --cancel
# Test post-processing rules live
dictee-test-rules # interactive
dictee-test-rules --loop # continuous loop
dictee-test-rules --wav file.wav # from audio file→ Full command reference: CLI-Reference wiki
dictee runs a configurable 12-step pipeline after transcription and before paste:
- ASR variants normalization
- Dictionary substitution
- Numbers & dates conversion
- Continuation buffer merge
- Regex rules (pre-LLM)
- LLM correction (optional, first position)
- Regex rules (post-LLM)
- Short-text exceptions (keepcaps)
- Extended match mode
- Final capitalization
- Translation (optional)
- Paste / inject
Configure via dictee --setup → Post-processing tab, or test rules live with dictee-test-rules.
→ Deep dives: Post-Processing-Overview · Rules-and-Dictionary · LLM-Correction · Numbers-Dates-Continuation
- Long-file diarization: the chunked pipeline shipped in v1.3 (used by
dictee-transcribe) lifts the VRAM cap (54-min keynote diarized in 122 s on 8 GB). In continuous live dictation (push-to-talk held without releasing), a single utterance > 10-15 min on an 8 GB GPU may still OOM — rare in practice, split the file or switch to the CPU backend. → Diarization wiki - AMD / Intel GPUs are not currently supported — dictee falls back to CPU.
- No real-time streaming — Parakeet-TDT and Canary require the full utterance; only Nemotron (EN-only, via Rust binary) streams natively.
For bug reports and workarounds, see Troubleshooting.
v1.3.3 (current) — Cross-distro packaging consistency. Arch .install hooks now add input/docker groups at install time (postinst .deb / %post .rpm already did — PTT and LT silently broke on fresh Arch installs without this). python-evdev promoted to hard depends on Arch (was optdepends → fallback to broken raw mode). Plasmoid wraps docker inspect in sg docker so the LT indicator stays accurate when plasmashell's group set hasn't refreshed yet. udev rule shipped in mode 0660 directly. Postprocess venv (text2num) created in tarball install too. dictee script wraps dotool in sg input when invoked from a stale-group parent shell (plasmoid Dictate button typing into focused window). Closes #5 and #6.
v1.3.2 — CUDA → CPU runtime fallback: the CUDA package now probes /proc/driver/nvidia/gpus/ at startup and falls back to CPU automatically on hosts without a usable NVIDIA driver (virtio VMs, headless containers, machines with the driver uninstalled), instead of crashing in a restart loop. Setup wizard's "ASR service" check is now strict (active state + open socket), so the final page can no longer report "Everything is ready" while the daemon is dead. New DICTEE_FORCE_CPU=1 env override.
v1.3.0 — Short-text keepcaps exceptions (7 languages), extended match mode, LibreTranslate purge models, continuation + translate fixes, version-number dictation, multi-user safe (UID suffix on state files), plasmoid cross-process toggles (LLM / Short / Meeting), 682 postprocess tests + 148 pipeline tests, theme-aware banner.
v1.4+ (planned)
- Chunked diarization — process files > 15 min via
transcribe-diarize-batch(prototype validated: 54 min in 122 s) - Hotword boosting — bias ASR decoding toward custom names (shallow fusion on TDT logits, Parakeet only)
- Whisper translate — multi-target translation via
task="translate"(EN-only, offline) - Moonshine CPU backend
- CLI speech-to-text — pipe audio, get text
- VAD — hands-free dictation without push-to-talk
- Streaming transcription with live text display
- Built-in overlay — replace external
animation-speech - AppImage / Flatpak packaging
- COSMIC / GNOME Shell applets (contributions welcome!)
→ Full history: Changelog wiki
The transcription engine builds on parakeet-rs by Enes Altun — Rust library for NVIDIA Parakeet inference via ONNX Runtime. The Rust Canary implementation was originally ported from onnx-asr by Ivan Stupakov and is now fully self-contained. Parakeet and Canary ONNX models are provided by NVIDIA (downloaded separately from HuggingFace, not redistributed by this project).
Keyboard input simulation uses dotool by geb (GPL-3.0).
This project is distributed under the GPL-3.0-or-later license (see LICENSE).
The original parakeet-rs code by Enes Altun is under the MIT license (see LICENSE-MIT). dotool is bundled under GPL-3.0.









