voissistant

Voice-driven notes for Apple Silicon — speak, review, commit to markdown.

┌─────────── Voiss Notes ────────────┐
│                                     │
│  ┌─ Speech ──────┐ ┌─ Notes ──────┐ │
│  │               │ │              │ │
│  │ You speak     │ │ Rendered     │ │
│  │ here. VAD     │ │ markdown     │ │
│  │ detects turn  │ │ appears here │ │
│  │ boundaries.   │ │ after you    │ │
│  │               │ │ commit.      │ │
│  │ Tab/click to  │ │              │ │
│  │ focus, then   │ │ Tab/click to │ │
│  │ e to edit.    │ │ focus, then  │ │
│  │ (in-memory)   │ │ e to edit.   │ │
│  └───────────────┘ └──────────────┘ │
│                                     │
│  ␣ Rec  e Edit  ⏎ Raw  r Rewrite  │
│  ⎋ Discard  q Quit                 │
└─────────────────────────────────────┘

Press Space to record. Speak naturally — VAD detects when you pause. Press Enter to commit accumulated speech raw (with vocab corrections). Press r to send through an LLM for cleanup before committing. Clean markdown is rendered in the right panel and saved to disk.

Both panels support modal editing. Panels have three visual states: blue (default), orange (selected via Tab/click — no cursor, no interaction), and red with an "EDIT" label (editing — cursor visible, full typing). Press e on a selected panel to enter edit mode. In edit mode, all keys type normally (Space, Enter, q, etc. insert text instead of triggering actions). Press Ctrl+S to save your edit, or Escape to cancel and revert. Left panel edits update the in-memory accumulator (for correcting speech before commit). Right panel edits write back to the notes file.

All processing runs locally on your Mac's GPU via MLX. No cloud required (unless you choose a cloud LLM for post-processing).

uv run voiss notes --rewrite-model ollama/llama3.2

Use as a Library

voiss can be installed as a Python dependency for ASR inference on Apple Silicon, without pulling in any CLI/TUI/audio-capture dependencies.

Install (core only — model loading + file transcription)

pip install "voissistant @ git+https://github.com/jaju/dictate.sh.git"

Install with all features (CLI + notes TUI)

pip install "voissistant[all] @ git+https://github.com/jaju/dictate.sh.git"

Quick example

from voiss import load_asr_model, transcribe_file, TranscribeOptions, build_logit_bias

engine = load_asr_model()
result = transcribe_file(engine, "meeting.wav")
print(result.text)

# With domain vocabulary biasing
bias = build_logit_bias(["Kubernetes", "kubectl"], engine.tokenizer, scale=5.0)
result = transcribe_file(engine, "meeting.wav", TranscribeOptions(
    context="Kubernetes, kubectl, etcd, pod",
    logit_bias=bias,
))

Architecture

The codebase is split into three layers:

voiss.core — Pure inference: model loading, transcription, config. No UI or audio-capture deps. Safe to import in library mode.
voiss.audio — Ring buffer (numpy-only) and VAD (needs webrtcvad). Shared between CLI modes.
voiss.apps — CLI, TUI, pipeline orchestration. Depends on sounddevice, rich, textual, litellm.

The public API lives in voiss.api and is re-exported via voiss.__init__ for convenience.

Features

Full-screen notes TUI — two-panel Textual interface with push-to-record workflow and modal editing
Dual commit modes — Enter for raw commit (fast), r for LLM-rewritten commit
Rendered markdown — right panel displays notes as rendered markdown (switches to TextArea for editing)
Local, streaming ASR on Apple Silicon (MLX, Qwen3-ASR)
Voice activity detection (VAD) for automatic turn boundaries
LLM post-processing — each r commit cleaned up via any litellm-compatible model
ASR context biasing — supply domain vocabulary to improve transcription accuracy
Configurable system prompts for domain-specific output (SOAP notes, meeting minutes, etc.)
Also works as a live transcription pipe (uv run voiss)
Fully offline after models are downloaded (with local LLM backends)

Requirements

macOS on Apple Silicon (MLX)
Python >= 3.12
uv installed
Microphone permission granted to your terminal
For notes mode: an LLM backend (e.g., Ollama running locally)

Quick Start

Notes mode (the main event):

uv run voiss notes --rewrite-model ollama/llama3.2

Live transcription (simpler, no TUI):

uv run voiss

With domain vocabulary for better transcription accuracy:

uv run voiss notes --rewrite-model ollama/llama3.2 \
    --context "Kubernetes, kubectl, etcd, CoreDNS, Istio"

With a custom system prompt for post-processing:

uv run voiss notes --rewrite-model ollama/llama3.2 \
    --system-prompt "You are a medical scribe. Format as SOAP notes."

Or load the system prompt from a file:

uv run voiss notes --rewrite-model ollama/llama3.2 \
    --system-prompt-file ~/prompts/meeting-notes.txt

Save notes to a specific file:

uv run voiss notes --rewrite-model ollama/llama3.2 \
    --notes-file ./meeting-2026-02-04.md

CLI Reference

Shared Options (all modes)

Option	Default	Description
`--model`	`mlx-community/Qwen3-ASR-0.6B-8bit`	ASR model
`--language`	`English`	Transcription language
`--context`	—	Domain vocabulary for ASR context biasing
`--context-file`	—	File containing domain vocabulary
`--context-bias`	`5.0`	Additive logit bias scale for context terms
`--transcribe-interval`	`0.5`	Seconds between ASR updates
`--vad-frame-ms`	`30`	VAD frame size (10/20/30)
`--vad-mode`	`3`	VAD aggressiveness (0-3)
`--vad-silence-ms`	`500`	Silence to finalize a turn (ms)
`--min-words`	`3`	Minimum words to finalize a turn
`--max-buffer`	`30`	Maximum audio buffer in seconds
`--energy-threshold`	`300.0`	RMS energy gate for noise rejection
`--device`	—	Audio input device index
`--list-devices`	—	List audio input devices
`--config-file`	`~/.config/voiss/config.json`	JSON config file for context, corrections, bias, and LLM settings

Transcription Mode (`voiss`)

Option	Description
`--analyze`	Enable LLM intent analysis on turn completion
`--llm-model`	LLM for intent analysis (default: `mlx-community/Qwen3-0.6B-4bit`)
`--no-ui`	Disable the Rich live UI

Notes Mode (`voiss notes`)

Option	Description
`--rewrite-model`	(required) LLM model for post-processing (e.g., `ollama/llama3.2`). Falls back to `litellm_postprocess.model` in config.
`--system-prompt`	System prompt to guide post-processing style
`--system-prompt-file`	Path to file containing the system prompt
`--notes-file`	Output file path (default: auto-named in notes directory)

Notes are saved to $VOISS_NOTES_DIR (default: ~/.local/share/voiss/notes/) as timestamped markdown files. Use --notes-file to write to a specific path instead.

Notes Mode Key Bindings

Key	Action
`Space`	Start / stop recording
`Enter`	Commit accumulated speech raw (with vocab corrections, no LLM)
`r`	Commit with LLM post-processing
`e`	Enter edit mode on focused panel
`E`	Open focused panel in `$EDITOR` (neovim, vim, etc.)
`Escape`	Cancel edit (in edit mode) / Discard accumulated text (normal mode)
`Ctrl+S`	Save edit and exit edit mode
`q`	Quit with confirmation (saves uncommitted text raw to file)

Normal mode: All keys above trigger their actions. Tab or click to focus a panel (accent border appears). The left panel header shows "Paused" or "● Listening" so you always know whether audio is being captured. The right panel displays rendered markdown; it switches to a TextArea editor when you press e.

Edit mode: Press e on a focused panel to start editing. Space, Enter, q, and e all type normally into the editor. Press Ctrl+S to save or Escape to cancel and revert changes. Left panel edits update the in-memory accumulator; right panel edits write to the notes file.

Press Space to record, speak naturally, press Space to stop. Repeat to accumulate multiple turns. Press Enter for a quick raw commit, or r when you want the LLM to clean up your speech into polished markdown. Press Escape to discard and start over.

Domain Vocabulary and Context Biasing

Qwen3-ASR has a native context biasing capability trained during supervised fine-tuning (SFT) — the model was explicitly taught to attend to vocabulary hints in its system prompt during decoding. This is the primary mechanism for improving transcription accuracy on domain-specific terms, and it works well out of the box.

voiss builds on this with two additional layers for cases where native biasing isn't enough:

Prompt context biasing (native SFT) — domain terms are injected into the Qwen3-ASR system prompt. The model was trained to prefer these terms during decoding. This is the recommended starting point and usually sufficient.
Logit biasing (mechanical supplement) — optionally, --context-bias applies an additive logit bias to the subword tokens of context terms during decoding. This directly increases token probability regardless of acoustic evidence. Useful as a nudge for stubborn misrecognitions, but too high a value can cause hallucination. Use sparingly.
Post-ASR replacements — the replacements section of config.json applies regex find-and-replace after transcription. This is a last resort for deterministic ASR failure patterns that neither prompt nor logit biasing can fix.

Via CLI

# Inline domain vocabulary — feeds native SFT context biasing
uv run voiss --context "Kubernetes, kubectl, etcd, CoreDNS"

# With supplementary logit biasing (default scale: 5.0)
uv run voiss --context "Kubernetes, kubectl, etcd" --context-bias 4.0

# From a file (one term per line, or freeform text)
uv run voiss --context-file ~/vocab/medical-terms.txt

Default Prompt File

Place a prompt.md file in ~/.config/voiss/ to set a default system prompt for the post-processing LLM without needing CLI flags or JSON escaping:

cat > ~/.config/voiss/prompt.md << 'EOF'
You are a medical scribe. Format the transcript as SOAP notes:

## Subjective
## Objective
## Assessment
## Plan
EOF

For backward compatibility, rewrite_prompt.md is also recognized if prompt.md does not exist.

The full priority chain for the system prompt (first match wins):

--system-prompt / --system-prompt-file (CLI flags)
prompt_file in config section (file path wins over inline prompt when both present)
prompt field in litellm_postprocess config section
~/.config/voiss/prompt.md (or rewrite_prompt.md fallback)
Built-in default prompt

Via Configuration File

voiss loads configuration from ~/.config/voiss/config.json (override with --config-file or the VOISS_CONFIG_DIR env var). All sections are optional. A sample config for clinical notes is provided in examples/config.json.

{
  "audio": {
    "asr": {
      "context": [
        "acetaminophen", "ibuprofen", "metformin", "lisinopril",
        "hypertension", "tachycardia", "echocardiogram",
        "SOAP", "CBC", "MRI", "EKG"
      ],
      "logit_bias": {
        "terms": ["acetaminophen", "echocardiogram", "metformin"],
        "scale": 5.0
      }
    },
    "corrections": {
      "echo cardiogram": "echocardiogram",
      "die a beat ease": "diabetes"
    }
  },
  "litellm_postprocess": {
    "model": "ollama/llama3.2",
    "prompt": "Clean up the transcript. Format as markdown.",
    "max_tokens": 2048,
    "flags": { "think": false }
  }
}

Section	Purpose
`audio.asr.context`	Terms injected into the ASR system prompt for native SFT context biasing. This is the primary accuracy lever.
`audio.asr.logit_bias.terms`	Subset of terms whose subword tokens get additive logit bias during decoding. Reserve for the hardest words.
`audio.asr.logit_bias.scale`	Logit bias strength (default 5.0). Overridden by `--context-bias` on the CLI.
`audio.corrections`	Post-ASR regex find-and-replace (case-insensitive). Catches deterministic misrecognitions. Applied during both raw and LLM commits.
`litellm_postprocess.model`	Default LLM model for post-processing. Overridden by `--rewrite-model`.
`litellm_postprocess.prompt`	Inline system prompt. Overridden by `prompt_file`, `--system-prompt`, or `prompt.md`.
`litellm_postprocess.prompt_file`	Path to a prompt file (relative to config dir or absolute). Takes precedence over inline `prompt`.
`litellm_postprocess.max_tokens`	Max tokens for the LLM response (default 2048).
`litellm_postprocess.flags`	Extra keyword arguments passed to `litellm.completion()` (e.g., `{"think": false}`).

Config context terms and CLI --context terms are merged (config first, CLI appended).

Context biasing and --config-file work in both transcription and notes modes.

Recommended Models

ASR (MLX Qwen3-ASR)

Model	Notes
`mlx-community/Qwen3-ASR-0.6B-4bit`	Fastest, lowest quality
`mlx-community/Qwen3-ASR-0.6B-8bit`	Good balance (default)
`mlx-community/Qwen3-ASR-0.6B-bf16`	Higher quality, more RAM
`mlx-community/Qwen3-ASR-1.7B-8bit`	Higher quality, slower

LLM for Notes Post-Processing (via litellm)

Model	Backend	Notes
`ollama/llama3.2`	Ollama	Good local default
`ollama/mistral`	Ollama	Strong instruction following
`openai/gpt-4o-mini`	OpenAI	Cloud, fast and cheap

Any model supported by litellm works.

UI and Piping

The Rich live UI renders on stderr; stdout carries clean transcript lines.
When stdout is not a TTY, the UI is automatically suppressed.
Notes mode uses a full-screen Textual TUI (separate from the Rich live UI).

# Pipe raw transcripts to another tool
uv run voiss | grep "important"

# Watch notes being written in real-time
uv run voiss notes --rewrite-model ollama/llama3.2 &
tail -f ~/.local/share/voiss/notes/*.md

Troubleshooting

Too many short turns: increase --vad-silence-ms or lower --vad-mode
Background noise triggering VAD: increase --energy-threshold (default 300, try 500-800)
Quiet speech being dropped: lower --energy-threshold (try 100-200)
No audio: check mic permissions or try --list-devices + --device
Buffer fills up too fast: increase --max-buffer (default 30s) to allow longer speech before committing
Laggy output: reduce --transcribe-interval
Notes post-processing failures: check that your LLM backend is running (e.g., ollama serve). Raw transcripts are saved on failure.
Domain words misrecognized: use --context or --context-file with your vocabulary. If prompt biasing alone isn't enough, the logit bias (--context-bias) will increase token probabilities directly. Try values between 3.0-8.0. Too high (>10) may cause hallucination of biased terms.
Bias causing wrong words: lower --context-bias (try 2.0-3.0) or remove overly broad terms from config

Logging

Set LOG_LEVEL=DEBUG for verbose logs.
Hugging Face and httpx request logs are suppressed by default.

Acknowledgements

This project is built upon dictate.sh by Marc Puig, which provided the original implementation of real-time speech transcription using MLX on Apple Silicon. The foundational architecture — audio capture pipeline, VAD-based turn detection, streaming ASR with Qwen3, and the Rich terminal UI — originates from that work.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.beads		.beads
.github/workflows		.github/workflows
docs		docs
examples		examples
src/voiss		src/voiss
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
stt.py		stt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

voissistant

Use as a Library

Install (core only — model loading + file transcription)

Install with all features (CLI + notes TUI)

Quick example

Architecture

Features

Requirements

Quick Start

CLI Reference

Shared Options (all modes)

Transcription Mode (`voiss`)

Notes Mode (`voiss notes`)

Notes Mode Key Bindings

Domain Vocabulary and Context Biasing

Via CLI

Default Prompt File

Via Configuration File

Recommended Models

ASR (MLX Qwen3-ASR)

LLM for Notes Post-Processing (via litellm)

UI and Piping

Troubleshooting

Logging

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

voissistant

Use as a Library

Install (core only — model loading + file transcription)

Install with all features (CLI + notes TUI)

Quick example

Architecture

Features

Requirements

Quick Start

CLI Reference

Shared Options (all modes)

Transcription Mode (voiss)

Notes Mode (voiss notes)

Notes Mode Key Bindings

Domain Vocabulary and Context Biasing

Via CLI

Default Prompt File

Via Configuration File

Recommended Models

ASR (MLX Qwen3-ASR)

LLM for Notes Post-Processing (via litellm)

UI and Piping

Troubleshooting

Logging

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Transcription Mode (`voiss`)

Notes Mode (`voiss notes`)

Packages