SilentSpeak

Real-time visual speech recognition — lip-read from your webcam and type what you silently mouth. Fully local, no internet required.

Built for HACKHIVE 2k26.

Overview

SilentSpeak combines a state-of-the-art Visual Speech Recognition (VSR) model with an LLM correction layer to let you communicate silently using only your lips. A cinematic landing page showcases the product experience.

silent-speech/
├── frontend/          # Next.js landing page (Three.js + GSAP)
└── slient-speech/     # Python VSR engine (PyTorch + ESPnet + Mediapipe)

Frontend

A cinematic, dark-glass landing page built with Next.js 16, Three.js, and GSAP.

Stack

Technology	Version
Next.js	16.2.4
React	19.2.4
TypeScript	5
Tailwind CSS	4
Three.js	0.184.0
GSAP	3.15.0

Run locally

cd frontend
npm install
npm run dev

Open http://localhost:3000.

Deploy

The frontend is configured for one-click deployment on Vercel. Set the root directory to frontend/ in your Vercel project settings.

VSR Engine

Real-time lip-reading pipeline running entirely on your machine.

Requirements

Python 3.11
uv
Ollama with qwen3:4b pulled
NVIDIA GPU with CUDA 12.1 (CPU fallback supported)

Setup

cd slient-speech

# Download model weights
./setup.sh

# Pull LLM correction model
ollama pull qwen3:4b

Run

uv run \
  --extra-index-url https://download.pytorch.org/whl/cu121 \
  --with-requirements requirements.txt \
  --python 3.11 \
  main.py config_filename=./configs/LRS3_V_WER19.1.ini detector=mediapipe

Controls

Action	Key
Start / stop recording	`Alt` (Windows/Linux) · `Option` (Mac)
Quit	`q` (with camera window focused)

Improvements

Four targeted upgrades derived from integrating auto-AVSR (Meta's state-of-the-art AVSR framework) with the existing pipeline.

1. Beam search tuning for real-time latency

Status: Implemented

The default beam size is 40, which gives best accuracy but adds ~300–500 ms per utterance. For real-time use, beam size 8 achieves a good speed/quality balance — the LLM correction layer compensates for the small accuracy drop.

A fast preset is available:

uv run ... main.py config_filename=./configs/LRS3_V_WER19.1_fast.ini detector=mediapipe

Or override on the fly:

uv run ... main.py config_filename=./configs/LRS3_V_WER19.1.ini detector=mediapipe beam_size=8

2. Auto-AVSR conformer checkpoint (250M, trained on 3291h)

Status: Config ready — model download required

Auto-AVSR's conformer model (vsr_trlrs2lrs3vox2avsp_base.pth) was trained on 3,291 hours of LRS2 + LRS3 + VoxCeleb2 + AVSpeech, giving 20.3% WER. The slient-speech pipeline supports loading it directly.

Download and register the model:

cd slient-speech
./setup_autoavsr.sh

Then run with the conformer config:

uv run ... main.py config_filename=./configs/LRS3_V_WER20.3_conformer.ini detector=mediapipe

3. Audio-visual (AV) fusion

Status: Implemented

When the user is whispering rather than fully silent, fusing a microphone channel alongside lip video dramatically reduces WER. The pipeline now captures microphone audio during every recording and muxes it into the temp video file.

Enable AV mode by using an audiovisual model config:

uv run ... main.py config_filename=./configs/LRS3_AV.ini detector=mediapipe

The audiovisual modality is already supported by the AVSR and AVSRDataLoader classes — no architecture changes needed.

4. Mouth-cropping pipeline (auto-AVSR preprocessing)

Status: Implemented via MediaPipe detector

The existing MediaPipe pipeline already mirrors auto-AVSR's preprocessing: face detection → landmark alignment → 88×88 grayscale mouth ROI crop. Short-range detection (for typical webcam distances of 50–80 cm) is tried first with fallback to full-range, saving ~20 ms per frame.

Author

Sanskaar — HACKHIVE 2k26

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
backend		backend
edged		edged
frontend		frontend
gesture		gesture
slient-speech		slient-speech
.gitignore		.gitignore
README.md		README.md
_ollama_notes.py		_ollama_notes.py
gpu.py		gpu.py
test_gpu.py		test_gpu.py
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SilentSpeak

Overview

Frontend

Stack

Run locally

Deploy

VSR Engine

Requirements

Setup

Run

Controls

Improvements

1. Beam search tuning for real-time latency

2. Auto-AVSR conformer checkpoint (250M, trained on 3291h)

3. Audio-visual (AV) fusion

4. Mouth-cropping pipeline (auto-AVSR preprocessing)

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SilentSpeak

Overview

Frontend

Stack

Run locally

Deploy

VSR Engine

Requirements

Setup

Run

Controls

Improvements

1. Beam search tuning for real-time latency

2. Auto-AVSR conformer checkpoint (250M, trained on 3291h)

3. Audio-visual (AV) fusion

4. Mouth-cropping pipeline (auto-AVSR preprocessing)

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages