Skip to content

stealthwhizz/SpeechBridge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SpeechBridge

An AI-powered assistive communication system that improves the clarity of impaired speech in real time — built for people with stuttering, dysarthria, and speech motor impairments.

Traditional speech-to-text systems perform poorly on distorted or irregular speech patterns. SpeechBridge is a multi-layer pipeline that goes beyond a simple ASR wrapper to handle the unique challenges of impaired speech.


Problem Statement

People with speech disorders often struggle to communicate clearly. Conditions like stuttering, dysarthria, and speech motor impairments produce speech patterns that break standard ASR systems — leading to misrecognition, frustration, and reduced independence.

SpeechBridge addresses this by combining local ASR, phoneme-level correction, and adaptive personal memory into a single assistive pipeline.


System Architecture

Audio Input
    ↓
ASR — Whisper (local inference, beam search, confidence scores)
    ↓
Stutter Detection & Removal
    ↓
Dysarthria Correction (Phoneme + Confidence Based)
    ↓
Personal Adaptive Memory
    ↓
Final Cleaned Output

Core Components

1. ASR Layer — Speech to Text

  • Uses OpenAI Whisper running fully locally (no API calls)
  • Tuned with beam search, best_of sampling, and temperature control
  • Audio is preprocessed via ffmpeg: mono conversion, 16kHz resampling, error-tolerant decoding
  • Per-word confidence scores are extracted from Whisper's output

2. Stutter Removal

Detects and removes:

  • Rapid word repetitions (timestamp-based gap detection)
  • Elongated characters (e.g. "sooo" → "so")
  • Short-gap duplications

3. Dysarthria Correction Engine

For each low-confidence word (confidence < 0.6):

  1. Look up phoneme sequence via the CMU Pronouncing Dictionary
  2. Compute Levenshtein distance between phoneme strings
  3. Find the closest match in the top 50,000 most frequent English words (wordfreq)
  4. Replace only if edit distance ≤ 2 (conservative — avoids over-correction)

4. Personal Adaptive Memory

  • Corrections are stored per user session
  • Once a word is corrected (e.g. wader → water), future occurrences are fixed instantly
  • Makes the system more accurate over time for each individual speaker

Observed Challenge

When testing with dysarthric speech:

Stage Output
Input speech "I want some water"
Whisper raw output "I one big bit of water"
After correction "I want some water"

Key insight: Base ASR misrecognition under dysarthric speech cannot be fixed by an LLM alone — acoustic decoding must be improved first.


Improvements Implemented

Improvement Detail
Beam search decoding Improves transcription accuracy on distorted speech
Audio preprocessing Mono, 16kHz, loudness normalization, error-tolerant ffmpeg flags
Confidence-aware correction Only corrects words Whisper is uncertain about
Phoneme distance matching Uses CMU dict + Levenshtein for phonetically-informed replacement
Adaptive user memory Personalises corrections over time

Optional LLM Layer

An LLM is not used to blindly fix ASR errors. It is used selectively for:

  • Grammar smoothing
  • Semantic clarification
  • Emergency intent detection

Example:

Stage Text
ASR output "I need doctor breathing problem"
LLM enhancement "I need a doctor. I am having trouble breathing."

Tech Stack

Layer Technology
Frontend React, Vite, Tailwind CSS
Backend FastAPI, Python
ASR OpenAI Whisper (base model, local inference)
Phoneme matching pronouncing (CMU dict), python-Levenshtein
Vocabulary wordfreq (top 50k English words)
Audio processing ffmpeg (via MediaRecorder API → WebM/Opus → WAV)

Project Structure

├── backend/
│   ├── main.py          # FastAPI server, Whisper transcription, stutter cleaning
│   ├── dysarthria.py    # Dysarthria correction — phoneme matching + adaptive memory
│   └── requirements.txt
├── frontend/
│   ├── src/
│   │   ├── App.jsx              # Recording logic, MIME detection, API call
│   │   ├── components/
│   │   │   ├── RecordButton.jsx
│   │   │   ├── WaveformVisualizer.jsx
│   │   │   ├── OutputPanels.jsx
│   │   │   ├── StatsRow.jsx
│   │   │   └── ActionButtons.jsx
│   │   └── pages/
│   │       └── HistoryPage.jsx
│   └── package.json
└── README.md

Setup & Running

Prerequisites

  • Python 3.10+
  • Node.js 18+
  • ffmpeg on PATH

Backend

cd backend
pip install -r requirements.txt
uvicorn main:app --reload --port 8000

The Whisper base model (~139MB) downloads automatically on first run.

Frontend

cd frontend
npm install
npm run dev

Open http://localhost:5173.


API

POST /transcribe

Accepts a multipart/form-data audio file. Returns:

{
  "raw_text": "I I w-want to to go",
  "after_stutter": "I want to go",
  "final_text": "I want to go",
  "corrections_applied": [{ "from": "wader", "to": "water" }],
  "tokens": [{ "word": "I", "start": 0.0, "end": 0.3, "confidence": 0.98 }]
}

Human Impact

SpeechBridge enables:

  • Clear communication for speech-impaired individuals
  • Reduced frustration in social and professional interactions
  • Assistive technology for caregivers and clinicians
  • A foundation for rehabilitation progress tracking

Planned additions: speech clarity scoring over time, error-type classification, WebSocket-based live streaming transcription, emergency intent detection , speech tehrapy care.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors