Skip to content

ninjacode911/Project-VoiceVault

Repository files navigation

VoiceVault

Voice-First RAG Knowledge Agent

Speak to your documents. Get cited answers back.

Python FastAPI License Tests HF Spaces

Live Demo →  |  Documentation →  |  Project Plan →


Overview

VoiceVault is a production-grade, voice-first Retrieval-Augmented Generation (RAG) system built entirely from scratch. It enables users to record or type questions and receive answers grounded in their own private document collections — with inline citations pointing back to the exact source, page, and paragraph.

The project was built in 6 phases over several weeks, with a full test suite (328 tests), enterprise-grade security practices (bcrypt, parameterized SQL, SHA-256 audit logs, SSRF prevention), and deployment to Hugging Face Spaces via Docker.

What makes this different from typical RAG demos:

  • Hybrid retrieval — BM25 keyword search + semantic vector search, fused with Reciprocal Rank Fusion (RRF) + cross-encoder reranking. Most tutorials use only one retrieval method.
  • Voice-native pipeline — Groq Whisper API for ~300ms cloud transcription with local Whisper fallback; Web Speech API for TTS output.
  • Faithfulness guard — Detects when the LLM cannot answer from retrieved context and returns a grounded refusal instead of hallucinating.
  • Multi-KB support — Multiple independent knowledge bases, each optionally password-protected.

Screenshots

Ask VoiceVault — Voice Query Interface

Record your question via microphone or type it. The mic button pulses when recording.

Ask VoiceVault — main voice query interface with dark glassmorphism UI

Knowledge Base Management

Create named knowledge bases, upload documents (PDF, DOCX, HTML, MD, TXT), and manage them.

Knowledge Bases panel — empty state with New Knowledge Base button

Analytics Dashboard

Real-time query statistics: total queries, average latency, citation counts, and daily breakdowns.

Analytics dashboard showing query statistics

Full App in Action

A populated knowledge base (358 chunks from 1 document) and a live conversation with the RAG pipeline.

Full VoiceVault app with a knowledge base and active conversation

Architecture

INGESTION PATH (one-time per document set)
──────────────────────────────────────────────────────
  User uploads PDF / HTML / DOCX / MD / TXT
      │
      ▼
  DocumentParser         →  text + metadata per page
      │                     (PyMuPDF, BS4, python-docx)
      ▼
  SemanticChunker        →  sentence-aware chunks
      │                     (spaCy sentences + cosine boundary)
      ▼
  IndexBuilder           →  ChromaDB (vector) + BM25 (keyword)
                             + SQLite (metadata)

QUERY PATH (real-time, per question)
──────────────────────────────────────────────────────
  Browser mic → WAV → POST /api/transcribe
      │
      ▼
  GroqTranscriber        →  Groq Whisper API (~300ms)
      │                     [fallback: local Whisper CPU]
      ▼
  QueryPreprocessor      →  filler removal, intent classification
      │                     (factual / summary / compare)
      ▼
  HybridRetriever        →  BM25 top-20 + Vector top-20
      │                     → RRF merge (k=60)
      │                     → CrossEncoder rerank (ms-marco-MiniLM-L12-v2)
      │                     → diversity filter (max 2 chunks/page)
      ▼
  ContextBuilder         →  formatted context with [Source:N] markers
      ▼
  LangChain LCEL         →  Groq Llama-3.1-70B (primary)
      │                     [fallback: Gemini 1.5 Flash]
      ▼
  FaithfulnessGuard      →  refusal detection, confidence scoring
      │
  CitationInjector       →  resolve [Source:N] → filename + page
      ▼
  JSON response          →  answer + citations + confidence + tts_text
      │
      ▼
  SPA Frontend           →  chat display + Web Speech API TTS

Features

Feature Detail
Voice Input Browser microphone → WAV conversion → Groq Whisper API (~300ms)
Hybrid Retrieval BM25 + semantic vector search, RRF fusion, cross-encoder reranking
Multi-KB Create multiple independent knowledge bases per session
KB Access Control Optional bcrypt password protection (work factor 12) per KB
Document Formats PDF, DOCX, HTML, Markdown, TXT (OCR fallback for scanned PDFs)
Source Citations Every answer traceable to source file + page number
Faithfulness Guard Detects hallucinations; returns grounded refusal when context is insufficient
Conversation Memory Rolling 5-turn conversation window passed to the LLM
LLM Fallback Groq Llama-3.1-70B → Gemini 1.5 Flash automatic fallback
TTS Output Web Speech API reads answer aloud with citation markers stripped
Analytics SQLite audit log: query counts, latency, citation rates (7-day window)
Privacy Raw queries never stored — SHA-256 hash only in audit log
328 Tests Integration + unit tests across all 6 phases

Tech Stack

Layer Technology Purpose
API FastAPI + uvicorn REST backend with async endpoints
Frontend HTML5 / CSS3 / Vanilla JS Premium dark SPA (no framework)
ASR Groq Whisper API Cloud transcription (~300ms)
ASR Fallback OpenAI Whisper Large-v3 Local CPU transcription
Embeddings sentence-transformers all-MiniLM-L6-v2 Dense vector representations
Reranking cross-encoder/ms-marco-MiniLM-L12-v2 Semantic relevance scoring
Vector Store ChromaDB In-process vector database
Keyword Search rank-bm25 (BM25Okapi) Lexical keyword matching
Chunking spaCy en_core_web_sm Sentence boundary detection
LLM (primary) Groq Llama-3.1-70B Fast inference via Groq cloud
LLM (fallback) Gemini 1.5 Flash Google generative AI fallback
Orchestration LangChain LCEL LLM pipeline composition
Metadata SQLite KB registry, doc index, audit log
Security bcrypt (work factor 12) KB password hashing
Config Pydantic-settings Centralized, type-safe config
Deployment Docker on Hugging Face Spaces Container-based cloud hosting

Project Structure

Project-VoiceVault/
├── server.py                      # FastAPI entry point (run this)
├── app.py                         # Gradio entry point (legacy / tests)
├── config.py                      # Centralized Pydantic-settings config
├── requirements.txt               # All dependencies
├── Dockerfile                     # HF Spaces Docker deployment
├── .env.example                   # Environment variable template
│
├── api/                           # FastAPI REST API
│   ├── __init__.py
│   └── routes.py                  # All /api/* endpoints
│
├── static/                        # SPA frontend assets
│   ├── index.html                 # Single-page application shell
│   ├── style.css                  # Dark glassmorphism design system
│   └── app.js                     # Full SPA logic (recording, chat, KB CRUD)
│
├── voicevault/                    # Core package
│   ├── models.py                  # Pydantic data models
│   ├── asr/
│   │   ├── groq_transcriber.py    # Groq Whisper cloud ASR (~300ms)
│   │   ├── whisper_transcriber.py # Local Whisper CPU/GPU fallback
│   │   └── query_preprocessor.py  # Filler removal, intent classification
│   ├── ingestion/
│   │   ├── document_parser.py     # PDF/HTML/DOCX/MD/TXT → structured text
│   │   ├── semantic_chunker.py    # Sentence-aware chunking with topic boundaries
│   │   └── index_builder.py      # ChromaDB + BM25 + SQLite orchestration
│   ├── retrieval/
│   │   ├── hybrid_retriever.py    # BM25 + vector + RRF + cross-encoder
│   │   ├── bm25_retriever.py      # BM25Okapi keyword search
│   │   ├── vector_retriever.py    # ChromaDB semantic search
│   │   └── context_builder.py     # Context formatting + citation markers
│   ├── generation/
│   │   ├── answer_chain.py        # LangChain LCEL + Groq + Gemini fallback
│   │   ├── faithfulness_guard.py  # Hallucination detection + refusal
│   │   └── citation_injector.py   # [Source:N] → filename + page resolution
│   ├── kb/
│   │   └── kb_manager.py          # KB lifecycle, bcrypt auth, validation
│   ├── storage/
│   │   ├── sqlite_store.py        # Schema, CRUD, audit log queries
│   │   └── chroma_store.py        # ChromaDB wrapper
│   └── tts/
│       └── web_speech.py          # TTS text preparation
│
├── ui/                            # Gradio UI components (legacy / app.py)
│   ├── tabs/
│   │   ├── ask_tab.py
│   │   ├── kb_tab.py
│   │   ├── analytics_tab.py
│   │   └── settings_tab.py
│   └── components/
│       ├── citation_panel.py
│       └── audio_controls.py
│
├── tests/                         # Full test suite — 328 tests
│   ├── conftest.py
│   ├── test_api_routes.py         # Integration tests (FastAPI + real methods)
│   ├── test_phase0.py             # Foundation tests
│   ├── test_phase1.py             # Ingestion tests
│   ├── test_phase2.py             # Retrieval tests
│   ├── test_phase3.py             # ASR tests
│   ├── test_phase4.py             # Generation tests
│   └── test_phase5.py             # UI / access control tests
│
├── DOCS/                          # Detailed phase documentation
│   ├── phase0_foundation.md
│   ├── phase1_ingestion.md
│   ├── phase2_retrieval.md
│   ├── phase3_asr.md
│   ├── phase4_generation.md
│   ├── phase5_ui_access.md
│   └── phase6_deployment.md
│
└── Screenshots/
    ├── 1.png                      # Ask tab — voice query interface
    ├── 2.png                      # Knowledge Bases panel
    ├── 3.png                      # Analytics dashboard
    └── 4.png                      # Full app with KB and live conversation

Quick Start

Prerequisites

1. Clone and install

git clone https://github.com/ninjacode911/Project-VoiceVault.git
cd Project-VoiceVault
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install torch --index-url https://download.pytorch.org/whl/cpu   # CPU-only (saves ~1.8GB)
pip install -r requirements.txt
python -m spacy download en_core_web_sm

2. Configure secrets

cp .env.example .env
# Edit .env and add:
# GROQ_API_KEY=gsk_...
# GEMINI_API_KEY=...   (optional)

3. Run

python server.py
# Open http://localhost:7860

4. Use it

  1. Navigate to Knowledge Bases → click + New Knowledge Base
  2. Name it (lowercase, hyphens only, e.g. my-docs) and upload your PDFs/documents
  3. Go back to Ask VoiceVault → select your KB → record or type a question → click Ask

Running Tests

pytest tests/ -v
# Expected: 328 passed

The integration tests in tests/test_api_routes.py use a real KBManager backed by a temp SQLite DB and exercise the actual FastAPI routes and method signatures — not mocked pipelines. This is intentional: it catches runtime AttributeError bugs that pure-mock unit tests miss.


Deployment to Hugging Face Spaces

The project ships with a Dockerfile configured for HF Spaces. The Docker image:

  • Uses Python 3.11-slim base
  • Installs CPU-only PyTorch (~650MB vs 2.5GB GPU wheels)
  • Pre-downloads all-MiniLM-L6-v2 and cross-encoder/ms-marco-MiniLM-L12-v2 at build time (no cold-start model downloads)
  • Downloads en_core_web_sm spaCy model at build time
  • Binds to 0.0.0.0:7860 (HF Spaces default port)

To deploy your own copy:

  1. Create a Hugging Face Space with Docker SDK
  2. Push this repository to the Space's git remote
  3. Add GROQ_API_KEY (and optionally GEMINI_API_KEY) as Space secrets

See DOCS/phase6_deployment.md for the full deployment walkthrough.


Configuration

All configuration is environment-driven via .env. See .env.example for the full reference.

Key variables:

Variable Default Description
GROQ_API_KEY Required. Groq API key for Whisper + Llama
GEMINI_API_KEY Optional Gemini fallback key
HOST 0.0.0.0 Server bind address
PORT 7860 Server port
FINAL_TOP_K 5 Number of chunks passed to LLM
MAX_ANSWER_TOKENS 500 LLM max output tokens
CHUNK_SIZE_MAX 600 Max tokens per document chunk
BCRYPT_ROUNDS 12 bcrypt work factor for KB passwords

Security

Control Implementation
No raw queries stored Audit log stores SHA-256 hash only
KB access control bcrypt-hashed passwords (work factor 12)
SQL injection prevention 100% parameterized queries — no f-string SQL
Path traversal prevention KB names validated as slugs (^[a-z0-9][a-z0-9\-]*[a-z0-9]$)
SSRF prevention URL ingestion via trafilatura with no internal-network access
Upload whitelist Only .pdf, .html, .docx, .md, .txt accepted
File size limit 50MB max per upload
GPU isolation CUDA_VISIBLE_DEVICES=-1 prevents CUDA crashes on incompatible hardware
No secrets in git .env gitignored; HF secrets via Space settings API

Phase Documentation

Each phase has a detailed write-up covering design decisions, key code sections, and test results:

Phase Topic Tests
Phase 0 Project Foundation (config, models, schema, scaffold) 58 ✅
Phase 1 Document Ingestion (parser, chunker, indexer) 46 ✅
Phase 2 Hybrid Retrieval (BM25 + vector + RRF + reranker) 33 ✅
Phase 3 ASR & Voice Input (Whisper, query preprocessor) 47 ✅
Phase 4 Generation & Citations (LangChain, faithfulness guard) 72 ✅
Phase 5 Full UI, TTS & Access Control 55 ✅
Phase 6 FastAPI Server, SPA Frontend & HF Deployment 17 ✅

Total: 328 tests — all passing.


License

Source Available — All Rights Reserved. See LICENSE for full terms.

The source code is publicly visible for viewing and educational purposes. Any use in personal, commercial, or academic projects requires explicit written permission from the author.

To request permission: navnitamrutharaj1234@gmail.com

Author: Navnit Amrutharaj

About

VoiceVault is a production-grade, voice-first Retrieval-Augmented Generation (RAG) system built entirely from scratch. It enables users to record or type questions and receive answers grounded in their own private document collections — with inline citations pointing back to the exact source, page, and paragraph.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors