Voice-First RAG Knowledge Agent
Speak to your documents. Get cited answers back.
VoiceVault is a production-grade, voice-first Retrieval-Augmented Generation (RAG) system built entirely from scratch. It enables users to record or type questions and receive answers grounded in their own private document collections — with inline citations pointing back to the exact source, page, and paragraph.
The project was built in 6 phases over several weeks, with a full test suite (328 tests), enterprise-grade security practices (bcrypt, parameterized SQL, SHA-256 audit logs, SSRF prevention), and deployment to Hugging Face Spaces via Docker.
What makes this different from typical RAG demos:
- Hybrid retrieval — BM25 keyword search + semantic vector search, fused with Reciprocal Rank Fusion (RRF) + cross-encoder reranking. Most tutorials use only one retrieval method.
- Voice-native pipeline — Groq Whisper API for ~300ms cloud transcription with local Whisper fallback; Web Speech API for TTS output.
- Faithfulness guard — Detects when the LLM cannot answer from retrieved context and returns a grounded refusal instead of hallucinating.
- Multi-KB support — Multiple independent knowledge bases, each optionally password-protected.
Record your question via microphone or type it. The mic button pulses when recording.
Create named knowledge bases, upload documents (PDF, DOCX, HTML, MD, TXT), and manage them.
Real-time query statistics: total queries, average latency, citation counts, and daily breakdowns.
A populated knowledge base (358 chunks from 1 document) and a live conversation with the RAG pipeline.
INGESTION PATH (one-time per document set)
──────────────────────────────────────────────────────
User uploads PDF / HTML / DOCX / MD / TXT
│
▼
DocumentParser → text + metadata per page
│ (PyMuPDF, BS4, python-docx)
▼
SemanticChunker → sentence-aware chunks
│ (spaCy sentences + cosine boundary)
▼
IndexBuilder → ChromaDB (vector) + BM25 (keyword)
+ SQLite (metadata)
QUERY PATH (real-time, per question)
──────────────────────────────────────────────────────
Browser mic → WAV → POST /api/transcribe
│
▼
GroqTranscriber → Groq Whisper API (~300ms)
│ [fallback: local Whisper CPU]
▼
QueryPreprocessor → filler removal, intent classification
│ (factual / summary / compare)
▼
HybridRetriever → BM25 top-20 + Vector top-20
│ → RRF merge (k=60)
│ → CrossEncoder rerank (ms-marco-MiniLM-L12-v2)
│ → diversity filter (max 2 chunks/page)
▼
ContextBuilder → formatted context with [Source:N] markers
▼
LangChain LCEL → Groq Llama-3.1-70B (primary)
│ [fallback: Gemini 1.5 Flash]
▼
FaithfulnessGuard → refusal detection, confidence scoring
│
CitationInjector → resolve [Source:N] → filename + page
▼
JSON response → answer + citations + confidence + tts_text
│
▼
SPA Frontend → chat display + Web Speech API TTS
| Feature | Detail |
|---|---|
| Voice Input | Browser microphone → WAV conversion → Groq Whisper API (~300ms) |
| Hybrid Retrieval | BM25 + semantic vector search, RRF fusion, cross-encoder reranking |
| Multi-KB | Create multiple independent knowledge bases per session |
| KB Access Control | Optional bcrypt password protection (work factor 12) per KB |
| Document Formats | PDF, DOCX, HTML, Markdown, TXT (OCR fallback for scanned PDFs) |
| Source Citations | Every answer traceable to source file + page number |
| Faithfulness Guard | Detects hallucinations; returns grounded refusal when context is insufficient |
| Conversation Memory | Rolling 5-turn conversation window passed to the LLM |
| LLM Fallback | Groq Llama-3.1-70B → Gemini 1.5 Flash automatic fallback |
| TTS Output | Web Speech API reads answer aloud with citation markers stripped |
| Analytics | SQLite audit log: query counts, latency, citation rates (7-day window) |
| Privacy | Raw queries never stored — SHA-256 hash only in audit log |
| 328 Tests | Integration + unit tests across all 6 phases |
| Layer | Technology | Purpose |
|---|---|---|
| API | FastAPI + uvicorn | REST backend with async endpoints |
| Frontend | HTML5 / CSS3 / Vanilla JS | Premium dark SPA (no framework) |
| ASR | Groq Whisper API | Cloud transcription (~300ms) |
| ASR Fallback | OpenAI Whisper Large-v3 | Local CPU transcription |
| Embeddings | sentence-transformers all-MiniLM-L6-v2 |
Dense vector representations |
| Reranking | cross-encoder/ms-marco-MiniLM-L12-v2 |
Semantic relevance scoring |
| Vector Store | ChromaDB | In-process vector database |
| Keyword Search | rank-bm25 (BM25Okapi) | Lexical keyword matching |
| Chunking | spaCy en_core_web_sm |
Sentence boundary detection |
| LLM (primary) | Groq Llama-3.1-70B | Fast inference via Groq cloud |
| LLM (fallback) | Gemini 1.5 Flash | Google generative AI fallback |
| Orchestration | LangChain LCEL | LLM pipeline composition |
| Metadata | SQLite | KB registry, doc index, audit log |
| Security | bcrypt (work factor 12) | KB password hashing |
| Config | Pydantic-settings | Centralized, type-safe config |
| Deployment | Docker on Hugging Face Spaces | Container-based cloud hosting |
Project-VoiceVault/
├── server.py # FastAPI entry point (run this)
├── app.py # Gradio entry point (legacy / tests)
├── config.py # Centralized Pydantic-settings config
├── requirements.txt # All dependencies
├── Dockerfile # HF Spaces Docker deployment
├── .env.example # Environment variable template
│
├── api/ # FastAPI REST API
│ ├── __init__.py
│ └── routes.py # All /api/* endpoints
│
├── static/ # SPA frontend assets
│ ├── index.html # Single-page application shell
│ ├── style.css # Dark glassmorphism design system
│ └── app.js # Full SPA logic (recording, chat, KB CRUD)
│
├── voicevault/ # Core package
│ ├── models.py # Pydantic data models
│ ├── asr/
│ │ ├── groq_transcriber.py # Groq Whisper cloud ASR (~300ms)
│ │ ├── whisper_transcriber.py # Local Whisper CPU/GPU fallback
│ │ └── query_preprocessor.py # Filler removal, intent classification
│ ├── ingestion/
│ │ ├── document_parser.py # PDF/HTML/DOCX/MD/TXT → structured text
│ │ ├── semantic_chunker.py # Sentence-aware chunking with topic boundaries
│ │ └── index_builder.py # ChromaDB + BM25 + SQLite orchestration
│ ├── retrieval/
│ │ ├── hybrid_retriever.py # BM25 + vector + RRF + cross-encoder
│ │ ├── bm25_retriever.py # BM25Okapi keyword search
│ │ ├── vector_retriever.py # ChromaDB semantic search
│ │ └── context_builder.py # Context formatting + citation markers
│ ├── generation/
│ │ ├── answer_chain.py # LangChain LCEL + Groq + Gemini fallback
│ │ ├── faithfulness_guard.py # Hallucination detection + refusal
│ │ └── citation_injector.py # [Source:N] → filename + page resolution
│ ├── kb/
│ │ └── kb_manager.py # KB lifecycle, bcrypt auth, validation
│ ├── storage/
│ │ ├── sqlite_store.py # Schema, CRUD, audit log queries
│ │ └── chroma_store.py # ChromaDB wrapper
│ └── tts/
│ └── web_speech.py # TTS text preparation
│
├── ui/ # Gradio UI components (legacy / app.py)
│ ├── tabs/
│ │ ├── ask_tab.py
│ │ ├── kb_tab.py
│ │ ├── analytics_tab.py
│ │ └── settings_tab.py
│ └── components/
│ ├── citation_panel.py
│ └── audio_controls.py
│
├── tests/ # Full test suite — 328 tests
│ ├── conftest.py
│ ├── test_api_routes.py # Integration tests (FastAPI + real methods)
│ ├── test_phase0.py # Foundation tests
│ ├── test_phase1.py # Ingestion tests
│ ├── test_phase2.py # Retrieval tests
│ ├── test_phase3.py # ASR tests
│ ├── test_phase4.py # Generation tests
│ └── test_phase5.py # UI / access control tests
│
├── DOCS/ # Detailed phase documentation
│ ├── phase0_foundation.md
│ ├── phase1_ingestion.md
│ ├── phase2_retrieval.md
│ ├── phase3_asr.md
│ ├── phase4_generation.md
│ ├── phase5_ui_access.md
│ └── phase6_deployment.md
│
└── Screenshots/
├── 1.png # Ask tab — voice query interface
├── 2.png # Knowledge Bases panel
├── 3.png # Analytics dashboard
└── 4.png # Full app with KB and live conversation
- Python 3.11+
- A Groq API key (free at console.groq.com)
- Optionally a Gemini API key (free at aistudio.google.com)
git clone https://github.com/ninjacode911/Project-VoiceVault.git
cd Project-VoiceVault
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install torch --index-url https://download.pytorch.org/whl/cpu # CPU-only (saves ~1.8GB)
pip install -r requirements.txt
python -m spacy download en_core_web_smcp .env.example .env
# Edit .env and add:
# GROQ_API_KEY=gsk_...
# GEMINI_API_KEY=... (optional)python server.py
# Open http://localhost:7860- Navigate to Knowledge Bases → click + New Knowledge Base
- Name it (lowercase, hyphens only, e.g.
my-docs) and upload your PDFs/documents - Go back to Ask VoiceVault → select your KB → record or type a question → click Ask
pytest tests/ -v
# Expected: 328 passedThe integration tests in tests/test_api_routes.py use a real KBManager backed by a temp SQLite DB and exercise the actual FastAPI routes and method signatures — not mocked pipelines. This is intentional: it catches runtime AttributeError bugs that pure-mock unit tests miss.
The project ships with a Dockerfile configured for HF Spaces. The Docker image:
- Uses Python 3.11-slim base
- Installs CPU-only PyTorch (~650MB vs 2.5GB GPU wheels)
- Pre-downloads
all-MiniLM-L6-v2andcross-encoder/ms-marco-MiniLM-L12-v2at build time (no cold-start model downloads) - Downloads
en_core_web_smspaCy model at build time - Binds to
0.0.0.0:7860(HF Spaces default port)
To deploy your own copy:
- Create a Hugging Face Space with Docker SDK
- Push this repository to the Space's git remote
- Add
GROQ_API_KEY(and optionallyGEMINI_API_KEY) as Space secrets
See DOCS/phase6_deployment.md for the full deployment walkthrough.
All configuration is environment-driven via .env. See .env.example for the full reference.
Key variables:
| Variable | Default | Description |
|---|---|---|
GROQ_API_KEY |
— | Required. Groq API key for Whisper + Llama |
GEMINI_API_KEY |
— | Optional Gemini fallback key |
HOST |
0.0.0.0 |
Server bind address |
PORT |
7860 |
Server port |
FINAL_TOP_K |
5 |
Number of chunks passed to LLM |
MAX_ANSWER_TOKENS |
500 |
LLM max output tokens |
CHUNK_SIZE_MAX |
600 |
Max tokens per document chunk |
BCRYPT_ROUNDS |
12 |
bcrypt work factor for KB passwords |
| Control | Implementation |
|---|---|
| No raw queries stored | Audit log stores SHA-256 hash only |
| KB access control | bcrypt-hashed passwords (work factor 12) |
| SQL injection prevention | 100% parameterized queries — no f-string SQL |
| Path traversal prevention | KB names validated as slugs (^[a-z0-9][a-z0-9\-]*[a-z0-9]$) |
| SSRF prevention | URL ingestion via trafilatura with no internal-network access |
| Upload whitelist | Only .pdf, .html, .docx, .md, .txt accepted |
| File size limit | 50MB max per upload |
| GPU isolation | CUDA_VISIBLE_DEVICES=-1 prevents CUDA crashes on incompatible hardware |
| No secrets in git | .env gitignored; HF secrets via Space settings API |
Each phase has a detailed write-up covering design decisions, key code sections, and test results:
| Phase | Topic | Tests |
|---|---|---|
| Phase 0 | Project Foundation (config, models, schema, scaffold) | 58 ✅ |
| Phase 1 | Document Ingestion (parser, chunker, indexer) | 46 ✅ |
| Phase 2 | Hybrid Retrieval (BM25 + vector + RRF + reranker) | 33 ✅ |
| Phase 3 | ASR & Voice Input (Whisper, query preprocessor) | 47 ✅ |
| Phase 4 | Generation & Citations (LangChain, faithfulness guard) | 72 ✅ |
| Phase 5 | Full UI, TTS & Access Control | 55 ✅ |
| Phase 6 | FastAPI Server, SPA Frontend & HF Deployment | 17 ✅ |
Total: 328 tests — all passing.
Source Available — All Rights Reserved. See LICENSE for full terms.
The source code is publicly visible for viewing and educational purposes. Any use in personal, commercial, or academic projects requires explicit written permission from the author.
To request permission: navnitamrutharaj1234@gmail.com
Author: Navnit Amrutharaj