Badge source
This project is an end-to-end question answering system that allows users to upload books or documents (PDF, DOCX, TXT, HTML), ask natural language questions, and receive accurate answers with cited supporting context from the uploaded materials.
Built with Gradio and Hugging Face, the solution integrates document parsing, text chunking, embeddings, vector search (FAISS), and a lightweight LLM (TinyLlama) to deliver scalable QA directly from unstructured files.
The workflow demonstrates machine learning best practices by combining automated pipelines, CI/CD for reliability, and easy deployment via Hugging Face Spaces.
Hugging Face Spaces Gradio App link : https://huggingface.co/spaces/samithcs/RAG_Book_QA_System Docker link : https://hub.docker.com/repository/docker/samithc/rag-book-qa
- Authors
- Table of Contents
- Problem Statement
- Tech Stack
- Supported Document Types
- Quick glance at the results
- Lessons learned and recommendation
- Limitation and what can be improved
- Work Flows
- Run Locally
- Contribution
- License
Modern documents—books, manuals, research papers, and technical guides—contain vast amounts of valuable information, but searching for specific answers or contextual knowledge within them is slow, tedious, and often inefficient using traditional keyword search or manual review.
There is a growing need for intelligent tools that can extract precise answers to natural language questions directly from user-provided materials, enabling fast, context-aware insights.
This project addresses the challenge of question answering across diverse document formats by:
Automatically parsing unstructured files (PDF, DOCX, TXT, HTML)
Segmenting and embedding textual content for semantic search
Leveraging scalable retrieval and LLMs to generate accurate, context-cited answers
The solution empowers students, researchers, professionals, and organizations to interactively query their documents and receive instant, explainable responses—making knowledge truly accessible.
- Python
- Gradio
- Hugging Face Transformers
- FAISS
- SentenceTransformers
- Docker
- GitHub Actions
- Hugging Face Spaces
- FastAPI
The system works with on-demand uploads. Supported formats: PDF, DOCX, TXT, HTML.
Gradio App in Hugging Face Space
-
Implemented a scalable, modular QA pipeline combining document ingestion, semantic chunking, embeddings, and retrieval-augmented LLMs.
-
Integration with Hugging Face Spaces enables hassle-free cloud sharing, but required careful management of memory and file size limitations.
-
Encountered file format edge-cases (PDF parsing, encoding issues) and addressed them with robust pre-processing.
-
Embedding model choice (SentenceTransformers, TinyLlama) substantially impacts retrieval quality and inference speed—benchmark on both accuracy and performance for your use case.
-
Deployment on Spaces requires keeping dependency lists lean and minimizing build times.
-
Pre-validate and clean user-uploaded files to avoid pipeline crashes on corrupt or non-standard documents.
-
Choose chunk sizes and embedding strategies that balance retrieval relevance with model context window limits.
-
Monitor resource usage and error logs actively, especially when running in shared/public cloud environments.
-
Consider progressive enhancements: add support for more file types, GPU inference, admin monitoring, or user feedback collection as next steps.
-
Document all setup and deployment steps clearly in README to facilitate rapid local and cloud onboarding.
-
File Size & Format: Large files may exceed processing/memory capacity on Hugging Face Spaces. Only PDF, DOCX, TXT, and HTML are supported; some complex or scanned (image-based) PDFs may not parse correctly.
-
Model Performance: TinyLlama and similar small LLMs may struggle with advanced reasoning or highly ambiguous queries, especially for very technical or niche topics.
-
Chunking Strategy: Fixed-size text chunking can split meaningful context, sometimes impacting answer quality if relevant information is spread across multiple chunks.
-
Deployment Constraints: Free or public cloud deployments (e.g., Spaces) can be slow for bigger documents or under heavy load due to resource limits and cold starts.
-
No User Authentication: Currently, the app is fully public, with no per-user session management, history, or upload privacy guarantees.
-
Dynamic Chunking: Implement smarter, context-aware chunking using sentence boundaries or topic detection.
-
More Powerful Models: Allow optional use of larger LLMs (on GPU) or external APIs for users needing higher accuracy or complex reasoning.
-
Expanded Format Support: Add support for additional document types (e.g., PPTX, XLSX), images with OCR, and multi-document querying.
-
Scalability: Enable horizontal scaling, persistent storage of embeddings, and batch/background processing for larger workloads.
-
User & Admin Features: Add user authentication, query analytics, admin dashboard, error reporting, and feedback mechanisms for continuous improvement.
- Document Upload
- Parsing
- Chunking
- Embedding
- Vector Storage
- Query Input
- Semantic Retrieval
- Answer Generation
- Result Display
Initialize git
git initClone the project
git clone https://github.com/samithcsachi/rag-book-qa.gitOpen Anaconda Prompt and Change the Directory and Open VSCODE by typing code .
cd E:/rag-book-qaCreate a virtual environment
python -m venv venv
.\venv\Scripts\activate install the requirements
pip install -r requirements.txtRun the FAST API
uvicorn app:app --host 127.0.0.1 --port 8000 --reload
Run the gradio app
python run app.pyPull requests are welcome! For major changes, please open an issue first to discuss what you would like to change or contribute.
MIT License
Copyright (c) 2025 Samith Chimminiyan
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Learn more about MIT license
If you have any questions, suggestions, or collaborations in data science, feel free to reach out:
- 📧 Email: samith.sachi@gmail.com
- 🔗 LinkedIn: www.linkedin.com/in/samithchimminiyan
- 🌐 Website: www.samithc.github.io

