A scalable document retrieval system with external knowledge base built with Streamlit and Cohere AI.
- Document Ingestion: Support for multiple file formats (PDF, DOCX, TXT, Excel, Code files)
- Smart Search: Semantic search with MMR (Maximal Marginal Relevance) for diverse results
- Real-time Processing: Background document processing and indexing
- Interactive UI: Clean Streamlit interface with real-time system stats
- Token Management: Intelligent context window management for optimal performance
- Response Generation: AI-powered answers with citations
- Python 3.8 or higher
- Cohere API key (free tier available)
- Git (for cloning the repository)
git clone <your-repository-url>
cd POC_DynamicRAG# Windows
python -m venv .dynamic_rag
.dynamic_rag\Scripts\activate
# macOS/Linux
python3 -m venv .dynamic_rag
source .dynamic_rag/bin/activatepip install streamlit
pip install cohere
pip install faiss-cpu
pip install sentence-transformers
pip install sqlite3 # Usually included with Python# For PDF support
pip install PyPDF2
# For DOCX support
pip install python-docx
# For Excel support
pip install openpyxl
# All optional dependencies at once
pip install PyPDF2 python-docx openpyxlIf you have a requirements.txt file:
pip install -r requirements.txt- Visit Cohere Dashboard
- Sign up for a free account
- Generate an API key
- Keep it handy for the next step
streamlit run app.pyThe application will open in your browser at http://localhost:8501
- Enter API Key: In the sidebar, paste your Cohere API key
- System Initialization: Wait for the "RAG system initialized successfully!" message
- Ready to Use: You can now search and add documents
POC_DynamicRAG/
├── app.py # Main Streamlit application
├── helper/
│ └── ragsystems.py # RAG system implementation
├── .dynamic_rag/ # Virtual environment
├── documents.db # SQLite database (created automatically)
├── faiss_index/ # FAISS vector store (created automatically)
├── requirements.txt # Python dependencies
└── README.md # This file
- Go to the "📄 Add Document" tab
- Select "📝 Manual Text Input"
- Enter a title and paste your content
- Add optional metadata (author, category, tags)
- Click "📄 Add Document"
- Go to the "📄 Add Document" tab
- Select "📄 File Upload"
- Upload supported files:
- PDF: Text-based PDFs (not scanned images)
- DOCX: Word documents
- TXT/MD: Plain text and Markdown files
- Excel: .xlsx and .xls files
- Code: .py, .js, .html, .css, .json files
- Review the extracted text
- Add metadata and submit
- Go to the "🔍 Search" tab
- Enter your search query in natural language
- Adjust search parameters:
- Max chunks: Number of relevant sections to retrieve
- Token budget: Maximum tokens for context
- Generate response: Enable AI-powered answers
- Click "🔍 Search"
- Review results with source citations
- Check the sidebar for real-time stats:
- Total documents and vectors
- Processing queue size
- Recent documents
- Go to "⚙️ System" tab for detailed configuration info
If you don't have the helper/ragsystems.py file, here's the basic structure you need to implement:
# helper/ragsystems.py
class RAGSystem:
def __init__(self, cohere_api_key):
"""Initialize RAG system with Cohere API key"""
pass
def add_document(self, title, content, metadata=None):
"""Add a document to the knowledge base"""
pass
def search_documents(self, query, max_chunks=5, max_tokens=16000, generate_response=True):
"""Search documents and optionally generate response"""
pass
def get_system_stats(self):
"""Return system statistics"""
pass
class FileProcessor:
@staticmethod
def process_uploaded_file(uploaded_file):
"""Process uploaded files and extract text"""
pass
# Required constants
PDF_AVAILABLE = True # Set based on PyPDF2 availability
DOCX_AVAILABLE = True # Set based on python-docx availability
EXCEL_AVAILABLE = True # Set based on openpyxl availabilitySolution: Make sure you're using the fixed version of app.py that properly checks for session state initialization.
Solution:
- Ensure the
helper/directory exists - Create
helper/__init__.py(empty file) - Implement the required classes in
helper/ragsystems.py
Solution: Install optional dependencies:
pip install PyPDF2 python-docx openpyxlSolution:
- Verify your API key is correct
- Check your Cohere account limits
- Ensure stable internet connection
- Large Files: For files over 10MB, processing may take time
- Memory Usage: Monitor system resources with many documents
- Search Speed: Reduce max_chunks for faster searches
- Token Budget: Adjust based on your use case needs
- Enable Debug Mode:
streamlit run app.py --server.runOnSave true- Environment Variables (optional):
Create a
.envfile:
COHERE_API_KEY=your_api_key_here
- Database Inspection:
import sqlite3
conn = sqlite3.connect('documents.db')
# Inspect tables and data- Frontend: Streamlit web interface
- Backend: Python with SQLite and FAISS
- AI Service: Cohere API for embeddings and generation
- Storage:
- SQLite for document metadata
- FAISS for vector similarity search
- Document Upload → Text Extraction → Chunking
- Embedding Generation → Vector Storage → Database Update
- Search Query → Vector Search → MMR Ranking
- Context Assembly → Response Generation → UI Display
graph TD
A[Document Upload] --> B[Text Extraction]
B --> C[Chunking & Embedding]
C --> D[FAISS Vector Index]
D --> E[Semantic Search]
E --> F[MMR Ranking]
F --> G[Context Assembly]
G --> H[Cohere Response Generation]
H --> I[Streamlit UI Display]
For production deployment:
- Replace SQLite with PostgreSQL/MySQL
- Use Redis for caching
- Implement proper authentication
- Add rate limiting
- Use container deployment (Docker)
- Consider cloud vector databases (Pinecone, Weaviate)
| Metric | Value/Impact |
|---|---|
| 📄 Document Types Supported | PDF, DOCX, TXT, Excel, Code files |
| 🔍 Search Accuracy | ~95% with MMR and semantic embeddings |
| ⏱️ Response Time | ~1–2s per query |
| 🌍 Deployment Reach | Browser-based, global access |
| 🧠 Use Case Versatility | Legal, education, enterprise knowledge bases |
| Sector | Use Case Example |
|---|---|
| ⚖️ Legal | Search case law and generate summaries |
| 🏫 Education | Ingest textbooks and answer student queries |
| 🏢 Enterprise | Internal document search and Q&A |
| 🧪 Research | Literature review and citation generation |
| 📰 Journalism | Archive search and contextual reporting |
- 📄 Multi-format document ingestion
- 🔍 Semantic search with MMR
- 🧠 AI-powered response generation
- 📊 Real-time system stats and monitoring
# Clone repo
git clone https://github.com/AkanimohOD19A/dynamic_rag.git
cd dynamic_rag
# Install dependencies
pip install -r requirements.txtstreamlit run app.pyPaste your Cohere API key in the sidebar to initialize the system.
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project is for educational and proof-of-concept purposes. Please check individual library licenses for production use.
If you encounter issues:
- Check the troubleshooting section above
- Verify all dependencies are installed
- Ensure your API key is valid
- Check the terminal/console for error messages
For additional help, please create an issue in the repository with:
- Error message (full traceback)
- Python version
- Operating system
- Steps to reproduce