CSVQuery-RAG is built on a modular architecture consisting of three main components:
- Backend RAG System (
backend/rag_system.py) - System Prompts Module (
backend/system_prompts.py) - Frontend Gradio Interface (
frontend/app.py)
- Language: Python 3.8+
- RAG Implementation: LangChain
- Vector Store: Chroma
- LLM: OpenAI GPT-4 Turbo
- UI Framework: Gradio
- Embeddings: OpenAI Embeddings
The core class that handles the Retrieval Augmented Generation functionality.
def __init__(self, csv_path: str)- Purpose: Initializes the RAG system with a CSV file
- Parameters:
csv_path: Path to the CSV file to be processed
- Process:
- Validates CSV file existence
- Initializes OpenAI embeddings with error handling
- Calls
initialize_system()
def initialize_system(self)- Purpose: Sets up the RAG pipeline
- Components:
- CSV Data Processing
- Text Chunking
- Vector Store Creation
- Conversation Chain Setup
- Key Features:
- Uses RecursiveCharacterTextSplitter (chunk_size=300, overlap=50) with CSV-optimized separators
- Implements ConversationalRetrievalChain with MMR retrieval for diversity
- Configures ChatOpenAI (GPT-4 Turbo) with temperature=0 and 60s timeout
- Uses Chroma vector store with enhanced HNSW parameters
def is_valid_query(self, question: str) -> bool- Purpose: Validates if a query is appropriate for the dataset
- Parameters:
question: The user's input query
- Returns: Boolean indicating query validity
- Validation Checks:
- Non-empty string
- Not out-of-scope topics (using keyword filtering)
- Relevant to CSV data context
def reset_memory(self)- Purpose: Clears the conversation memory
- Usage: Called when starting a new conversation
- Process: Clears the ConversationBufferMemory
def query(self, input_data: Union[str, dict]) -> str- Purpose: Processes user queries and returns responses
- Parameters:
input_data: Either a string question or a dict with question and chat history
- Returns: Generated response string
- Features:
- Supports memory reset with special command
- Adds column context to queries
- Handles timeouts and errors gracefully
- Validates input before processing
def process_query(message: str, history: list) -> tuple- Purpose: Handles user input from Gradio interface
- Parameters:
message: Current user messagehistory: Chat history
- Returns: Updated history and empty message
- Features:
- Converts chat history to LangChain format
- Maintains conversation context
- Handles empty messages
def clear_chat() -> list- Purpose: Resets the chat interface and backend memory
- Returns: Empty list for chat history
- Process: Calls backend reset_memory()
MAIN_SYSTEM_PROMPT
RETRIEVAL_CONTEXT_PROMPT- Purpose: Define system prompts for controlling LLM behavior
- Usage: Ensures the model only answers questions within the CSV data context
def get_combined_prompt(context: str = "", columns_info: str = "") -> str- Purpose: Combines system prompts with contextual information
- Parameters:
context: Additional context retrieved from the vector storecolumns_info: Information about the columns in the CSV
- Returns: Complete system prompt with context
- Features:
- Incorporates dataset column information
- Enforces strict query boundary enforcement
- Ensures responses are based solely on CSV data
with gr.Blocks(theme=gr.themes.Soft()) as demo:- Components:
- Markdown header
- Chatbot interface (600px height)
- Query input textbox
- Submit and Clear buttons
- Example queries section
- Features:
- Responsive layout
- Example queries for user guidance
- Clear chat functionality
- Modern Soft theme
-
Data Loading
- CSV file reading using pandas
- Column information extraction
- Data validation and error handling
-
Text Processing
- Document creation from CSV rows with enhanced metadata
- Text chunking (300 chars, 50 overlap) with CSV-specific separators
- Rich metadata tracking (chunk ID, source, columns)
-
Vector Store
- Document embedding using OpenAI with metadata enrichment
- Chroma vector store with optimized HNSW (construction_ef: 200, search_ef: 100)
- MMR retrieval (k=5, fetch_k=20, lambda_mult=0.7)
-
Query Processing
- Query validation with keyword filtering
- Context addition with column information
- Response generation with timeout handling
- Memory management
- System prompt enforcement of query boundaries
- Python 3.8+
- OpenAI API access
- Required Python packages (see requirements.txt):
- langchain
- langchain_openai
- langchain_community
- gradio
- pandas
- chromadb
- python-dotenv
.envfile with OPENAI_API_KEY- CSV data file in
backend/db/ - Sufficient system memory for vector operations
-
Vector Store
- Optimized chunk size (300) and overlap (50)
- MMR search for balanced diversity and relevance
- Enhanced HNSW parameters for improved accuracy
-
Query Processing
- 60-second timeout for complex queries
- Memory management for conversations
- Efficient chat history handling
-
UI Performance
- Responsive chat interface
- Efficient message handling
- Example queries for better UX
-
API Key Management
- Environment variable validation
- Explicit error handling
- Secure key loading
-
Input Validation
- Query sanitization
- Out-of-scope detection via system prompts
- Two-layer validation (keyword filtering and system prompt enforcement)
- Error boundary implementation
-
Data Privacy
- Local vector store
- No external data storage
- Secure conversation handling
-
System Initialization
- CSV file validation
- API key verification
- Comprehensive logging
-
Query Processing
- Invalid query detection
- Timeout management (30s)
- Graceful error handling
-
UI Error Handling
- Empty input handling
- Clear chat functionality
- Error message display
- Local Deployment
python run.py- Runs on 0.0.0.0:7860
- Accessible via web browser
- Development mode support
- Production Considerations
- Environment configuration
- Comprehensive logging
- Error handling
- Memory management
- Security measures
-
Code Updates
- Version control
- Dependency management
- API compatibility
-
Data Updates
- CSV file updates
- Vector store rebuilding
- System reinitialization
-
Performance Monitoring
- Query response times
- Memory usage
- Error rates
-
Potential Improvements
- Multiple file support
- Advanced query capabilities
- Enhanced error handling
- UI customization options
- Refined system prompts for even better query boundary enforcement
-
Scalability Options
- Distributed processing
- Caching mechanisms
- Load balancing
-
Feature Additions
- Data visualization
- Export capabilities
- Advanced analytics