Information Retrieval System with Retrieval-Augmented Generation (RAG)

A Complete End-to-End Information Retrieval and Generation System for Technology and Software Domain

Executive Summary

This project implements a sophisticated information retrieval system that combines multiple retrieval strategies with Retrieval-Augmented Generation (RAG) to provide comprehensive, citation-backed answers to user queries. The system intelligently detects insufficient local search results using multi-criterion insufficiency detection and automatically augments the search with web results when necessary. Retrieved documents are then passed to a Large Language Model (LLM) via the RAG pipeline to generate natural language answers with extracted citations.

Key Features:

🔍 Multi-Method Local Search: LSI (Latent Semantic Indexing), TF-IDF, and vector similarity-based retrieval
🌐 Intelligent Web Augmentation: Automatic web search fallback with DuckDuckGo integration
🤖 RAG-Based Answer Generation: Leverages Ollama LLMs for citation-backed responses
📊 Automatic Crawling: Six specialized web crawlers for continuous document acquisition
✅ Comprehensive Evaluation: IR metrics (MAP, MRR, NDCG, Precision, Recall)
🎯 Insufficiency Detection: Quantity, quality, and semantic overlap criteria
📚 Full Citation Tracking: Automatic extraction and formatting of citations
⭐ Recommendation Module: Content-based recommendations and automatic personalization from the latest user searches

Project Structure

Information-Retrieval-System-Project/
├── src/                              # Source code
│   ├── main_orchestator.py           # Main orchestration layer (core API)
│   ├── main.py                       # Command-line interface
│   ├── rag_cli.py                    # RAG-specific CLI wrapper
│   │
│   ├── sri/                          # Core SRI (Sistema de Recuperación de Información)
│   │   ├── pipeline.py               # Main SRI pipeline orchestrator
│   │   ├── __init__.py
│   │   ├── crawler/                  # Web document acquisition
│   │   │   ├── base.py               # Base crawler class
│   │   │   ├── caller.py             # Crawler execution manager
│   │   │   ├── items.py              # Data models for crawler items
│   │   │   ├── pipeline.py           # Scrapy pipeline
│   │   │   ├── settings.py           # Crawler configuration
│   │   │   ├── spiders/              # 6 specialized web crawlers
│   │   │   │   ├── devto.py
│   │   │   │   ├── hackernews.py
│   │   │   │   ├── realpython.py
│   │   │   │   ├── lobsters.py
│   │   │   │   ├── thenewstack.py
│   │   │   │   └── theverge.py
│   │   │   └── __main__.py
│   │   │
│   │   └── web_search/               # Web search augmentation
│   │       ├── checker.py            # Insufficiency detection logic
│   │       ├── searcher.py           # DuckDuckGo web search
│   │       ├── indexer.py            # Web result processing
│   │       └── pipeline.py           # Web search pipeline
│   │
│   ├── rag/                          # Retrieval-Augmented Generation
│   │   ├── rag_module.py             # Main RAG orchestrator
│   │   ├── llm_provider.py           # LLM provider abstraction (Ollama)
│   │   ├── config.py                 # RAG configuration
│   │   ├── prompt_templates.py       # Prompt engineering templates
│   │   ├── citations.py              # Citation extraction and formatting
│   │   ├── output_parser.py          # RAG response parsing
│   │   └── __init__.py
│   │
│   ├── retrieval/                    # Document retrieval methods
│   │   ├── lsi_model.py              # Latent Semantic Indexing
│   │   ├── vector_store.py           # Vector storage (ChromaDB/custom backend)
│   │   └── __pycache__/
│   │
│   ├── indexing/                     # Text indexing
│   │   ├── indexer.py                # Inverted index and TF-IDF
│   │   └── __pycache__/
│   │
│   ├── ranking/                      # Result ranking
│   │   ├── ranking.py                # Multi-signal ranking engine
│   │   └── __pycache__/
│   │
│   ├── evaluation/                   # System evaluation
│   │   ├── evaluation.py             # IR metrics computation
│   │   └── __pycache__/
│   │
│   ├── recommendation/               # Optional recommendation module
│   │   ├── recommender.py            # Content-based recommendation engine
│   │   ├── user_history.py           # Search history persistence
│   │   └── __init__.py
│   │
│   ├── app/                          # Frontend application
│   └── acquisition/                  # Placeholder for future data acquisition
│
├── data/                             # Data storage
│   ├── documents.json                # Consolidated document collection
│   ├── user_history.json             # Local search history for automatic recommendations
│   ├── evaluation/                   # Evaluation results
│   ├── index/                        # Vector store and indices
│   │   └── chroma.sqlite3            # ChromaDB persistence
│   ├── processed/                    # Processed documents
│   └── qrels/                        # Relevance judgments
│
├── tests/                            # Test suite
│   ├── sri/
│   │   ├── test_system.py            # System integration tests
│   │   ├── crawler/                  # Crawler tests
│   │   ├── indexer/                  # Indexing tests
│   │   ├── ranking/                  # Ranking tests
│   │   ├── retrieval/                # Retrieval tests
│   │   ├── vectordb/                 # Vector store tests
│   │   ├── rag/                      # RAG tests
│   │   └── web_search/               # Web search tests
│
├── docs/                             # Documentation
│   ├── ARCHITECTURE_ANALYSIS.md      # Detailed architecture analysis
│   ├── FRONTEND_IMPLEMENTATION_PLAN.md  # Gradio interface specification
│   ├── RECOMMENDATION_MODULE.md      # Recommendation module documentation
│   ├── PRE_RAG_STATUS.md             # Pre-RAG implementation status
│   └── RAG_IMPLEMENTATION_PLAN.md    # RAG implementation details
│
├── pyproject.toml                    # Project metadata and dependencies
├── Makefile                          # Build and development commands
├── docker-compose.yml                # Docker orchestration
├── Dockerfile                        # Container specification
└── README.md                         # This file

Key Directory Functions

Directory	Purpose	Key Files
`src/`	All source code	Python modules and CLI entry points
`src/sri/`	Core retrieval system	Pipeline, crawlers, web search
`src/rag/`	Answer generation	LLM integration, prompts, citations
`src/retrieval/`	Document retrieval	LSI, vector storage, embeddings
`data/`	Persistent data storage	Documents, indices, evaluation results
`tests/`	Test suite	Unit and integration tests
`docs/`	Project documentation	Architecture, implementation plans

Architecture and Components

1. System Overview

The system operates in the following pipeline:

User Query
    ↓
┌─────────────────────────────────────────────┐
│ MainOrchestrator (main_orchestator.py)      │
│ ┌──────────────────────────────────────┐   │
│ │ 1. Database Health Check            │   │
│ │    • Check indexed document count   │   │
│ │    • Verify vector store availability│   │
│ └──────────────────────────────────────┘   │
│           ↓                                  │
│ ┌──────────────────────────────────────┐   │
│ │ 2. Local Search (SRIPipeline)        │   │
│ │    • LSI (semantic similarity)       │   │
│ │    • TF-IDF (term frequency)         │   │
│ │    • Vector similarity (embeddings)  │   │
│ └──────────────────────────────────────┘   │
│           ↓                                  │
│ ┌──────────────────────────────────────┐   │
│ │ 3. Insufficiency Detection           │   │
│ │    • Quantity: Few results?          │   │
│ │    • Quality: Low relevance scores?  │   │
│ │    • Semantic: Keyword overlap?      │   │
│ └──────────────────────────────────────┘   │
│           ↓                                  │
│ ┌──────────────────────────────────────┐   │
│ │ 4. Conditional Web Search (Optional) │   │
│ │    • DuckDuckGo augmentation         │   │
│ │    • Persist results for future use  │   │
│ └──────────────────────────────────────┘   │
│           ↓                                  │
│ ┌──────────────────────────────────────┐   │
│ │ 5. Document Consolidation            │   │
│ │    • Merge local + web results       │   │
│ │    • Deduplication                   │   │
│ └──────────────────────────────────────┘   │
│           ↓                                  │
│ ┌──────────────────────────────────────┐   │
│ │ 6. RAG Generation (RAGModule)        │   │
│ │    • Prompt template application     │   │
│ │    • Ollama LLM inference            │   │
│ │    • Citation extraction             │   │
│ └──────────────────────────────────────┘   │
└─────────────────────────────────────────────┘
    ↓
RAGResponse (answer + citations)

2. Core Components

MainOrchestrator (`src/main_orchestator.py`)

Central API for the entire system. Orchestrates all operations and exposes a clean interface independent of the front-end technology (CLI, GUI, REST API, etc.).

Key Methods:

Method	Purpose	Parameters	Returns
`query()`	Complete end-to-end query pipeline	`question, max_local_results, enable_web_search, auto_reload`	`RAGResponse`
`retrieve_documents()`	Document retrieval without RAG	Same as `query()`	`Dict[str, Any]` with documents
`augment_response()`	RAG generation from documents	`question, documents`	`RAGResponse`
`clear_all_indices()`	Clear database	None	`Dict[str, Any]` status
`load_documents_from_crawlers()`	Execute crawlers and index	`max_articles, force_recrawl`	`Dict[str, Any]` statistics
`check_database_health()`	Database readiness check	None	`Dict[str, Any]` health metrics
`detect_insufficiency_for_query()`	Insufficiency detection	`query, results`	`Dict[str, Any]` assessment
`evaluate_test()`	System evaluation	`test_spec` (optional)	`Dict[str, Any]` IR metrics
`get_status()`	System diagnostics	None	`Dict[str, Any]` complete status

Internal Architecture:

Database Management: Manages VectorStore, ChromaDB, LSI models, and consolidated documents
Search Pipeline: Orchestrates local search via SRIPipeline
Insufficiency Detection: Multi-criterion analysis (quantity, quality, semantic)
Web Augmentation: Conditional DuckDuckGo integration
RAG Integration: Passes documents to RAGModule for generation
Crawlers: Manages six specialized web crawlers for document acquisition

SRIPipeline (`src/sri/pipeline.py`)

Unified interface for local document search.

Methods:

index(documents, save=True) — Index documents using LSI, TF-IDF, and vector storage
search(query, top_k=10) — Retrieve relevant documents using multi-method scoring
evaluate(test_set) — Compute IR evaluation metrics

Integrated Components:

InvertedIndex (src/indexing/indexer.py) — Vocabulary, posting lists, TF-IDF
LSIModel (src/retrieval/lsi_model.py) — Latent semantic analysis
VectorStore (src/retrieval/vector_store.py) — ChromaDB or custom backend
RankingEngine (src/ranking/ranking.py) — Multi-signal result ranking

RAGModule (`src/rag/rag_module.py`)

Orchestrates the Retrieval-Augmented Generation pipeline.

Methods:

generate(query, documents, temperature, max_tokens) — Generate answer with citations

Components:

LLMProvider (src/rag/llm_provider.py) — Ollama interface
PromptTemplateFactory (src/rag/prompt_templates.py) — Template selection (basic, domain_specific, chain_of_thought)
CitationExtractor (src/rag/citations.py) — Automatic citation extraction
OutputParser (src/rag/output_parser.py) — Response structuring

Web Search and Insufficiency Detection (`src/sri/web_search/`)

Automatic augmentation of local results with web search.

SufficiencyChecker (checker.py) — Multi-criterion insufficiency detection
WebSearcher (searcher.py) — DuckDuckGo integration
WebIndexer (indexer.py) — Web result processing and formatting

Web Crawlers (`src/sri/crawler/`)

Six specialized crawlers for continuous document acquisition:

DevTo — Technology articles and tutorials
HackerNews — Tech news and discussions
RealPython — Python programming resources
Lobsters — Software engineering news
TheNewStack — Cloud-native and DevOps content
TheVerge — Tech product reviews and news

Orchestrated by:

CrawlerCaller (caller.py) — Execution manager
CrawlerSettings (settings.py) — Configuration

System Requirements

Python Environment

Python: 3.10 or higher
Package Manager: pip or uv (recommended)

External Dependencies

Runtime Requirements

Ollama: For LLM inference (required for RAG)
- Default URL: http://localhost:11434
- Default Model: llama3.2:latest (customizable)
- Installation: https://ollama.ai

Core Libraries

See pyproject.toml for complete dependency list.

Critical Dependencies:

scikit-learn>=1.6 — LSI, TF-IDF, machine learning
chromadb>=0.5 — Vector database with embeddings
sentence-transformers>=2.7 — Text embeddings
langchain>=0.2 — LLM integration framework
duckduckgo-search>=6.0 — Web search fallback
scrapy>=2.11 — Web crawling framework
nltk>=3.8 — Natural language processing
gradio>=4.0 — Web interface framework

System Resources

RAM: Minimum 4GB (8GB recommended)
Storage: 2GB+ for document indexing and vector store
Network: Required for crawler and web search operations

Installation and Setup

1. Clone Repository

git clone https://github.com/D4R102004/Information-Retrieval-System-Project.git
cd Information-Retrieval-System-Project

2. Create Virtual Environment

# Using Python venv
python3.10 -m venv .venv

# Activate environment
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate

3. Install Dependencies

# Using pip
pip install -e ".[dev]"

# OR using uv (faster, recommended)
uv pip install -e ".[dev]"

4. Install Ollama

# Visit https://ollama.ai and download the installer for your OS

# Verify installation
ollama pull llama3.2:latest
ollama serve  # Start Ollama service in background

5. Initialize Pre-commit Hooks (Optional)

pre-commit install
pre-commit install --hook-type commit-msg

6. Verify Installation

python -c "from main_orchestator import MainOrchestator; print('[OK] MainOrchestrator imported successfully')"

CLI Usage Guide

Command Syntax

The CLI is accessed via src/main.py with the following structure:

python src/main.py [MODE] [OPTIONS]

Available Modes

1. Single Query Execution

Execute a single query and display results:

python src/main.py --query "How does machine learning work?"

Output includes:

Formatted answer with proper markdown rendering
Citation list with sources and snippets
Metadata (local documents, web documents, generation time)

With custom parameters:

# Use more local results
python src/main.py --query "Python async programming" --max-local 10

# Disable web search augmentation
python src/main.py --query "Your question" --no-web-search

# Increase verbosity for debugging
python src/main.py --query "Your question" -v

# Write logs to file
python src/main.py --query "Your question" --log-file system.log

2. Interactive Mode

Enter a loop where you can ask multiple queries sequentially:

python src/main.py --interactive

Interactive commands:

ask <query>      - Submit a query
status           - Display database health
load             - Load documents from crawlers
clear            - Clear database (with confirmation)
help             - Show help message
exit             - Exit application

Example session:

ask What is LSI?
ask How does ChromaDB work?
status
load
exit

3. Database Status

Display current database health and statistics:

python src/main.py --status

Output:

========================================================
DATABASE STATUS
========================================================
Status:                healthy / degraded / empty
Indexed Documents:     1250 (Min required: 500)
File Documents:        1250 (data/documents.json)
Raw Documents:         0 (data/raw/)
ChromaDB Available:    Yes
VectorStore Count:     1250
LLM Connection:        ✓ Connected (latency: 45ms)
Last Update:           2025-05-31 14:23:15 UTC
========================================================

4. Load Data from Crawlers

Execute the full crawl → consolidate → index pipeline:

# Standard load (respects existing crawls)
python src/main.py --load-data

# Force recrawl (ignore cache)
python src/main.py --load-data --force

# Load with custom limits
python src/main.py --load-data --max-articles 500

Output:

Crawler execution status for each of 6 spiders
Consolidated document count
Indexing progress
Final statistics

5. Clear Database

Remove all indices, models, and documents:

# With confirmation prompt
python src/main.py --clear-db

# Force clear without confirmation
python src/main.py --clear-db --force

Warning: This operation cannot be undone.

Global Options

--verbose, -v              Enable debug logging
--log-file PATH            Write logs to file
--force                    Skip confirmations
--max-articles N           Max articles per crawler (default: 100)
--max-local N              Max local search results (default: 5)
--no-web-search            Disable web search fallback

Complete Examples

Research query with full output:

python src/main.py --query "Distributed systems consensus algorithms" \
                   --max-local 10 \
                   --verbose \
                   --log-file research.log

Bulk load and evaluate:

python src/main.py --load-data --max-articles 1000 --verbose

Interactive session with logging:

python src/main.py --interactive --log-file session.log --verbose

Backend Integration (MainOrchestrator)

Direct Python API Usage

The MainOrchestrator can be used directly in Python code without the CLI:

from main_orchestator import MainOrchestrator

# Initialize
orchestrator = MainOrchestrator()

# Single query (complete pipeline)
response = orchestrator.query(
    question="What is Retrieval-Augmented Generation?",
    max_local_results=5,
    enable_web_search=True,
    auto_reload=True
)

print(f"Answer: {response.answer}")
print(f"Citations: {len(response.citations)}")
print(f"Metadata: {response.metadata}")

Advanced Usage Examples

1. Document Retrieval Without RAG

# Get documents without generating an answer
result = orchestrator.retrieve_documents(
    question="machine learning frameworks",
    max_local_results=10,
    enable_web_search=True,
    auto_reload=False
)

documents = result['documents']
metadata = result['metadata']

print(f"Local docs: {metadata['local_documents']}")
print(f"Web docs: {metadata['web_documents']}")
print(f"Insufficiency: {metadata['insufficiency_reasons']}")

2. RAG Generation from Custom Documents

# Generate answer from pre-selected documents
documents = [
    {
        "id": "doc1",
        "title": "Deep Learning Fundamentals",
        "content": "Neural networks are...",
        "url": "https://example.com/article1"
    },
    # ... more documents
]

response = orchestrator.augment_response(
    question="Explain backpropagation",
    documents=documents
)

print(response.answer)
for citation in response.citations:
    print(f"  - {citation.title}")

3. Multi-Step Query with Custom Parameters

# Step 1: Retrieve and check insufficiency
retrieval = orchestrator.retrieve_documents(
    question="query",
    max_local_results=3,
    enable_web_search=False  # Don't use web search in this step
)

# Step 2: Check insufficiency
insufficiency = orchestrator.detect_insufficiency_for_query(
    query="query",
    results=retrieval['documents']
)

if insufficiency['is_insufficient']:
    print(f"Reasons: {insufficiency['reasons']}")
    # Optionally refine query or fetch more documents

# Step 3: Generate answer
response = orchestrator.augment_response(
    question="query",
    documents=retrieval['documents']
)

4. System Diagnostics

# Complete system status
status = orchestrator.get_status()
print(f"Database status: {status['database']['status']}")
print(f"Indexed docs: {status['database']['indexed_documents']}")
print(f"Raw docs: {status['crawlers']['raw_documents']}")

# Database health
health = orchestrator.check_database_health()
print(f"Ready to search: {health['can_search']}")

5. Database Operations

# Clear all data
result = orchestrator.clear_all_indices()
if result['success']:
    print("Database cleared")

# Load data from crawlers
load_result = orchestrator.load_documents_from_crawlers(
    max_articles=1000,
    force_recrawl=False
)
print(f"Indexed: {load_result['indexed_documents']} documents")

Configuration Reference

RAG Configuration (`src/rag/config.py`)

from rag.config import config

# View current configuration
print(config.ollama_model)      # Current model: "llama3.2:latest"
print(config.temperature)       # Generation randomness: 0.7
print(config.max_tokens)        # Max response length: 1024

# Modify configuration (at runtime)
config.temperature = 0.5
config.max_tokens = 2048

Key Parameters:

Parameter	Type	Default	Range	Description
`ollama_model`	str	`llama3.2:latest`	-	Model identifier for inference
`ollama_base_url`	str	`http://localhost:11434`	-	Ollama service URL
`rag_template`	str	`domain_specific`	basic, domain_specific, chain_of_thought	Prompt template strategy
`temperature`	float	`0.7`	0.0–1.0	Response randomness (0=deterministic, 1=random)
`max_tokens`	int	`1024`	100–4096	Maximum response length in tokens
`max_cites`	int	`10`	1–20	Maximum citations to extract
`top_k_retrieval`	int	`5`	1–20	Documents to pass to RAG

Crawler Configuration (`src/sri/crawler/settings.py`)

from sri.crawler.settings import CrawlerSettings

settings = CrawlerSettings()
print(settings.MIN_RESULTS_FOR_QUERY)      # Minimum results threshold: 5
print(settings.MIN_AVG_SCORE_THRESHOLD)    # Minimum avg relevance: 0.3

Key Parameters:

Parameter	Type	Default	Description
`MIN_RESULTS_FOR_QUERY`	int	5	Minimum local results before web search
`MIN_AVG_SCORE_THRESHOLD`	float	0.3	Minimum average relevance score
`MIN_ARTICLES_PER_SPIDER`	int	50	Minimum articles per crawler

API Examples

REST API (Future Integration)

The MainOrchestrator API is designed for REST API wrapping:

# Pseudocode for FastAPI wrapper
from fastapi import FastAPI
from main_orchestator import MainOrchestrator

app = FastAPI()
orchestrator = MainOrchestrator()

@app.post("/api/query")
async def query_endpoint(question: str, max_local: int = 5):
    response = orchestrator.query(
        question=question,
        max_local_results=max_local
    )
    return response.dict()

@app.get("/api/status")
async def status_endpoint():
    return orchestrator.get_status()

Python SDK Usage

# As a Python library
from main_orchestator import MainOrchestrator

class SearchClient:
    def __init__(self):
        self.orchestrator = MainOrchestrator()
    
    def search(self, question: str):
        return self.orchestrator.query(question)
    
    def evaluate(self, test_file: str):
        with open(test_file) as f:
            test_spec = json.load(f)
        return self.orchestrator.evaluate_test(test_spec)

client = SearchClient()
response = client.search("Your question")

Evaluation System

Running Evaluations

Via CLI:

python src/main.py --evaluate

Via Python API:

# Run evaluation with default test set (data/test_queries.json)
results = orchestrator.evaluate_test()

# Run evaluation with custom test set
custom_tests = {
    "test_queries": [
        {
            "query_id": "q1",
            "query": "machine learning",
            "relevant": ["doc1", "doc2"],
            "grades": {"doc1": 3, "doc2": 2}
        },
        # ... more queries
    ]
}
results = orchestrator.evaluate_test(custom_tests)

Evaluation Metrics

The system computes standard Information Retrieval metrics:

Per-Query Metrics:

AP (Average Precision) — Area under precision-recall curve
RR (Reciprocal Rank) — Position of first relevant document
P@k (Precision at k) — Proportion of relevant in top-k
R@k (Recall at k) — Proportion of relevant retrieved in top-k
NDCG@k (Normalized DCG) — Ranking quality with graded relevance

Aggregate Metrics:

MAP (Mean Average Precision)
MRR (Mean Reciprocal Rank)
Mean P@1, @3, @5, @10
Mean R@1, @3, @5, @10
Mean NDCG@1, @3, @5, @10

Evaluation Output:

{
    "status": "success",
    "timestamp": "2025-05-31T14:30:00Z",
    "execution_time_seconds": 45.3,
    "aggregate": {
        "num_queries": 10,
        "MAP": 0.652,
        "MRR": 0.847,
        "mean_P@1": 0.7,
        "mean_R@5": 0.432,
        "mean_NDCG@5": 0.715
    },
    "per_query": [
        {
            "query_id": "q1",
            "num_relevant": 5,
            "ap": 0.85,
            "rr": 1.0,
            "p@5": 0.8,
            "r@5": 0.4,
            "ndcg@5": 0.895
        },
        // ... more queries
    ]
}

Test File Format

data/test_queries.json:

{
    "test_queries": [
        {
            "query_id": "q1",
            "query": "search query text",
            "relevant": ["doc_id1", "doc_id2", "doc_id3"],
            "grades": {
                "doc_id1": 3,
                "doc_id2": 2,
                "doc_id3": 1
            }
        },
        // ... more test queries
    ]
}

Troubleshooting

Common Issues

1. Database Empty Error

Problem: "Database is empty. A search is impossible to perform"

Solutions:

# Load data from crawlers
python src/main.py --load-data

# Or via Python API
orchestrator.load_documents_from_crawlers()

Check minimum documents:

python src/main.py --status

Ensure Indexed Documents ≥ 500 (configurable self.settings["min_documents"]).

2. Ollama Connection Failed

Problem: "Failed to connect to Ollama service"

Solutions:

# 1. Start Ollama service
ollama serve

# 2. Verify installation
ollama list

# 3. Pull required model
ollama pull llama3.2:latest

# 4. Check URL configuration
# Default: http://localhost:11434
# Modify in src/rag/config.py if different

3. Low RAG Quality / Weak Citations

Problem: Generated answers lack relevant citations

Solutions:

Increase local search results:

python src/main.py --query "Your question" --max-local 10

Verify document relevance:
```
python src/main.py --status
```

Adjust RAG parameters:

from rag.config import config
config.temperature = 0.3  # More deterministic
config.max_tokens = 2048  # Longer responses

Use different prompt template:

rag = RAGModule(llm, template_type="chain_of_thought")

4. Crawler Failures

Problem: "Crawler execution failed"

Solutions:

# Check network connectivity
ping duckduckgo.com

# Force recrawl
python src/main.py --load-data --force

# Increase timeout
python src/main.py --load-data --verbose

5. Memory Issues During Indexing

Problem: Out of memory when loading large datasets

Solutions:

Reduce documents per crawler:

python src/main.py --load-data --max-articles 100

Use ChromaDB (more memory-efficient):
- Install: pip install chromadb
- System will auto-detect and use it
Increase system RAM or use pagination in custom code

Development and Testing

Running Tests

# Run all tests
make test

# Run specific test module
pytest tests/sri/test_system.py -v

# Run with coverage
pytest --cov=src tests/

# Run specific test function
pytest tests/sri/retrieval/test_lsi_model.py::test_lsi_fitting -v

Code Quality

# Format code
make format

# Lint code
make lint

# Full check
make check

Development Commands

# View all available commands
make help

# Install development dependencies
make install

# Clean cache files
make clean

Team and Acknowledgments

Project Members:

Darío Francisco Alfonso (@D4R102004)
Juan Carlos Carmenate (@Juank404)
Sebastian González Alfonso (@sebagonz106)

Advisors:

University of Havana, Faculty of Mathematics and Computer Science

Contributions: We acknowledge the open-source community for the excellent libraries that power this system: scikit-learn, Ollama, ChromaDB, Scrapy, and many others.

Bibliography

Core References

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.

Retrieval-Augmented Generation

Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv preprint arXiv:2005.11401.

Vector Databases and Embeddings

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. (2019). arXiv preprint arXiv:1908.10084.

Information Retrieval Evaluation

Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern information retrieval: The concepts and technology behind search (2nd ed.). Addison-Wesley.

Web Crawling and Data Acquisition

Scrapy Documentation. https://docs.scrapy.org/

License

Academic Project
Universidad de La Habana, 2025–2026

This project is developed as an academic exercise in Information Retrieval systems and Natural Language Processing. All code is provided as-is for educational purposes.

Quick Start

Get started in 5 minutes:

# 1. Install
pip install -e ".[dev]"

# 2. Start Ollama
ollama pull llama3.2:latest && ollama serve

# 3. Load data
python src/main.py --load-data

# 4. Ask a question
python src/main.py --query "What is machine learning?"

# 5. Try interactive mode
python src/main.py --interactive

For detailed architecture documentation, see:

docs/INDEX.md — Documentation structure guide
docs/ARCHITECTURE_ANALYSIS.md — Technical deep-dive

Recommendation System

The project now includes a content-based recommendation module under src/recommendation/. It recommends documents using a hybrid score composed of:

TF-IDF content similarity over title, tags, source, and article content.
Optional user interests or current query text.
Optional liked/seed document IDs.
Freshness and source-prior signals.

Main backend methods exposed through MainOrchestator:

orchestrator.recommend_documents(
    query="serverless websocket apps",
    interests="cloud computing, javascript, APIs",
    liked_doc_ids=["010a7286-edfa-4143-9e46-462829787546"],
    top_k=10,
)

orchestrator.recommend_similar_documents(
    document_id="010a7286-edfa-4143-9e46-462829787546",
    top_k=10,
)

The Gradio UI also includes a new Recommendation tab where the user can generate profile-based recommendations or find documents similar to a selected document ID.

Automatic recommendations from recent searches

The recommendation module also persists user search behavior in data/user_history.json. Every successful search executed from the Search tab is recorded with:

the search query,
the timestamp,
the document IDs retrieved by that search.

The method recommend_from_history() builds an automatic user profile from the latest 5 searches and recommends new documents from the local corpus:

orchestrator.recommend_from_history(
    user_id="default",
    top_k=10,
    history_limit=5,
)

In the Gradio UI, each completed Search tab query automatically refreshes the Automatic recommendations section inside the Recommendation tab. The user can also manually regenerate those recommendations or clear the stored search history from that same tab.

For the complete technical documentation of this module, see docs/RECOMMENDATION_MODULE.md.

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
data		data
docker		docker
docs		docs
notebooks		notebooks
presentation		presentation
src		src
tests		tests
ui		ui
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
clean_for_presentation.ps1		clean_for_presentation.ps1
clean_for_presentation.sh		clean_for_presentation.sh
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval System with Retrieval-Augmented Generation (RAG)

Executive Summary

Table of Contents

Project Structure

Key Directory Functions

Architecture and Components

1. System Overview

2. Core Components

MainOrchestrator (src/main_orchestator.py)

SRIPipeline (src/sri/pipeline.py)

RAGModule (src/rag/rag_module.py)

Web Search and Insufficiency Detection (src/sri/web_search/)

Web Crawlers (src/sri/crawler/)

System Requirements

Python Environment

External Dependencies

Runtime Requirements

Core Libraries

System Resources

Installation and Setup

1. Clone Repository

2. Create Virtual Environment

3. Install Dependencies

4. Install Ollama

5. Initialize Pre-commit Hooks (Optional)

6. Verify Installation

CLI Usage Guide

Command Syntax

Available Modes

1. Single Query Execution

2. Interactive Mode

3. Database Status

4. Load Data from Crawlers

5. Clear Database

Global Options

Complete Examples

Backend Integration (MainOrchestrator)

Direct Python API Usage

Advanced Usage Examples

1. Document Retrieval Without RAG

2. RAG Generation from Custom Documents

3. Multi-Step Query with Custom Parameters

4. System Diagnostics

5. Database Operations

Configuration Reference

RAG Configuration (src/rag/config.py)

Crawler Configuration (src/sri/crawler/settings.py)

API Examples

REST API (Future Integration)

Python SDK Usage

Evaluation System

Running Evaluations

Evaluation Metrics

Test File Format

Troubleshooting

Common Issues

1. Database Empty Error

2. Ollama Connection Failed

3. Low RAG Quality / Weak Citations

4. Crawler Failures

5. Memory Issues During Indexing

Development and Testing

Running Tests

Code Quality

Development Commands

Team and Acknowledgments

Bibliography

Core References

Retrieval-Augmented Generation

Vector Databases and Embeddings

Information Retrieval Evaluation

Web Crawling and Data Acquisition

License

Quick Start

Recommendation System

Automatic recommendations from recent searches

MainOrchestrator (`src/main_orchestator.py`)

SRIPipeline (`src/sri/pipeline.py`)

RAGModule (`src/rag/rag_module.py`)

Web Search and Insufficiency Detection (`src/sri/web_search/`)

Web Crawlers (`src/sri/crawler/`)

RAG Configuration (`src/rag/config.py`)

Crawler Configuration (`src/sri/crawler/settings.py`)

Packages