A Complete End-to-End Information Retrieval and Generation System for Technology and Software Domain
This project implements a sophisticated information retrieval system that combines multiple retrieval strategies with Retrieval-Augmented Generation (RAG) to provide comprehensive, citation-backed answers to user queries. The system intelligently detects insufficient local search results using multi-criterion insufficiency detection and automatically augments the search with web results when necessary. Retrieved documents are then passed to a Large Language Model (LLM) via the RAG pipeline to generate natural language answers with extracted citations.
Key Features:
- 🔍 Multi-Method Local Search: LSI (Latent Semantic Indexing), TF-IDF, and vector similarity-based retrieval
- 🌐 Intelligent Web Augmentation: Automatic web search fallback with DuckDuckGo integration
- 🤖 RAG-Based Answer Generation: Leverages Ollama LLMs for citation-backed responses
- 📊 Automatic Crawling: Six specialized web crawlers for continuous document acquisition
- ✅ Comprehensive Evaluation: IR metrics (MAP, MRR, NDCG, Precision, Recall)
- 🎯 Insufficiency Detection: Quantity, quality, and semantic overlap criteria
- 📚 Full Citation Tracking: Automatic extraction and formatting of citations
- ⭐ Recommendation Module: Content-based recommendations and automatic personalization from the latest user searches
- Project Structure
- Architecture and Components
- System Requirements
- Installation and Setup
- CLI Usage Guide
- Backend Integration (MainOrchestrator)
- Configuration Reference
- API Examples
- Evaluation System
- Recommendation System
- Troubleshooting
- Team and Acknowledgments
- Bibliography
- License
Information-Retrieval-System-Project/
├── src/ # Source code
│ ├── main_orchestator.py # Main orchestration layer (core API)
│ ├── main.py # Command-line interface
│ ├── rag_cli.py # RAG-specific CLI wrapper
│ │
│ ├── sri/ # Core SRI (Sistema de Recuperación de Información)
│ │ ├── pipeline.py # Main SRI pipeline orchestrator
│ │ ├── __init__.py
│ │ ├── crawler/ # Web document acquisition
│ │ │ ├── base.py # Base crawler class
│ │ │ ├── caller.py # Crawler execution manager
│ │ │ ├── items.py # Data models for crawler items
│ │ │ ├── pipeline.py # Scrapy pipeline
│ │ │ ├── settings.py # Crawler configuration
│ │ │ ├── spiders/ # 6 specialized web crawlers
│ │ │ │ ├── devto.py
│ │ │ │ ├── hackernews.py
│ │ │ │ ├── realpython.py
│ │ │ │ ├── lobsters.py
│ │ │ │ ├── thenewstack.py
│ │ │ │ └── theverge.py
│ │ │ └── __main__.py
│ │ │
│ │ └── web_search/ # Web search augmentation
│ │ ├── checker.py # Insufficiency detection logic
│ │ ├── searcher.py # DuckDuckGo web search
│ │ ├── indexer.py # Web result processing
│ │ └── pipeline.py # Web search pipeline
│ │
│ ├── rag/ # Retrieval-Augmented Generation
│ │ ├── rag_module.py # Main RAG orchestrator
│ │ ├── llm_provider.py # LLM provider abstraction (Ollama)
│ │ ├── config.py # RAG configuration
│ │ ├── prompt_templates.py # Prompt engineering templates
│ │ ├── citations.py # Citation extraction and formatting
│ │ ├── output_parser.py # RAG response parsing
│ │ └── __init__.py
│ │
│ ├── retrieval/ # Document retrieval methods
│ │ ├── lsi_model.py # Latent Semantic Indexing
│ │ ├── vector_store.py # Vector storage (ChromaDB/custom backend)
│ │ └── __pycache__/
│ │
│ ├── indexing/ # Text indexing
│ │ ├── indexer.py # Inverted index and TF-IDF
│ │ └── __pycache__/
│ │
│ ├── ranking/ # Result ranking
│ │ ├── ranking.py # Multi-signal ranking engine
│ │ └── __pycache__/
│ │
│ ├── evaluation/ # System evaluation
│ │ ├── evaluation.py # IR metrics computation
│ │ └── __pycache__/
│ │
│ ├── recommendation/ # Optional recommendation module
│ │ ├── recommender.py # Content-based recommendation engine
│ │ ├── user_history.py # Search history persistence
│ │ └── __init__.py
│ │
│ ├── app/ # Frontend application
│ └── acquisition/ # Placeholder for future data acquisition
│
├── data/ # Data storage
│ ├── documents.json # Consolidated document collection
│ ├── user_history.json # Local search history for automatic recommendations
│ ├── evaluation/ # Evaluation results
│ ├── index/ # Vector store and indices
│ │ └── chroma.sqlite3 # ChromaDB persistence
│ ├── processed/ # Processed documents
│ └── qrels/ # Relevance judgments
│
├── tests/ # Test suite
│ ├── sri/
│ │ ├── test_system.py # System integration tests
│ │ ├── crawler/ # Crawler tests
│ │ ├── indexer/ # Indexing tests
│ │ ├── ranking/ # Ranking tests
│ │ ├── retrieval/ # Retrieval tests
│ │ ├── vectordb/ # Vector store tests
│ │ ├── rag/ # RAG tests
│ │ └── web_search/ # Web search tests
│
├── docs/ # Documentation
│ ├── ARCHITECTURE_ANALYSIS.md # Detailed architecture analysis
│ ├── FRONTEND_IMPLEMENTATION_PLAN.md # Gradio interface specification
│ ├── RECOMMENDATION_MODULE.md # Recommendation module documentation
│ ├── PRE_RAG_STATUS.md # Pre-RAG implementation status
│ └── RAG_IMPLEMENTATION_PLAN.md # RAG implementation details
│
├── pyproject.toml # Project metadata and dependencies
├── Makefile # Build and development commands
├── docker-compose.yml # Docker orchestration
├── Dockerfile # Container specification
└── README.md # This file
| Directory | Purpose | Key Files |
|---|---|---|
src/ |
All source code | Python modules and CLI entry points |
src/sri/ |
Core retrieval system | Pipeline, crawlers, web search |
src/rag/ |
Answer generation | LLM integration, prompts, citations |
src/retrieval/ |
Document retrieval | LSI, vector storage, embeddings |
data/ |
Persistent data storage | Documents, indices, evaluation results |
tests/ |
Test suite | Unit and integration tests |
docs/ |
Project documentation | Architecture, implementation plans |
The system operates in the following pipeline:
User Query
↓
┌─────────────────────────────────────────────┐
│ MainOrchestrator (main_orchestator.py) │
│ ┌──────────────────────────────────────┐ │
│ │ 1. Database Health Check │ │
│ │ • Check indexed document count │ │
│ │ • Verify vector store availability│ │
│ └──────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────┐ │
│ │ 2. Local Search (SRIPipeline) │ │
│ │ • LSI (semantic similarity) │ │
│ │ • TF-IDF (term frequency) │ │
│ │ • Vector similarity (embeddings) │ │
│ └──────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────┐ │
│ │ 3. Insufficiency Detection │ │
│ │ • Quantity: Few results? │ │
│ │ • Quality: Low relevance scores? │ │
│ │ • Semantic: Keyword overlap? │ │
│ └──────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────┐ │
│ │ 4. Conditional Web Search (Optional) │ │
│ │ • DuckDuckGo augmentation │ │
│ │ • Persist results for future use │ │
│ └──────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────┐ │
│ │ 5. Document Consolidation │ │
│ │ • Merge local + web results │ │
│ │ • Deduplication │ │
│ └──────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────┐ │
│ │ 6. RAG Generation (RAGModule) │ │
│ │ • Prompt template application │ │
│ │ • Ollama LLM inference │ │
│ │ • Citation extraction │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
↓
RAGResponse (answer + citations)
Central API for the entire system. Orchestrates all operations and exposes a clean interface independent of the front-end technology (CLI, GUI, REST API, etc.).
Key Methods:
| Method | Purpose | Parameters | Returns |
|---|---|---|---|
query() |
Complete end-to-end query pipeline | question, max_local_results, enable_web_search, auto_reload |
RAGResponse |
retrieve_documents() |
Document retrieval without RAG | Same as query() |
Dict[str, Any] with documents |
augment_response() |
RAG generation from documents | question, documents |
RAGResponse |
clear_all_indices() |
Clear database | None | Dict[str, Any] status |
load_documents_from_crawlers() |
Execute crawlers and index | max_articles, force_recrawl |
Dict[str, Any] statistics |
check_database_health() |
Database readiness check | None | Dict[str, Any] health metrics |
detect_insufficiency_for_query() |
Insufficiency detection | query, results |
Dict[str, Any] assessment |
evaluate_test() |
System evaluation | test_spec (optional) |
Dict[str, Any] IR metrics |
get_status() |
System diagnostics | None | Dict[str, Any] complete status |
Internal Architecture:
- Database Management: Manages VectorStore, ChromaDB, LSI models, and consolidated documents
- Search Pipeline: Orchestrates local search via SRIPipeline
- Insufficiency Detection: Multi-criterion analysis (quantity, quality, semantic)
- Web Augmentation: Conditional DuckDuckGo integration
- RAG Integration: Passes documents to RAGModule for generation
- Crawlers: Manages six specialized web crawlers for document acquisition
Unified interface for local document search.
Methods:
index(documents, save=True)— Index documents using LSI, TF-IDF, and vector storagesearch(query, top_k=10)— Retrieve relevant documents using multi-method scoringevaluate(test_set)— Compute IR evaluation metrics
Integrated Components:
- InvertedIndex (
src/indexing/indexer.py) — Vocabulary, posting lists, TF-IDF - LSIModel (
src/retrieval/lsi_model.py) — Latent semantic analysis - VectorStore (
src/retrieval/vector_store.py) — ChromaDB or custom backend - RankingEngine (
src/ranking/ranking.py) — Multi-signal result ranking
Orchestrates the Retrieval-Augmented Generation pipeline.
Methods:
generate(query, documents, temperature, max_tokens)— Generate answer with citations
Components:
- LLMProvider (
src/rag/llm_provider.py) — Ollama interface - PromptTemplateFactory (
src/rag/prompt_templates.py) — Template selection (basic, domain_specific, chain_of_thought) - CitationExtractor (
src/rag/citations.py) — Automatic citation extraction - OutputParser (
src/rag/output_parser.py) — Response structuring
Automatic augmentation of local results with web search.
- SufficiencyChecker (
checker.py) — Multi-criterion insufficiency detection - WebSearcher (
searcher.py) — DuckDuckGo integration - WebIndexer (
indexer.py) — Web result processing and formatting
Six specialized crawlers for continuous document acquisition:
- DevTo — Technology articles and tutorials
- HackerNews — Tech news and discussions
- RealPython — Python programming resources
- Lobsters — Software engineering news
- TheNewStack — Cloud-native and DevOps content
- TheVerge — Tech product reviews and news
Orchestrated by:
- CrawlerCaller (
caller.py) — Execution manager - CrawlerSettings (
settings.py) — Configuration
- Python: 3.10 or higher
- Package Manager: pip or uv (recommended)
- Ollama: For LLM inference (required for RAG)
- Default URL:
http://localhost:11434 - Default Model:
llama3.2:latest(customizable) - Installation: https://ollama.ai
- Default URL:
See pyproject.toml for complete dependency list.
Critical Dependencies:
scikit-learn>=1.6— LSI, TF-IDF, machine learningchromadb>=0.5— Vector database with embeddingssentence-transformers>=2.7— Text embeddingslangchain>=0.2— LLM integration frameworkduckduckgo-search>=6.0— Web search fallbackscrapy>=2.11— Web crawling frameworknltk>=3.8— Natural language processinggradio>=4.0— Web interface framework
- RAM: Minimum 4GB (8GB recommended)
- Storage: 2GB+ for document indexing and vector store
- Network: Required for crawler and web search operations
git clone https://github.com/D4R102004/Information-Retrieval-System-Project.git
cd Information-Retrieval-System-Project# Using Python venv
python3.10 -m venv .venv
# Activate environment
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate# Using pip
pip install -e ".[dev]"
# OR using uv (faster, recommended)
uv pip install -e ".[dev]"# Visit https://ollama.ai and download the installer for your OS
# Verify installation
ollama pull llama3.2:latest
ollama serve # Start Ollama service in backgroundpre-commit install
pre-commit install --hook-type commit-msgpython -c "from main_orchestator import MainOrchestator; print('[OK] MainOrchestrator imported successfully')"The CLI is accessed via src/main.py with the following structure:
python src/main.py [MODE] [OPTIONS]Execute a single query and display results:
python src/main.py --query "How does machine learning work?"Output includes:
- Formatted answer with proper markdown rendering
- Citation list with sources and snippets
- Metadata (local documents, web documents, generation time)
With custom parameters:
# Use more local results
python src/main.py --query "Python async programming" --max-local 10
# Disable web search augmentation
python src/main.py --query "Your question" --no-web-search
# Increase verbosity for debugging
python src/main.py --query "Your question" -v
# Write logs to file
python src/main.py --query "Your question" --log-file system.logEnter a loop where you can ask multiple queries sequentially:
python src/main.py --interactiveInteractive commands:
ask <query> - Submit a query
status - Display database health
load - Load documents from crawlers
clear - Clear database (with confirmation)
help - Show help message
exit - Exit application
Example session:
ask What is LSI?
ask How does ChromaDB work?
status
load
exit
Display current database health and statistics:
python src/main.py --statusOutput:
========================================================
DATABASE STATUS
========================================================
Status: healthy / degraded / empty
Indexed Documents: 1250 (Min required: 500)
File Documents: 1250 (data/documents.json)
Raw Documents: 0 (data/raw/)
ChromaDB Available: Yes
VectorStore Count: 1250
LLM Connection: ✓ Connected (latency: 45ms)
Last Update: 2025-05-31 14:23:15 UTC
========================================================
Execute the full crawl → consolidate → index pipeline:
# Standard load (respects existing crawls)
python src/main.py --load-data
# Force recrawl (ignore cache)
python src/main.py --load-data --force
# Load with custom limits
python src/main.py --load-data --max-articles 500Output:
- Crawler execution status for each of 6 spiders
- Consolidated document count
- Indexing progress
- Final statistics
Remove all indices, models, and documents:
# With confirmation prompt
python src/main.py --clear-db
# Force clear without confirmation
python src/main.py --clear-db --forceWarning: This operation cannot be undone.
--verbose, -v Enable debug logging
--log-file PATH Write logs to file
--force Skip confirmations
--max-articles N Max articles per crawler (default: 100)
--max-local N Max local search results (default: 5)
--no-web-search Disable web search fallback
Research query with full output:
python src/main.py --query "Distributed systems consensus algorithms" \
--max-local 10 \
--verbose \
--log-file research.logBulk load and evaluate:
python src/main.py --load-data --max-articles 1000 --verboseInteractive session with logging:
python src/main.py --interactive --log-file session.log --verboseThe MainOrchestrator can be used directly in Python code without the CLI:
from main_orchestator import MainOrchestrator
# Initialize
orchestrator = MainOrchestrator()
# Single query (complete pipeline)
response = orchestrator.query(
question="What is Retrieval-Augmented Generation?",
max_local_results=5,
enable_web_search=True,
auto_reload=True
)
print(f"Answer: {response.answer}")
print(f"Citations: {len(response.citations)}")
print(f"Metadata: {response.metadata}")# Get documents without generating an answer
result = orchestrator.retrieve_documents(
question="machine learning frameworks",
max_local_results=10,
enable_web_search=True,
auto_reload=False
)
documents = result['documents']
metadata = result['metadata']
print(f"Local docs: {metadata['local_documents']}")
print(f"Web docs: {metadata['web_documents']}")
print(f"Insufficiency: {metadata['insufficiency_reasons']}")# Generate answer from pre-selected documents
documents = [
{
"id": "doc1",
"title": "Deep Learning Fundamentals",
"content": "Neural networks are...",
"url": "https://example.com/article1"
},
# ... more documents
]
response = orchestrator.augment_response(
question="Explain backpropagation",
documents=documents
)
print(response.answer)
for citation in response.citations:
print(f" - {citation.title}")# Step 1: Retrieve and check insufficiency
retrieval = orchestrator.retrieve_documents(
question="query",
max_local_results=3,
enable_web_search=False # Don't use web search in this step
)
# Step 2: Check insufficiency
insufficiency = orchestrator.detect_insufficiency_for_query(
query="query",
results=retrieval['documents']
)
if insufficiency['is_insufficient']:
print(f"Reasons: {insufficiency['reasons']}")
# Optionally refine query or fetch more documents
# Step 3: Generate answer
response = orchestrator.augment_response(
question="query",
documents=retrieval['documents']
)# Complete system status
status = orchestrator.get_status()
print(f"Database status: {status['database']['status']}")
print(f"Indexed docs: {status['database']['indexed_documents']}")
print(f"Raw docs: {status['crawlers']['raw_documents']}")
# Database health
health = orchestrator.check_database_health()
print(f"Ready to search: {health['can_search']}")# Clear all data
result = orchestrator.clear_all_indices()
if result['success']:
print("Database cleared")
# Load data from crawlers
load_result = orchestrator.load_documents_from_crawlers(
max_articles=1000,
force_recrawl=False
)
print(f"Indexed: {load_result['indexed_documents']} documents")from rag.config import config
# View current configuration
print(config.ollama_model) # Current model: "llama3.2:latest"
print(config.temperature) # Generation randomness: 0.7
print(config.max_tokens) # Max response length: 1024
# Modify configuration (at runtime)
config.temperature = 0.5
config.max_tokens = 2048Key Parameters:
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
ollama_model |
str | llama3.2:latest |
- | Model identifier for inference |
ollama_base_url |
str | http://localhost:11434 |
- | Ollama service URL |
rag_template |
str | domain_specific |
basic, domain_specific, chain_of_thought | Prompt template strategy |
temperature |
float | 0.7 |
0.0–1.0 | Response randomness (0=deterministic, 1=random) |
max_tokens |
int | 1024 |
100–4096 | Maximum response length in tokens |
max_cites |
int | 10 |
1–20 | Maximum citations to extract |
top_k_retrieval |
int | 5 |
1–20 | Documents to pass to RAG |
from sri.crawler.settings import CrawlerSettings
settings = CrawlerSettings()
print(settings.MIN_RESULTS_FOR_QUERY) # Minimum results threshold: 5
print(settings.MIN_AVG_SCORE_THRESHOLD) # Minimum avg relevance: 0.3Key Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
MIN_RESULTS_FOR_QUERY |
int | 5 | Minimum local results before web search |
MIN_AVG_SCORE_THRESHOLD |
float | 0.3 | Minimum average relevance score |
MIN_ARTICLES_PER_SPIDER |
int | 50 | Minimum articles per crawler |
The MainOrchestrator API is designed for REST API wrapping:
# Pseudocode for FastAPI wrapper
from fastapi import FastAPI
from main_orchestator import MainOrchestrator
app = FastAPI()
orchestrator = MainOrchestrator()
@app.post("/api/query")
async def query_endpoint(question: str, max_local: int = 5):
response = orchestrator.query(
question=question,
max_local_results=max_local
)
return response.dict()
@app.get("/api/status")
async def status_endpoint():
return orchestrator.get_status()# As a Python library
from main_orchestator import MainOrchestrator
class SearchClient:
def __init__(self):
self.orchestrator = MainOrchestrator()
def search(self, question: str):
return self.orchestrator.query(question)
def evaluate(self, test_file: str):
with open(test_file) as f:
test_spec = json.load(f)
return self.orchestrator.evaluate_test(test_spec)
client = SearchClient()
response = client.search("Your question")Via CLI:
python src/main.py --evaluateVia Python API:
# Run evaluation with default test set (data/test_queries.json)
results = orchestrator.evaluate_test()
# Run evaluation with custom test set
custom_tests = {
"test_queries": [
{
"query_id": "q1",
"query": "machine learning",
"relevant": ["doc1", "doc2"],
"grades": {"doc1": 3, "doc2": 2}
},
# ... more queries
]
}
results = orchestrator.evaluate_test(custom_tests)The system computes standard Information Retrieval metrics:
Per-Query Metrics:
- AP (Average Precision) — Area under precision-recall curve
- RR (Reciprocal Rank) — Position of first relevant document
- P@k (Precision at k) — Proportion of relevant in top-k
- R@k (Recall at k) — Proportion of relevant retrieved in top-k
- NDCG@k (Normalized DCG) — Ranking quality with graded relevance
Aggregate Metrics:
- MAP (Mean Average Precision)
- MRR (Mean Reciprocal Rank)
- Mean P@1, @3, @5, @10
- Mean R@1, @3, @5, @10
- Mean NDCG@1, @3, @5, @10
Evaluation Output:
{
"status": "success",
"timestamp": "2025-05-31T14:30:00Z",
"execution_time_seconds": 45.3,
"aggregate": {
"num_queries": 10,
"MAP": 0.652,
"MRR": 0.847,
"mean_P@1": 0.7,
"mean_R@5": 0.432,
"mean_NDCG@5": 0.715
},
"per_query": [
{
"query_id": "q1",
"num_relevant": 5,
"ap": 0.85,
"rr": 1.0,
"p@5": 0.8,
"r@5": 0.4,
"ndcg@5": 0.895
},
// ... more queries
]
}data/test_queries.json:
{
"test_queries": [
{
"query_id": "q1",
"query": "search query text",
"relevant": ["doc_id1", "doc_id2", "doc_id3"],
"grades": {
"doc_id1": 3,
"doc_id2": 2,
"doc_id3": 1
}
},
// ... more test queries
]
}Problem: "Database is empty. A search is impossible to perform"
Solutions:
# Load data from crawlers
python src/main.py --load-data
# Or via Python API
orchestrator.load_documents_from_crawlers()Check minimum documents:
python src/main.py --statusEnsure Indexed Documents ≥ 500 (configurable self.settings["min_documents"]).
Problem: "Failed to connect to Ollama service"
Solutions:
# 1. Start Ollama service
ollama serve
# 2. Verify installation
ollama list
# 3. Pull required model
ollama pull llama3.2:latest
# 4. Check URL configuration
# Default: http://localhost:11434
# Modify in src/rag/config.py if differentProblem: Generated answers lack relevant citations
Solutions:
-
Increase local search results:
python src/main.py --query "Your question" --max-local 10 -
Verify document relevance:
python src/main.py --status
-
Adjust RAG parameters:
from rag.config import config config.temperature = 0.3 # More deterministic config.max_tokens = 2048 # Longer responses
-
Use different prompt template:
rag = RAGModule(llm, template_type="chain_of_thought")
Problem: "Crawler execution failed"
Solutions:
# Check network connectivity
ping duckduckgo.com
# Force recrawl
python src/main.py --load-data --force
# Increase timeout
python src/main.py --load-data --verboseProblem: Out of memory when loading large datasets
Solutions:
-
Reduce documents per crawler:
python src/main.py --load-data --max-articles 100
-
Use ChromaDB (more memory-efficient):
- Install:
pip install chromadb - System will auto-detect and use it
- Install:
-
Increase system RAM or use pagination in custom code
# Run all tests
make test
# Run specific test module
pytest tests/sri/test_system.py -v
# Run with coverage
pytest --cov=src tests/
# Run specific test function
pytest tests/sri/retrieval/test_lsi_model.py::test_lsi_fitting -v# Format code
make format
# Lint code
make lint
# Full check
make check# View all available commands
make help
# Install development dependencies
make install
# Clean cache files
make cleanProject Members:
- Darío Francisco Alfonso (@D4R102004)
- Juan Carlos Carmenate (@Juank404)
- Sebastian González Alfonso (@sebagonz106)
Advisors:
- University of Havana, Faculty of Mathematics and Computer Science
Contributions: We acknowledge the open-source community for the excellent libraries that power this system: scikit-learn, Ollama, ChromaDB, Scrapy, and many others.
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
- Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv preprint arXiv:2005.11401.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. (2019). arXiv preprint arXiv:1908.10084.
- Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern information retrieval: The concepts and technology behind search (2nd ed.). Addison-Wesley.
- Scrapy Documentation. https://docs.scrapy.org/
Academic Project
Universidad de La Habana, 2025–2026
This project is developed as an academic exercise in Information Retrieval systems and Natural Language Processing. All code is provided as-is for educational purposes.
Get started in 5 minutes:
# 1. Install
pip install -e ".[dev]"
# 2. Start Ollama
ollama pull llama3.2:latest && ollama serve
# 3. Load data
python src/main.py --load-data
# 4. Ask a question
python src/main.py --query "What is machine learning?"
# 5. Try interactive mode
python src/main.py --interactiveFor detailed architecture documentation, see:
docs/INDEX.md— Documentation structure guidedocs/ARCHITECTURE_ANALYSIS.md— Technical deep-dive
The project now includes a content-based recommendation module under src/recommendation/.
It recommends documents using a hybrid score composed of:
- TF-IDF content similarity over title, tags, source, and article content.
- Optional user interests or current query text.
- Optional liked/seed document IDs.
- Freshness and source-prior signals.
Main backend methods exposed through MainOrchestator:
orchestrator.recommend_documents(
query="serverless websocket apps",
interests="cloud computing, javascript, APIs",
liked_doc_ids=["010a7286-edfa-4143-9e46-462829787546"],
top_k=10,
)
orchestrator.recommend_similar_documents(
document_id="010a7286-edfa-4143-9e46-462829787546",
top_k=10,
)The Gradio UI also includes a new Recommendation tab where the user can generate profile-based recommendations or find documents similar to a selected document ID.
The recommendation module also persists user search behavior in data/user_history.json.
Every successful search executed from the Search tab is recorded with:
- the search query,
- the timestamp,
- the document IDs retrieved by that search.
The method recommend_from_history() builds an automatic user profile from the latest 5 searches
and recommends new documents from the local corpus:
orchestrator.recommend_from_history(
user_id="default",
top_k=10,
history_limit=5,
)In the Gradio UI, each completed Search tab query automatically refreshes the Automatic recommendations section inside the Recommendation tab. The user can also manually regenerate those recommendations or clear the stored search history from that same tab.
For the complete technical documentation of this module, see docs/RECOMMENDATION_MODULE.md.