Skip to content

MobinaToorani/TextRetrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

156 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” Text Retrieval & Search Engine

A collection of command-line programs implementing core information retrieval concepts β€”
from web crawling and inverted indexes to PageRank, NLP pipelines, and ML-based sentiment analysis.

A01 Β· Web Crawling Β· A02 Β· IR & NLP Β· A03 Β· ML Classification & Clustering Β· Setup


🧠 What This Project Covers

This project spans the full pipeline of a modern search and information retrieval system β€” skills directly applicable to NLP, document understanding, and enterprise AI systems:

Concept Where
Web crawling & HTML parsing A01
Hash-based content storage A01
Zipf's Law & corpus linguistics A02
Tokenization, stemming, stopword removal A02
Inverted index construction A02
Elias Gamma & Delta compression coding A02
PageRank on large-scale graph (2.3M edges) A02
Noisy Channel Model & spell correction A02
Sentiment classification (Naive Bayes, SVM, Decision Tree, KNN) A03
Text clustering (KMeans, DBSCAN, Hierarchical) A03

πŸ“ A01 Β· Web Crawling

Building three progressively complex web crawlers to download, store, and parse web content.

webcrawler1.py β€” General-Purpose Crawler

Recursively crawls from an initialURL up to a configurable max depth.

  • Downloads page content and stores it as H.txt where H is the MD5 hash of the URL (via hashlib)
  • Extracts all hyperlinks from downloaded content
  • Appends structured log entries <H, URL, DateTime, HTTP_Status> to crawler1.log
$ python3 webcrawler1.py --maxdepth 3 --verbose TRUE https://example.com
Flag Description Default
--maxdepth Max crawl depth from initial URL β€”
--rewrite Re-download even if H.txt exists FALSE
--verbose Print <URL, depth> during crawl FALSE

webcrawler2.py β€” Researcher Profile Crawler

Crawls an academic researcher's profile page and extracts structured publication data as JSON (H.json). Handles pagination via dynamic POST requests to retrieve the full publication list beyond initial page limits.

$ python3 webcrawler2.py https://researcher-profile-url.com

webcrawler3.py β€” Main Content Extractor

Identifies the main content region of a webpage using a scoring function over HTML/token density.

Method: Tags are replaced with 1, words with 0. The optimal window (i*, j*) is found by maximizing:

$$f(i,j) = \sum_{n=0}^{i-1} b_n + \sum_{n=i}^{j}(1 - b_n) + \sum_{n=j}^{N-1} b_n$$

  • Extracts the content between i* and j* into H.txt
  • Generates a 2D/3D heatmap of all f(i,j) values to visualize the content boundary
$ python3 webcrawler3.py https://example.com

πŸ“ A02 Β· Information Retrieval & NLP

Core NLP preprocessing pipeline, compression coding, large-scale PageRank, and probabilistic spell correction.

wikipedia_processing.py β€” NLP Pipeline

Full text processing pipeline on a Wikipedia corpus with modular CLI flags.

$ python3 wikipedia_processing.py [--zipf] [--tokenize] [--stemming] [--stopword] [--invertedindex]
Flag What It Does
--zipf Plots top 50 word frequencies; validates Zipf's Law on corpus
--tokenize Tokenizes raw text and outputs token list to file
--stemming Applies Porter Stemming to reduce words to their root forms
--stopword Removes stopwords (using NLTK stopword list) from token stream
--invertedindex Builds inverted index from stemmed, cleaned tokens; outputs to file and stdout

What's an inverted index? It maps each unique term β†’ list of documents it appears in. It's the core data structure behind every search engine.


elias_coding.py β€” Integer Compression (Elias Gamma & Delta)

Implements two lossless integer compression algorithms used in compressed inverted indexes for efficient storage of posting lists.

$ python3 elias_coding.py --alg elias_gamma --encode 1 2 3 4 5
$ python3 elias_coding.py --alg elias_delta --decode 10110

Elias Gamma: $n = 1 + \lfloor\log_2(x)\rfloor$, result = unary($n$) + binary offset
Elias Delta: Similar but encodes $n$ itself in binary for better compression on larger numbers

Both include full decode + verification logic.


page_rank.py β€” PageRank on Stanford Web Graph

Applies the PageRank algorithm to the web-Stanford.txt dataset containing 2,312,502 directed edges.

$ python3 page_rank.py --maxiteration 100 --lambda 0.85 --thr 0.0001 --nodes 1 2 3 4 5

Performance optimizations made:

  • Precomputes reciprocal of outbound edge count once before iteration loop (avoids redundant division per step)
  • Builds adjacency structure in a single O(E) pass
  • Selection sort on final scores for top-N output
Parameter Description
--maxiteration Max number of power iteration steps
--lambda Damping factor (typically 0.85)
--thr Convergence threshold
--nodes Space-separated list of node IDs to rank

noisy_channel.py β€” Probabilistic Spell Correction

Implements the Noisy Channel Model for spelling correction: $\hat{w} = \arg\max_w P(w) \cdot P(\text{typo}|w)$

$ python noisy_channel.py --correct hte recieve wrold
$ python noisy_channel.py --proba the receive world

Candidate generation via edit distance-1 operations: delete, transpose, replace, insert
Channel model computes transformation cost using character-level edit distance
Language model uses unigram probabilities from the training corpus (via Counter)


πŸ“ A03 Β· ML Classification & Clustering

Supervised sentiment classification and unsupervised news clustering using scikit-learn.

training_sentiment.py β€” Sentiment Classifier

Trains and evaluates text classifiers on three real-world review datasets (IMDb, Amazon, Yelp).

$ python3 training_sentiment.py --imdb --svm
$ python3 training_sentiment.py --amazon --naive
$ python3 training_sentiment.py --yelp --knn 5

Classifiers supported:

Flag Algorithm
--naive Multinomial Naive Bayes
--svm Support Vector Machine
--decisiontree Decision Tree
--knn <n> K-Nearest Neighbours (specify k)

Pipeline:

  1. Load dataset β†’ pandas DataFrame
  2. Vectorize text (TF or TF-IDF)
  3. Train/test split + cross-validation
  4. Evaluate: accuracy, precision, recall, F1
  5. Save confusion matrix to disk
  6. Serialize trained classifier + vectorizer (pickle)

cluster_news.py β€” News Article Clustering

Clusters the 20 Newsgroups dataset (18,000+ documents, 20 categories) using four unsupervised algorithms.

$ python3 cluster_news.py --ncluster 20 --kmeans
$ python3 cluster_news.py --ncluster 5 --dbscan
Flag Algorithm
--kmeans K-Means clustering
--whc Ward Hierarchical Clustering
--ac Agglomerative Clustering
--dbscan DBSCAN (density-based)

Preprocessing: Headers, footers, and quotes stripped via sklearn.datasets.fetch_20newsgroups built-in cleaner. TF-IDF vectorization β†’ dimensionality reduction β†’ cluster assignment.


πŸ›  Tech Stack

Library Used For
requests / BeautifulSoup Web crawling, HTML parsing
hashlib URL-based content addressing
re REGEX-based HTML tag identification
nltk Tokenization, stemming, stopword removal
scikit-learn Vectorization, classifiers, clustering, metrics
pandas Dataset loading and preprocessing
matplotlib Zipf's Law plots, f(i,j) heatmaps, confusion matrices
pickle Classifier serialization
argparse CLI argument handling across all programs

βš™οΈ Setup

# Clone the repo
git clone https://github.com/MobinaToorani/TextRetrieval.git
cd TextRetrieval

# Install pipenv
pip install pipenv

# Create environment with Python 3.10
pipenv --python 3.10

# Activate environment
pipenv shell

# Install dependencies
pipenv install -r requirements.txt

Run an example

cd A01
python3 webcrawler1.py --maxdepth 2 --verbose TRUE https://example.com

cd ../A02
python3 wikipedia_processing.py --zipf --invertedindex

cd ../A03
python3 training_sentiment.py --imdb --svm

πŸ“‚ Repository Structure

TextRetrieval/
β”‚
β”œβ”€β”€ A01/
β”‚   β”œβ”€β”€ webcrawler1.py       # General recursive web crawler
β”‚   β”œβ”€β”€ webcrawler2.py       # Researcher profile + pagination crawler
β”‚   └── webcrawler3.py       # HTML content boundary extractor
β”‚
β”œβ”€β”€ A02/
β”‚   β”œβ”€β”€ wikipedia_processing.py   # NLP pipeline: tokenize, stem, stopwords, index
β”‚   β”œβ”€β”€ elias_coding.py           # Elias Gamma/Delta compression coding
β”‚   β”œβ”€β”€ page_rank.py              # PageRank on Stanford web graph
β”‚   └── noisy_channel.py          # Probabilistic spell correction
β”‚
β”œβ”€β”€ A03/
β”‚   β”œβ”€β”€ training_sentiment.py     # Sentiment classification (SVM, NB, DT, KNN)
β”‚   └── cluster_news.py           # News clustering (KMeans, DBSCAN, Hierarchical)
β”‚
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Pipfile
└── README.md

Mobina Tooranisama, Aidan Traboulay, Nausher Rao Β· Text Retrieval & Search Engine Β· Wilfrid Laurier University

About

A compilation of all my assignments from my text retrieval & search engine course, written in Python and run in a virtual environment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors