🔍 Text Retrieval & Search Engine

A collection of command-line programs implementing core information retrieval concepts —
from web crawling and inverted indexes to PageRank, NLP pipelines, and ML-based sentiment analysis.

A01 · Web Crawling · A02 · IR & NLP · A03 · ML Classification & Clustering · Setup

🧠 What This Project Covers

This project spans the full pipeline of a modern search and information retrieval system — skills directly applicable to NLP, document understanding, and enterprise AI systems:

Concept	Where
Web crawling & HTML parsing	A01
Hash-based content storage	A01
Zipf's Law & corpus linguistics	A02
Tokenization, stemming, stopword removal	A02
Inverted index construction	A02
Elias Gamma & Delta compression coding	A02
PageRank on large-scale graph (2.3M edges)	A02
Noisy Channel Model & spell correction	A02
Sentiment classification (Naive Bayes, SVM, Decision Tree, KNN)	A03
Text clustering (KMeans, DBSCAN, Hierarchical)	A03

📁 A01 · Web Crawling

Building three progressively complex web crawlers to download, store, and parse web content.

`webcrawler1.py` — General-Purpose Crawler

Recursively crawls from an initialURL up to a configurable max depth.

Downloads page content and stores it as H.txt where H is the MD5 hash of the URL (via hashlib)
Extracts all hyperlinks from downloaded content
Appends structured log entries <H, URL, DateTime, HTTP_Status> to crawler1.log

$ python3 webcrawler1.py --maxdepth 3 --verbose TRUE https://example.com

Flag	Description	Default
`--maxdepth`	Max crawl depth from initial URL	—
`--rewrite`	Re-download even if `H.txt` exists	`FALSE`
`--verbose`	Print `<URL, depth>` during crawl	`FALSE`

`webcrawler2.py` — Researcher Profile Crawler

Crawls an academic researcher's profile page and extracts structured publication data as JSON (H.json). Handles pagination via dynamic POST requests to retrieve the full publication list beyond initial page limits.

$ python3 webcrawler2.py https://researcher-profile-url.com

`webcrawler3.py` — Main Content Extractor

Identifies the main content region of a webpage using a scoring function over HTML/token density.

Method: Tags are replaced with 1, words with 0. The optimal window (i*, j*) is found by maximizing:

$$f(i,j) = \sum_{n=0}^{i-1} b_n + \sum_{n=i}^{j}(1 - b_n) + \sum_{n=j}^{N-1} b_n$$

Extracts the content between i* and j* into H.txt
Generates a 2D/3D heatmap of all f(i,j) values to visualize the content boundary

$ python3 webcrawler3.py https://example.com

📁 A02 · Information Retrieval & NLP

Core NLP preprocessing pipeline, compression coding, large-scale PageRank, and probabilistic spell correction.

`wikipedia_processing.py` — NLP Pipeline

Full text processing pipeline on a Wikipedia corpus with modular CLI flags.

$ python3 wikipedia_processing.py [--zipf] [--tokenize] [--stemming] [--stopword] [--invertedindex]

Flag	What It Does
`--zipf`	Plots top 50 word frequencies; validates Zipf's Law on corpus
`--tokenize`	Tokenizes raw text and outputs token list to file
`--stemming`	Applies Porter Stemming to reduce words to their root forms
`--stopword`	Removes stopwords (using NLTK stopword list) from token stream
`--invertedindex`	Builds inverted index from stemmed, cleaned tokens; outputs to file and stdout

What's an inverted index? It maps each unique term → list of documents it appears in. It's the core data structure behind every search engine.

`elias_coding.py` — Integer Compression (Elias Gamma & Delta)

Implements two lossless integer compression algorithms used in compressed inverted indexes for efficient storage of posting lists.

$ python3 elias_coding.py --alg elias_gamma --encode 1 2 3 4 5
$ python3 elias_coding.py --alg elias_delta --decode 10110

Elias Gamma: $n = 1 + \lfloor\log_2(x)\rfloor$, result = unary($n$) + binary offset
Elias Delta: Similar but encodes $n$ itself in binary for better compression on larger numbers

Both include full decode + verification logic.

`page_rank.py` — PageRank on Stanford Web Graph

Applies the PageRank algorithm to the web-Stanford.txt dataset containing 2,312,502 directed edges.

$ python3 page_rank.py --maxiteration 100 --lambda 0.85 --thr 0.0001 --nodes 1 2 3 4 5

Performance optimizations made:

Precomputes reciprocal of outbound edge count once before iteration loop (avoids redundant division per step)
Builds adjacency structure in a single O(E) pass
Selection sort on final scores for top-N output

Parameter	Description
`--maxiteration`	Max number of power iteration steps
`--lambda`	Damping factor (typically 0.85)
`--thr`	Convergence threshold
`--nodes`	Space-separated list of node IDs to rank

`noisy_channel.py` — Probabilistic Spell Correction

Implements the Noisy Channel Model for spelling correction: $\hat{w} = \arg\max_w P(w) \cdot P(\text{typo}|w)$

$ python noisy_channel.py --correct hte recieve wrold
$ python noisy_channel.py --proba the receive world

Candidate generation via edit distance-1 operations: delete, transpose, replace, insert
Channel model computes transformation cost using character-level edit distance
Language model uses unigram probabilities from the training corpus (via Counter)

📁 A03 · ML Classification & Clustering

Supervised sentiment classification and unsupervised news clustering using scikit-learn.

`training_sentiment.py` — Sentiment Classifier

Trains and evaluates text classifiers on three real-world review datasets (IMDb, Amazon, Yelp).

$ python3 training_sentiment.py --imdb --svm
$ python3 training_sentiment.py --amazon --naive
$ python3 training_sentiment.py --yelp --knn 5

Classifiers supported:

Flag	Algorithm
`--naive`	Multinomial Naive Bayes
`--svm`	Support Vector Machine
`--decisiontree`	Decision Tree
`--knn <n>`	K-Nearest Neighbours (specify k)

Pipeline:

Load dataset → pandas DataFrame
Vectorize text (TF or TF-IDF)
Train/test split + cross-validation
Evaluate: accuracy, precision, recall, F1
Save confusion matrix to disk
Serialize trained classifier + vectorizer (pickle)

`cluster_news.py` — News Article Clustering

Clusters the 20 Newsgroups dataset (18,000+ documents, 20 categories) using four unsupervised algorithms.

$ python3 cluster_news.py --ncluster 20 --kmeans
$ python3 cluster_news.py --ncluster 5 --dbscan

Flag	Algorithm
`--kmeans`	K-Means clustering
`--whc`	Ward Hierarchical Clustering
`--ac`	Agglomerative Clustering
`--dbscan`	DBSCAN (density-based)

Preprocessing: Headers, footers, and quotes stripped via sklearn.datasets.fetch_20newsgroups built-in cleaner. TF-IDF vectorization → dimensionality reduction → cluster assignment.

🛠 Tech Stack

Library	Used For
`requests` / `BeautifulSoup`	Web crawling, HTML parsing
`hashlib`	URL-based content addressing
`re`	REGEX-based HTML tag identification
`nltk`	Tokenization, stemming, stopword removal
`scikit-learn`	Vectorization, classifiers, clustering, metrics
`pandas`	Dataset loading and preprocessing
`matplotlib`	Zipf's Law plots, f(i,j) heatmaps, confusion matrices
`pickle`	Classifier serialization
`argparse`	CLI argument handling across all programs

⚙️ Setup

# Clone the repo
git clone https://github.com/MobinaToorani/TextRetrieval.git
cd TextRetrieval

# Install pipenv
pip install pipenv

# Create environment with Python 3.10
pipenv --python 3.10

# Activate environment
pipenv shell

# Install dependencies
pipenv install -r requirements.txt

Run an example

cd A01
python3 webcrawler1.py --maxdepth 2 --verbose TRUE https://example.com

cd ../A02
python3 wikipedia_processing.py --zipf --invertedindex

cd ../A03
python3 training_sentiment.py --imdb --svm

📂 Repository Structure

TextRetrieval/
│
├── A01/
│   ├── webcrawler1.py       # General recursive web crawler
│   ├── webcrawler2.py       # Researcher profile + pagination crawler
│   └── webcrawler3.py       # HTML content boundary extractor
│
├── A02/
│   ├── wikipedia_processing.py   # NLP pipeline: tokenize, stem, stopwords, index
│   ├── elias_coding.py           # Elias Gamma/Delta compression coding
│   ├── page_rank.py              # PageRank on Stanford web graph
│   └── noisy_channel.py          # Probabilistic spell correction
│
├── A03/
│   ├── training_sentiment.py     # Sentiment classification (SVM, NB, DT, KNN)
│   └── cluster_news.py           # News clustering (KMeans, DBSCAN, Hierarchical)
│
├── requirements.txt
├── Pipfile
└── README.md

Mobina Tooranisama, Aidan Traboulay, Nausher Rao · Text Retrieval & Search Engine · Wilfrid Laurier University

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Text Retrieval & Search Engine

🧠 What This Project Covers

📁 A01 · Web Crawling

`webcrawler1.py` — General-Purpose Crawler

`webcrawler2.py` — Researcher Profile Crawler

`webcrawler3.py` — Main Content Extractor

📁 A02 · Information Retrieval & NLP

`wikipedia_processing.py` — NLP Pipeline

`elias_coding.py` — Integer Compression (Elias Gamma & Delta)

`page_rank.py` — PageRank on Stanford Web Graph

`noisy_channel.py` — Probabilistic Spell Correction

📁 A03 · ML Classification & Clustering

`training_sentiment.py` — Sentiment Classifier

`cluster_news.py` — News Article Clustering

🛠 Tech Stack

⚙️ Setup

Run an example

📂 Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
A01		A01
A02		A02
A03		A03
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🔍 Text Retrieval & Search Engine

🧠 What This Project Covers

📁 A01 · Web Crawling

webcrawler1.py — General-Purpose Crawler

webcrawler2.py — Researcher Profile Crawler

webcrawler3.py — Main Content Extractor

📁 A02 · Information Retrieval & NLP

wikipedia_processing.py — NLP Pipeline

elias_coding.py — Integer Compression (Elias Gamma & Delta)

page_rank.py — PageRank on Stanford Web Graph

noisy_channel.py — Probabilistic Spell Correction

📁 A03 · ML Classification & Clustering

training_sentiment.py — Sentiment Classifier

cluster_news.py — News Article Clustering

🛠 Tech Stack

⚙️ Setup

Run an example

📂 Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`webcrawler1.py` — General-Purpose Crawler

`webcrawler2.py` — Researcher Profile Crawler

`webcrawler3.py` — Main Content Extractor

`wikipedia_processing.py` — NLP Pipeline

`elias_coding.py` — Integer Compression (Elias Gamma & Delta)

`page_rank.py` — PageRank on Stanford Web Graph

`noisy_channel.py` — Probabilistic Spell Correction

`training_sentiment.py` — Sentiment Classifier

`cluster_news.py` — News Article Clustering

Packages