A collection of command-line programs implementing core information retrieval concepts β
from web crawling and inverted indexes to PageRank, NLP pipelines, and ML-based sentiment analysis.
A01 Β· Web Crawling Β· A02 Β· IR & NLP Β· A03 Β· ML Classification & Clustering Β· Setup
This project spans the full pipeline of a modern search and information retrieval system β skills directly applicable to NLP, document understanding, and enterprise AI systems:
| Concept | Where |
|---|---|
| Web crawling & HTML parsing | A01 |
| Hash-based content storage | A01 |
| Zipf's Law & corpus linguistics | A02 |
| Tokenization, stemming, stopword removal | A02 |
| Inverted index construction | A02 |
| Elias Gamma & Delta compression coding | A02 |
| PageRank on large-scale graph (2.3M edges) | A02 |
| Noisy Channel Model & spell correction | A02 |
| Sentiment classification (Naive Bayes, SVM, Decision Tree, KNN) | A03 |
| Text clustering (KMeans, DBSCAN, Hierarchical) | A03 |
Building three progressively complex web crawlers to download, store, and parse web content.
Recursively crawls from an initialURL up to a configurable max depth.
- Downloads page content and stores it as
H.txtwhereHis the MD5 hash of the URL (viahashlib) - Extracts all hyperlinks from downloaded content
- Appends structured log entries
<H, URL, DateTime, HTTP_Status>tocrawler1.log
$ python3 webcrawler1.py --maxdepth 3 --verbose TRUE https://example.com| Flag | Description | Default |
|---|---|---|
--maxdepth |
Max crawl depth from initial URL | β |
--rewrite |
Re-download even if H.txt exists |
FALSE |
--verbose |
Print <URL, depth> during crawl |
FALSE |
Crawls an academic researcher's profile page and extracts structured publication data as JSON (H.json). Handles pagination via dynamic POST requests to retrieve the full publication list beyond initial page limits.
$ python3 webcrawler2.py https://researcher-profile-url.comIdentifies the main content region of a webpage using a scoring function over HTML/token density.
Method: Tags are replaced with 1, words with 0. The optimal window (i*, j*) is found by maximizing:
- Extracts the content between
i*andj*intoH.txt - Generates a 2D/3D heatmap of all
f(i,j)values to visualize the content boundary
$ python3 webcrawler3.py https://example.comCore NLP preprocessing pipeline, compression coding, large-scale PageRank, and probabilistic spell correction.
Full text processing pipeline on a Wikipedia corpus with modular CLI flags.
$ python3 wikipedia_processing.py [--zipf] [--tokenize] [--stemming] [--stopword] [--invertedindex]| Flag | What It Does |
|---|---|
--zipf |
Plots top 50 word frequencies; validates Zipf's Law on corpus |
--tokenize |
Tokenizes raw text and outputs token list to file |
--stemming |
Applies Porter Stemming to reduce words to their root forms |
--stopword |
Removes stopwords (using NLTK stopword list) from token stream |
--invertedindex |
Builds inverted index from stemmed, cleaned tokens; outputs to file and stdout |
What's an inverted index? It maps each unique term β list of documents it appears in. It's the core data structure behind every search engine.
Implements two lossless integer compression algorithms used in compressed inverted indexes for efficient storage of posting lists.
$ python3 elias_coding.py --alg elias_gamma --encode 1 2 3 4 5
$ python3 elias_coding.py --alg elias_delta --decode 10110Elias Gamma:
Elias Delta: Similar but encodes
Both include full decode + verification logic.
Applies the PageRank algorithm to the web-Stanford.txt dataset containing 2,312,502 directed edges.
$ python3 page_rank.py --maxiteration 100 --lambda 0.85 --thr 0.0001 --nodes 1 2 3 4 5Performance optimizations made:
- Precomputes reciprocal of outbound edge count once before iteration loop (avoids redundant division per step)
- Builds adjacency structure in a single O(E) pass
- Selection sort on final scores for top-N output
| Parameter | Description |
|---|---|
--maxiteration |
Max number of power iteration steps |
--lambda |
Damping factor (typically 0.85) |
--thr |
Convergence threshold |
--nodes |
Space-separated list of node IDs to rank |
Implements the Noisy Channel Model for spelling correction:
$ python noisy_channel.py --correct hte recieve wrold
$ python noisy_channel.py --proba the receive worldCandidate generation via edit distance-1 operations: delete, transpose, replace, insert
Channel model computes transformation cost using character-level edit distance
Language model uses unigram probabilities from the training corpus (via Counter)
Supervised sentiment classification and unsupervised news clustering using scikit-learn.
Trains and evaluates text classifiers on three real-world review datasets (IMDb, Amazon, Yelp).
$ python3 training_sentiment.py --imdb --svm
$ python3 training_sentiment.py --amazon --naive
$ python3 training_sentiment.py --yelp --knn 5Classifiers supported:
| Flag | Algorithm |
|---|---|
--naive |
Multinomial Naive Bayes |
--svm |
Support Vector Machine |
--decisiontree |
Decision Tree |
--knn <n> |
K-Nearest Neighbours (specify k) |
Pipeline:
- Load dataset β
pandasDataFrame - Vectorize text (TF or TF-IDF)
- Train/test split + cross-validation
- Evaluate: accuracy, precision, recall, F1
- Save confusion matrix to disk
- Serialize trained classifier + vectorizer (
pickle)
Clusters the 20 Newsgroups dataset (18,000+ documents, 20 categories) using four unsupervised algorithms.
$ python3 cluster_news.py --ncluster 20 --kmeans
$ python3 cluster_news.py --ncluster 5 --dbscan| Flag | Algorithm |
|---|---|
--kmeans |
K-Means clustering |
--whc |
Ward Hierarchical Clustering |
--ac |
Agglomerative Clustering |
--dbscan |
DBSCAN (density-based) |
Preprocessing: Headers, footers, and quotes stripped via sklearn.datasets.fetch_20newsgroups built-in cleaner. TF-IDF vectorization β dimensionality reduction β cluster assignment.
| Library | Used For |
|---|---|
requests / BeautifulSoup |
Web crawling, HTML parsing |
hashlib |
URL-based content addressing |
re |
REGEX-based HTML tag identification |
nltk |
Tokenization, stemming, stopword removal |
scikit-learn |
Vectorization, classifiers, clustering, metrics |
pandas |
Dataset loading and preprocessing |
matplotlib |
Zipf's Law plots, f(i,j) heatmaps, confusion matrices |
pickle |
Classifier serialization |
argparse |
CLI argument handling across all programs |
# Clone the repo
git clone https://github.com/MobinaToorani/TextRetrieval.git
cd TextRetrieval
# Install pipenv
pip install pipenv
# Create environment with Python 3.10
pipenv --python 3.10
# Activate environment
pipenv shell
# Install dependencies
pipenv install -r requirements.txtcd A01
python3 webcrawler1.py --maxdepth 2 --verbose TRUE https://example.com
cd ../A02
python3 wikipedia_processing.py --zipf --invertedindex
cd ../A03
python3 training_sentiment.py --imdb --svmTextRetrieval/
β
βββ A01/
β βββ webcrawler1.py # General recursive web crawler
β βββ webcrawler2.py # Researcher profile + pagination crawler
β βββ webcrawler3.py # HTML content boundary extractor
β
βββ A02/
β βββ wikipedia_processing.py # NLP pipeline: tokenize, stem, stopwords, index
β βββ elias_coding.py # Elias Gamma/Delta compression coding
β βββ page_rank.py # PageRank on Stanford web graph
β βββ noisy_channel.py # Probabilistic spell correction
β
βββ A03/
β βββ training_sentiment.py # Sentiment classification (SVM, NB, DT, KNN)
β βββ cluster_news.py # News clustering (KMeans, DBSCAN, Hierarchical)
β
βββ requirements.txt
βββ Pipfile
βββ README.md
Mobina Tooranisama, Aidan Traboulay, Nausher Rao Β· Text Retrieval & Search Engine Β· Wilfrid Laurier University