A comprehensive data engineering and analysis project on 50+ million Reddit comments
π View Live Demo
| Name | ID |
|---|---|
| Hrithik Gaikwad | hg3916 |
| Jie Zhang | jz7563 |
| Siddharth Bhople | sb8336 |
This project implements a complete data pipeline for analyzing the Reddit Comments May 2015 dataset (~50 million comments, ~30GB). The work spans three phases:
| Phase | Focus | Technology |
|---|---|---|
| Phase 1 | Relational Data Model | PostgreSQL, SQL DDL |
| Phase 2 | Document-Oriented Model | MongoDB, PyMongo |
| Phase 3 | Data Mining & Cleaning | Apriori Algorithm, Association Rules |
- Source: Kaggle - Reddit Comments May 2015
- Size: ~50 million comments, ~30GB compressed
- Format: SQLite database (
database.sqlite) - Time Period: May 2015
- Fields: author, subreddit, body, score, ups, downs, gilded, controversiality, and more
Reddit-Comments-May-2015-Data-Analysis/
β
βββ data/ # Dataset directory (not included in repo)
β βββ reddit-comments-may-2015/
β βββ database.sqlite # ~30GB SQLite file
β
βββ phase1_relational/ # Phase 1: Relational Model
β βββ code/
β β βββ load_data.py # SQLite β PostgreSQL loader
β βββ ddl/
β β βββ create_ddl_queries.sql # Schema creation DDL
β β βββ test_queries.sql # Validation queries
β βββ diagrams/
β β βββ relational_schema.jpg # ER diagram
β βββ docs/
β β βββ Data Description.docx
β β βββ Data_Description_Group_7.pdf
β βββ README_phase1.md
β
βββ phase2_document_model/ # Phase 2: Document Model
β βββ code/
β β βββ load_to_mongo.py # SQLite β MongoDB loader
β β βββ phase2_queries.py # Benchmark queries
β β βββ sample_queries.js # MongoDB validation queries
β β βββ discover_functional_dependencies.py
β βββ diagrams/
β β βββ doc_model_visual.png # Document model diagram
β βββ docs/
β βββ document_model_report.md
β βββ functional_dependencies_report.md
β βββ README_Queries.md
β
βββ phase3_data_mining/ # Phase 3: Data Mining
β βββ code/
β β βββ association_rule_mining.py # Apriori + Association Rules
β β βββ data_cleaner.py # Data cleaning pipeline
β β βββ mongo_to_relational.py # Data migration utility
β βββ docs/
β β βββ association_rule_mining_README.md
β βββ sql/
β βββ dirty_data_finder.sql # Identify dirty data
β βββ verify_cleaning.sql # Validate cleaned data
β
βββ submission/ # Final submission package
β βββ CSCI620_Term_Project_Group_7.zip
β
βββ website/ # Project showcase website
β βββ index.html # Main HTML page
β βββ styles.css # Styling
β βββ script.js # Interactive features
β βββ assets/images/ # Diagrams and images
β
βββ kaggle.json # Kaggle API credentials
βββ requirements.txt # Python dependencies
βββ README.md # This file
π Live Website: reddit-comments-may-2015-data-analy.vercel.app
- Python 3.8+
- PostgreSQL 14+
- MongoDB 6.0+
- ~50GB disk space for the dataset
git clone https://github.com/SiD-array/Reddit-Comments-May-2015-Data-Analysis.git
cd Reddit-Comments-May-2015-Data-Analysispip install -r requirements.txtPlace your kaggle.json in the project root:
{
"username": "your_kaggle_username",
"key": "your_kaggle_api_key"
}The dataset will be auto-downloaded when you run the loaders, or manually:
kaggle datasets download -d kaggle/reddit-comments-may-2015Phase 1 implements a normalized relational schema with 6 tables following 3NF principles.
| Table | Description | Primary Key |
|---|---|---|
Users |
Reddit authors and flair info | author |
Subreddit |
Subreddit metadata | subreddit_id |
Post |
Post-level data | link_id |
Post_Link |
Post references | link_id |
Comment |
All comments with scores | id |
Moderation |
Moderation actions | mod_action_id |
# Create database
psql -U postgres -c "CREATE DATABASE redditdb;"
# Run DDL
psql -U postgres -d redditdb -f phase1_relational/ddl/create_ddl_queries.sqlpython phase1_relational/code/load_data.py \
--input data/reddit-comments-may-2015/database.sqlite \
--host localhost \
--port 5432 \
--user postgres \
--password yourpassword \
--dbname redditdb
# Test with sample data
python phase1_relational/code/load_data.py \
--input data/reddit-comments-may-2015/database.sqlite \
--password yourpassword \
--dbname redditdb \
--sample 10000psql -U postgres -d redditdb -f phase1_relational/ddl/test_queries.sqlPhase 2 re-models the dataset into a hybrid document architecture optimized for MongoDB.
| Collection | Purpose | Key Design |
|---|---|---|
users |
Author profiles | _id = author name |
subreddits |
Subreddit metadata | _id = subreddit_id |
posts |
Posts with embedded comments | Hybrid: embeds top N comments |
comments |
All comments (analytics) | Full comment storage |
moderation |
Moderation signals | Composite _id |
- Hybrid Embedding: Top N comments embedded in posts for fast reads
- Bulk Operations: Efficient
bulk_write()for high throughput - Streaming Ingestion: Chunked SQLite reads (50K rows/chunk)
- Idempotent Upserts: Safe for re-runs
python phase2_document_model/code/load_to_mongo.py \
--input data/reddit-comments-may-2015/database.sqlite \
--mongo_uri "mongodb://localhost:27017/" \
--dbname reddit_may2015 \
--chunksize 50000 \
--embed-cap 200 \
--reset# MongoDB Shell
mongosh phase2_document_model/code/sample_queries.js
# Python benchmark queries
python phase2_document_model/code/phase2_queries.py \
--host localhost \
--user postgres \
--password yourpassword \
--dbname redditdbpython phase2_document_model/code/discover_functional_dependencies.py \
--input data/reddit-comments-may-2015/database.sqlite \
--sample 1000000Phase 3 focuses on data quality and pattern discovery using association rule mining.
The cleaning pipeline handles:
| Issue | Action |
|---|---|
| Missing authors | Drop row |
| Empty bodies | Drop row |
[deleted]/[removed] content |
Drop row |
| Invalid timestamps | Drop row |
| Inconsistent scores | Fix: score = ups - downs |
| Symbol-only flair | Set to NULL |
python phase3_data_mining/code/data_cleaner.py \
--host localhost \
--user postgres \
--password yourpassword \
--dbname redditdb \
--batch-size 50000Discovers patterns in Reddit comments using the Apriori algorithm.
Each comment becomes a transaction with:
- Subreddit:
subreddit:AskReddit - Score Category:
very_high_score,high_score,medium_score,low_score,negative_score - Status Flags:
gilded,controversial,edited,distinguished,archived
| Parameter | Default | Description |
|---|---|---|
min_support |
0.01 (1%) | Minimum frequency threshold |
min_confidence |
0.5 (50%) | Minimum rule confidence |
python phase3_data_mining/code/association_rule_mining.py \
--host localhost \
--user postgres \
--password yourpassword \
--dbname redditdb \
--min-support 0.03 \
--min-confidence 0.5 \
--sample 100000| Metric | Description |
|---|---|
| Support | Frequency of itemset in all transactions |
| Confidence | P(consequent | antecedent) |
| Lift | Strength of association (>1 = positive) |
| Conviction | Expected error of the rule |
| Query | Before Index | After Index | Speedup |
|---|---|---|---|
| AskReddit latest 50 posts | 2.34s | 0.12s | 19.5x |
| Top 20 subreddits by avg score | 5.67s | 0.89s | 6.4x |
| Top 20 authors by post count | 3.21s | 0.45s | 7.1x |
| Gilded but not archived posts | 1.89s | 0.08s | 23.6x |
| Posts by authors containing 'cat' | 4.56s | 0.67s | 6.8x |
| Avg comments per post (top 10) | 8.23s | 1.34s | 6.1x |
psycopg2-binary>=2.9.0 # PostgreSQL adapter
pandas>=1.3.0 # Data manipulation
requests>=2.25.0 # HTTP library
tqdm>=4.60.0 # Progress bars
kaggle # Dataset download
mlxtend>=0.22.0 # Association rule mining
pymongo # MongoDB driver
| Document | Location |
|---|---|
| Phase 1 README | phase1_relational/README_phase1.md |
| Document Model Report | phase2_document_model/docs/document_model_report.md |
| Functional Dependencies | phase2_document_model/docs/functional_dependencies_report.md |
| Query Documentation | phase2_document_model/docs/README_Queries.md |
| Association Rule Mining | phase3_data_mining/docs/association_rule_mining_README.md |
| Issue | Solution |
|---|---|
| Connection refused | Ensure PostgreSQL/MongoDB is running |
| Missing kaggle.json | Download API token from Kaggle account |
| Permission denied | Check database user privileges |
| Disk space full | Dataset requires ~50GB during processing |
| Memory error | Use --sample parameter for testing |
| Download timeout | Ensure stable internet (~20GB download) |
This project is for educational purposes as part of CSCI-620: Big Data course work.
- Kaggle for hosting the Reddit Comments dataset
- Reddit for the original data
- RIT CSCI-620 course staff for guidance
Made with β€οΈ by Group 7