Predicting movie success using OMDb API metadata, The Numbers box-office scraping, and machine learning pipelines.
Explore the features »
View Notebook
·
Report Bug
·
Request Feature
This project builds a complete end-to-end machine learning system for predicting worldwide box-office performance using:
- 🎥 OMDb API → Movie metadata
- 💰 The Numbers → Budget & worldwide gross scraping
- 🧹 Data Cleaning + Feature engineering
- 🤖 ML Models → Random Forest, Gradient Boosting, Linear Regression
- 📊 Visualizations → Scatter plots, histograms, feature importances
- 🧪 Synthetic fallback dataset for testing
- 🚀 Google Colab support
movie-success-project/
│
├── data_raw/
├── data_clean/
│
├── src/
│ ├── data_collection.py
│ ├── boxoffice_scraper.py
│ ├── cleaning.py
│ ├── modeling.py
│ └── utils.py
│
├── notebooks/
│ ├── movie_pipeline.ipynb
│ └── EDA.ipynb
│
├── visuals/
│ ├── scatter_budget_vs_worldwide.png
│ ├── hist_worldwide_gross.png
│ └── rf_feature_importances.png
│
├── models/
│ └── best_model.pkl
│
├── report/
│ └── final_report.md
│
└── README.md
- OMDb API metadata
- Budget + box‑office scraping
- Fuzzy title matching
- Synthetic dataset generator
- Numeric normalization
- Missing value handling
- Genre & cast parsing
- Release decade extraction
- Random Forest
- Gradient Boosting
- Linear Regression
Metrics:
- RMSE
- MAE
- R²
git clone https://github.com/alexander-ayer/COS482-Project
pip install -r requirements.txtexport OMDB_API_KEY="cdd03cf8"python src/data_collection.py --generate-synthetic 2000python src/data_collection.py --omdb-key $OMDB_API_KEY --titles-file titles.txtpython src/boxoffice_scraper.pypython src/cleaning.pypython src/modeling.py- Budget vs Gross
- Gross distribution
- Feature importance
- TMDB API integration
- NLP analysis of plots
- Deep learning models
Vasu Patel & Alex Ayer
Computer Science | DATA Science