🎬 Predicting Movie Success using Machine Learning

Predicting movie success using OMDb API metadata, The Numbers box-office scraping, and machine learning pipelines.
Explore the features »
View Notebook · Report Bug · Request Feature

🌟 Overview

This project builds a complete end-to-end machine learning system for predicting worldwide box-office performance using:

🎥 OMDb API → Movie metadata
💰 The Numbers → Budget & worldwide gross scraping
🧹 Data Cleaning + Feature engineering
🤖 ML Models → Random Forest, Gradient Boosting, Linear Regression
📊 Visualizations → Scatter plots, histograms, feature importances
🧪 Synthetic fallback dataset for testing
🚀 Google Colab support

🗂 Project Structure

movie-success-project/
│
├── data_raw/               
├── data_clean/             
│
├── src/
│   ├── data_collection.py 
│   ├── boxoffice_scraper.py
│   ├── cleaning.py         
│   ├── modeling.py         
│   └── utils.py
│
├── notebooks/
│   ├── movie_pipeline.ipynb
│   └── EDA.ipynb
│
├── visuals/
│   ├── scatter_budget_vs_worldwide.png
│   ├── hist_worldwide_gross.png
│   └── rf_feature_importances.png
│
├── models/
│   └── best_model.pkl 
│
├── report/
│   └── final_report.md
│
└── README.md

🚀 Features

🔍 Data Collection

OMDb API metadata
Budget + box‑office scraping
Fuzzy title matching
Synthetic dataset generator

🧹 Data Cleaning

Numeric normalization
Missing value handling
Genre & cast parsing
Release decade extraction

🤖 ML Modeling

Random Forest
Gradient Boosting
Linear Regression

Metrics:

RMSE
MAE
R²

🛠 Installation

git clone https://github.com/alexander-ayer/COS482-Project
pip install -r requirements.txt

🔑 OMDb API Setup

export OMDB_API_KEY="cdd03cf8"

📥 Usage

Synthetic data

python src/data_collection.py --generate-synthetic 2000

OMDb fetch

python src/data_collection.py --omdb-key $OMDB_API_KEY --titles-file titles.txt

Scraping budgets

python src/boxoffice_scraper.py

Cleaning

python src/cleaning.py

Training

python src/modeling.py

📊 Visualizations

Budget vs Gross
Gross distribution
Feature importance

🧠 Future Work

TMDB API integration
NLP analysis of plots
Deep learning models

👨‍💻 Author

Vasu Patel & Alex Ayer
Computer Science | DATA Science

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 Predicting Movie Success using Machine Learning

🌟 Overview

🗂 Project Structure

🚀 Features

🔍 Data Collection

🧹 Data Cleaning

🤖 ML Modeling

🛠 Installation

🔑 OMDb API Setup

📥 Usage

Synthetic data

OMDb fetch

Scraping budgets

Cleaning

Training

📊 Visualizations

🧠 Future Work

👨‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
EDA.ipynb		EDA.ipynb
README.md		README.md
boxoffice_scraper.py		boxoffice_scraper.py
cleaning.py		cleaning.py
data_collection.py		data_collection.py
database.py		database.py
final_report.md		final_report.md
modeling.py		modeling.py
requirements.txt		requirements.txt
run_and_plot.py		run_and_plot.py

Folders and files

Latest commit

History

Repository files navigation

🎬 Predicting Movie Success using Machine Learning

🌟 Overview

🗂 Project Structure

🚀 Features

🔍 Data Collection

🧹 Data Cleaning

🤖 ML Modeling

🛠 Installation

🔑 OMDb API Setup

📥 Usage

Synthetic data

OMDb fetch

Scraping budgets

Cleaning

Training

📊 Visualizations

🧠 Future Work

👨‍💻 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages