Skip to content

alexander-ayer/COS482-Project

Repository files navigation

🎬 Predicting Movie Success using Machine Learning

Predicting movie success using OMDb API metadata, The Numbers box-office scraping, and machine learning pipelines.
Explore the features »
View Notebook · Report Bug · Request Feature


🌟 Overview

This project builds a complete end-to-end machine learning system for predicting worldwide box-office performance using:

  • 🎥 OMDb API → Movie metadata
  • 💰 The Numbers → Budget & worldwide gross scraping
  • 🧹 Data Cleaning + Feature engineering
  • 🤖 ML Models → Random Forest, Gradient Boosting, Linear Regression
  • 📊 Visualizations → Scatter plots, histograms, feature importances
  • 🧪 Synthetic fallback dataset for testing
  • 🚀 Google Colab support

🗂 Project Structure

movie-success-project/
│
├── data_raw/               
├── data_clean/             
│
├── src/
│   ├── data_collection.py 
│   ├── boxoffice_scraper.py
│   ├── cleaning.py         
│   ├── modeling.py         
│   └── utils.py
│
├── notebooks/
│   ├── movie_pipeline.ipynb
│   └── EDA.ipynb
│
├── visuals/
│   ├── scatter_budget_vs_worldwide.png
│   ├── hist_worldwide_gross.png
│   └── rf_feature_importances.png
│
├── models/
│   └── best_model.pkl 
│
├── report/
│   └── final_report.md
│
└── README.md

🚀 Features

🔍 Data Collection

  • OMDb API metadata
  • Budget + box‑office scraping
  • Fuzzy title matching
  • Synthetic dataset generator

🧹 Data Cleaning

  • Numeric normalization
  • Missing value handling
  • Genre & cast parsing
  • Release decade extraction

🤖 ML Modeling

  • Random Forest
  • Gradient Boosting
  • Linear Regression

Metrics:

  • RMSE
  • MAE

🛠 Installation

git clone https://github.com/alexander-ayer/COS482-Project
pip install -r requirements.txt

🔑 OMDb API Setup

export OMDB_API_KEY="cdd03cf8"

📥 Usage

Synthetic data

python src/data_collection.py --generate-synthetic 2000

OMDb fetch

python src/data_collection.py --omdb-key $OMDB_API_KEY --titles-file titles.txt

Scraping budgets

python src/boxoffice_scraper.py

Cleaning

python src/cleaning.py

Training

python src/modeling.py

📊 Visualizations

  • Budget vs Gross
  • Gross distribution
  • Feature importance

🧠 Future Work

  • TMDB API integration
  • NLP analysis of plots
  • Deep learning models

👨‍💻 Author

Vasu Patel & Alex Ayer
Computer Science | DATA Science

About

This repository contains the source code for the final project in COS482

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors