This project is a web-based movie recommendation application developed as a final project for the Fundamentals of Data Science. It addresses the "choice overload" and "cold-start" problems common in streaming platforms by providing intelligent, content-based suggestions without requiring user history.
The system utilizes a hybrid approach:
- Weighted Content-Based Filtering: Uses TF-IDF Vectorization and Cosine Similarity to find semantically similar movies, with higher weights assigned to Directors and Cast to capture "auteur" style and star power.
- Bayesian Quality Scoring: Implements the IMDb weighted rating formula to ensure that recommended movies are statistically high-quality, balancing raw ratings with vote counts.
- Content-Based Recommender: Suggests movies based on a "Weighted Soup" of metadata (Director x3, Cast x2, Keywords x1, Genres x1).
- Browse by Star: Search for movies featuring specific Actors or Directors. Includes a "Cold Start" fix that suggests popular stars if the search is empty.
- Surprise Me!: A discovery feature that randomly selects a high-quality movie from the top-rated 500 films.
- Smart Catalog: A full, paginated library of over 4,800 movies, filterable by Genre and sorted by Bayesian Quality Score.
- Interactive Metadata: All Directors, Cast members, and Genres are clickable, allowing seamless navigation to related content.
- Modern UI: A responsive, dark-themed interface built with Tailwind CSS.
- Backend: Python, Flask
- Data Manipulation: Pandas, NumPy
- Machine Learning: Scikit-learn (TF-IDF, Cosine Similarity)
- Frontend: HTML5, Tailwind CSS (via CDN)
Before you begin, ensure you have the following installed on your system:
- Python 3.x (This program is made in 3.14, but any version of 3.x python should work.)
- The
pippackage manager
Follow these steps to set up and run the application locally.
- Clone the Repository
git clone [https://github.com/MichaelFirstAC/MovieCatalog.git](https://github.com/MichaelFirstAC/MovieCatalog.git)
cd MovieCatalog
- Install Dependencies Install the required Python libraries using pip:
pip install flask pandas scikit-learn
- Prepare the Data and Model (One-Time Setup)
The application requires the raw CSV files (tmdb_5000_movies.csv and tmdb_5000_credits.csv) to be present in the root directory.
- Note: If these files are zipped (
archive.zip), please extract them into the root folder first.
Run the prepare_model.py script. This script will:
- Clean and parse the JSON datasets.
- Calculate the Bayesian Quality Score for every movie.
- Build the TF-IDF and Cosine Similarity matrices.
- Save the processed models (movies.pkl and cosine_sim.pkl).
python prepare_model.py
- Run the Web Application Once the model files are generated, start the Flask server:
python app.py
You should see output indicating the server is running, typically on http://127.0.0.1:5000/.
- Access the Application Open your web browser and navigate to:
- app.py: The main Flask application containing routing logic and the recommendation engine.
- prepare_model.py: The data pipeline script for cleaning, feature engineering, and model training.
- templates/index.html: The unified frontend template handling all views (Home, Catalog, Browse, etc.).
- static/: Contains CSS assets and team images.
- movies.pkl & cosine_sim.pkl: Serialized model files generated by the preparation script.
- OTHER FILES OTHER THAN THE ONES MENTIONED ARE NOT REQUIRED FOR THE PROGRAM TO RUN, THEY ARE ALL DOCUMENTATION FILES.