A content-based movie recommendation system that suggests similar movies based on your selection. Built with Python and Streamlit, featuring a modern Netflix-inspired UI and an optimized TF-IDF algorithm.
- Content-Based Filtering - Analyzes movie metadata (genres, cast, crew, keywords, plot)
- TF-IDF Algorithm - Optimized vectorization with bigram support for better recommendations
- Modern UI - Netflix-inspired dark theme with smooth animations
- Fast Performance - Cached model loading and API responses
- Secure - API keys stored in secrets, not in code
π Live App: cinematch-movie-recommend.streamlit.app
Select any movie from the database of 4,800+ films and get instant recommendations!
- Python 3.9+
- TMDB API Key (Get one free)
# Clone the repository
git clone https://github.com/SiD-array/movie-recommender.git
cd movie-recommender
# Create virtual environment (recommended)
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # macOS/Linux
# Install dependencies
pip install -r requirements.txt
# Configure API key
copy .streamlit\secrets.toml.example .streamlit\secrets.toml
# Edit secrets.toml and add your TMDB API key
# Run the app
streamlit run app.pyOpen http://localhost:8501 in your browser.
movie-recommender/
βββ .streamlit/
β βββ config.toml # Streamlit theme & server config
β βββ secrets.toml.example # API key template
βββ models/
β βββ movies_dict.pkl # Processed movie data (4806 movies)
β βββ similarity.pkl # TF-IDF similarity matrix (Git LFS)
βββ app.py # Main Streamlit application
βββ build_improved_model.py # Script to rebuild/improve the model
βββ download_models.py # Script to download model files
βββ requirements.txt # Python dependencies
βββ README.md
The system uses Content-Based Filtering with TF-IDF Vectorization:
Movie Features β Text Preprocessing β TF-IDF Vectorization β Cosine Similarity β Recommendations
Unlike simple word counting, TF-IDF weighs words by their importance:
| Word Type | Example | Weight |
|---|---|---|
| Rare (discriminative) | "christophernolan", "pixar" | High |
| Common (generic) | "action", "movie", "story" | Low |
Formula:
weight = log(1 + term_frequency) Γ log(total_documents / documents_containing_term)
- Genres - Action, Comedy, Drama, etc.
- Keywords - Plot-specific tags
- Cast - Top actors
- Crew - Director
- Overview - Plot summary
| Feature | Benefit |
|---|---|
| Bigrams | Captures phrases like "science fiction" as single features |
| Sublinear TF | Diminishing returns for repeated words |
| Document Frequency Limits | Filters out typos and overly common words |
| Component | Technology |
|---|---|
| Frontend | Streamlit |
| ML/NLP | scikit-learn (TF-IDF, Cosine Similarity) |
| Data | Pandas, NumPy |
| API | TMDB API |
| Dataset | TMDB 5000 Movie Dataset |
To rebuild or customize the recommendation model:
python build_improved_model.pyThis script allows you to adjust:
max_features- Vocabulary sizengram_range- Unigrams, bigrams, etc.min_df/max_df- Document frequency thresholds
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - see LICENSE for details.
- TMDB for the movie database and API
- Streamlit for the web framework
- scikit-learn for ML tools
Siddharth Bhople - sid.work0403@gmail.com