| title | Image Caption Generator |
|---|---|
| emoji | 🖼️ |
| colorFrom | blue |
| colorTo | green |
| sdk | streamlit |
| app_file | app.py |
| pinned | false |
An advanced image captioning application that automatically generates highly accurate, descriptive captions for uploaded images using Deep Learning.
This project features a dual-architecture approach:
- Salesforce BLIP (Bootstrapping Language-Image Pre-training): A state-of-the-art transformer model integrated via Hugging Face for maximum accuracy and robust production-level inference.
- Custom CNN-LSTM Encoder-Decoder: A custom-built PyTorch architecture (ResNet-50 encoder and an LSTM decoder) trained from scratch on the Flickr8k dataset.
It provides a clean, interactive Streamlit web interface for generating captions instantly.
- High-Accuracy Inference: Uses the robust Salesforce BLIP model for state-of-the-art zero-shot image captioning.
- Custom Model Training Pipeline: Complete end-to-end pipeline for training a CNN-LSTM model on custom datasets.
- Interactive UI: Built with Streamlit for a fast, responsive, and user-friendly experience.
- Hugging Face Spaces Ready: Pre-configured metadata for seamless deployment.
- Deep Learning Framework: PyTorch
- Models: Salesforce BLIP, ResNet-50, LSTM
- Libraries: Transformers (Hugging Face), Torchvision, NLTK, PIL
- Frontend: Streamlit
├── app.py # Streamlit web application (uses BLIP for frontend)
├── blip_inference.py # BLIP model inference script
├── model.py # Custom CNN-LSTM architecture (ResNet50 + LSTM)
├── train.py # Training script for the custom model
├── dataset.py # Custom PyTorch Dataset loader for Flickr8k
├── build_vocab.py # Vocabulary builder for custom training
├── inference.py # Command-line inference for the custom CNN-LSTM model
├── requirements.txt # Python dependencies
└── README.md # Project documentation
-
Clone the repository:
git clone <repository_url> cd image-caption-generator
-
Create a virtual environment (Optional but recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the dependencies:
pip install -r requirements.txt
The easiest way to use the application is through the interactive Streamlit web interface. It will automatically download the BLIP Large model on the very first run.
streamlit run app.pySimply upload an image through the UI to generate an accurate caption!
If you wish to train the custom Encoder-Decoder model from scratch:
- Download the Flickr8k dataset (images and captions).
- Configure the dataset paths appropriately (see
dataset.pyandtrain.py). - Run the training script:
python train.py
- This process will generate a vocabulary dictionary (
vocab.pkl) and save model weights (caption_model.pth).
Once the custom CNN-LSTM model is fully trained, you can use the command-line inference script to generate captions:
python inference.py --image path/to/image.jpg- macOS SSL Verification: If you encounter SSL certificate verification errors on macOS when downloading pretrained ResNet weights, the code includes a built-in patch (
ssl._create_unverified_contextinmodel.py) to resolve this automatically.
This project is open-source and available under the MIT License.