A complete comparative study of classical machine learning, deep learning, and pretrained vision-language models on the Flickr8k image-caption dataset.
This project compares four experimental components:
- Classical ML retrieval baseline using hand-crafted image features and TF-IDF/SVD caption features.
- CNN-LSTM captioning model using a ResNet50 image encoder and LSTM decoder.
- CLIP retrieval model using
openai/clip-vit-base-patch32for zero-shot image-text retrieval. - BLIP captioning model using
Salesforce/blip-image-captioning-basefor zero-shot caption generation.
The completed pipeline evaluates both image-text retrieval and image caption generation.
project-root/
dataset/
captions.txt
Images/
*.jpg
This repository does not include the raw Flickr8k image files or captions. The dataset should be downloaded or requested separately and placed under dataset/ as shown above. This avoids redistributing image files whose copyright belongs to the original Flickr image owners.
Official dataset / author links:
- Flickr8k image-caption page by the Hockenmaier group: https://hockenmaier.cs.illinois.edu/8k-pictures.html
- Official paper/data page: https://hockenmaier.cs.illinois.edu/Framing_Image_Description/KCCA.html
- JAIR article page: https://www.jair.org/index.php/jair/article/view/10833
- DOI: https://doi.org/10.1613/jair.3994
The dataset is associated with the following paper:
@article{hodosh2013framing,
title = {Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics},
author = {Hodosh, Micah and Young, Peter and Hockenmaier, Julia},
journal = {Journal of Artificial Intelligence Research},
volume = {47},
pages = {853--899},
year = {2013},
doi = {10.1613/jair.3994}
}The original image-caption annotation collection should also be cited:
@inproceedings{rashtchian2010collecting,
title = {Collecting Image Annotations Using Amazon's Mechanical Turk},
author = {Rashtchian, Cyrus and Young, Peter and Hodosh, Micah and Hockenmaier, Julia},
booktitle = {Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk},
year = {2010}
}| Item | Value |
|---|---|
| Unique images | 8,091 |
| Caption rows | 40,445 |
| Average captions per image | 4.9988 |
| Vocabulary size before min-frequency filtering | 8,488 |
| CNN-LSTM vocabulary size | 4,404 |
| Missing images | 0 |
| Corrupted images | 0 |
| Split leakage | None detected |
| Split | Images | Caption rows |
|---|---|---|
| Train | 5,663 | 28,306 |
| Validation | 1,214 | 6,069 |
| Test | 1,214 | 6,070 |
The split is performed by image ID, not by caption row, to prevent leakage.
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS/Linux
pip install -r requirements.txtFor the TensorFlow CLIP/BLIP run used in this project, the following packages were also required in the Windows environment:
pip install "transformers==4.52.3" tf-keras huggingface-hub safetensorsNo PyTorch version of the code is required for the current implementation.
The CNN-LSTM uses a local ResNet50 ImageNet weight file:
outputs/checkpoints/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
The trained CNN-LSTM checkpoint and vocabulary are saved as:
outputs/checkpoints/cnn_lstm_best.weights.h5
outputs/checkpoints/caption_vocab.json
python scripts/run_eda.py
python scripts/run_classical.py
python scripts/run_vlm.py --clip
python scripts/run_deep.py --epochs 10
python scripts/run_vlm.py --blip
python scripts/run_evaluation.py| Approach | Direction | R@1 | R@5 | R@10 | Median rank | Mean rank | MRR | Queries |
|---|---|---|---|---|---|---|---|---|
| Classical ML | Image→Text | 0.08% | 0.33% | 0.66% | 2692.0 | 4523.49 | 0.0045 | 1214 |
| CLIP ViT-B/32 | Image→Text | 69.52% | 88.71% | 94.15% | 1.0 | 3.17 | 0.7817 | 1214 |
| CLIP ViT-B/32 | Text→Image | 50.59% | 78.04% | 86.95% | 1.0 | 7.12 | 0.6290 | 6070 |
| Approach | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | CIDEr | SPICE | CLIPScore |
|---|---|---|---|---|---|---|---|---|---|
| CNN-LSTM | 0.5165 | 0.3325 | 0.1999 | 0.1183 | 0.3376 | 0.4349 | 0.3425 | Not computed | Not computed |
| BLIP | 0.5444 | 0.3813 | 0.2649 | 0.1759 | 0.3822 | 0.5213 | 0.5265 | Not computed | 0.2846 |
SPICE was not computed because the pycocoevalcap SPICE backend failed due to Java/CoreNLP compatibility. This does not affect the other metrics.
The CNN-LSTM model was trained for 10 epochs due to CPU-based computational limits.
| Quantity | Value |
|---|---|
| First training loss | 5.0504 |
| Final training loss | 2.9909 |
| First validation loss | 4.4537 |
| Final validation loss | 3.1255 |
| Test predictions generated | 1,214 |
| Successful CNN-LSTM predictions | 1,214 |
BLIP generated captions for all 1,214 test images.
| Quantity | Value |
|---|---|
| Successful BLIP predictions | 1,214 |
| Unique BLIP captions | 937 |
| Mean BLIP caption length | 6.4390 words |
| Mean BLIP CLIPScore | 0.2846 |
| Median BLIP CLIPScore | 0.2875 |
- The classical ML baseline is useful as an interpretable baseline but performs poorly for semantic retrieval.
- The CNN-LSTM model learns real captioning patterns, but it is limited by dataset size, model simplicity, and CPU training constraints.
- CLIP is the strongest retrieval model, with image-to-text R@1=69.52% and text-to-image R@1=50.59%.
- BLIP is the strongest captioning model, outperforming CNN-LSTM on BLEU, METEOR, ROUGE-L, and CIDEr.
outputs/tables/eda_summary.json
outputs/tables/classical_retrieval_metrics.json
outputs/tables/clip_i2t_metrics.json
outputs/tables/clip_t2i_metrics.json
outputs/tables/deep_caption_metrics.json
outputs/tables/blip_caption_metrics.json
outputs/tables/deep_learning_history.csv
outputs/tables/blip_clipscore.csv
outputs/predictions/classical_retrieval_predictions.csv
outputs/predictions/clip_retrieval_predictions.csv
outputs/predictions/deep_learning_predictions.csv
outputs/predictions/blip_caption_predictions.csv
outputs/figures/deep_learning_curve.png
outputs/checkpoints/cnn_lstm_best.weights.h5
outputs/checkpoints/caption_vocab.json
src/
config.py
data_loader.py
preprocessing.py
eda.py
classical_ml.py
deep_model.py
vlm_model.py
metrics.py
visualization.py
scripts/
run_eda.py
run_classical.py
run_deep.py
run_vlm.py
run_evaluation.py
report/
final_report.md
outputs/
checkpoints/
figures/
predictions/
tables/
The completed project shows that pretrained VLMs are the strongest models for Flickr8k image-text learning. CLIP performs very well for retrieval, while BLIP performs best for caption generation. The custom CNN-LSTM is still a valid trained baseline and demonstrates meaningful learning, while the classical ML model provides an interpretable but weak baseline.