Skip to content

Aryanoor/Flickr8k-Multimodal-Learning-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flickr8k Multimodal Learning Project

A complete comparative study of classical machine learning, deep learning, and pretrained vision-language models on the Flickr8k image-caption dataset.

Project Overview

This project compares four experimental components:

  1. Classical ML retrieval baseline using hand-crafted image features and TF-IDF/SVD caption features.
  2. CNN-LSTM captioning model using a ResNet50 image encoder and LSTM decoder.
  3. CLIP retrieval model using openai/clip-vit-base-patch32 for zero-shot image-text retrieval.
  4. BLIP captioning model using Salesforce/blip-image-captioning-base for zero-shot caption generation.

The completed pipeline evaluates both image-text retrieval and image caption generation.

Dataset Layout

project-root/
  dataset/
    captions.txt
    Images/
      *.jpg

Dataset Source, License, and Citation

This repository does not include the raw Flickr8k image files or captions. The dataset should be downloaded or requested separately and placed under dataset/ as shown above. This avoids redistributing image files whose copyright belongs to the original Flickr image owners.

Official dataset / author links:

The dataset is associated with the following paper:

@article{hodosh2013framing,
  title   = {Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics},
  author  = {Hodosh, Micah and Young, Peter and Hockenmaier, Julia},
  journal = {Journal of Artificial Intelligence Research},
  volume  = {47},
  pages   = {853--899},
  year    = {2013},
  doi     = {10.1613/jair.3994}
}

The original image-caption annotation collection should also be cited:

@inproceedings{rashtchian2010collecting,
  title     = {Collecting Image Annotations Using Amazon's Mechanical Turk},
  author    = {Rashtchian, Cyrus and Young, Peter and Hodosh, Micah and Hockenmaier, Julia},
  booktitle = {Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk},
  year      = {2010}
}

Dataset Summary

Item Value
Unique images 8,091
Caption rows 40,445
Average captions per image 4.9988
Vocabulary size before min-frequency filtering 8,488
CNN-LSTM vocabulary size 4,404
Missing images 0
Corrupted images 0
Split leakage None detected

Train/Validation/Test Split

Split Images Caption rows
Train 5,663 28,306
Validation 1,214 6,069
Test 1,214 6,070

The split is performed by image ID, not by caption row, to prevent leakage.

Environment Setup

python -m venv .venv
.venv\Scripts\activate      # Windows
# source .venv/bin/activate  # macOS/Linux
pip install -r requirements.txt

For the TensorFlow CLIP/BLIP run used in this project, the following packages were also required in the Windows environment:

pip install "transformers==4.52.3" tf-keras huggingface-hub safetensors

No PyTorch version of the code is required for the current implementation.

Important Model Files

The CNN-LSTM uses a local ResNet50 ImageNet weight file:

outputs/checkpoints/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5

The trained CNN-LSTM checkpoint and vocabulary are saved as:

outputs/checkpoints/cnn_lstm_best.weights.h5
outputs/checkpoints/caption_vocab.json

How to Run

python scripts/run_eda.py
python scripts/run_classical.py
python scripts/run_vlm.py --clip
python scripts/run_deep.py --epochs 10
python scripts/run_vlm.py --blip
python scripts/run_evaluation.py

Completed Results

Retrieval Results

Approach Direction R@1 R@5 R@10 Median rank Mean rank MRR Queries
Classical ML Image→Text 0.08% 0.33% 0.66% 2692.0 4523.49 0.0045 1214
CLIP ViT-B/32 Image→Text 69.52% 88.71% 94.15% 1.0 3.17 0.7817 1214
CLIP ViT-B/32 Text→Image 50.59% 78.04% 86.95% 1.0 7.12 0.6290 6070

Captioning Results

Approach BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr SPICE CLIPScore
CNN-LSTM 0.5165 0.3325 0.1999 0.1183 0.3376 0.4349 0.3425 Not computed Not computed
BLIP 0.5444 0.3813 0.2649 0.1759 0.3822 0.5213 0.5265 Not computed 0.2846

SPICE was not computed because the pycocoevalcap SPICE backend failed due to Java/CoreNLP compatibility. This does not affect the other metrics.

CNN-LSTM Training Result

The CNN-LSTM model was trained for 10 epochs due to CPU-based computational limits.

Quantity Value
First training loss 5.0504
Final training loss 2.9909
First validation loss 4.4537
Final validation loss 3.1255
Test predictions generated 1,214
Successful CNN-LSTM predictions 1,214

BLIP Result

BLIP generated captions for all 1,214 test images.

Quantity Value
Successful BLIP predictions 1,214
Unique BLIP captions 937
Mean BLIP caption length 6.4390 words
Mean BLIP CLIPScore 0.2846
Median BLIP CLIPScore 0.2875

Key Interpretation

  • The classical ML baseline is useful as an interpretable baseline but performs poorly for semantic retrieval.
  • The CNN-LSTM model learns real captioning patterns, but it is limited by dataset size, model simplicity, and CPU training constraints.
  • CLIP is the strongest retrieval model, with image-to-text R@1=69.52% and text-to-image R@1=50.59%.
  • BLIP is the strongest captioning model, outperforming CNN-LSTM on BLEU, METEOR, ROUGE-L, and CIDEr.

Output Files

outputs/tables/eda_summary.json
outputs/tables/classical_retrieval_metrics.json
outputs/tables/clip_i2t_metrics.json
outputs/tables/clip_t2i_metrics.json
outputs/tables/deep_caption_metrics.json
outputs/tables/blip_caption_metrics.json
outputs/tables/deep_learning_history.csv
outputs/tables/blip_clipscore.csv
outputs/predictions/classical_retrieval_predictions.csv
outputs/predictions/clip_retrieval_predictions.csv
outputs/predictions/deep_learning_predictions.csv
outputs/predictions/blip_caption_predictions.csv
outputs/figures/deep_learning_curve.png
outputs/checkpoints/cnn_lstm_best.weights.h5
outputs/checkpoints/caption_vocab.json

Project Structure

src/
  config.py
  data_loader.py
  preprocessing.py
  eda.py
  classical_ml.py
  deep_model.py
  vlm_model.py
  metrics.py
  visualization.py
scripts/
  run_eda.py
  run_classical.py
  run_deep.py
  run_vlm.py
  run_evaluation.py
report/
  final_report.md
outputs/
  checkpoints/
  figures/
  predictions/
  tables/

Final Conclusion

The completed project shows that pretrained VLMs are the strongest models for Flickr8k image-text learning. CLIP performs very well for retrieval, while BLIP performs best for caption generation. The custom CNN-LSTM is still a valid trained baseline and demonstrates meaningful learning, while the classical ML model provides an interpretable but weak baseline.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages