Flickr8k Multimodal Learning Project

A complete comparative study of classical machine learning, deep learning, and pretrained vision-language models on the Flickr8k image-caption dataset.

Project Overview

This project compares four experimental components:

Classical ML retrieval baseline using hand-crafted image features and TF-IDF/SVD caption features.
CNN-LSTM captioning model using a ResNet50 image encoder and LSTM decoder.
CLIP retrieval model using openai/clip-vit-base-patch32 for zero-shot image-text retrieval.
BLIP captioning model using Salesforce/blip-image-captioning-base for zero-shot caption generation.

The completed pipeline evaluates both image-text retrieval and image caption generation.

Dataset Layout

project-root/
  dataset/
    captions.txt
    Images/
      *.jpg

Dataset Source, License, and Citation

This repository does not include the raw Flickr8k image files or captions. The dataset should be downloaded or requested separately and placed under dataset/ as shown above. This avoids redistributing image files whose copyright belongs to the original Flickr image owners.

Official dataset / author links:

Flickr8k image-caption page by the Hockenmaier group: https://hockenmaier.cs.illinois.edu/8k-pictures.html
Official paper/data page: https://hockenmaier.cs.illinois.edu/Framing_Image_Description/KCCA.html
JAIR article page: https://www.jair.org/index.php/jair/article/view/10833
DOI: https://doi.org/10.1613/jair.3994

The dataset is associated with the following paper:

@article{hodosh2013framing,
  title   = {Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics},
  author  = {Hodosh, Micah and Young, Peter and Hockenmaier, Julia},
  journal = {Journal of Artificial Intelligence Research},
  volume  = {47},
  pages   = {853--899},
  year    = {2013},
  doi     = {10.1613/jair.3994}
}

The original image-caption annotation collection should also be cited:

@inproceedings{rashtchian2010collecting,
  title     = {Collecting Image Annotations Using Amazon's Mechanical Turk},
  author    = {Rashtchian, Cyrus and Young, Peter and Hodosh, Micah and Hockenmaier, Julia},
  booktitle = {Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk},
  year      = {2010}
}

Dataset Summary

Item	Value
Unique images	8,091
Caption rows	40,445
Average captions per image	4.9988
Vocabulary size before min-frequency filtering	8,488
CNN-LSTM vocabulary size	4,404
Missing images	0
Corrupted images	0
Split leakage	None detected

Train/Validation/Test Split

Split	Images	Caption rows
Train	5,663	28,306
Validation	1,214	6,069
Test	1,214	6,070

The split is performed by image ID, not by caption row, to prevent leakage.

Environment Setup

python -m venv .venv
.venv\Scripts\activate      # Windows
# source .venv/bin/activate  # macOS/Linux
pip install -r requirements.txt

For the TensorFlow CLIP/BLIP run used in this project, the following packages were also required in the Windows environment:

pip install "transformers==4.52.3" tf-keras huggingface-hub safetensors

No PyTorch version of the code is required for the current implementation.

Important Model Files

The CNN-LSTM uses a local ResNet50 ImageNet weight file:

outputs/checkpoints/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5

The trained CNN-LSTM checkpoint and vocabulary are saved as:

outputs/checkpoints/cnn_lstm_best.weights.h5
outputs/checkpoints/caption_vocab.json

How to Run

python scripts/run_eda.py
python scripts/run_classical.py
python scripts/run_vlm.py --clip
python scripts/run_deep.py --epochs 10
python scripts/run_vlm.py --blip
python scripts/run_evaluation.py

Completed Results

Retrieval Results

Approach	Direction	R@1	R@5	R@10	Median rank	Mean rank	MRR	Queries
Classical ML	Image→Text	0.08%	0.33%	0.66%	2692.0	4523.49	0.0045	1214
CLIP ViT-B/32	Image→Text	69.52%	88.71%	94.15%	1.0	3.17	0.7817	1214
CLIP ViT-B/32	Text→Image	50.59%	78.04%	86.95%	1.0	7.12	0.6290	6070

Captioning Results

Approach	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE	CLIPScore
CNN-LSTM	0.5165	0.3325	0.1999	0.1183	0.3376	0.4349	0.3425	Not computed	Not computed
BLIP	0.5444	0.3813	0.2649	0.1759	0.3822	0.5213	0.5265	Not computed	0.2846

SPICE was not computed because the pycocoevalcap SPICE backend failed due to Java/CoreNLP compatibility. This does not affect the other metrics.

CNN-LSTM Training Result

The CNN-LSTM model was trained for 10 epochs due to CPU-based computational limits.

Quantity	Value
First training loss	5.0504
Final training loss	2.9909
First validation loss	4.4537
Final validation loss	3.1255
Test predictions generated	1,214
Successful CNN-LSTM predictions	1,214

BLIP Result

BLIP generated captions for all 1,214 test images.

Quantity	Value
Successful BLIP predictions	1,214
Unique BLIP captions	937
Mean BLIP caption length	6.4390 words
Mean BLIP CLIPScore	0.2846
Median BLIP CLIPScore	0.2875

Key Interpretation

The classical ML baseline is useful as an interpretable baseline but performs poorly for semantic retrieval.
The CNN-LSTM model learns real captioning patterns, but it is limited by dataset size, model simplicity, and CPU training constraints.
CLIP is the strongest retrieval model, with image-to-text R@1=69.52% and text-to-image R@1=50.59%.
BLIP is the strongest captioning model, outperforming CNN-LSTM on BLEU, METEOR, ROUGE-L, and CIDEr.

Output Files

outputs/tables/eda_summary.json
outputs/tables/classical_retrieval_metrics.json
outputs/tables/clip_i2t_metrics.json
outputs/tables/clip_t2i_metrics.json
outputs/tables/deep_caption_metrics.json
outputs/tables/blip_caption_metrics.json
outputs/tables/deep_learning_history.csv
outputs/tables/blip_clipscore.csv
outputs/predictions/classical_retrieval_predictions.csv
outputs/predictions/clip_retrieval_predictions.csv
outputs/predictions/deep_learning_predictions.csv
outputs/predictions/blip_caption_predictions.csv
outputs/figures/deep_learning_curve.png
outputs/checkpoints/cnn_lstm_best.weights.h5
outputs/checkpoints/caption_vocab.json

Project Structure

src/
  config.py
  data_loader.py
  preprocessing.py
  eda.py
  classical_ml.py
  deep_model.py
  vlm_model.py
  metrics.py
  visualization.py
scripts/
  run_eda.py
  run_classical.py
  run_deep.py
  run_vlm.py
  run_evaluation.py
report/
  final_report.md
outputs/
  checkpoints/
  figures/
  predictions/
  tables/

Final Conclusion

The completed project shows that pretrained VLMs are the strongest models for Flickr8k image-text learning. CLIP performs very well for retrieval, while BLIP performs best for caption generation. The custom CNN-LSTM is still a valid trained baseline and demonstrates meaningful learning, while the classical ML model provides an interpretable but weak baseline.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
outputs/tables		outputs/tables
report		report
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
REFRENCES.md		REFRENCES.md
req.txt		req.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flickr8k Multimodal Learning Project

Project Overview

Dataset Layout

Dataset Source, License, and Citation

Dataset Summary

Train/Validation/Test Split

Environment Setup

Important Model Files

How to Run

Completed Results

Retrieval Results

Captioning Results

CNN-LSTM Training Result

BLIP Result

Key Interpretation

Output Files

Project Structure

Final Conclusion

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Flickr8k Multimodal Learning Project

Project Overview

Dataset Layout

Dataset Source, License, and Citation

Dataset Summary

Train/Validation/Test Split

Environment Setup

Important Model Files

How to Run

Completed Results

Retrieval Results

Captioning Results

CNN-LSTM Training Result

BLIP Result

Key Interpretation

Output Files

Project Structure

Final Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages