Deep Feature Barcoding with Combinatorial Genetic Algorithm (CGA)

This repository contains the implementation for deep feature barcoding using Combinatorial Genetic Algorithm for medical image retrieval, particularly focused on TCGA (The Cancer Genome Atlas) datasets.

Overview

This project implements various hashing and barcoding methods for efficient medical image retrieval, with a focus on histopathological images from cancer datasets. The main contribution is the use of Combinatorial Genetic Algorithm (CGA) to optimize feature ordering for improved barcoding performance.

Key Features

Multiple Barcoding Methods: CGA-dHash, CGA-DFT, LSH, ITQ, LBP, and various neural network-based approaches (DHN, DSH, DTSH, CSQ, DPSH)
Comprehensive Dataset Support: TCGA cancer datasets across multiple organ systems (Brain, Endocrine, Gastrointestinal, Gynecologic, Liver, Mesenchymal, Pulmonary, Urinary Tract)
Multiple Feature Extractors: Support for both KimiaNet and DenseNet121 features
Robust Evaluation: F1-score, Precision@k, and mean Average Precision (mAP) metrics
Efficient Implementation: Optimized Hamming distance computation and vectorized operations

Installation

Prerequisites

Python 3.8 or higher
CUDA-capable GPU (optional, for neural network methods)

Dependencies

Install the required packages using pip:

pip install -r requirements.txt

Dataset Setup

The code expects datasets to be organized in the data/ directory. You have two options for setting up datasets:

Option 1: Automatic Download (Recommended)

Use the --download flag to automatically download datasets when needed:

python main.py --dataset cifar10 --download

Option 2: Manual Download

Download the datasets from the following links and extract them to the ./data directory:

CIFAR-10: https://drive.google.com/uc?export=download&id=1nJHt0v1g8bxlXKcQUFzR3HL28rqeyW4m
CIFAR-100: https://drive.google.com/uc?export=download&id=1V6vBIX-MW8LGAl7W8cGZHbI4D8FhcoeY
COVID-19: https://drive.google.com/uc?export=download&id=1yUhCG__EJUKei4s5dtagldO0Blh6EFzt
Fashion-MNIST: https://drive.google.com/uc?export=download&id=1zAgS9-fGtnRNKY8tVOqMkS8yZHnkDDGB
TCGA: https://drive.google.com/uc?export=download&id=1jZCslLLz_jJ6lctlfutvMN4ljfNn2PQU

After downloading (either automatically or manually), your data directory structure should look like:

data/
├── tcga/
│   ├── kimiaNet_train_data.xlsx
│   ├── kimiaNet_validation_data.xlsx
│   ├── kimiaNet_test_data.xlsx
│   ├── AllKimiaPatches/
│   └── AllDensePatches/
├── cifar10/
│   ├── train_features.npy
│   ├── train_labels.npy
│   ├── test_features.npy
│   └── test_labels.npy
├── cifar100/
├── covid19/
└── fashion/

Usage

Basic Usage

Run the main script with default parameters:

python main.py

Advanced Usage

Using CGA-dHash on TCGA Brain dataset:

python main.py --dataset tcga_brain_kimianet --method CGA-dHash --k 10 --cga_n_gen 100

Using neural network methods:

python main.py --dataset tcga_brain_kimianet --method DHN --n_bits 64 --n_epochs 100 --device cuda

With feature selection:

python main.py --dataset tcga_brain_kimianet --method CGA-dHash --feature_selection

With automatic dataset download:

python main.py --dataset cifar10 --method CGA-dHash --download

Available Parameters

--dataset: Dataset to use (default: tcga_brain_kimianet)
--method: Barcoding method (default: CGA-dHash)
--k: Number of nearest neighbors for evaluation (default: 10)
--n_bits: Number of bits for hash codes (default: 128)
--feature_selection: Enable feature selection (only works for TCGA)
--download: Automatically download datasets if they don't exist
--cga_n_gen: Number of generations for CGA (default: 100)
--cga_pop_size: Population size for CGA (default: 100)
--device: Device to use (cuda or cpu)

Supported Methods

Method	Type	Description
CGA-dHash	Genetic Algorithm	Optimized difference hashing
CGA-DFT	Genetic Algorithm	Optimized DFT-based barcoding
aHash	Traditional	Average hashing
dHash	Traditional	Difference hashing
MinMax	Traditional	Min-max normalization
DFT	Traditional	Discrete Fourier Transform
ITQ	Learning-based	Iterative Quantization
LBP	Learning-based	Linear Binary Patterns
LSH	Learning-based	Locality Sensitive Hashing
DHN	Neural Network	Deep Hashing Network
DSH	Neural Network	Deep Supervised Hashing
DTSH	Neural Network	Deep Triplet Supervised Hashing
CSQ	Neural Network	Contrastive Supervised Quantization
DPSH	Neural Network	Deep Pairwise Supervised Hashing

Supported Datasets

TCGA Datasets (by Organ System)

Brain: tcga_brain_kimianet, tcga_brain_densenet121
Endocrine: tcga_endocrine_kimianet, tcga_endocrine_densenet121
Gastrointestinal: tcga_gastrointestinal_kimianet, tcga_gastrointestinal_densenet121
Gynecologic: tcga_gynecologic_kimianet, tcga_gynecologic_densenet121
Liver: tcga_liver_kimianet, tcga_liver_densenet121
Mesenchymal: tcga_mesenchymal_kimianet, tcga_mesenchymal_densenet121
Pulmonary: tcga_pulmonary_kimianet, tcga_pulmonary_densenet121
Urinary Tract: tcga_urinary_tract_kimianet, tcga_urinary_tract_densenet121

Other Datasets

CIFAR-10: cifar10
CIFAR-100: cifar100
COVID-19: covid19
Fashion-MNIST: fashion

Methodology

Combinatorial Genetic Algorithm (CGA)

The main contribution of this work is the application of Combinatorial Genetic Algorithm to optimize feature ordering for barcoding methods. CGA optimizes the permutation of features to improve the effectiveness of various barcoding techniques for medical image retrieval.

Evaluation Metrics

The code evaluates retrieval performance using three key metrics:

F1-Score: Weighted F1-score based on majority voting of retrieved labels
Precision@k: Precision of the top-k retrieved items
Mean Average Precision (mAP): Average precision across all queries

Code Structure

src/
├── methods/
│   ├── cga.py          # Compact Genetic Algorithm implementation
│   ├── hashingnn.py    # Neural network hashing methods
│   ├── dhash.py        # Difference hashing
│   ├── dft.py          # DFT-based barcoding
│   ├── itq.py          # Iterative Quantization
│   ├── lbp.py          # Linear Binary Patterns
│   ├── lsh.py          # Locality Sensitive Hashing
│   ├── ahash.py        # Average hashing
│   └── minmax.py       # Min-max normalization
└── utils.py            # Utility functions for data loading and evaluation

Reproducibility

To ensure reproducibility of results:

Set the random seed using --seed parameter
Use the same dataset splits (train/validation/test)
Use identical hyperparameters as specified in the paper
Ensure consistent hardware setup for neural network methods

Citation

If you use this code in your research, please cite our paper:

@article{khosrowshahli2025enhancing,
  title={Enhancing Image Retrieval Through Optimal Barcode Representation},
  author={Khosrowshahli, Rasa and Kheiri, Farnaz and Bidgoli, Azam Asilian and Tizhoosh, H.R. and Makrehchi, Masoud and Rahnamayan, Shahryar},
  journal={Scientific Report},
  year={2025}
}

Code Availability

This code is made available for peer review and reproducibility purposes as required by publication guidelines. The code allows readers to repeat the published results and is deposited with a DOI for permanent access.

Project Developer: Rasa Khosrowshahli, Faculty of Mathematics and Science, Brock University

License

This project is licensed under the MIT License - see the LICENSE file for details.

Data Availability

The TCGA datasets used in this study are publicly available through the Cancer Genome Atlas Research Network. Processed features and dataset splits used in our experiments can be made available upon reasonable request, subject to data use agreements.

Requirements

See requirements.txt for a complete list of dependencies.

Contributing

This code is provided primarily for reproducibility of published results. For questions or issues, please open an issue on this repository.

Acknowledgments

The Cancer Genome Atlas (TCGA) Research Network for providing the medical image datasets
KimiaNet and DenseNet121 for pre-trained feature extractors
PyTorch and scikit-learn communities for excellent machine learning libraries

Note: This code is provided as-is for research reproducibility. Please ensure you have appropriate computational resources and data access permissions before running experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Deep Feature Barcoding with Combinatorial Genetic Algorithm (CGA)

Overview

Key Features

Installation

Prerequisites

Dependencies

Dataset Setup

Option 1: Automatic Download (Recommended)

Option 2: Manual Download

Usage

Basic Usage

Advanced Usage

Using CGA-dHash on TCGA Brain dataset:

Using neural network methods:

With feature selection:

With automatic dataset download:

Available Parameters

Supported Methods

Supported Datasets

TCGA Datasets (by Organ System)

Other Datasets

Methodology

Combinatorial Genetic Algorithm (CGA)

Evaluation Metrics

Code Structure

Reproducibility

Citation

Code Availability

License

Data Availability

Requirements

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages