TailorCLIP

Code implement of paper "Adapting Vision-Language Models from Iconic to Inclusive for Multi-Label Recognition Without Labels"

1. Method Overview

Understanding multi-label images remains a challenging task in computer vision. Although vision-language models (VLMs) enable zero-shot recognition without labeled data, they often focus on the most iconic object and miss contextual positives. This bias conflicts with the nature of multi-label learning.

We propose an unsupervised framework that adapts VLMs from iconic recognition to inclusive understanding for label-free multi-label image recognition.

Our method contains two key stages:

Cutting stage: a multi-sampling response estimator is designed to prevent the model from concentrating on only one dominant object and to expose more positive labels.
Sewing stage: a multi-object blend adaptation strategy is introduced to refine label distribution toward multi-label consistency while preserving the intrinsic characteristics of the original model. The adaptation is completed within one epoch.

Extensive experiments show that the framework significantly outperforms existing unsupervised methods on four public datasets, and even surpasses several representative weakly supervised baselines.

Backbone Network

2. Installation

The codebase is Python/PyTorch based.

conda create -n tailorclip python=3.10 -y
conda activate tailorclip
pip install -r requirements.txt

If your environment cannot import randaugment:

pip install randaugment
# or
pip install git+https://github.com/ildoonet/pytorch-randaugment.git

3. Code

Main experiment entry:

recipe_interpolation.py

Core modules:

data.py
losses.py
label_names.py
utils.py
models/
clip/
clip_nopooling/
chinopie/

4. Data

This release only keeps the main experiment code. Large datasets, checkpoints, logs, and runtime artifacts are not included.

Expected dataset layout:

deps/data/voc2012
deps/data/voc2007
deps/data/coco2014
deps/data/NUS-WIDE

Set dataset name via environment variable:

export dataset=voc2012

5. Train

Run the main pipeline:

python recipe_interpolation.py \
  --comment interpolation-main \
  --version 1.9.2 \
  --num_epoch 30 \
  --dev cuda

6. Train/Validation Scripts

We provide convenient scripts in scripts/.

Training:

dataset=voc2012 ./scripts/train.sh

Validation:

dataset=voc2012 ./scripts/val.sh

Optional environment variables:

COMMENT (default: interpolation-main)
VERSION (default: 1.9.2)
EPOCHS (default: 30, train script only)
DEV (default: cuda)

Notes:

Use the same COMMENT, VERSION, and dataset for training and validation.
Validation loads the best checkpoint from the stage-2 run.

7. Outputs

During training, outputs are written to:

deps/checkpoints/
deps/boards/
deps/state/
opts/

Citation

(To be added)

Acknowledgment

This release is built on the local CLIP implementation and the lightweight chinopie training framework.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TailorCLIP

1. Method Overview

Backbone Network

2. Installation

3. Code

4. Data

5. Train

6. Train/Validation Scripts

7. Outputs

Citation

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assert		assert
chinopie		chinopie
clip		clip
clip_nopooling		clip_nopooling
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data.py		data.py
label_names.py		label_names.py
losses.py		losses.py
recipe_interpolation.py		recipe_interpolation.py
recipe_validate.py		recipe_validate.py
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

TailorCLIP

1. Method Overview

Backbone Network

2. Installation

3. Code

4. Data

5. Train

6. Train/Validation Scripts

7. Outputs

Citation

Acknowledgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages