CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection
You only need four steps to implement CHIPS.
- Step 0 – Download models & datasets
- Edit
scripts/step_0_download.shto setMODEL_DIR、DATASET_DIR - Then run:
- Edit
bash scripts/step_0_download.sh- Step 1 – Generate pre‑projection embeddings
- Edit paths in
scripts/step_1_generate_pp_emb.sh(model path / dataset path / output path) - Then run:
- Edit paths in
bash scripts/step_1_generate_pp_emb.sh- Step 2 – Compute & test CHIPS scores
bash scripts/steo_2_compute_score.sh- Step 3 – Select data subsets by score
bash scripts/step_3_select_data.sh- Step 4 – Train on selected data
bash scripts/step_4_train.shIf you prefer to reuse our prepared data / checkpoints instead of recomputing everything, please see our Hugging Face collection.
Please note that, in our example we use the BIOMEDICA dataset, which you need to agree to the special liciense before use.
-
Hugging Face collection (data & ckpts):
https://huggingface.co/collections/Mihara-bot/chips -
Original Datasets
https://huggingface.co/datasets/BIOMEDICA/biomedica_webdataset_24Mhttps://huggingface.co/datasets/UCSC-VLAA/MedTrinity-25M
You can directly download:
- Selected subsets
- Model checkpoints
and integrate them into your own pipeline.
We recommend:
- Python ≥ 3.9
- PyTorch ≥ 2.0 (required for
torch.func.jvp) - CUDA‑enabled GPUs (CHIPS is designed for multi‑GPU, e.g. 8× GPUs)
Key Python packages (non‑exhaustive):
torch,transformers(for CLIP model)ray,pyarrow,pandas,numpy,polars,tqdm
You can create a conda environment and install dependencies (pseudo‑example, adjust versions as needed):
conda create -n chips python=3.10
conda activate chips
pip install torch transformers ray[default] pyarrow polars pandas numpy tqdmScript: scripts/step_0_download.sh
This script uses the Hugging Face CLI (hf) to download:
- Model
facebook/metaclip-b16-400m→${MODEL_DIR}/metaclip-b16-400m
- Training pool
BIOMEDICA/biomedica_webdataset_24M→${DATASET_DIR}/biomedica_webdataset_24M
- Evaluation sets (examples)
clip-benchmark/wds_vtab-diabetic_retinopathyAlejandro98/LC25000
Before running, make sure:
- You have installed the Hugging Face CLI:
pip install -U "huggingface_hub[cli]" - You have run
huggingface-cli login - You have edited:
MODEL_DIR="YOUR_MODEL_DIR"DATASET_DIR="YOUR_DATASET_DIR"
Then:
bash scripts/step_0_download.shScript: scripts/step_1_generate_pp_emb.sh
Core implementation: src/step_0/compute_pp_embeddings.py, src/step_0/merge_pp_embeddings.py
We first compute pre‑projection embeddings (pp_embeddings) for the whole training pool and the medical evaluation sets. In CHIPS, we only need:
- CLIP visual backbone output (CLS)
- CLIP text transformer output (EOT)
Key variables in step_1_generate_pp_emb.sh:
MODEL_NAME=metaclip-b16-400mMODEL_PATH=models/metaclip-b16-400mTEXT_EMBED_DIM=512IMAGE_EMBED_DIM=768BIOMEDICA_PATH=datasets/biomedica_webdataset_24MEMBEDDING_SAVE_PATH=embeddings/biomedica_pp_embeddings
You may need to edit these paths to match your local layout.
The script:
- Calls
src/step_0/compute_pp_embeddings.pyonother/noncommercial/commercial_part1…commercial_part6
- Merges all parts using
src/step_0/merge_pp_embeddings.pyintotrain_final/merged.parquet
- Computes embeddings for evaluation sets (e.g.
eval_50/eval.parquet)
You can control batch size, workers, etc. via embedding_generation_config.json.
Run:
bash scripts/step_1_generate_pp_emb.shAfter this step you should have (by default):
embeddings/biomedica_pp_embeddings/metaclip-b16-400m/train_final/merged.parquetembeddings/biomedica_pp_embeddings/metaclip-b16-400m/eval_50/eval.parquet
Script: scripts/steo_2_compute_score.sh
Core implementation: src/select/compute_pp_score_jvp.py, src/select/test_score.py
Note: The file name is
steo_2_compute_score.sh(typo) in this repo; it is the score computation step.
This step:
- Computes influence / TRAK‑style scores (
trak_score) for training pool using pre‑projection embeddings. - Validates & merges score shards into a single Parquet file.
Important configuration inside the script:
- Pre‑computed embeddings:
EMBEDDING_TRAIN_FILE="$EMBEDDING_SAVE_PATH/$MODEL_NAME/train_final/merged.parquet"EMBEDDING_EVAL_FILE="$EMBEDDING_SAVE_PATH/$MODEL_NAME/eval_50/eval.parquet"
- CHIPS score output:
SCORE_BASE_PATH="$BASE_PATH/inf_scores/$HEAD"SELECTED_BASE_PATH="$BASE_PATH/data/biomedica_selected/$HEAD"
COMMON_COMPUTE_ARGS(forcompute_pp_score_jvp.py):--score_type trak--alpha 1--train_path,--eval_path--model_name(path to CLIP model)--grad_mode all--normalize--projection_dim 4096--num_gpus 8--object_store_memory_gb 800
The script runs:
python src/select/compute_pp_score_jvp.py \
"${COMMON_COMPUTE_ARGS[@]}" \
--use_margin \
--beta 0.5 \
--gamma 0.5 \
--compute_batch_size 32768 \
--output_path "$SCORE_BASE_PATH/chips"Then, it calls src/select/test_score.py to:
- Load all score shards
- Validate schema
- Deduplicate and merge on
image_path - Save final scores to:
"$SCORE_BASE_PATH/chips/final/final_chips.parquet"
Run:
bash scripts/step_2_compute_score.shAfter this step you should have (by default):
inf_scores/<HEAD>/chips/final/final_chips.parquet(containstrak_scoreper sample)
Script: scripts/step_3_select_data.sh
Core implementation: src/select/select_score.py
step_3_select_data.sh uses the final score file to select top‑k% data from the original WebDataset training pool and export them into new WebDataset shards.
Key pieces in the script:
- Training pool layout:
BASE_PATH="CHIPS_BASE_PATH"
DATA_DIR="$BASE_PATH/data/biomedica_webdataset_24M"
TRAIN_SET=(
"$DATA_DIR/commercial_part"{1..6}
"$DATA_DIR/noncommercial"
"$DATA_DIR/other"
)- Final score file:
score_file="$SCORE_BASE_PATH/$score_type/final/final_${score_type}.parquet"- For each keeping ratio (10%, 20%, 30%), the script runs:
python src/select/select_score.py \
--webdataset-input "${TRAIN_SET[@]}" \
--score-column "trak_score" \
--scores "$score_file" \
--output "$output_dir" \
--keeping-ratio "$ratio_decimal" \
--num-workers 64This internally:
- Loads the scores parquet, selects top‑k samples by
trak_score - Scans the WebDataset
.tarfiles and extracts selected keys - Writes a new WebDataset under
"$SELECTED_BASE_PATH/${score_type}score_${ratio}"
Run:
bash scripts/step_3_select_data.shAfter this step you should have (by default):
data/biomedica_selected/<HEAD>/chipsscore_10/data/biomedica_selected/<HEAD>/chipsscore_20/data/biomedica_selected/<HEAD>/chipsscore_30/
Each directory is a WebDataset subset ready for training.
Script: scripts/step_4_train.sh
You can plug in your own training code here. A typical usage pattern is:
- Choose one subset, e.g.
chipsscore_10(10% of original pool). - Use these WebDataset shards as training data for CLIP finetuning continual pre-training.
- Evaluate on your downstream tasks / benchmarks.
If you find this repository useful, please cite the CHIPS paper:
@misc{zhuang2025chipsefficientclipadaptation,
title={CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection},
author={Xinlin Zhuang and Yichen Li and Xiwei Liu and Haolin Yang and Yifan Lu and Ziyun Zou and Yulong Li and Huifa Li and Dongliang Chen and Qinglei Wang and Weiyang Liu and Ying Qian and Jiangming Shi and Imran Razzak},
year={2025},
eprint={2511.18519},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.18519},
}