GitHub - mihara-bot/CHIPS: CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection

CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection

Quick Start

You only need four steps to implement CHIPS.

Step 0 – Download models & datasets
- Edit scripts/step_0_download.sh to set MODEL_DIR、DATASET_DIR
- Then run:

bash scripts/step_0_download.sh

Step 1 – Generate pre‑projection embeddings
- Edit paths in scripts/step_1_generate_pp_emb.sh (model path / dataset path / output path)
- Then run:

bash scripts/step_1_generate_pp_emb.sh

Step 2 – Compute & test CHIPS scores

bash scripts/steo_2_compute_score.sh

Step 3 – Select data subsets by score

bash scripts/step_3_select_data.sh

Step 4 – Train on selected data

bash scripts/step_4_train.sh

Using Released Data / Model Checkpoints

If you prefer to reuse our prepared data / checkpoints instead of recomputing everything, please see our Hugging Face collection.

Please note that, in our example we use the BIOMEDICA dataset, which you need to agree to the special liciense before use.

Hugging Face collection (data & ckpts):
https://huggingface.co/collections/Mihara-bot/chips
Original Datasets
- https://huggingface.co/datasets/BIOMEDICA/biomedica_webdataset_24M
- https://huggingface.co/datasets/UCSC-VLAA/MedTrinity-25M

You can directly download:

Selected subsets
Model checkpoints

and integrate them into your own pipeline.

Environment & Dependencies

We recommend:

Python ≥ 3.9
PyTorch ≥ 2.0 (required for torch.func.jvp)
CUDA‑enabled GPUs (CHIPS is designed for multi‑GPU, e.g. 8× GPUs)

Key Python packages (non‑exhaustive):

torch, transformers (for CLIP model)
ray, pyarrow, pandas, numpy, polars, tqdm

You can create a conda environment and install dependencies (pseudo‑example, adjust versions as needed):

conda create -n chips python=3.10
conda activate chips
pip install torch transformers ray[default] pyarrow polars pandas numpy tqdm

Step 0 – Download Models & Datasets

Script: scripts/step_0_download.sh

This script uses the Hugging Face CLI (hf) to download:

Model
- facebook/metaclip-b16-400m → ${MODEL_DIR}/metaclip-b16-400m
Training pool
- BIOMEDICA/biomedica_webdataset_24M → ${DATASET_DIR}/biomedica_webdataset_24M
Evaluation sets (examples)
- clip-benchmark/wds_vtab-diabetic_retinopathy
- Alejandro98/LC25000

Before running, make sure:

You have installed the Hugging Face CLI: pip install -U "huggingface_hub[cli]"
You have run huggingface-cli login
You have edited:
- MODEL_DIR="YOUR_MODEL_DIR"
- DATASET_DIR="YOUR_DATASET_DIR"

Then:

bash scripts/step_0_download.sh

Step 1 – Generate Pre‑Projection Embeddings

Script: scripts/step_1_generate_pp_emb.sh
Core implementation: src/step_0/compute_pp_embeddings.py, src/step_0/merge_pp_embeddings.py

We first compute pre‑projection embeddings (pp_embeddings) for the whole training pool and the medical evaluation sets. In CHIPS, we only need:

CLIP visual backbone output (CLS)
CLIP text transformer output (EOT)

Key variables in step_1_generate_pp_emb.sh:

MODEL_NAME=metaclip-b16-400m
MODEL_PATH=models/metaclip-b16-400m
TEXT_EMBED_DIM=512
IMAGE_EMBED_DIM=768
BIOMEDICA_PATH=datasets/biomedica_webdataset_24M
EMBEDDING_SAVE_PATH=embeddings/biomedica_pp_embeddings

You may need to edit these paths to match your local layout.

The script:

Calls src/step_0/compute_pp_embeddings.py on
- other/
- noncommercial/
- commercial_part1 … commercial_part6
Merges all parts using src/step_0/merge_pp_embeddings.py into
- train_final/merged.parquet
Computes embeddings for evaluation sets (e.g. eval_50/eval.parquet)

You can control batch size, workers, etc. via embedding_generation_config.json.

Run:

bash scripts/step_1_generate_pp_emb.sh

After this step you should have (by default):

embeddings/biomedica_pp_embeddings/metaclip-b16-400m/train_final/merged.parquet
embeddings/biomedica_pp_embeddings/metaclip-b16-400m/eval_50/eval.parquet

Step 2 – Compute & Test CHIPS Scores

Script: scripts/steo_2_compute_score.sh
Core implementation: src/select/compute_pp_score_jvp.py, src/select/test_score.py

Note: The file name is steo_2_compute_score.sh (typo) in this repo; it is the score computation step.

This step:

Computes influence / TRAK‑style scores (trak_score) for training pool using pre‑projection embeddings.
Validates & merges score shards into a single Parquet file.

Important configuration inside the script:

Pre‑computed embeddings:
- EMBEDDING_TRAIN_FILE="$EMBEDDING_SAVE_PATH/$MODEL_NAME/train_final/merged.parquet"
- EMBEDDING_EVAL_FILE="$EMBEDDING_SAVE_PATH/$MODEL_NAME/eval_50/eval.parquet"
CHIPS score output:
- SCORE_BASE_PATH="$BASE_PATH/inf_scores/$HEAD"
- SELECTED_BASE_PATH="$BASE_PATH/data/biomedica_selected/$HEAD"
COMMON_COMPUTE_ARGS (for compute_pp_score_jvp.py):
- --score_type trak
- --alpha 1
- --train_path, --eval_path
- --model_name (path to CLIP model)
- --grad_mode all
- --normalize
- --projection_dim 4096
- --num_gpus 8
- --object_store_memory_gb 800

The script runs:

python src/select/compute_pp_score_jvp.py \
    "${COMMON_COMPUTE_ARGS[@]}" \
    --use_margin \
    --beta 0.5 \
    --gamma 0.5 \
    --compute_batch_size 32768 \
    --output_path "$SCORE_BASE_PATH/chips"

Then, it calls src/select/test_score.py to:

Load all score shards
Validate schema
Deduplicate and merge on image_path
Save final scores to:
"$SCORE_BASE_PATH/chips/final/final_chips.parquet"

Run:

bash scripts/step_2_compute_score.sh

After this step you should have (by default):

inf_scores/<HEAD>/chips/final/final_chips.parquet (contains trak_score per sample)

Step 3 – Select Data by CHIPS Score

Script: scripts/step_3_select_data.sh
Core implementation: src/select/select_score.py

step_3_select_data.sh uses the final score file to select top‑k% data from the original WebDataset training pool and export them into new WebDataset shards.

Key pieces in the script:

Training pool layout:

BASE_PATH="CHIPS_BASE_PATH"
DATA_DIR="$BASE_PATH/data/biomedica_webdataset_24M"

TRAIN_SET=(
    "$DATA_DIR/commercial_part"{1..6}
    "$DATA_DIR/noncommercial"
    "$DATA_DIR/other"
)

Final score file:

score_file="$SCORE_BASE_PATH/$score_type/final/final_${score_type}.parquet"

For each keeping ratio (10%, 20%, 30%), the script runs:

python src/select/select_score.py \
    --webdataset-input "${TRAIN_SET[@]}" \
    --score-column "trak_score" \
    --scores "$score_file" \
    --output "$output_dir" \
    --keeping-ratio "$ratio_decimal" \
    --num-workers 64

This internally:

Loads the scores parquet, selects top‑k samples by trak_score
Scans the WebDataset .tar files and extracts selected keys
Writes a new WebDataset under "$SELECTED_BASE_PATH/${score_type}score_${ratio}"

Run:

bash scripts/step_3_select_data.sh

After this step you should have (by default):

data/biomedica_selected/<HEAD>/chipsscore_10/
data/biomedica_selected/<HEAD>/chipsscore_20/
data/biomedica_selected/<HEAD>/chipsscore_30/

Each directory is a WebDataset subset ready for training.

Step 4 – Train on Selected Subsets

Script: scripts/step_4_train.sh

You can plug in your own training code here. A typical usage pattern is:

Choose one subset, e.g. chipsscore_10 (10% of original pool).
Use these WebDataset shards as training data for CLIP finetuning continual pre-training.
Evaluate on your downstream tasks / benchmarks.

Citation

If you find this repository useful, please cite the CHIPS paper:

@misc{zhuang2025chipsefficientclipadaptation,
      title={CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection}, 
      author={Xinlin Zhuang and Yichen Li and Xiwei Liu and Haolin Yang and Yifan Lu and Ziyun Zou and Yulong Li and Huifa Li and Dongliang Chen and Qinglei Wang and Weiyang Liu and Ying Qian and Jiangming Shi and Imran Razzak},
      year={2025},
      eprint={2511.18519},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.18519}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
scripts		scripts
src		src
README.md		README.md
embedding_generation_config.json		embedding_generation_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

Using Released Data / Model Checkpoints

Environment & Dependencies

Step 0 – Download Models & Datasets

Step 1 – Generate Pre‑Projection Embeddings

Step 2 – Compute & Test CHIPS Scores

Step 3 – Select Data by CHIPS Score

Step 4 – Train on Selected Subsets

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Using Released Data / Model Checkpoints

Environment & Dependencies

Step 0 – Download Models & Datasets

Step 1 – Generate Pre‑Projection Embeddings

Step 2 – Compute & Test CHIPS Scores

Step 3 – Select Data by CHIPS Score

Step 4 – Train on Selected Subsets

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages