Skip to content

mihara-bot/CHIPS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection


Quick Start

You only need four steps to implement CHIPS.

  • Step 0 – Download models & datasets
    • Edit scripts/step_0_download.sh to set MODEL_DIRDATASET_DIR
    • Then run:
bash scripts/step_0_download.sh
  • Step 1 – Generate pre‑projection embeddings
    • Edit paths in scripts/step_1_generate_pp_emb.sh (model path / dataset path / output path)
    • Then run:
bash scripts/step_1_generate_pp_emb.sh
  • Step 2 – Compute & test CHIPS scores
bash scripts/steo_2_compute_score.sh
  • Step 3 – Select data subsets by score
bash scripts/step_3_select_data.sh
  • Step 4 – Train on selected data
bash scripts/step_4_train.sh

Using Released Data / Model Checkpoints

If you prefer to reuse our prepared data / checkpoints instead of recomputing everything, please see our Hugging Face collection.

Please note that, in our example we use the BIOMEDICA dataset, which you need to agree to the special liciense before use.

  • Hugging Face collection (data & ckpts):
    https://huggingface.co/collections/Mihara-bot/chips

  • Original Datasets

    • https://huggingface.co/datasets/BIOMEDICA/biomedica_webdataset_24M
    • https://huggingface.co/datasets/UCSC-VLAA/MedTrinity-25M

You can directly download:

  • Selected subsets
  • Model checkpoints

and integrate them into your own pipeline.


Environment & Dependencies

We recommend:

  • Python ≥ 3.9
  • PyTorch ≥ 2.0 (required for torch.func.jvp)
  • CUDA‑enabled GPUs (CHIPS is designed for multi‑GPU, e.g. 8× GPUs)

Key Python packages (non‑exhaustive):

  • torch, transformers (for CLIP model)
  • ray, pyarrow, pandas, numpy, polars, tqdm

You can create a conda environment and install dependencies (pseudo‑example, adjust versions as needed):

conda create -n chips python=3.10
conda activate chips
pip install torch transformers ray[default] pyarrow polars pandas numpy tqdm

Step 0 – Download Models & Datasets

Script: scripts/step_0_download.sh

This script uses the Hugging Face CLI (hf) to download:

  • Model
    • facebook/metaclip-b16-400m${MODEL_DIR}/metaclip-b16-400m
  • Training pool
    • BIOMEDICA/biomedica_webdataset_24M${DATASET_DIR}/biomedica_webdataset_24M
  • Evaluation sets (examples)
    • clip-benchmark/wds_vtab-diabetic_retinopathy
    • Alejandro98/LC25000

Before running, make sure:

  • You have installed the Hugging Face CLI: pip install -U "huggingface_hub[cli]"
  • You have run huggingface-cli login
  • You have edited:
    • MODEL_DIR="YOUR_MODEL_DIR"
    • DATASET_DIR="YOUR_DATASET_DIR"

Then:

bash scripts/step_0_download.sh

Step 1 – Generate Pre‑Projection Embeddings

Script: scripts/step_1_generate_pp_emb.sh
Core implementation: src/step_0/compute_pp_embeddings.py, src/step_0/merge_pp_embeddings.py

We first compute pre‑projection embeddings (pp_embeddings) for the whole training pool and the medical evaluation sets. In CHIPS, we only need:

  • CLIP visual backbone output (CLS)
  • CLIP text transformer output (EOT)

Key variables in step_1_generate_pp_emb.sh:

  • MODEL_NAME=metaclip-b16-400m
  • MODEL_PATH=models/metaclip-b16-400m
  • TEXT_EMBED_DIM=512
  • IMAGE_EMBED_DIM=768
  • BIOMEDICA_PATH=datasets/biomedica_webdataset_24M
  • EMBEDDING_SAVE_PATH=embeddings/biomedica_pp_embeddings

You may need to edit these paths to match your local layout.

The script:

  1. Calls src/step_0/compute_pp_embeddings.py on
    • other/
    • noncommercial/
    • commercial_part1commercial_part6
  2. Merges all parts using src/step_0/merge_pp_embeddings.py into
    • train_final/merged.parquet
  3. Computes embeddings for evaluation sets (e.g. eval_50/eval.parquet)

You can control batch size, workers, etc. via embedding_generation_config.json.

Run:

bash scripts/step_1_generate_pp_emb.sh

After this step you should have (by default):

  • embeddings/biomedica_pp_embeddings/metaclip-b16-400m/train_final/merged.parquet
  • embeddings/biomedica_pp_embeddings/metaclip-b16-400m/eval_50/eval.parquet

Step 2 – Compute & Test CHIPS Scores

Script: scripts/steo_2_compute_score.sh
Core implementation: src/select/compute_pp_score_jvp.py, src/select/test_score.py

Note: The file name is steo_2_compute_score.sh (typo) in this repo; it is the score computation step.

This step:

  1. Computes influence / TRAK‑style scores (trak_score) for training pool using pre‑projection embeddings.
  2. Validates & merges score shards into a single Parquet file.

Important configuration inside the script:

  • Pre‑computed embeddings:
    • EMBEDDING_TRAIN_FILE="$EMBEDDING_SAVE_PATH/$MODEL_NAME/train_final/merged.parquet"
    • EMBEDDING_EVAL_FILE="$EMBEDDING_SAVE_PATH/$MODEL_NAME/eval_50/eval.parquet"
  • CHIPS score output:
    • SCORE_BASE_PATH="$BASE_PATH/inf_scores/$HEAD"
    • SELECTED_BASE_PATH="$BASE_PATH/data/biomedica_selected/$HEAD"
  • COMMON_COMPUTE_ARGS (for compute_pp_score_jvp.py):
    • --score_type trak
    • --alpha 1
    • --train_path, --eval_path
    • --model_name (path to CLIP model)
    • --grad_mode all
    • --normalize
    • --projection_dim 4096
    • --num_gpus 8
    • --object_store_memory_gb 800

The script runs:

python src/select/compute_pp_score_jvp.py \
    "${COMMON_COMPUTE_ARGS[@]}" \
    --use_margin \
    --beta 0.5 \
    --gamma 0.5 \
    --compute_batch_size 32768 \
    --output_path "$SCORE_BASE_PATH/chips"

Then, it calls src/select/test_score.py to:

  • Load all score shards
  • Validate schema
  • Deduplicate and merge on image_path
  • Save final scores to:
    "$SCORE_BASE_PATH/chips/final/final_chips.parquet"

Run:

bash scripts/step_2_compute_score.sh

After this step you should have (by default):

  • inf_scores/<HEAD>/chips/final/final_chips.parquet (contains trak_score per sample)

Step 3 – Select Data by CHIPS Score

Script: scripts/step_3_select_data.sh
Core implementation: src/select/select_score.py

step_3_select_data.sh uses the final score file to select top‑k% data from the original WebDataset training pool and export them into new WebDataset shards.

Key pieces in the script:

  • Training pool layout:
BASE_PATH="CHIPS_BASE_PATH"
DATA_DIR="$BASE_PATH/data/biomedica_webdataset_24M"

TRAIN_SET=(
    "$DATA_DIR/commercial_part"{1..6}
    "$DATA_DIR/noncommercial"
    "$DATA_DIR/other"
)
  • Final score file:
score_file="$SCORE_BASE_PATH/$score_type/final/final_${score_type}.parquet"
  • For each keeping ratio (10%, 20%, 30%), the script runs:
python src/select/select_score.py \
    --webdataset-input "${TRAIN_SET[@]}" \
    --score-column "trak_score" \
    --scores "$score_file" \
    --output "$output_dir" \
    --keeping-ratio "$ratio_decimal" \
    --num-workers 64

This internally:

  • Loads the scores parquet, selects top‑k samples by trak_score
  • Scans the WebDataset .tar files and extracts selected keys
  • Writes a new WebDataset under "$SELECTED_BASE_PATH/${score_type}score_${ratio}"

Run:

bash scripts/step_3_select_data.sh

After this step you should have (by default):

  • data/biomedica_selected/<HEAD>/chipsscore_10/
  • data/biomedica_selected/<HEAD>/chipsscore_20/
  • data/biomedica_selected/<HEAD>/chipsscore_30/

Each directory is a WebDataset subset ready for training.


Step 4 – Train on Selected Subsets

Script: scripts/step_4_train.sh

You can plug in your own training code here. A typical usage pattern is:

  1. Choose one subset, e.g. chipsscore_10 (10% of original pool).
  2. Use these WebDataset shards as training data for CLIP finetuning continual pre-training.
  3. Evaluate on your downstream tasks / benchmarks.

Citation

If you find this repository useful, please cite the CHIPS paper:

@misc{zhuang2025chipsefficientclipadaptation,
      title={CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection}, 
      author={Xinlin Zhuang and Yichen Li and Xiwei Liu and Haolin Yang and Yifan Lu and Ziyun Zou and Yulong Li and Huifa Li and Dongliang Chen and Qinglei Wang and Weiyang Liu and Ying Qian and Jiangming Shi and Imran Razzak},
      year={2025},
      eprint={2511.18519},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.18519}, 
}

About

CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors