CycDiff-DPO uses preference-aligned diffusion to design target-specific macrocyclic peptides with enhanced membrane permeability, balancing binding competence and cell permeability in one framework.
- Setup
- Quick Start
- Training from Scratch
- Preference Pairs Construction
- Permeability Predictor Training
conda env create -f env.yaml
conda activate CycDiff_DPONote: This environment uses Python 3.9.19.
Download datasets and pre-trained models from Zenodo:
# Download checkpoints
wget https://zenodo.org/records/19429073/files/ckpts.tar.gz?download=1 -O ./ckpts.tar.gz
tar -xzf ./ckpts.tar.gz && rm ./ckpts.tar.gz
# Download train_valid dataset (for training DPO)
wget https://zenodo.org/records/19429955/files/train_valid.tar.gz?download=1 -O ./datasets/train_valid.tar.gz
tar -xzf ./datasets/train_valid.tar.gz -C ./datasets/ && rm ./datasets/train_valid.tar.gz
# Download LNR_CPSea dataset
wget https://zenodo.org/records/19429073/files/LNR_CPSea.tar.gz?download=1 -O ./datasets/LNR_CPSea.tar.gz
tar -xzf ./datasets/LNR_CPSea.tar.gz -C ./datasets/ && rm ./datasets/LNR_CPSea.tar.gz
# Download SciBERT model
wget https://zenodo.org/records/19429073/files/scibert_model.tar.gz?download=1 -O ./scibert_model.tar.gz
tar -xzf ./scibert_model.tar.gz && rm ./scibert_model.tar.gzThe following weights are included in this repository:
| File | Description |
|---|---|
./ckpts/base_model.ckpt |
Base model from CP-Composer |
./ckpts/autoencoder.pth |
Pre-trained full-atom autoencoder (from PepGLAD) |
./ckpts/dpo/epoch44_step513090.ckpt |
DPO-fine-tuned model (final checkpoint) |
./ckpts/xgboost_ensemble/ |
XGBoost ensemble for membrane permeability prediction |
./datasets/train_valid/generated_pairs.pkl |
Pre-generated DPO preference pairs |
The full pipeline consists of three steps. Default: 5 samples per target on the LNR_CPSea test set.
conda activate CycDiff_DPO
GPU=0 bash scripts/inference_forw.shbash scripts/filter_success.sh ./results/LNR_CPSea/condition2_w5_5samples/results.jsonlINPUT_DIR=./results/LNR_CPSea/condition2_w5_5samples/candidates
OUTPUT_DIR=./results/LNR_CPSea/condition2_w5_5samples/relaxed
NUM_CORES=10 bash scripts/batch_relax_good_results.shDPO training fine-tunes the pre-trained LDM to align with membrane permeability preferences. We provide the pre-generated preference pairs and the trained DPO model. To train from scratch:
conda activate CycDiff_DPO
GPU=0 bash scripts/train.shPreference pairs are used to train the DPO model. We provide pre-generated pairs at ./datasets/train_valid/generated_pairs.pkl.
To regenerate pairs with a custom permeability predictor, run:
bash scripts/run_build_pairs_xgboost.shThis requires the training PDB structures. Download and place them in:
./datasets/train_valid/pdbs/ # Reference PDB structuresWe provide the trained XGBoost ensemble at ./ckpts/xgboost_ensemble/, which includes:
model_*.pkl— 10 individual XGBoost modelsscaler.pkl— feature scalerextractor.pkl— ECFP + descriptor feature extractorconfig.json— ensemble configuration
To retrain from scratch using Caco-2 permeability data at ./datasets/caco2/caco2_dedup.csv:
bash scripts/train_xgb.sh