Discrete Space Diffusion: A Different Method for Protein Sequence Generation using LLMs

Overview

This project encompasses our work in applying discrete space diffusion for protein language models to generate developable lysozymes. Building off of the original Diffusion Protein Language Model (DPLM) paper for the model architecture, we build 3 training schemes to train the model on the sequence diffusion task. The three schemes are Full Diffusion, Two Stage Diffusion and Incremental Diffusion. The first two training schemes are from our implementations of the training schemes detailed in DPLM. Incremental Diffusion is a training scheme we propose to fix some of the limitations in the other two. We find that the model trained on Incremental Diffusion improves its ability to generate outside of its training set while maintaining high foldability and low generative perplexity.

Inference

First install the requirements file

pip install -r requirements.txt

Next, download the model weights from each training scheme here

Finally, run the following script 'generate_script_commands.py' with the commands lines below:

python /home/en540-lludwig2/ProMDLM/generate_script_commands.py \
  --weight_path 'path_to_full_diff_weights path_to_increment_diff_weights' \
  --save_names 'full_diff increment_diff'  \
  --temp '0.5 1 1.5' \
  --save_dir 'my_outputs_new' \
  --max_iter 100 \
  --generation_length 160 \
  --nb_generated_sequences 1 \
  --resample_ratio 0.3 \
  --device 'cuda'

ProGen_Finetune

In our evaluation we finetune ProGen to compare to our sequence space diffusion approach. Code for finetuning ProGen can be found here.

Results

We evaluated our Incremental Diffusion method (Increment) on it's ability to generate Lysozymes. We tested our training scheme against the two training schemes from the DPLM paper, Full Diffusion (fullDiff) and Two Stage Diffusion (two_stage). We also compared our method to ProGen that we trained on the split of our custom dataset of Lysozyme sequences and a randomly selected pool of Lysozyme sequences from the test split of our custom dataset.

We started by generating or selecting 100 sequences from the different methods. Since certain methods could generated repetitive sequences we applied a sequence entropy cutoff of 3.5, where the sequences that passed moved on with the evaluation.

Next, we compared the following metrics: pident (diversity), foldability (pLDDT) and psuedo_perplexity using ESM 650M parameters plotted on the following scatterplots.

From ESM, we also construct a tSNE plot to visualize ESM embedding space of generated lysozymes (green) compared to real lysozymes (blue) and other protein families.

Finally, we investigate the hit rate of generated Lysozyme sequences across the different training schemes. We consider a sequence as "developable" if it has > 0.7 plddt and < 80 diversity (pident) and > 3.5 entropy.

From these results, we see that IncrementDiff and two_stage generate sequences with pseudo-perplexity similar to the testing set while also remaining diverse. IncrementDiff also generates foldable sequences with high diversity compared to the other methods tested. From the tSNE plot, IncrementDiff generations also occupy their own cluster close to the lysozyme family within the ESM Embedding space. Finally, generated sequences from our IncrementDiff are the most developable.

Finally, using Alphafold3 we qualitatively analyzed folded representations of our generated sequences from our IncrementDiff training and the DPLM Two Stage Training against a true Lysozyme. From the folds below we can see that the generated sequences adopt a similar celft which is representative of Lysozyme proteins.

Sources

DPLM paper: Wang, X., Zheng, Z., Ye, F., Xue, D., Huang, S., & Gu, Q. (2024, October 16). Diffusion language models are versatile protein learners. arXiv.org. https://arxiv.org/abs/2402.18567

License

MIT License. See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
DisplayImages		DisplayImages
blast		blast
configs		configs
data		data
generated_sequences		generated_sequences
imgs		imgs
misc_ProteinData		misc_ProteinData
models		models
my_outputs		my_outputs
plots_all_temp		plots_all_temp
README.md		README.md
calculate_seq_entropy.py		calculate_seq_entropy.py
esmfold.py		esmfold.py
evaluationESM.py		evaluationESM.py
filter.py		filter.py
generate_script.py		generate_script.py
generate_script_commands.py		generate_script_commands.py
hitRate.py		hitRate.py
plotting.ipynb		plotting.ipynb
prepare.py		prepare.py
pseudo_perplexity_log.txt		pseudo_perplexity_log.txt
read_pkl.py		read_pkl.py
requirements.txt		requirements.txt
run_blastp_analysis.py		run_blastp_analysis.py
train_custom.py		train_custom.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discrete Space Diffusion: A Different Method for Protein Sequence Generation using LLMs

Overview

Inference

ProGen_Finetune

Results

Sources

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Discrete Space Diffusion: A Different Method for Protein Sequence Generation using LLMs

Overview

Inference

ProGen_Finetune

Results

Sources

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages