Peptide pLDDT Predictor - Machine Learning Pipeline

🚀 Try the Interactive Dashboard!

Screen and predict peptide pLDDT scores in real-time with our live Binder Predictor app.

This repository contains a professional-grade Deep Learning pipeline for predicting peptide pLDDT scores using PyTorch. The model migrates from a traditional linear regression approach to a Multi-Layer Perceptron (MLP) architecture, incorporating advanced optimization, interpretability, and diagnostic tools.

Key Features

Deep Learning Framework: Built with PyTorch, optimized for Apple Silicon (MPS/GPU).
Advanced Optimization: Utilizes Bayesian Optimization (Optuna) for hyperparameter tuning and GroupKFold cross-validation for robust performance estimation.
Interpretability: Integrated SHAP (Shapley Additive Explanations) to visualize position-specific amino acid effects on pLDDT.
Professional Diagnostics: Complete suite for residual analysis, bias detection, and identifying biochemical "blind spots."
Experiment Tracking: Integrated with Weights & Biases (W&B) for real-time training monitoring.

Project Structure

src/: Core logic and feature engineering.
- data_loader.py: Loading CSV datasets, sequence alignment, and feature encoding.
- esm_feature_extractor.py: Interface for ESM-2 embeddings.
scripts/: Independent pipelines for training and analysis.
- train_pytorch.py: Main training script for the MLP model.
- hyperparameter_tune_advanced.py: Bayesian hyperparameter tuning with Optuna.
- plot_diagnostics_pytorch.py: Evaluation metrics and residual analysis.
streamlit_app.py: Interactive Streamlit dashboard for real-time prediction and exploration.
assets/docs/: Visual assets used in this documentation.

Model Performance & Diagnostics

1. Reliability Analysis

The model demonstrates strong predictive power, as visualized in the Correlation and Residual plots.

Actual vs Predicted	Residual Analysis

2. SHAP Interpretability

The 2D SHAP heatmap reveals exactly which amino acids at specific positions drive the pLDDT score higher or lower.

3. Biological Blind Spots

We identify specific sequences where the model deviates most from the actual values, pointing to rare biochemical motifs for further investigation.

Sequence	Actual pLDDT	Predicted	Residual
`GESTRQNFPG-----`	24.52	84.26	-59.74
`SVPQRDIFSS----`	33.32	87.36	-54.04
`-ELAELDEQRN`	40.54	93.23	-52.69
`-SLERQIFLDA`	42.66	90.75	-48.09
`-KDNLSQQIES`	91.24	46.35	44.89

How to Run

Install Dependencies:
```
pip install -r requirements.txt
```
Run the Dashboard:
```
streamlit run streamlit_app.py
```
Run Training (from root):
```
python scripts/train_pytorch.py
```

Run Advanced Tuning:

python scripts/hyperparameter_tune_advanced.py

Next Steps

Implement ESM-2 Embeddings for enhanced biological context.
Explore 1D-CNN architectures for local motif detection.
Use the model as a "Digital Lab" for generative peptide design.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
AF2_WildTypeEpitope		AF2_WildTypeEpitope
assets/docs		assets/docs
esm_results		esm_results
hybrid_results		hybrid_results
model_comparison		model_comparison
pytorch_results		pytorch_results
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
data_loader.py		data_loader.py
esm_feature_extractor.py		esm_feature_extractor.py
requirements.txt		requirements.txt
runtime.txt		runtime.txt
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Peptide pLDDT Predictor - Machine Learning Pipeline

🚀 Try the Interactive Dashboard!

Key Features

Project Structure