This repository contains the official implementation of the paper Language Models for Molecular Dynamics (bioRxiv 2024).
This codebase implements Molecular Dynamics Language Models (MDLMs), a novel approach that uses GPT-J model to explore the conformational space of Chignolin. The model is trained on a short classical MD trajectory and maintains structural accuracy through kernel density estimations derived from extensive MD datasets.
- GPU with minimum 8GB VRAM
- Tested on Intel® Core™ i7-12700H CPU, NVIDIA GeForce RTX 4060 Laptop GPU
- 16GB RAM recommended
- 15GB free disk space
- Python 3.8+
- PyTorch 2.4+
- CUDA 12.7+
- Ubuntu 20.04 or later
- macOS 12.0 or later
- Windows 10/11 with WSL2
Typical install time: 10-15 minutes
├── dataset/ # Dataset directory
│ ├── train.txt # Training data
│ └── valid.txt # Validation data
├── inference/ # Inference code
│ ├── pretrained_model/ # Pretrained model weights
│ ├── angle_utils.py # Angle processing utilities
│ ├── model_setup.py # Model initialization
│ └── sampler.py # Sampling implementation
├── KDE/ # Kernel Density Estimation
│ ├── KDE220.py
│ └── low_memory_kde_functions.pkl
│ └── kde_functions_from_md.pkl # Download from Zenodo
├── model/ # Model files
│ ├── custom_trainer.py # Custom training implementation
│ ├── data_processing.py # Data processing utilities
│ ├── helpers.py
│ └── model_config.py # Model configuration
├── tokenizer/ # Tokenizer files
│ └── vocab.txt
├── utils/ # Utility functions
│ ├── training_utils.py
│ ├── config.py
│ └── generate.py
└── train.py # Main training script
# Clone the repository
git clone https://github.com/yourusername/mdlm.git
cd mdlm
# Create and activate virtual environment (recommended)
python -m venv mdlm_env
source mdlm_env/bin/activate # On Windows use: mdlm_env\Scripts\activate
# Install requirements
pip install -r requirements.txt
# Download KDE functions
wget https://zenodo.org/records/14263500/files/kde_functions_from_md.pkl -O KDE/kde_functions_from_md.pklThe dataset directory contains example tokenized trajectory data:
train.txt: Training set of tokenized conformationsvalid.txt: Validation set of tokenized conformations
Each line represents one protein conformation, where tokens follow the format:
XyyZ where:
- X, Z are amino acid identities
- yy represents discretized φ-ψ angles between residues
.marks the end of each conformation
Example: GaeY YdnD DfnP PemE EdgT TkhG GciT TbeW WnaG .
represents a complete conformation of Chignolin.
Expected runtime with demo data: ~14 hours for training and inference
To train the model:
python train.py --tokenizer ./tokenizer/ \
--train_data ./dataset/train.txt \
--valid_data ./dataset/valid.txt \
--output_dir ./outputTo use the pretrained weights and generate conformations for Chignolin:
python generate.py This will generate conformations saved as a NumPy file (.npy) with shape (X, 2, 9), where:
- X is the number of generated conformations
- 2 represents φ and ψ angles
- 9 represents the number of angle pairs in Chignolin
To reproduce the results from the paper:
- Train the model using provided demo data (~6 hours)
- Run inference for sampling (~8 hours)
- Generated conformations will be saved as an npy file as described in the inference section.
If you use this code in your research, please cite:
@article{Murtada2024.11.25.625337,
author = {Murtada, Mhd Hussein and Brotzakis, Z. Faidon and Vendruscolo, Michele},
title = {Language Models for Molecular Dynamics},
elocation-id = {2024.11.25.625337},
year = {2024},
doi = {10.1101/2024.11.25.625337},
publisher = {Cold Spring Harbor Laboratory},
journal = {bioRxiv}
}This project is licensed under the MIT License - see the LICENSE file for details
For questions about the code, please open an issue.