An earlier version of this work was presented at NeurIPS 2025 workshops as a non-archival presentation:
- AI for Science: https://openreview.net/pdf?id=Imw5NGMgje
- Machine Learning and the Physical Sciences: https://ml4physicalsciences.github.io/2025/files/NeurIPS_ML4PS_2025_107.pdf
The design of antibodies with high affinity and specificity for target antigens is a cornerstone of therapeutic and diagnostic innovation. Traditional optimization strategies, such as phage or yeast display and directed evolution, remain resource-intensive and limited in their ability to integrate contextual information. Recent AI-driven approaches have accelerated protein engineering, but most rely exclusively on structured inputs, overlooking the potential of natural language as a flexible design interface. In this work, we introduce TeBaAb, a novel text-based antigen-conditioned framework for antibody redesign that combines generative modeling with iterative optimization inspired by directed evolution. TeBaAb integrates a Conditional Variational Autoencoder (CVAE) jointly conditioned on antigen sequences and textual descriptions of antibody properties, coupled with a two-stage binding affinity predictor and an iterative enrichment loop. To support this approach, we curated AbDes, a new dataset of 7,684 text–antibody–antigen pairs with accompanying structural and binding information.
In silico experimental evaluations demonstrate that TeBaAb improves the predicted binding affinity by an average of
- Python 3.10 or higher
- A virtual environment (e.g.,
venvorconda) is recommended - Dependencies listed in
requirements.txt
-
Clone the repository:
git clone https://github.com/HySonLab/TeBaAb.git cd TeBaAb -
Set up a virtual environment and install dependencies:
conda env create -f environment.yml
or
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Create the checkpoints directory:
mkdir -p checkpoints
- AbDes: Contains antibody-antigen sequences paired with text descriptions.
- Download: Hugging Face: AbDes
- Path:
datasets/abdes
Ensure datasets are placed in the specified paths or update configuration files accordingly.
Configuration is managed via YAML files in configs/:
training.yaml: Defines architectures for protein/text encoders, CVAE, and fitness predictor.optimize.yaml: Configures directed evolution parameters (e.g., generations, mutation rate).
Extract and preprocess the dataset:
cd datasets
python3 extract_embedding.py \
--input_csv ./abdes/train.csv \
--output_dir ./cvae \
--output_prefix train \
--modality all \
--embedding_type pooler \
--device cuda:0 \
--esmc_cache /path/to/esmc_300m_2024_12_v0.pth-
Train the CVAE:
./scripts/run_train_cvae.sh
- Trains on
datasets/cvae.
- Trains on
-
Train the Oracle:
python3 scripts/train_predictor.py
- Trains on
datasets/affinity.
- Trains on
Generate and optimize protein sequences using directed evolution:
./scripts/optimize.py- CVAE: Reconstruction loss, KL divergence, validation loss.
- Oracle: Mean Squared Error (MSE) for fitness prediction.
- Binding affinity scores from the oracle.
- Diversity, novelty.
- 3D Structure Error: ABodyBuilder2
- Antibody Developability: TAP
@inproceedings{
nguyen2025tebaab,
title={TeBaAb: Text-Based Antigen-Conditioned Antibody Redesign via Directed Evolution},
author={Cuong Manh Nguyen and Huy-Hoang Do-Huu and Viet Thanh Duy Nguyen and Truong-Son Hy},
booktitle={NeurIPS 2025 AI for Science Workshop},
year={2025},
url={https://openreview.net/forum?id=Imw5NGMgje}
}