AAVGen is an AI-driven framework that designs novel AAV capsids optimized simultaneously for kidney targeting, production efficiency, and thermal stability, enabling next-generation renal gene therapy vectors.
Adeno-associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency. Engineering capsids to overcome these hurdles is challenging due to the vast sequence space and the difficulty of simultaneously optimizing multiple functional properties. The complexity also adds when it comes to the kidney, which presents unique anatomical barriers and cellular targets that require precise and efficient vector engineering. Here, we present AAVGen, a generative artificial intelligence framework for de novo design of AAV capsids with enhanced multi-trait profiles. AAVGen integrates a protein language model (PLM) with supervised fine-tuning (SFT) and a reinforcement learning technique termed Group Sequence Policy Optimization (GSPO). The model is guided by a composite reward signal derived from three ESM-2-based regression predictors, each trained to predict a key property: production fitness, kidney tropism, and thermostability. Our results demonstrate that AAVGen produces a diverse library of novel VP1 protein sequences. In silico validations revealed that the majority of the generated variants have superior performance across all three employed indices, indicating successful multi-objective optimization. Furthermore, structural analysis via AlphaFold3 confirms that the generated sequences preserve the canonical capsid folding despite sequence diversification. AAVGen establishes a foundation for data-driven viral vector engineering, accelerating the development of next-generation AAV vectors with tailored functional characteristics.
Figure 1. Schematic overview of the AAVGen framework. Three ESM-2 regression models predict production fitness, kidney tropism, and thermostability, serving as reward functions within a GSPO-based reinforcement learning pipeline to fine-tune a ProtGPT2 generative model.
- Regression models: Production fitness predictor achieved Spearman Ο = 0.91; kidney tropism Ο = 0.35; thermostability Ο = 0.26.
- Generative diversity: AAVGen generated 500,000 sequences with ~96% uniqueness; median sequence identity to AAV2 WT of 99.18%, with ~13% edit distance, confirming biologically plausible novelty.
- Functional quality: 99.7% of generated sequences classified as "Best" for production fitness; 98.27% as "Good" or better for kidney tropism; 88.57% as "Good" for thermostability.
- Multi-property co-optimization: Strong positive Spearman correlations across all three properties, confirming no property trade-off.
- Structural fidelity: AlphaFold3 analysis showed median RMSD of ~0.42β0.47 Γ vs. AAV2 WT, far superior to randomly generated baselines (median RMSD 0.48 Γ , median fitness β4.65).
AAVGen/
βββ assets/ # Figures and visual assets
βββ datasets/ # Input/output datasets (.csv)
βββ src/
β βββ data_processing/ # Data preprocessing scripts
β β βββ get_deep_diversification_production_fitness.py
β β βββ get_fit4function_production_datasets.py
β β βββ get_landscape_production_fitness.py
β βββ train/ # Model training scripts
β β βββ aav2_regs.py # Regression model training
β β βββ sft.py # Supervised fine-tuning
β β βββ gspo.py # GSPO reinforcement learning
β βββ inference/ # Sequence generation and scoring
β βββ AAVGen.py
β βββ AAVGen_CLI.py
β βββ regressions_inferences.py
βββ README.md
git clone https://github.com/your-username/AAVGen.git
cd AAVGen
pip install -r requirements.txtRaw and processed datasets are available on Hugging Face:
The training corpus integrates data from three independent studies:
- Ogden et al. β AAV2 deep mutational scanning: 31,579 VP1 sequences (production fitness), 24,984 (kidney tropism), 30,889 (thermostability).
- Bryant et al. β AAV2 VP1 residues 561β588: 296,896 multi-mutation sequences with fitness scores.
- Eid et al. β AAV9: 100,000 sequences with production fitness and liver biodistribution data.
cd src/data_processing
python get_deep_diversification_production_fitness.py
python get_fit4function_production_datasets.py
python get_landscape_production_fitness.pyA preprocessed version is also available directly on the Hugging Face dataset hub.
Three ESM-2-based regression models are trained sequentially via transfer learning to predict production fitness, kidney tropism, and thermostability. The fitness model (trained for ~11.4 hours on an NVIDIA V100 32 GB) serves as the initialization checkpoint for the other two models.
cd src/train
python aav2_regs.pyProtGPT2 (~738M parameters) is fine-tuned on 192,199 non-redundant VP1 sequences from AAV2 and AAV9 to learn residueβresidue relationships across serotypes (~9 hours on NVIDIA V100).
cd src/train
python sft.pyThe SFT model is further refined via GSPO using a composite reward signal from the three regression models, plus auxiliary rewards for sequence length diversity and intra-batch uniqueness (~9.6 hours on NVIDIA V100).
cd src/train
python gspo.pyVia a graphical UI: Launch the web interface to generate AAV VP1 sequences interactively.
Via Python API:
from AAVGen import generate
input_sequences = ["M", "MAAG"]
generated_sequences = generate(
input_sequences,
temperature=0.8,
top_p=0.95,
max_length=300,
)
print(generated_sequences)Via CLI (recommended for large-scale generation):
Quick test run:
cd src/inference
python AAVGen_CLI.py -n 1000 -b 32 --name test-runFull example with all options:
python AAVGen_CLI.py \
-n 100000 \
--batch-size 64 \
--temperature 0.9 \
--top-p 0.95 \
--top-k 40 \
--max-length 600 \
--output-dir ./results \
--name my-aav2-sequencesScore your generated sequences for production fitness, kidney tropism, and thermostability using the trained regression models. Place your sequences in a .csv file with a column named generate_seqs inside the datasets/ folder, then run:
cd src/inference
python regressions_inferences.py \
--df_name my_dataset\
--batch_size 256Do not include the
.csvextension in--df_name. Output will be saved asmy_dataset_out.csv.
Note: The preprocessing step (see Data Preprocessing) must be completed before running regression inference.
| Component | Base Model | Parameters | Purpose |
|---|---|---|---|
| Regression models (Γ3) | ESM-2 (UR50D) | 8M | Reward functions for fitness, tropism, thermostability |
| Generative model | ProtGPT2 | 738M | VP1 sequence generation |
- Regression models are fine-tuned from ESM-2 using sequential transfer learning (fitness β tropism, fitness β thermostability), supervised with MSE loss and AdamW.
- SFT fine-tunes ProtGPT2 on a curated corpus of 192,199 high-fitness AAV2 and AAV9 VP1 sequences.
- GSPO applies sequence-level policy gradient optimization. For each training step, G=32 candidate sequences are generated and evaluated; advantage estimates are computed via group-wise normalization, and the policy is updated using a clipped surrogate objective (Ξ΅=0.2).
| Reward | Description |
|---|---|
| Production fitness | ESM-2 regressor score vs. AAV2 WT threshold |
| Kidney tropism | ESM-2 regressor score vs. AAV2 WT threshold |
| Thermostability | ESM-2 regressor score vs. AAV2 WT threshold |
| Length controller | Gaussian penalty discouraging WT-length sequences |
| Intra-batch uniqueness | Binary reward penalizing duplicate sequences within a batch |
500 sequences sampled from "Good"/"Best" categories were folded using AlphaFold3 (5 structures each) and compared to the AAV2 WT PDB structure via RMSD in PyMOL (v3.1.1). A baseline of 250 randomly mutated sequences was generated for comparison.
All models were trained on a dedicated server equipped with:
- GPU: NVIDIA V100 (32 GB VRAM)
- CPU: AMD EPYC 7502
- RAM: 32 GB
| Model | Training Time |
|---|---|
| Production fitness regression | 11 h 25 min |
| Kidney tropism regression | 3 h 24 min |
| Thermostability regression | 3 h 29 min |
| SFT (ProtGPT2) | 9 h 05 min |
| GSPO fine-tuning | 9 h 38 min |
If you use AAVGen in your research, please cite:
@misc{ghaffarzadehesfahani2026aavgenprecisionengineeringadenoassociated,
title={AAVGen: Precision Engineering of Adeno-associated Viral Capsids for Renal Selective Targeting},
author={Mohammadreza Ghaffarzadeh-Esfahani and Yousof Gheisari},
year={2026},
eprint={2602.18915},
archivePrefix={arXiv},
primaryClass={q-bio.QM},
url={https://arxiv.org/abs/2602.18915},
}This project is licensed under the MIT License. See the LICENSE file for details.
For questions or feedback, please open an issue or contact mreghafarazadeh@gmail.com.

