Deep learning of tautomer stability from crystallographic proton positions

Tautomerism plays a central role in molecular recognition, physicochemical properties, and chemical reactivity, yet rapid identification of stable tautomeric states remains a persistent challenge in molecular design and structure-based drug discovery. Quantum-mechanical approaches are often too computationally demanding for library-scale application, while rule-based and 3D-dependent machine-learning methods remain limited in either transferability or throughput. Here, we show that experimentally resolved hydrogen positions in the Cambridge Structural Database (CSD) provide a powerful large-scale source of supervision for learning tautomer stability. Using more than 1.1 million tautomeric states derived from proton-resolved crystal structures, we trained a 2D graph neural network that predicts stable tautomeric states directly from molecular topology, without conformer generation or quantum calculations. Applied to 5,075 PDBBind ligands with multiple tautomeric states, the model identified 126 cases in which the assigned tautomeric state is likely incorrect; in each case, the reassigned stable tautomer exhibited improved hydrogen-bonding patterns together with fewer unsatisfied polar atoms. We further developed an open-source workflow, Tautomer-Predictor, capable of processing the 4.6-million-compound Enamine collection in about 3.2 h on a single GPU-enabled node. Together, these results establish crystallographic proton placement as a rich and underexploited source of chemical knowledge for learning tautomer stability, and provide a practical route to large-scale tautomer assignment for molecular discovery.

Free webserver: https://huggingface.co/spaces/panceler/Tautomer-Predictor-Web

SMILES / CSV / TSV / SDF -> clean molecules -> enumerate tautomers -> build graphs -> score with AttentiveFP -> rank stable states

Requirements

Core prediction dependencies:

Python 3.11.8
numpy 1.26.4
pandas 2.2.1
RDKit 2025.9.3
PyTorch 2.2.0
PyTorch Geometric 2.4.0
torch-scatter 2.1.2
torch-sparse 0.6.18
torch-cluster 1.6.3
torch-spline-conv 1.2.2
mols2grid

Training-only extras:

pytorch-lightning 2.5.5
torchmetrics 1.4.0
scikit-learn 1.4.1

Usage

1. End-to-end prediction

The main prediction entrypoint is predict_tautomers_parallel.py. It reads CSV, TSV, or SDF input, cleans molecules, enumerates candidate tautomers, converts them to PyTorch Geometric graphs, scores them with the exported model in checkpoints/best_model_state_dict.pt, and writes ranked output.

usage: predict_tautomers_parallel.py [-h] [--input-format {auto,csv,tsv,sdf}]
                                     [--smiles-column SMILES_COLUMN]
                                     [--name-column NAME_COLUMN]
                                     [--has-header {auto,true,false}]
                                     [--sdf-name-property SDF_NAME_PROPERTY]
                                     [--encoding ENCODING]
                                     [--checkpoint-pattern CHECKPOINT_PATTERN]
                                     [--threshold THRESHOLD]
                                     [--tautomer-mode {all,stable,top1}]
                                     [--fragment FRAGMENT]
                                     [--max-combinations MAX_COMBINATIONS]
                                     [--output-format {csv,tsv}]
                                     [--num-workers NUM_WORKERS]
                                     input_path output_path

Example:

python predict_tautomers_parallel.py molecules.csv ranked_tautomers.csv \
    --input-format csv \
    --smiles-column smiles \
    --name-column name \
    --tautomer-mode stable \
    --num-workers 8

For SDF input with fragment-based enumeration:

python predict_tautomers_parallel.py molecules.sdf ranked_tautomers.csv \
    --fragment true \
    --tautomer-mode stable \
    --num-workers 8

2. Two-stage pipeline for large libraries

For larger datasets, the repository provides a split workflow:

predict_stage1_preprocess.py cleans molecules, enumerates tautomers, builds graph objects, and stores them in a compressed pickle.
predict_stage2_predict.py loads the precomputed graphs and performs batched GPU inference.

Stage 1 example:

python predict_stage1_preprocess.py molecules.tsv molecules.pkl.gz \
    --input-format tsv \
    --smiles-column smiles \
    --name-column name \
    --num-workers 32

Stage 2 example:

python predict_stage2_predict.py molecules.pkl.gz ranked_tautomers.csv \
    --checkpoint-pattern checkpoints/best_model_state_dict.pt \
    --tautomer-mode stable \
    --threshold 0.43 \
    --batch-size 4096 \
    --num-workers 4

3. Notebook demo

An interactive example is provided in predict_example.ipynb. The notebook loads a checkpoint, enumerates tautomers for a sample molecule, scores them, and visualizes the results with Mols2Grid.

4. Expected output

The main ranked output contains the following columns:

input_smiles
name
clean_smiles
taut_smiles
taut_prob
taut_rank

Code Organization

The current codebase is split between reusable library code and training utilities.

src/library_io.py handles CSV, TSV, and SDF parsing.
src/structure_ops.py contains molecule cleaning, tautomer enumeration, scoring, and candidate ranking helpers.
src/tautomer_model.py defines graph featurization, model loading, and prediction utilities.
src/combine_frag.py and src/cut_mol.py implement the optional fragment-based prediction path.
train/ contains training-data preprocessing and model-training scripts.

Training

Training and graph-generation utilities are stored under train/.

To prepare graph data:

python train/calc_graph.py valid_set_tautobase.csv

The SLURM helper script train/run.s shows the original cluster workflow for graph generation.

To train a model:

python train/train_cmodel.py \
    --train_path data/train_set_neutral.pkl \
    --valid_path data/valid_set_tautobase.pkl \
    --wandb_project atfp \
    --wandb_name taut \
    --checkpoint_dir checkpoints

The training script saves checkpoints in the target checkpoint directory and supports resuming with --resume or --resume_checkpoint. The corresponding cluster launcher is train/run_train.s.

Model Architecture

The current implementation uses an AttentiveFP molecular graph neural network implemented with PyTorch Geometric.

Atom features include atom type, degree, hybridization, implicit valence, aromaticity, ring membership, hydrogen-bond donor and acceptor flags, and formal charge.
Bond features include single, double, triple, aromatic, conjugation, and ring indicators.
The default AttentiveFP configuration in src/tautomer_model.py uses 29 input channels, 512 hidden channels, 8 message-passing layers, 3 timesteps, and 0.2 dropout.
Training uses an EMA-style averaged model, and the current inference scripts load a single exported state-dict checkpoint by default.

Notes

predict_tautomers_parallel.py uses checkpoints/best_model_state_dict.pt by default.
predict_stage2_predict.py is intended for large preprocessed libraries and currently defaults to the same exported state-dict checkpoint file.
run_predict.sh and run_two_stage.sh are lightweight local wrapper scripts that should be edited for your own input files and environment before use.
Shared library code is located in src/, while training utilities are grouped under train/.
The notebook is intended for interactive inspection and visualization rather than high-throughput prediction.
Training requires additional dependencies beyond the prediction stack.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep learning of tautomer stability from crystallographic proton positions

Requirements

Usage

1. End-to-end prediction

2. Two-stage pipeline for large libraries

3. Notebook demo

4. Expected output

Code Organization

Training

Model Architecture

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
checkpoints		checkpoints
data		data
images		images
script		script
src		src
train		train
LICENSE		LICENSE
README.md		README.md
predict_example.ipynb		predict_example.ipynb
predict_stage1_preprocess.py		predict_stage1_preprocess.py
predict_stage2_predict.py		predict_stage2_predict.py
predict_tautomers_parallel.py		predict_tautomers_parallel.py
requirements.txt		requirements.txt
run_predict.sh		run_predict.sh
run_two_stage.sh		run_two_stage.sh

Folders and files

Latest commit

History

Repository files navigation

Deep learning of tautomer stability from crystallographic proton positions

Requirements

Usage

1. End-to-end prediction

2. Two-stage pipeline for large libraries

3. Notebook demo

4. Expected output

Code Organization

Training

Model Architecture

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages