Tautomerism plays a central role in molecular recognition, physicochemical properties, and chemical reactivity, yet rapid identification of stable tautomeric states remains a persistent challenge in molecular design and structure-based drug discovery. Quantum-mechanical approaches are often too computationally demanding for library-scale application, while rule-based and 3D-dependent machine-learning methods remain limited in either transferability or throughput. Here, we show that experimentally resolved hydrogen positions in the Cambridge Structural Database (CSD) provide a powerful large-scale source of supervision for learning tautomer stability. Using more than 1.1 million tautomeric states derived from proton-resolved crystal structures, we trained a 2D graph neural network that predicts stable tautomeric states directly from molecular topology, without conformer generation or quantum calculations. Applied to 5,075 PDBBind ligands with multiple tautomeric states, the model identified 126 cases in which the assigned tautomeric state is likely incorrect; in each case, the reassigned stable tautomer exhibited improved hydrogen-bonding patterns together with fewer unsatisfied polar atoms. We further developed an open-source workflow, Tautomer-Predictor, capable of processing the 4.6-million-compound Enamine collection in about 3.2 h on a single GPU-enabled node. Together, these results establish crystallographic proton placement as a rich and underexploited source of chemical knowledge for learning tautomer stability, and provide a practical route to large-scale tautomer assignment for molecular discovery.
Free webserver: https://huggingface.co/spaces/panceler/Tautomer-Predictor-Web
Core prediction dependencies:
- Python 3.11.8
- numpy 1.26.4
- pandas 2.2.1
- RDKit 2025.9.3
- PyTorch 2.2.0
- PyTorch Geometric 2.4.0
- torch-scatter 2.1.2
- torch-sparse 0.6.18
- torch-cluster 1.6.3
- torch-spline-conv 1.2.2
- mols2grid
Training-only extras:
- pytorch-lightning 2.5.5
- torchmetrics 1.4.0
- scikit-learn 1.4.1
The main prediction entrypoint is predict_tautomers_parallel.py. It reads CSV, TSV, or SDF input, cleans molecules, enumerates candidate tautomers, converts them to PyTorch Geometric graphs, scores them with the exported model in checkpoints/best_model_state_dict.pt, and writes ranked output.
usage: predict_tautomers_parallel.py [-h] [--input-format {auto,csv,tsv,sdf}]
[--smiles-column SMILES_COLUMN]
[--name-column NAME_COLUMN]
[--has-header {auto,true,false}]
[--sdf-name-property SDF_NAME_PROPERTY]
[--encoding ENCODING]
[--checkpoint-pattern CHECKPOINT_PATTERN]
[--threshold THRESHOLD]
[--tautomer-mode {all,stable,top1}]
[--fragment FRAGMENT]
[--max-combinations MAX_COMBINATIONS]
[--output-format {csv,tsv}]
[--num-workers NUM_WORKERS]
input_path output_path
Example:
python predict_tautomers_parallel.py molecules.csv ranked_tautomers.csv \
--input-format csv \
--smiles-column smiles \
--name-column name \
--tautomer-mode stable \
--num-workers 8For SDF input with fragment-based enumeration:
python predict_tautomers_parallel.py molecules.sdf ranked_tautomers.csv \
--fragment true \
--tautomer-mode stable \
--num-workers 8For larger datasets, the repository provides a split workflow:
predict_stage1_preprocess.pycleans molecules, enumerates tautomers, builds graph objects, and stores them in a compressed pickle.predict_stage2_predict.pyloads the precomputed graphs and performs batched GPU inference.
Stage 1 example:
python predict_stage1_preprocess.py molecules.tsv molecules.pkl.gz \
--input-format tsv \
--smiles-column smiles \
--name-column name \
--num-workers 32Stage 2 example:
python predict_stage2_predict.py molecules.pkl.gz ranked_tautomers.csv \
--checkpoint-pattern checkpoints/best_model_state_dict.pt \
--tautomer-mode stable \
--threshold 0.43 \
--batch-size 4096 \
--num-workers 4An interactive example is provided in predict_example.ipynb. The notebook loads a checkpoint, enumerates tautomers for a sample molecule, scores them, and visualizes the results with Mols2Grid.
The main ranked output contains the following columns:
input_smilesnameclean_smilestaut_smilestaut_probtaut_rank
The current codebase is split between reusable library code and training utilities.
src/library_io.pyhandles CSV, TSV, and SDF parsing.src/structure_ops.pycontains molecule cleaning, tautomer enumeration, scoring, and candidate ranking helpers.src/tautomer_model.pydefines graph featurization, model loading, and prediction utilities.src/combine_frag.pyandsrc/cut_mol.pyimplement the optional fragment-based prediction path.train/contains training-data preprocessing and model-training scripts.
Training and graph-generation utilities are stored under train/.
To prepare graph data:
python train/calc_graph.py valid_set_tautobase.csvThe SLURM helper script train/run.s shows the original cluster workflow for graph generation.
To train a model:
python train/train_cmodel.py \
--train_path data/train_set_neutral.pkl \
--valid_path data/valid_set_tautobase.pkl \
--wandb_project atfp \
--wandb_name taut \
--checkpoint_dir checkpointsThe training script saves checkpoints in the target checkpoint directory and supports resuming with --resume or --resume_checkpoint. The corresponding cluster launcher is train/run_train.s.
The current implementation uses an AttentiveFP molecular graph neural network implemented with PyTorch Geometric.
- Atom features include atom type, degree, hybridization, implicit valence, aromaticity, ring membership, hydrogen-bond donor and acceptor flags, and formal charge.
- Bond features include single, double, triple, aromatic, conjugation, and ring indicators.
- The default AttentiveFP configuration in
src/tautomer_model.pyuses 29 input channels, 512 hidden channels, 8 message-passing layers, 3 timesteps, and 0.2 dropout. - Training uses an EMA-style averaged model, and the current inference scripts load a single exported state-dict checkpoint by default.
predict_tautomers_parallel.pyusescheckpoints/best_model_state_dict.ptby default.predict_stage2_predict.pyis intended for large preprocessed libraries and currently defaults to the same exported state-dict checkpoint file.run_predict.shandrun_two_stage.share lightweight local wrapper scripts that should be edited for your own input files and environment before use.- Shared library code is located in
src/, while training utilities are grouped undertrain/. - The notebook is intended for interactive inspection and visualization rather than high-throughput prediction.
- Training requires additional dependencies beyond the prediction stack.
