Updated training pipelines for the paper Molecular Fingerprints Are a Simple Yet Effective Solution to the Drug–Drug Interaction Problem.
- Modern PyTorch Lightning workflows live under
GPU/with W&B integration and Bayesian sweeps, now using the unifiedlightning.pytorchAPI. - TPU-ready TensorFlow GNN pipeline under
TPU/for converting the dataset and running on modern TPUs, updated for TensorFlow 2.15 and TF-GNN 1.0.3. - Symmetric fingerprint fusion for the baseline models on both PyTorch and TPU stacks, combining union/intersection/exclusive fingerprints and post-encoder interactions that remain invariant to swapping the drug order.
- Reproducible environments via the provided
pyproject.tomlandDockerfile.
Install dependencies with Poetry (Python 3.10 through 3.12 are supported with the current TensorFlow stack):
poetry installTrain a graph model with PyTorch Lightning and log to Weights & Biases:
python -m GPU.train --config GPU/configs/graph.yaml --run-name dev-runTo use the Bayesian sweep configuration:
wandb sweep GPU/sweeps/graph_bayesian.yaml
wandb agent <entity/project>/<sweep_id>You can also launch the sweep programmatically:
python GPU/sweeps/run_graph_sweep.py --entity <your-entity>The sweep explores optimiser settings alongside the Morgan fingerprint radius and bit-length so the data pipeline stays in sync with the model hyperparameters.
To tune the fingerprint models, dedicated sweeps cover each gradient-boosting estimator:
# CatBoost search across depth, learning-rate, iterations, bagging temperature, and regularisation strength.
wandb sweep GBDT/sweeps/fp_catboost_bayesian.yaml
# LightGBM search for tree shape, learning-rate, sampling ratios, and L1/L2 penalties.
wandb sweep GBDT/sweeps/fp_lightgbm_bayesian.yaml
# XGBoost search over depth, shrinkage, sampling, and both L1/L2 regularisation.
wandb sweep GBDT/sweeps/fp_xgboost_bayesian.yamlEach configuration keeps the fingerprint radius/bit-length coupled with the estimator-specific hyperparameters so Bayesian optimisation can explore compatible data/feature settings for the selected model (--model is fixed by the sweep command).
-
Export the PyTorch Geometric dataset to NumPy archives compatible with TF-GNN:
python TPU/preprocess_to_npz.py --output-dir tf_dataset
-
Train the TF-GNN model (runs on CPU/GPU by default, pass
--tputo target a TPU):python TPU/train_tf.py --dataset tf_dataset --model fp_graph --epochs 50 --batch-size 128 --tpu your-tpu-name
The trainer now validates that
--batch-sizeis a multiple of 64, matching Google’s TPU performance guidelines; 128 is the default for balanced per-core workloads.Use
--modelto mirror the PyTorch experiments exactly:fp_mlp(fingerprint MLP),graph(graph-only encoder),fp_graph(combined encoder), orssiddi. All models share the same fusion modes, decoder widths, and metric suite as their Lightning counterparts, and additional knobs like--fusion,--final-concat,--gnn-layer, and--top-kmatch the PyTorch configuration options. -
Run Bayesian optimisation to tune the TensorFlow hyperparameters with W&B sweeps:
python TPU/tune_tf.py --dataset tf_dataset --model fp_graph --wandb-project your-project --max-trials 40 --epochs 60
The CLI launches a W&B Bayesian sweep that samples the encoder width, depth, dropout, activations, attention heads, decoder size, optimiser learning rate, and the fingerprint radius/bit-length. Provide
--raw-data-dirif you want the tuner to regenerate datasets for unseen fingerprint settings on the fly. Every trial logs metrics, artefacts, and the saved model to W&B; the best run is also exported locally undertpu_tuning/by default.
Build and run the containerised environment:
docker build -t ddi-fp-graph .
docker run --gpus all -it --rm \
-v $(pwd):/workspace ddi-fp-graph \
--config GPU/configs/graph.yamlThe container entrypoint points to python -m GPU.train, so any additional CLI flags are appended to the docker run command.