A reproducible QSAR modeling workflow combining RDKit molecular descriptors, MACCS fingerprints, outlier detection, applicability domain analysis, and a Keras-based neural network for regression tasks.
This repository contains a cleaned and fixed version of an original QSAR script, with improved stability, validation, and error handling.
Author: Divya Karade
Model type: Regression (Neural Network)
Use case: Binding affinity / docking score prediction
Descriptors: RDKit 2D descriptors + MACCS keys
Outlier detection: PCA + Isolation Forest
Applicability Domain (AD): Descriptor range-based check
Frameworks: RDKit, scikit-learn, TensorFlow / Keras
-
Data Loading
- Reads molecular data from a CSV file (
SpikeRBD_DD.csv) - Requires SMILES strings and a numeric target value
- Reads molecular data from a CSV file (
-
Descriptor Generation
- RDKit 2D molecular descriptors
- MACCS fingerprints (167 bits)
-
Feature Cleaning
- Removes NaN and infinite values
- Clips extreme descriptor values
- Ensures numeric consistency
-
Train/Test Split
- 70/30 split with fixed random seed
- Standard scaling applied to features
-
Outlier Detection
- PCA (2 components) for dimensionality reduction
- Isolation Forest to remove anomalous samples
-
Applicability Domain (AD)
- Checks whether input molecule descriptors fall within the min–max range of the training set
-
Model Training
- Fully connected neural network (Keras)
- Early stopping based on validation R²
- Custom metrics: RMSE and R²
-
Evaluation
- MAE, MSE, RMSE, and R² on training and test sets
-
Prediction
- Predicts binding affinity for new SMILES
- Outputs applicability domain status
Required columns:
smiles— SMILES representation of moleculesDockingScore— Target regression value
Example:
smiles, DockingScore
CCO,-6.5
CCN,-7.1New molecules can be provided as a Python list:
smiles = ["CCO"] # Example: ethanol
The script performs the following steps:
- Validates SMILES strings
- Generates RDKit 2D descriptors and MACCS fingerprints
- Checks applicability domain
- Predicts binding affinity using the trained neural network
Input Layer: Descriptor dimension Dense Layer: 600 neurons (ReLU) Dense Layer: 100 neurons (ReLU) Dense Layer: 100 neurons (ReLU) Output Layer: 1 neuron (Linear)
- Loss: Mean Squared Error (MSE)
- Optimizer: Adam
- Batch size: 400
- Epochs: Up to 200 (early stopping enabled)
The following metrics are reported for both training and test sets:
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Coefficient of Determination (R²)
- Python ≥ 3.8
- RDKit
- TensorFlow / Keras
- scikit-learn
- pandas
- numpy
- matplotlib
pip install numpy pandas scikit-learn tensorflow matplotlib conda install -c conda-forge rdkit
- Applicability domain is based on descriptor range checks, not distance-based confidence metrics.
- Isolation Forest contamination rate is fixed at 10%.
- Designed for QSAR regression, not classification.
- Best used with chemically consistent datasets.
This code is provided for research and educational purposes.
Please cite appropriately if used in academic or industrial work.
Divya Karade
Cheminformatician | QSAR | AI/ML