OrgAIcat: Leveraging Literature Data for Enantioselectivity Prediction and Optimization in Organocatalysis
This repository contains the official source code and datasets for the paper "OrgAIcat: Leveraging Literature Data for Enantioselectivity Prediction and Optimization in Organocatalysis".
The vast repository of published chemical reactions constitutes a rich but deeply flawed resource. Systemic publication bias and data heterogeneity have fostered widespread skepticism about its utility for predictive modeling. Here, we demonstrate that this imperfect historical record can be transformed into a robust and generalizable predictive engine for asymmetric catalysis. We introduce OrgAIcat, a machine learning framework built by systematically curating over 22,000 aminocatalytic reactions into the iSynth dataset and developing a computationally efficient, physically meaningful descriptor, R-SPOC. OrgAIcat predicts the enantioselectivity of Aldol and Michael reactions with high accuracy (R²>0.7, MAE<0.30 kcal/mol) and, crucially, demonstrates robust generalization. It successfully forecasts outcomes for reactions from newly published literature and, in the most stringent test, for a series of in-house catalysts in previously unreported catalytic applications, demonstrating that it has learned underlying structure-selectivity relationships rather than merely memorizing literature trends. The practical power of this data-driven approach was definitively validated by integrating OrgAIcat into a closed-loop workflow, which guided the optimization of a challenging Aldol reaction from 41% to 95% enantiomeric excess (ee) in just 12 experiments. This work establishes a validated methodology for converting historical literature into actionable intelligence, offering a powerful tool to accelerate catalyst discovery and reaction optimization.
We recommend using Conda to manage the virtual environment and dependencies.
# Clone this repository
git clone https://github.com/deepsynthesis/orgaicat.git
cd orgaicat
# Create and activate the conda environment
conda env create -f environment.yml
conda activate orgaicat# Clone this repository
git clone https://github.com/deepsynthesis/orgaicat.git
cd orgaicat
# Create and activate conda environment
conda create --name orgaicat python=3.12 -y
conda activate orgaicat
# Install dependencies
pip install -r requirements.txtRxnFP requires Python 3.6 and should be installed in a separate environment:
conda create -n rxnfp python=3.6 -y
conda activate rxnfp
conda install -c rdkit rdkit=2020.03.3 -y
conda install -c tmap tmap -y
pip install rxnfp openpyxlorgaicat/
├── data/ # Original input data (.xlsx format)
├── orgaicat/ # Main source code directory
│ ├── 01_generate_descriptors/ # Scripts to generate molecular descriptors
│ ├── 02_run_benchmark/ # Benchmarking scripts and results
│ ├── 03_train_model/ # Model training and optimization
│ ├── 04_predict_validation/ # External validation and prediction
│ ├── 05_reaction_optimization/ # Leveraging the prediciton model for new reaction optimization
│ └── descriptors/ # Generated descriptor files (.csv format)
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment specification
├── LICENSE # MIT license
└── README.md # This file
For a quick demonstration of OrgAIcat's capabilities:
# 1. Generate descriptors for aldol reactions
python orgaicat/01_generate_descriptors/generate_r_spoc_descriptor_aldol.py
# 2. Train a model (uses pre-generated descriptors)
python orgaicat/03_train_model/train_extraTree_aldol_regression.py --test_mode
# 3. Make predictions on validation data
python orgaicat/04_predict_validation/evaluate_external_test_set_aldol.pyThe --test_mode flag enables faster execution with reduced parameters for demonstration purposes.
The repository follows a sequential workflow. Run the scripts in numbered order to reproduce all results.
Generate molecular descriptors from raw data. Results saved in orgaicat/descriptors/.
# Example for Aldol reactions
python orgaicat/01_generate_descriptors/generate_r_spoc_descriptor_aldol.py
# Run other scripts in this directory as neededCompare descriptors and ML algorithms. Results saved in orgaicat/02_run_benchmark/benchmark_results/.
python orgaicat/02_run_benchmark/benchmark.pyTrain ExtraTrees models with nested cross-validation.
# Aldol regression
python orgaicat/03_train_model/train_extraTree_aldol_regression.py
# Aldol classification
python orgaicat/03_train_model/train_extraTree_aldol_classification.py
# Michael models similarlyEvaluate on external validation sets.
python orgaicat/04_predict_validation/evaluate_external_test_set_aldol.py
python orgaicat/04_predict_validation/evaluate_external_test_set_michael.pycd orgaicat/05_reaction_optimization
python step1_featurization.py
python step2_clustering.py
python step3_forward_prediction.py
python step4_commercial_selection.py- The iSynth dataset compilation and curation team
- Contributors to the RDKit and scikit-learn open-source projects
- The broader cheminformatics and machine learning communities
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this code or data in your research, please cite our paper:
@article{orgaicat2025,
title={OrgAIcat: Leveraging Literature Data for Enantioselectivity Prediction and Optimization in Organocatalysis},
author={[To be updated upon publication]},
journal={[To be updated upon publication]},
year={2025}
}