OrgAIcat: Leveraging Literature Data for Enantioselectivity Prediction and Optimization in Organocatalysis

This repository contains the official source code and datasets for the paper "OrgAIcat: Leveraging Literature Data for Enantioselectivity Prediction and Optimization in Organocatalysis".

Abstract

The vast repository of published chemical reactions constitutes a rich but deeply flawed resource. Systemic publication bias and data heterogeneity have fostered widespread skepticism about its utility for predictive modeling. Here, we demonstrate that this imperfect historical record can be transformed into a robust and generalizable predictive engine for asymmetric catalysis. We introduce OrgAIcat, a machine learning framework built by systematically curating over 22,000 aminocatalytic reactions into the iSynth dataset and developing a computationally efficient, physically meaningful descriptor, R-SPOC. OrgAIcat predicts the enantioselectivity of Aldol and Michael reactions with high accuracy (R²>0.7, MAE<0.30 kcal/mol) and, crucially, demonstrates robust generalization. It successfully forecasts outcomes for reactions from newly published literature and, in the most stringent test, for a series of in-house catalysts in previously unreported catalytic applications, demonstrating that it has learned underlying structure-selectivity relationships rather than merely memorizing literature trends. The practical power of this data-driven approach was definitively validated by integrating OrgAIcat into a closed-loop workflow, which guided the optimization of a challenging Aldol reaction from 41% to 95% enantiomeric excess (ee) in just 12 experiments. This work establishes a validated methodology for converting historical literature into actionable intelligence, offering a powerful tool to accelerate catalyst discovery and reaction optimization.

Installation

We recommend using Conda to manage the virtual environment and dependencies.

Option 1: Using conda environment file (Recommended)

# Clone this repository
git clone https://github.com/deepsynthesis/orgaicat.git
cd orgaicat

# Create and activate the conda environment
conda env create -f environment.yml
conda activate orgaicat

Option 2: Manual installation

# Clone this repository
git clone https://github.com/deepsynthesis/orgaicat.git
cd orgaicat

# Create and activate conda environment
conda create --name orgaicat python=3.12 -y  
conda activate orgaicat

# Install dependencies
pip install -r requirements.txt

Special Requirements for RxnFP Descriptors

RxnFP requires Python 3.6 and should be installed in a separate environment:

conda create -n rxnfp python=3.6 -y
conda activate rxnfp
conda install -c rdkit rdkit=2020.03.3 -y
conda install -c tmap tmap -y
pip install rxnfp openpyxl

Repository Structure

orgaicat/
├── data/                          # Original input data (.xlsx format)
├── orgaicat/                      # Main source code directory
│   ├── 01_generate_descriptors/   # Scripts to generate molecular descriptors
│   ├── 02_run_benchmark/          # Benchmarking scripts and results
│   ├── 03_train_model/            # Model training and optimization
│   ├── 04_predict_validation/     # External validation and prediction
│   ├── 05_reaction_optimization/  # Leveraging the prediciton model for new reaction optimization
│   └── descriptors/               # Generated descriptor files (.csv format)
├── requirements.txt               # Python dependencies
├── environment.yml                # Conda environment specification
├── LICENSE                        # MIT license
└── README.md                      # This file

Quick Start

For a quick demonstration of OrgAIcat's capabilities:

# 1. Generate descriptors for aldol reactions
python orgaicat/01_generate_descriptors/generate_r_spoc_descriptor_aldol.py

# 2. Train a model (uses pre-generated descriptors)
python orgaicat/03_train_model/train_extraTree_aldol_regression.py --test_mode

# 3. Make predictions on validation data
python orgaicat/04_predict_validation/evaluate_external_test_set_aldol.py

The --test_mode flag enables faster execution with reduced parameters for demonstration purposes.

Reproducing Paper Results

The repository follows a sequential workflow. Run the scripts in numbered order to reproduce all results.

Step 1: Generate Descriptors

Generate molecular descriptors from raw data. Results saved in orgaicat/descriptors/.

# Example for Aldol reactions
python orgaicat/01_generate_descriptors/generate_r_spoc_descriptor_aldol.py

# Run other scripts in this directory as needed

Step 2: Benchmark Models

Compare descriptors and ML algorithms. Results saved in orgaicat/02_run_benchmark/benchmark_results/.

python orgaicat/02_run_benchmark/benchmark.py

Step 3: Train Final Models

Train ExtraTrees models with nested cross-validation.

# Aldol regression
python orgaicat/03_train_model/train_extraTree_aldol_regression.py

# Aldol classification
python orgaicat/03_train_model/train_extraTree_aldol_classification.py

# Michael models similarly

Step 4: Validate Models

Evaluate on external validation sets.

python orgaicat/04_predict_validation/evaluate_external_test_set_aldol.py
python orgaicat/04_predict_validation/evaluate_external_test_set_michael.py

Step 5: Reaction Optimization Workflow

cd orgaicat/05_reaction_optimization

python step1_featurization.py
python step2_clustering.py
python step3_forward_prediction.py
python step4_commercial_selection.py

Acknowledgments

The iSynth dataset compilation and curation team
Contributors to the RDKit and scikit-learn open-source projects
The broader cheminformatics and machine learning communities

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this code or data in your research, please cite our paper:

@article{orgaicat2025,
  title={OrgAIcat: Leveraging Literature Data for Enantioselectivity Prediction and Optimization in Organocatalysis},
  author={[To be updated upon publication]},
  journal={[To be updated upon publication]},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OrgAIcat: Leveraging Literature Data for Enantioselectivity Prediction and Optimization in Organocatalysis

Abstract

Installation

Option 1: Using conda environment file (Recommended)

Option 2: Manual installation

Special Requirements for RxnFP Descriptors

Repository Structure

Quick Start

Reproducing Paper Results

Step 1: Generate Descriptors

Step 2: Benchmark Models

Step 3: Train Final Models

Step 4: Validate Models

Step 5: Reaction Optimization Workflow

Acknowledgments

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
orgaicat		orgaicat
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
package-lock.json		package-lock.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

OrgAIcat: Leveraging Literature Data for Enantioselectivity Prediction and Optimization in Organocatalysis

Abstract

Installation

Option 1: Using conda environment file (Recommended)

Option 2: Manual installation

Special Requirements for RxnFP Descriptors

Repository Structure

Quick Start

Reproducing Paper Results

Step 1: Generate Descriptors

Step 2: Benchmark Models

Step 3: Train Final Models

Step 4: Validate Models

Step 5: Reaction Optimization Workflow

Acknowledgments

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages