Skip to content

DeepSynthesis/orgaicat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OrgAIcat: Leveraging Literature Data for Enantioselectivity Prediction and Optimization in Organocatalysis

License: MIT Python 3.12

This repository contains the official source code and datasets for the paper "OrgAIcat: Leveraging Literature Data for Enantioselectivity Prediction and Optimization in Organocatalysis".

Abstract

The vast repository of published chemical reactions constitutes a rich but deeply flawed resource. Systemic publication bias and data heterogeneity have fostered widespread skepticism about its utility for predictive modeling. Here, we demonstrate that this imperfect historical record can be transformed into a robust and generalizable predictive engine for asymmetric catalysis. We introduce OrgAIcat, a machine learning framework built by systematically curating over 22,000 aminocatalytic reactions into the iSynth dataset and developing a computationally efficient, physically meaningful descriptor, R-SPOC. OrgAIcat predicts the enantioselectivity of Aldol and Michael reactions with high accuracy (R²>0.7, MAE<0.30 kcal/mol) and, crucially, demonstrates robust generalization. It successfully forecasts outcomes for reactions from newly published literature and, in the most stringent test, for a series of in-house catalysts in previously unreported catalytic applications, demonstrating that it has learned underlying structure-selectivity relationships rather than merely memorizing literature trends. The practical power of this data-driven approach was definitively validated by integrating OrgAIcat into a closed-loop workflow, which guided the optimization of a challenging Aldol reaction from 41% to 95% enantiomeric excess (ee) in just 12 experiments. This work establishes a validated methodology for converting historical literature into actionable intelligence, offering a powerful tool to accelerate catalyst discovery and reaction optimization.

Installation

We recommend using Conda to manage the virtual environment and dependencies.

Option 1: Using conda environment file (Recommended)

# Clone this repository
git clone https://github.com/deepsynthesis/orgaicat.git
cd orgaicat

# Create and activate the conda environment
conda env create -f environment.yml
conda activate orgaicat

Option 2: Manual installation

# Clone this repository
git clone https://github.com/deepsynthesis/orgaicat.git
cd orgaicat

# Create and activate conda environment
conda create --name orgaicat python=3.12 -y  
conda activate orgaicat

# Install dependencies
pip install -r requirements.txt

Special Requirements for RxnFP Descriptors

RxnFP requires Python 3.6 and should be installed in a separate environment:

conda create -n rxnfp python=3.6 -y
conda activate rxnfp
conda install -c rdkit rdkit=2020.03.3 -y
conda install -c tmap tmap -y
pip install rxnfp openpyxl

Repository Structure

orgaicat/
├── data/                          # Original input data (.xlsx format)
├── orgaicat/                      # Main source code directory
│   ├── 01_generate_descriptors/   # Scripts to generate molecular descriptors
│   ├── 02_run_benchmark/          # Benchmarking scripts and results
│   ├── 03_train_model/            # Model training and optimization
│   ├── 04_predict_validation/     # External validation and prediction
│   ├── 05_reaction_optimization/  # Leveraging the prediciton model for new reaction optimization
│   └── descriptors/               # Generated descriptor files (.csv format)
├── requirements.txt               # Python dependencies
├── environment.yml                # Conda environment specification
├── LICENSE                        # MIT license
└── README.md                      # This file

Quick Start

For a quick demonstration of OrgAIcat's capabilities:

# 1. Generate descriptors for aldol reactions
python orgaicat/01_generate_descriptors/generate_r_spoc_descriptor_aldol.py

# 2. Train a model (uses pre-generated descriptors)
python orgaicat/03_train_model/train_extraTree_aldol_regression.py --test_mode

# 3. Make predictions on validation data
python orgaicat/04_predict_validation/evaluate_external_test_set_aldol.py

The --test_mode flag enables faster execution with reduced parameters for demonstration purposes.

Reproducing Paper Results

The repository follows a sequential workflow. Run the scripts in numbered order to reproduce all results.

Step 1: Generate Descriptors

Generate molecular descriptors from raw data. Results saved in orgaicat/descriptors/.

# Example for Aldol reactions
python orgaicat/01_generate_descriptors/generate_r_spoc_descriptor_aldol.py

# Run other scripts in this directory as needed

Step 2: Benchmark Models

Compare descriptors and ML algorithms. Results saved in orgaicat/02_run_benchmark/benchmark_results/.

python orgaicat/02_run_benchmark/benchmark.py

Step 3: Train Final Models

Train ExtraTrees models with nested cross-validation.

# Aldol regression
python orgaicat/03_train_model/train_extraTree_aldol_regression.py

# Aldol classification
python orgaicat/03_train_model/train_extraTree_aldol_classification.py

# Michael models similarly

Step 4: Validate Models

Evaluate on external validation sets.

python orgaicat/04_predict_validation/evaluate_external_test_set_aldol.py
python orgaicat/04_predict_validation/evaluate_external_test_set_michael.py

Step 5: Reaction Optimization Workflow

cd orgaicat/05_reaction_optimization

python step1_featurization.py
python step2_clustering.py
python step3_forward_prediction.py
python step4_commercial_selection.py

Acknowledgments

  • The iSynth dataset compilation and curation team
  • Contributors to the RDKit and scikit-learn open-source projects
  • The broader cheminformatics and machine learning communities

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this code or data in your research, please cite our paper:

@article{orgaicat2025,
  title={OrgAIcat: Leveraging Literature Data for Enantioselectivity Prediction and Optimization in Organocatalysis},
  author={[To be updated upon publication]},
  journal={[To be updated upon publication]},
  year={2025}
}

About

OrgAIcat: Leveraging Literature Data for Enantioselectivity Predic-tion and Optimization in Organocatalysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages