This repository contains the code for the publication CASyM: Chemotype Annotation Through Synthesis Mapping. The project introduces a synthesis based approach to annotating chemotypes for drug discovery projects, creating a chemotype graph based on major common intermediates found in the project synthesis data.
This repository is built on conda and poetry environments - to install:
git clone https://github.com/aidd-msca/CASyM.git
cd CASyM
conda env create -f environment.yml
conda activate casym
poetry installGiven a collection of synthesis data the package can be run via command line,
python casym/main.py -cn configThe code assumes that the config.yaml file is stored in experiments, though this can be updated. For further details on the settings available in the config file see section Config Files.
The package assumes a tab-seperated file (.tsv) containing at least two columns titled "reactants" and "products", additional columns will be ignored unless stated in the config file. The data can additionally be filtered by time, yield and project if these are present in the reaction data and this is specified in the config file. Additional data can also be passed and used as attributed in the chemotypes graph, these columns can be specified in the additional data section
The config file contains the following information
- reaction_file: File path to reaction data
- targetmolecules: Settings and information regarding target molecules, otherwise null
-
- targetmolecules_fp: File path to target molecules
-
- smiles_col: Name of column containing smiles in targetmolecules_fp
-
- project_col: Name of column containing project in targetmolecules_fp, if all compound in file are relevant use null
-
- time_col: Name of column containing time information in targetmolecules_fp, if not used use null
- project_col: Name of column containing project in reaction_file, if all compound in file are relevant use null
- filter_time: Settings for filtering reaction data by time, otherwise null
-
- min_time: Start date for reaction data used
-
- max_time: End date for reaction data used
-
- assign_compounds_settings:
-
- maximum_similarity: Whether to assign target molecule to vhemotypes according to maximum similarity or minimizing score
-
- chemotype_steps: Number of reaction steps to link major common intermediates as single chemotype
- common_intermediate_min_connections: Minimum number of associated target molecules to consider common intermediate a major common intermediate
- common_intermediate_max_connections: Maximum number of associated target molecules to consider common intermediate a major common intermediate, above this threshold the - common intermediate will always be considered major
- projects: Project(s) to analyze
- store_root: File path to store results, otherwise null
- similarity_threshold: Minimum proportion of maximum common substructure to link major common intermediates under one chemotype
- additional_data: Columns of data in reaction_file to store in synthesis graphs and chemotype graphs, otherwise null
- filter_yield: Settings to filter reaction data by yield, otherwise null
-
- yield_col: Name of column containing yield data in reaction_file
-
- min_yield: Minimum % yield to consider reaction for processing
-
- create_report: Whether to create markdown report summarizing run