ReactionSeek automates the multi-modal extraction of chemical data from scientific literature. It employs a hybrid architecture that combines the contextual understanding of LLMs with the chemical precision of established cheminformatics tools. ReactionSeek utilizes a domain-specific prompt engineering strategy, enabling robust and accurate data mining without the need for resource-intensive model fine-tuning.
· Automatically collecting reaction data
· SynChat
· License
· Contact
Clone the repository by using
https://github.com/DeepSynthesis/ReactionSeek.git
Then using:
cd ReactionSeek
conda create -n ReactionSeek python=3.12.0
conda activate ReactionSeek
pip install -r requirements.txt
to create a conda environment and install all the dependencies.
This script is used to extract reaction data using OpenAI API foramt. The script input should be a json file at least contains Title and Procedure, for example:
{
"volume96article1": {
"Title": "Title information of the reaction procedure.",
"Procedure": "The reaction of 1,2-dimethoxyethane with sodium ethoxide in dry ether is an elimination reaction to form ethene and methoxide ion."
},
}After prepared your input files, you should edit this part of script:
if __name__ == '__main__':
openai.proxy = {
'http': '',#your http proxy
'https': ''#your https proxy
}
openai.api_key = ""#your api key
openai.base_url = "https://api.openai.com/v1"#your api base url
model = 'gpt-3.5-turbo-16k'#your model
volumes = ["Volume26-30"]#names of json files
start = time.perf_counter()
main(volumes, model)
end = time.perf_counter()
print('runningtime:' + str(end - start))Then run the script:
python extract_gpt.pyThis script is used to structurelize the initial output csv file to a csv table containing Index, Reactants, Reactant amounts, Products, Product amounts, Solvents, Reaction temperature, Reaction time and Yield. The input file should be the output csv file of extract_gpt.py.
After prepared your input files, you should edit this part of script:
if __name__ == '__main__':
volumes = ["Volume96-100"]#your json name(same as the first step)
start = time.perf_counter()
main(volumes)
end = time.perf_counter()
print('runningtime:' + str(end - start))Then run the script:
python strcuturelize.pyThe output file "xxx_table.csv" is the structured csv file.
This script is a part of standardization module, used to convert the names of reactants and products to smiles. The input file should be a csv file containing a Name column.
After prepared your input files, you should edit this part of script to change your file path:
if __name__ == '__main__':
start = time.perf_counter()
input_filename = 'name.csv'# Your input file name.
output_filename = 'smiles.csv'# Output file name.
input_data = pd.read_csv(input_filename)
output_data = pd.DataFrame()
output_data['Name'] = input_data['Name']
output_data['SMILES'] = input_data['Name'].apply(get_smiles)
output_data.to_csv(output_filename, index=False)
end = time.perf_counter()
print('runningtime:' + str(end - start))Then run the script:
python name_to_smiles.pyThe output file "smiles.csv" is the csv file containing the smiles of each name.
This script is a part of standardization module, used to standardize the reaction time. The input file should be a csv file containing an Index and a Reaction time column.
After prepared your input files, you should edit this part of script to change your file path and your model API:
if __name__ == '__main__':
openai.proxy = {
'http': '',# Your http proxy
'https': ''# Your https proxy
}
openai.api_key = ""# Your api key
volumes = ["Volume16-20"]# Your input file name
delay = 20# Your delay time. If you don't have rate limit, please change it.
model = "gpt-3.5-turbo"# Your model name
start = time.perf_counter()
main(volumes, delay, model)
end = time.perf_counter()
print('runningtime:' + str(end - start))Then run the script:
python time_standardlize.pyThe output file "xxx_timetable.csv" is the standardized csv file.
SynChat is an interactive tool powered by LLM agents. SynChat allows researchers to query the historical reaction data and associated metadata using natural language, providing a more intuitive and efficient means of accessing specific data compared to traditional search methodology.
We welcome contributions from the community. Please fork the repository and submit pull requests.
This project is licensed under the MIT License. See the LICENSE file for details.
If you have any questions or suggestions, please contact us at lijiawei24@mails.tsinghua.edu.cn.