========
ChemAsLang is a collection of resources which treats chemistry as language. It breaks sets of chemistry down into component parts (fragments), implements networks of fragments, and trains Word-to-Vector algorithms on fragments to track changes in chemistry over time.
- Utilizes RDKit generate chemical fragments
- Uses Gensim to train Word-to-Vector models of fragments & chemical compounds
- To generate fragments: upload a set of chemicals (in SMILES or SMARTS format) to the
Fragments/Datadirectory, editFragments/common_frags_parallel.pyas necessary. - To train W2V: edit
Word2Vec/build_KEGG_gensim.pyas necessary.
This is an ongoing project as part of my graduate research at Arizona State Univeristy. Future plans are to:
- Simplify the workflow for generating fragments from a list of user-specified chemical compounds (estimated date: Summer 2021)
- Train a W2V model as part of the fragment generation process (estimated date: Fall 2021)
- Release this code as a fully documented Python package (estimated data: Spring 2022)
If you have questions, email me at: jfmalloy1@gmail.com
The project is licensed under the MIT license.