### Dataset Todo - [ ] Add "synthetic" data #13 [nice-to-have] - [ ] Run ChemDataExtractor on Free Text #18 [needs-discussion] - [ ] Prepare PubChem dataset #19 [priority-high] - [ ] Add CheMBL dataset #24 [priority-high] - [ ] Add ESOL dataset #33 [priority-high] ### Dataset In Progress - [ ] Add Papyrus dataset #335 #340 - [ ] Add papyrus protein targets #336 - [ ] Adding data from the Human Metabolome Database (HMDB) #136 [adamoyoung] - [ ] Adding Data from MassBank of North America (MoNA) #137 [adamoyoung] - [ ] Add Open Targets datasets for drug information #138 #139 #140 #141 #142 [jackapbutler] - [ ] Adding the europepmc dataset #162 [hssn-20] - [ ] Adding Uniprot, X-linking to reaction DBs for enzymes #191 [hypnopump] - [ ] Add DrugChat data #293 [alxfgh] - [ ] Adding Suzuki Miyaura yield prediction dataset #212 [pschwllr] - [ ] Add QMOF dataset #235 [kjappelbaum] - [ ] Add SuperCon dataset #236 [kjappelbaum] - [ ] Add QMUG dataset #237 [kjappelbaum] - [ ] Add Enamine dataset #238 [kjappelbaum] - [ ] Add ORD dataset #239 [kjappelbaum] - [ ] Refactor rhea_db into csv files #242 [kjappelbaum] - [ ] Add Drug-Target Interaction data #68 [strubeyj] - [ ] H2_storage_materials_database #64 [bethanyconnolly] #76 - [ ] Add EuroPMC Dataset #32 [abhinav-kashyap-asus] - [ ] Add Buchwald Hartwig dataset[pschwllr] #81 - [ ] Add Drug-Drug Interaction Data from nSIDES [apoorvasrinivasan26] #89 - [ ] Add uspto data from drfp #95 - [ ] Add NLMChem #114 [apoorvasrinivasan26] - [ ] Add ThermoML Archive dataset #118 - [ ] Adding the Chemistry textbooks from LibreTexts library #134 - [ ] Add Therapeutic Data Commons dataset #27 [priority-high] -[ ] Single-instance [phalem] #90 - [x] Add ADME Property [phalem] #84 - [x] Absorption #85 - [x] Caco-2 (Cell Effective Permeability), Wang et al.[MicPie] #37 - [x] PAMPA Permeability, NCATS [MicPie] #41 - [x] HIA (Human Intestinal Absorption), Hou et al. #85 - [x] Pgp (P-glycoprotein) Inhibition, Broccatelli et al. #85 - [x] Bioavailability, Ma et al. #85 - [x] Lipophilicity, AstraZeneca [MicPie] #22 - [x] Solubility, AqSolDB #85 - [x] Hydration Free Energy, FreeSolv #85 - [x] Distribution #86 - [x] BBB (Blood-Brain Barrier), Martins et al. #86 - [x] PPBR (Plasma Protein Binding Rate), AstraZeneca #86 - [x] VDss (Volumn of Distribution at steady state), Lombardo et al. #86 - [x] Metabolism #88 - [x] CYP P450 2C19 Inhibition, Veith et al. #88 - [x] CYP P450 2D6 Inhibition, Veith et al. #88 - [x] CYP P450 3A4 Inhibition, Veith et al. #88 - [x] CYP P450 1A2 Inhibition, Veith et al. #88 - [x] CYP P450 2C9 Inhibition, Veith et al. #88 - [x] CYP2C9 Substrate, Carbon-Mangels et al. #88 - [x] CYP2D6 Substrate, Carbon-Mangels et al. #88 - [x] CYP3A4 Substrate, Carbon-Mangels et al. #88 - [x] Excretion #87 - [x] Half Life, Obach et al. #87 - [x] Clearance, AstraZeneca #87 - [x] Add Toxicity [phalem] - [x] Acute Toxicity LD50 #54 - [x] hERG blockers #53 - [x] hERG Central #61 - [x] hERG Karim et al. #52 - [x] Ames Mutagenicity #56 - [x] DILI (Drug Induced Liver Injury) #51 - [x] Skin Reaction #49 - [x] Carcinogens #55 - [x] Tox21 #77 - [ ] ToxCast #79 #343 #345 #346 - [x] ClinTox #50 - [x] Add High-throughput Screening [phalem] - [x] SARS-CoV-2 In Vitro, Touret et al. #59 - [x] SARS-CoV-2 3CL Protease, Diamond. #94 - [x] HIV #60 - [x] Butkiewicz et al. #62 - [ ] Add Quantum Mechanics Modeling #78 - [ ] QM7b - [ ] QM8 - [ ] QM9 - [ ] Add Reaction Yields #78 - [ ] Buchwald-Hartwig #81 - [ ] USPTO - [ ] Add Epitope(Immunotherapy under Target discovery) #97 - [ ] IEDB, Jespersen et al. #96 - [ ] PDB, Jespersen et al. #96 - [ ] Add Antibody Developability #78 - [ ] TAP #99 - [ ] SAbDab, Chen et al. #99 - [ ] Add CRISPR Repair Outcome[apoorvasrinivasan26] - [ ] Leenay et al. -[ ] Multi-instance - [ ] Add Drug-Target Interaction data #68[strubeyj] - [ ] BindingDB - [ ] DAVIS - [ ] KIBA - [ ] Add Drug-Drug Interaction - [ ] DrugBank Multi-Typed DDI - [ ] TWOSIDES Polypharmacy Side Effects - [ ] Add Gene-Disease Association - [ ] DisGeNET - [ ] Add Drug Response - [ ] GDSC1 - [ ] GDSC2 - [ ] Add Peptide-MHC Binding - [ ] MHC Class I, IEDB-IMGT, Nielsen et al. - [ ] MHC Class II, IEDB, Jensen et al. - [ ] Add Antibody-antigen Affinity - [ ] SAbDab - [ ] Add MicroRNA-Target Interaction - [ ] miRTarBase - [ ] Add Catalyst - [ ] USPTO - [ ] Add TCR-Epitope Binding Affinity [strubeyj] #67 - [ ] Weber et al. -[ ] Generation data [phalem] #90 - [x] Add Molecule Generation #178 [arkadiusz-czerwinski] - [x] MOSES #178 [arkadiusz-czerwinski] - [x] ZINC #178 [arkadiusz-czerwinski] - [x] ChEMBL #178 [arkadiusz-czerwinski] - [ ] Add Retrosynthesis - [ ] USPTO-50K - [ ] USPTO - [ ] Add Reaction Outcome - [ ] USPTO - [ ] Add Structure-based Drug Design - [ ] PDBBind - [ ] DUD-E - [ ] scPDB ### Done ✓ - [x] Add flashpoint dataset #43 [othertea] - [x] add initial model pipeline [maw501] [bethanyconnolly][kjappelbaum][MicPie] #71 - [x] Add iupac goldbook #187 #188 [MicPie] - [x] Add RXN-SMILES as identifier type #113 [kjappelbaum] - [x] Add benchmark field #116 - [x] Add entos protonation energy #244 #233 [kjappelbaum] - [x] Add chebi-20 dataset #63 #108 [jackapbutler] - [x] Add FDA Adverse reactions datasets #139 #143 [jackapbutler] - [x] Add Natural text dataset elsevier_oa_cc-by_corpus #216
Dataset Todo
Dataset In Progress
Add Papyrus dataset Add Papyrus dataset #335 Add Papyrus 3 Million data point pchembl for 7k protein #340
Add papyrus protein targets Add papyrus protein targets #336
Adding data from the Human Metabolome Database (HMDB) New Task: Adding data from the Human Metabolome Database (HMDB) #136 [adamoyoung]
Adding Data from MassBank of North America (MoNA) New Task: Adding Data from MassBank of North America (MoNA) #137 [adamoyoung]
Add Open Targets datasets for drug information Add Open Targets datasets for drug information #138 FDA Adverse reactions datasets - We can add a dataset containing the frequency of adverse reaction events for individual drugs identified by their CHEMBL ID. #139 Drug disease indications - Add a dataset of drug disease indications by CHEMBL ID. #140 Mechanisms of action - Add a dataset of mechanism for multiple CHEMBL ID's. #141 Drug descriptions & approval - Add a dataset containing metadata identifiers (SMILES, size, names, years, etc) and approval results in different countries. #142 [jackapbutler]
Adding the europepmc dataset Adding the europepmc dataset #162 [hssn-20]
Adding Uniprot, X-linking to reaction DBs for enzymes Adding Uniprot, X-linking to reaction DBs for enzymes #191 [hypnopump]
Add DrugChat data Add DrugChat data #293 [alxfgh]
Adding Suzuki Miyaura yield prediction dataset Adding Suzuki Miyaura yield prediction dataset #212 [pschwllr]
Add QMOF dataset Add QMOF dataset #235 [kjappelbaum]
Add SuperCon dataset Add SuperCon dataset #236 [kjappelbaum]
Add QMUG dataset Add QMUG dataset #237 [kjappelbaum]
Add Enamine dataset Add Enamine dataset #238 [kjappelbaum]
Add ORD dataset Add ORD dataset #239 [kjappelbaum]
Refactor rhea_db into csv files refactor rhea_db into
csvfiles #242 [kjappelbaum]Add Drug-Target Interaction data New Task: Add Drug-Target Interaction data #68 [strubeyj]
H2_storage_materials_database H2_storage_materials_database #64 [bethanyconnolly] created h2 storage dataset #76
Add EuroPMC Dataset New Task: Add EuroPMC Dataset #32 [abhinav-kashyap-asus]
Add Buchwald Hartwig dataset[pschwllr] Add Buchwald Hartwig dataset #81
Add Drug-Drug Interaction Data from nSIDES [apoorvasrinivasan26] New Task: Add Drug-Drug Interaction Data #89
Add uspto data from drfp Add uspto data from drfp #95
Add NLMChem Add NLMChem #114 [apoorvasrinivasan26]
Add ThermoML Archive dataset Add ThermoML Archive dataset #118
Adding the Chemistry textbooks from LibreTexts library New Task: Adding the Chemistry textbooks from LibreTexts library #134
Add Therapeutic Data Commons dataset New Task: Add Therapeutic Data Commons dataset #27 [priority-high]
-[ ] Single-instance [phalem] New Task | Finish Single-instance remaining data & Generation Datasets from TDC #90
-[ ] Multi-instance
-[ ] Generation data [phalem] New Task | Finish Single-instance remaining data & Generation Datasets from TDC #90
Done ✓