Skip to content

DaQuawn-Edwards/Molecular-Target-Classification

Repository files navigation

A Random Forest Framework for Molecular Target Classification

274B Micropresentation

Group 13: DaQuawn Edwards, Andrea Kim, Shivani Tijare, Yejin Yang

Overview

This project examines how the theoretical foundations of the Random Forest algorithm connect to its practical use in molecular sciences and computational biology. We begin by outlining the core principles of decision tree ensembles, including bagging, feature randomness, and variance reduction, and explain why a forest of randomized trees provides a more stable and robust classifier than a single decision tree. We then analyze the computational complexity of Random Forests, covering training and prediction time as well as space requirements, to understand how factors such as tree depth, number of trees, and feature dimensionality influence scalability. To illustrate the practical relevance of these concepts, we review established biological applications of Random Forests across gene expression analysis, protein protein interaction prediction, and sequence based functional inference. Finally, we implement a full end to end demonstration using a multi target bioactivity dataset derived from ChEMBL, training a Random Forest classifier on molecular descriptors to predict protein targets and evaluating its performance through accuracy, classification reports, confusion matrices, and feature importance analysis.

Directory Structure

├── random_forest.ipynb                   # Random Forest theory, complexity, and biological target classification demo
├── molecule_classification_dataset.csv   # ChEMBL-derived dataset
├── image/
│   └── RF_image1.png                     # Random Forest workflow illustration
├── environment.yml
├── Makefile
└── README.md

Dataset Description

We analyze a multi-target bioactivity dataset derived from ChEMBL. Each compound includes a canonical SMILES string, computed molecular descriptors (molecular weight, LogP, HBA, HBD, TPSA, etc.), and a categorical protein target label. Because the dataset contains chemically meaningful features and discrete class labels, it is well suited for supervised machine-learning models such as random forests. The dataset used in this project is available at: https://www.kaggle.com/datasets/xjoannax88/multitarget-bioactivity-chembl/data

Installation

Follow the steps below to create and manage the Conda environment using the Makefile. Run these commands in your terminal:

1. Create the environment

make create

2. Activate the environment

make activate

3. Delete the environment

make delete

Project Pipeline

We clean the ChEMBL dataset by removing unused identifiers, imputing missing descriptor values, and label encoding protein targets. Five molecular descriptors serve as features. After an 80 and 20 stratified split, we train a Random Forest classifier (300 trees, sqrt feature sampling, class balanced weighting) and evaluate performance with accuracy, F1 scores, and feature importances.

Theory of Random Forest

Random Forest aggregates many decision trees trained on bootstrap samples while using random subsets of features at each split. This dual randomness reduces overfittgit ing and correlations between trees, producing a more stable classifier than a single tree. A few key hyperparameters—number of trees, tree depth, and feature sampling—control the balance between variance and bias.

Results

Our model shows strong performance on the protein target prediction task:

  • Test Accuracy: 0.95
  • Macro F1: 0.68
  • Weighted F1: 0.94
  • CV Accuracy: 0.90 ± 0.02
  • CV Macro F1: 0.53 ± 0.07
  • Top Features: TPSA, Molecular Wdight, HBD

Discussion

Random Forest predicts protein targets effectively using only basic molecular descriptors, showing its value for tabular chemical data. However, the model lacks structural information from SMILES and is influenced by class imbalance. Future improvements could integrate fingerprint or graph based features for greater chemical sensitivity.

Team Contribution Statement

Shivani served as the project manager, setting deadlines, organizing meeting times, and tracking key deliverables. Andrea focused on understanding the theory behind Random Forests and presented those concepts to the group, working with Yejin to explore the time and space complexity of the algorithm. The entire group collaborated in selecting a dataset that was both interesting and suitable for the project. DaQuawn and Shivani jointly worked on loading and cleaning the dataset, as well as building an initial model. Shivani later fine-tuned the model, and the group came together to interpret the results, evaluate model performance, and discuss advantages and limitations.Splitting tasks up and making team members owners of certain parts of the project allowed for the project to be cohesive, and completed in a timely manner. At the same time group meetings to discuss our findings let everyone learn and make the most of the assignment.

About

This project examines how the theoretical foundations of the Random Forest algorithm connect to its practical use in molecular sciences and computational biology.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors