Supervised Machine Learning Pipeline

Overview

This project implements a reproducible machine learning pipeline for predictive modeling on a numeric dataset with 20 features and 20,000 observations. Multiple linear and non-linear models were evaluated, with a focus on correct evaluation discipline, feature selection, and model tuning.

The goal was to identify a model that generalizes well while avoiding common pitfalls such as data leakage and overfitting.

Dataset

Size: 20,000 rows
Features: 20 numeric predictors
Target: Continuous response variable (range approximately 0–21)

Exploratory data analysis revealed strong correlations among several predictors and evidence of non-linear structure, motivating the use of tree-based models.

Modeling Approach

The following models were implemented and compared:

Linear Regression
Ridge Regression
LASSO Regression
Random Forest
XGBoost

Initial linear models performed poorly, indicating non-linear relationships in the data. Tree-based models were therefore explored in greater depth.

Evaluation Strategy

Data split: 75% training / 25% test
Out-of-Bag Mean Squared Prediction Error (OOB MSPE) was used for Random Forest hyperparameter tuning to avoid test-set leakage
Final performance was evaluated on the held-out test set

This approach ensures fair model comparison and realistic generalization estimates.

Feature Selection and Tuning

Feature importance from the Random Forest model was used to remove low-impact predictors
The model was retrained on a reduced feature set
Hyperparameters tuned:
- mtry
- nodesize

The final Random Forest model achieved an OOB MSPE of approximately 14, improving substantially over baseline models (~27 MSPE).

Final Model

Model: Random Forest
Trees: 500
Key tuning parameters:
- mtry = 7
- nodesize = 1

Predictions on the test set were generated and saved for submission and downstream use.

Reproducibility

To reproduce the final model:

Load the training data
Drop the specified low-importance variables
Set the random seed
Train the Random Forest with the tuned parameters

The full workflow and code are deterministic and reproducible.

Technologies

Language: R
Libraries: randomForest, corrplot
Methods: Feature importance analysis, hyperparameter tuning, supervised learning

Notes

This project emphasizes evaluation rigor, reproducibility, and ML system design rather than maximizing performance through excessive tuning.

The full academic report with detailed analysis and figures is available in STAT452_Report.pdf.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dataset		dataset
.gitignore		.gitignore
README.md		README.md
project.qmd		project.qmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Supervised Machine Learning Pipeline

Overview

Dataset

Modeling Approach

Evaluation Strategy

Feature Selection and Tuning

Final Model

Reproducibility

Technologies

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Supervised Machine Learning Pipeline

Overview

Dataset

Modeling Approach

Evaluation Strategy

Feature Selection and Tuning

Final Model

Reproducibility

Technologies

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages