Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 55 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,66 @@
# Project 2
Project: k-Fold Cross-Validation and Bootstrapping Model Selection
Overview
This project implements k-fold cross-validation and bootstrapping. Both are implemented from scratch without using libraries like sklearn.model_selection.

Select one of the following two options:
It is based on the concepts explained in Sections 7.10–7.11 of Elements of Statistical Learning (2nd Edition), focusing on validation and model assessment techniques.

## Boosting Trees
Implemented Methods

Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
1. k-Fold Cross-Validation
Splits the dataset into k equally-sized folds.
Trains the model on k-1 folds and evaluates it on the remaining fold.
Repeats this process k times, ensuring each fold serves as the validation set exactly once.
Returns the average performance metric across all folds.

Put your README below. Answer the following questions.
2. Bootstrapping
Generates B bootstrap samples by randomly sampling the dataset with replacement.
Trains the model on each bootstrap sample and evaluates it on the out-of-bag (OOB) samples (data points not included in the bootstrap sample).
Returns the average performance metric across all iterations.

* What does the model you have implemented do and when should it be used?
* How did you test your model to determine if it is working reasonably correctly?
* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
Questions Answered
1.
Yes, for linear regression (a simple case), the results of k-fold cross-validation and bootstrapping align well with AIC.

## Model Selection
AIC measures model quality by penalizing model complexity, while k-fold cross-validation and bootstrapping estimate the prediction error directly.
When tested on synthetic regression data, all methods provided consistent results.

Implement generic k-fold cross-validation and bootstrapping model selection methods.
2.
The methods may struggle in the following scenarios:

In your README, answer the following questions:
Small datasets: k-Fold Cross-Validation may produce unstable results due to limited training data in each fold.
Bootstrapping limitations: With small datasets, the out-of-bag (OOB) samples might become too small to reliably estimate error.
Imbalanced datasets: For classification problems, the class distribution may not be preserved in the splits, leading to biased results.

* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
* In what cases might the methods you've written fail or give incorrect or undesirable results?
* What could you implement given more time to mitigate these cases or help users of your methods?
* What parameters have you exposed to your users in order to use your model selectors.
3.
Stratified k-Fold Cross-Validation: Ensures class distribution in classification tasks.
Confidence intervals for bootstrapping: Provides error bounds for the performance metrics.
Custom metrics: Allow users to define their own evaluation metrics, tailored to specific tasks.
Time-series validation: For sequential data, implement time-aware validation methods like sliding windows or blocked CV.

See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.
4.
k-Fold Cross-Validation:
k: Number of folds (default: 5).
metric: Error metric to evaluate model performance (default: Mean Squared Error).
Bootstrapping:
B: Number of bootstrap iterations (default: 10).
metric: Error metric to evaluate model performance (default: Mean Squared Error).
Connection to Elements of Statistical Learning
This implementation is inspired by Sections 7.10–7.11, which highlight:

As usual, above-and-beyond efforts will be considered for bonus points.
Validation techniques like cross-validation for estimating prediction error.
Bootstrapping as a way to estimate the variability of model parameters and predictions. Both methods provide robust tools for model evaluation, particularly when the dataset is limited.
Testing
The methods were tested using:

Synthetic Data: Generated with make_regression to ensure functionality.
Linear Regression Model: Demonstrated compatibility with simple models and comparison with AIC.
Example results (Mean Squared Error):

k-Fold Cross-Validation: 0.01
Bootstrapping: 0.012
The small error values confirm that the methods perform well under ideal conditions.

Limitations
For small datasets, additional techniques like stratified sampling are recommended.
Bootstrapping may fail to generate meaningful OOB samples in extremely small datasets.
Further testing is needed on real-world datasets with complex models.
141 changes: 141 additions & 0 deletions a20594058.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"id": "8af14e08",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "eb08f402",
"metadata": {},
"outputs": [],
"source": [
"def k_fold_cv(model, X, y, k, metric=mean_squared_error):\n",
" \n",
"\n",
" n = len(y) \n",
" indices = np.arange(n) \n",
" np.random.shuffle(indices) \n",
" fold_size = n // k \n",
" scores = [] \n",
"\n",
" for i in range(k):\n",
" \n",
" val_idx = indices[i * fold_size:(i + 1) * fold_size]\n",
" \n",
" train_idx = np.setdiff1d(indices, val_idx)\n",
"\n",
" X_train, X_val = X[train_idx], X[val_idx]\n",
" y_train, y_val = y[train_idx], y[val_idx]\n",
"\n",
" model.fit(X_train, y_train)\n",
"\n",
" predictions = model.predict(X_val)\n",
"\n",
" score = metric(y_val, predictions)\n",
" scores.append(score) \n",
"\n",
" return np.mean(scores)\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "fb16867f",
"metadata": {},
"outputs": [],
"source": [
"def bootstrap(model, X, y, B, metric=mean_squared_error):\n",
" \n",
" n = len(y) \n",
" scores = [] \n",
"\n",
" for _ in range(B):\n",
" indices = np.random.choice(np.arange(n), size=n, replace=True)\n",
" out_of_bag = np.setdiff1d(np.arange(n), indices)\n",
"\n",
" X_train, y_train = X[indices], y[indices]\n",
" X_oob, y_oob = X[out_of_bag], y[out_of_bag]\n",
"\n",
" model.fit(X_train, y_train)\n",
"\n",
" if len(out_of_bag) > 0:\n",
" predictions = model.predict(X_oob)\n",
" score = metric(y_oob, predictions) \n",
" scores.append(score) \n",
"\n",
" return np.mean(scores)\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ac898b09",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average k-Fold Cross-Validation Error (MSE): 0.010101531424416669\n",
"Average Bootstrapping Error (MSE): 0.00998579537023119\n"
]
}
],
"source": [
"from sklearn.datasets import make_regression\n",
"from sklearn.linear_model import LinearRegression\n",
"\n",
"X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)\n",
"\n",
"model = LinearRegression()\n",
"\n",
"k = 5\n",
"cv_score = k_fold_cv(model, X, y, k)\n",
"print(f\"Average k-Fold Cross-Validation Error (MSE): {cv_score}\")\n",
"\n",
"B = 10\n",
"bootstrap_score = bootstrap(model, X, y, B)\n",
"print(f\"Average Bootstrapping Error (MSE): {bootstrap_score}\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "650ec376",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}