Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
507 changes: 507 additions & 0 deletions Boston.csv

Large diffs are not rendered by default.

111 changes: 93 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,104 @@
# Project 2
### Team Members

Select one of the following two options:
1. Satya Mani Srujan Dommeti #A20594429
2. Arjun Singh #A20577962
3. Akshitha Reddy Kuchipatla #A20583607
4. Vamsi Krishna #A20543669

## Boosting Trees
# Model Implementation
This project includes the implementation of both linear regression and ridge regression. The models are designed to run concurrently, allowing the selection of the one that performs best based on the given dataset. To ensure robust performance assessment, k-fold cross-validation and bootstrapping techniques were applied. Additionally, the Akaike Information Criterion (AIC) was used to evaluate the quality and effectiveness of each model.

Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
### K-fold Cross-Validation
K-fold cross-validation is a model validation technique used to assess the performance of a machine learning model. The dataset is split into K equally sized folds (subsets). For each fold, the model is trained using the remaining K-1 folds and tested on the current fold. This process is repeated K times, each time with a different fold serving as the test set. The final performance metric (e.g., Mean Squared Error) is averaged over all K iterations, providing a more reliable estimate of the model's generalization ability. The number of folds (K) can be adjusted based on the dataset size and computational resources, with the default set to 6 in our implementation.

Put your README below. Answer the following questions.
### Bootstraping
Bootstrapping is a resampling technique used to estimate the performance of a machine learning model. It involves generating multiple random subsets (with replacement) from the training dataset, each of the same size as the original dataset. For each resampled subset, a model is trained, and its performance is evaluated on the test set. This process is repeated for a specified number of iterations (n_iterations), and the performance metrics (such as Mean Squared Error) are averaged to provide a more robust estimate of the model’s accuracy. The number of iterations can be adjusted based on the dataset size and computational resources, with the default set to 100 in our implementation. Bootstrapping helps assess the stability of the model by testing it on multiple resampled subsets.

* What does the model you have implemented do and when should it be used?
* How did you test your model to determine if it is working reasonably correctly?
* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
### Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
In simple cases like linear regression, the cross-validation, bootstrapping, and AIC model selectors generally align, as all three methods aim to assess model performance in terms of its ability to fit the data while avoiding overfitting.

## Model Selection
- **Cross-Validation**: By splitting the data into multiple folds and evaluating the model's mean squared error (MSE) on the validation set, k-fold cross-validation provides a robust estimate of the model's performance across different data splits.

Implement generic k-fold cross-validation and bootstrapping model selection methods.
- **Bootstrapping**: This method repeatedly samples the training data with replacement, fits the model, and evaluates its performance (e.g., MSE). It captures variability in model performance due to data sampling.

In your README, answer the following questions:
- **AIC**: AIC uses the MSE and penalizes model complexity based on the number of parameters. For linear regression, which typically has a straightforward relationship between features and output, AIC often agrees with the other methods because the MSE forms the foundation for its calculations.

- Both k-fold cross-validation and bootstrapping rely on MSE for evaluation, similar to AIC. Hence, in simpler models like linear regression, these approaches are likely to identify the same optimal model because they evaluate similar metrics under slightly different conditions.
- The alignment may diverge in more complex scenarios, such as models with high regularization (e.g., ridge regression) or datasets with noise, where AIC’s penalty on model complexity could lead to different selections compared to resampling-based techniques.

* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
* In what cases might the methods you've written fail or give incorrect or undesirable results?
* What could you implement given more time to mitigate these cases or help users of your methods?
* What parameters have you exposed to your users in order to use your model selectors.
### In what cases might the methods you've written fail or give incorrect or undesirable results?
Bootstrapping:

May lead to overfitting by emphasizing specific patterns in the data, especially when there are repeated patterns due to resampling with replacement.
Assumes the data is representative of the population, which can cause biased estimates if the data is highly skewed or the sample size is too small.
K-fold Cross-Validation:

Can be computationally expensive for large datasets since it requires retraining the model for each fold, slowing down the process.
Might give biased results when dealing with imbalanced datasets or non-i.i.d. data (like time series), as improper data partitioning may overestimate the model’s performance.
AIC (Akaike Information Criterion):

Tends to penalize models with many predictors too much, especially in the presence of multicollinearity, potentially leading to overly simplistic models that underfit the data.
Assumes the model is correctly specified. If the model assumptions (e.g., linear relationships) are incorrect, AIC may suggest a poor-fitting model.

### What could you implement given more time to mitigate these cases or help users of your methods?
Given more time, there are several ways to improve the robustness and flexibility of the methods for linear regression, ridge regression, and model evaluation. Below are some considerations and improvements that could be made:

Improving Robustness for Linear and Ridge Regression Models:

Regularization Techniques: For Ridge Regression, we could implement adaptive regularization methods (such as randomForest Regressor) to improve performance, especially in the presence of multicollinearity or when the dataset has many features. This would allow for a more flexible model that adapts to various kinds of data and better generalization.
Handling Outliers: We could implement outlier detection and handling mechanisms (such as robust regression techniques or filtering out extreme values) to prevent the model from overfitting to noisy or extreme data points.
Enhancing Data Cleaning and Preprocessing:

Handling Missing Data: Instead of removing rows with missing values, we could implement methods for imputing missing data, such as filling with the mean, median, or mode of the column for numerical data, or using the most frequent category for categorical data. This would help retain valuable information and prevent the loss of data due to missing values, which is a common issue in real-world datasets.
Feature Scaling and Transformation: For better model convergence and accuracy, especially in models like Ridge Regression, feature scaling (e.g., standardization or normalization) could be applied. This would ensure that all features are on the same scale, preventing any feature from dominating the learning process due to larger values.
Customizable Parameters:

Making Parameters More Generic: The value of k in K-fold cross-validation and the number of iterations (n_iterations) in bootstrapping could be made more flexible by allowing users to pass these as arguments when calling the respective functions. This would make the methods more adaptable and user-friendly, especially for different use cases or datasets.
Dynamic Model Evaluation: We could enhance the evaluate_model method to include additional performance metrics (e.g., R-squared, Adjusted R-squared) to provide a more complete evaluation of model performance. This would give users more insight into how well the model is fitting the data, beyond just MSE and AIC.
Data Type Handling:

Categorical Data Handling: As we are currently not handling categorical variables, introducing preprocessing techniques like one-hot encoding or label encoding for categorical data would allow for a more comprehensive model that works with both numerical and categorical data. This is especially important for datasets with mixed data types.



### What parameters have you exposed to your users in order to use your model selectors.

The model selectors provide several important parameters that allow users to customize the model evaluation and selection process. These parameters include the number of folds (k) for cross-validation, which defaults to 5 but can be adjusted depending on the dataset size and computational constraints. For bootstrapping, users can set the number of iterations (n_iterations, with a default of 100) to balance estimation accuracy with computational efficiency. In the Ridge Regression model, the regularization strength can be tuned using the alpha parameter. Additionally, the test_size parameter (defaulting to 0.2 or 20% of the data) lets users control the proportion of data allocated for testing during data splitting. These customizable parameters offer flexibility, enabling users to tailor the model selection process to their specific requirements while ensuring proper cross-validation practices as shown in the image.

# Intructions on how to use the model.py and test.py.
Here's a paraphrase of the instructions for model.py and test.py:

**model.py:**

* Keep it simple! Make sure the data file (like "winequality-red.csv") and the test script (test.py) are in the same location.
* Double-check that you have all the necessary libraries installed for the code to work.

**test.py:**

* This script deals with loading the data. When you use the script to import the data, pay attention to the `delimiter` parameter (separates values in the file).
* The quality score (or whatever your target value is called) should be the last column in the data. Update the lines:

```python
X = df.drop('FEATURE_NAME', axis=1).values
y = df['FEATURE_NAME'].values
```

* Replace `'FEATURE_NAME'` with the actual name of the quality score column in your data.

**Main Function:**

* In the main function, make sure the filename you provide for loading the data matches the actual file you're using. For example, if your data is in "Winequality-red.csv", update the line to:

```python
def main():
X, y = load_and_preprocess_data('Winequality-red.csv')
```

**Tips:**

* Play around with the "K" value used for cross-validation and the number of iterations ("n_iterations") for bootstrapping. These can affect how well your model performs. The best settings will depend on the size of your data and your computer's processing power.

### Results
![Alt Text](Wine_img.png) ![Alt Text](boston_img.png)

See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.

As usual, above-and-beyond efforts will be considered for bonus points.
Binary file added Wine_img.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added boston_img.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
106 changes: 106 additions & 0 deletions model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# model.py
import numpy as np
import matplotlib.pyplot as plt

class BaseRegression:
def __init__(self):
self.coefficients = None
self.intercept = None

def _add_bias(self, X):
return np.c_[np.ones((X.shape[0], 1)), X]

class LinearRegression(BaseRegression):
def fit(self, X, y):
X_b = self._add_bias(X)
theta = np.linalg.pinv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
self.intercept = theta[0]
self.coefficients = theta[1:]
return self

def predict(self, X):
return np.dot(X, self.coefficients) + self.intercept

class RidgeRegression(BaseRegression):
def __init__(self, alpha=1.0):
super().__init__()
self.alpha = alpha

def fit(self, X, y):
X_b = self._add_bias(X)
n_features = X_b.shape[1]
identity = np.eye(n_features)
identity[0, 0] = 0 # Don't regularize intercept
theta = np.linalg.inv(X_b.T.dot(X_b) + self.alpha * identity).dot(X_b.T).dot(y)
self.intercept = theta[0]
self.coefficients = theta[1:]
return self

def predict(self, X):
return np.dot(X, self.coefficients) + self.intercept

class ModelSelector:
def __init__(self, X_train, X_test, y_train, y_test):
self.X_train = X_train
self.X_test = X_test
self.y_train = y_train
self.y_test = y_test

def mse(self, y_true, y_pred):
return np.mean((y_true - y_pred) ** 2)

def aic(self, y_true, y_pred, k):
n = len(y_true)
mse = self.mse(y_true, y_pred)
return n * np.log(mse) + 2 * k

def k_fold_cv(self, model, k=6):
fold_size = len(self.X_train) // k
mse_scores = []

for i in range(k):
start_idx = i * fold_size
end_idx = start_idx + fold_size

X_val = self.X_train[start_idx:end_idx]
y_val = self.y_train[start_idx:end_idx]
X_train_fold = np.concatenate([self.X_train[:start_idx], self.X_train[end_idx:]])
y_train_fold = np.concatenate([self.y_train[:start_idx], self.y_train[end_idx:]])

model_copy = type(model)()
model_copy.fit(X_train_fold, y_train_fold)
y_pred = model_copy.predict(X_val)
mse_scores.append(self.mse(y_val, y_pred))

return np.mean(mse_scores)

def bootstrap(self, model, n_iterations=100):
n_samples = len(self.X_train)
mse_scores = []

for _ in range(n_iterations):
indices = np.random.choice(n_samples, n_samples, replace=True)
X_boot = self.X_train[indices]
y_boot = self.y_train[indices]

model_copy = type(model)()
model_copy.fit(X_boot, y_boot)
y_pred = model_copy.predict(self.X_test)
mse_scores.append(self.mse(self.y_test, y_pred))

return np.mean(mse_scores)

def evaluate_model(self, model):
# Train the model
model.fit(self.X_train, self.y_train)
y_pred_train = model.predict(self.X_train)
y_pred_test = model.predict(self.X_test)

results = {
#'train_mse': self.mse(self.y_train, y_pred_train),
'test_mse': self.mse(self.y_test, y_pred_test),
'kfold_mse': self.k_fold_cv(model),
'bootstrap_mse': self.bootstrap(model),
'aic': self.aic(self.y_test, y_pred_test, len(model.coefficients)),
}
return results
96 changes: 96 additions & 0 deletions test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# test.py
import numpy as np
import pandas as pd
from model import LinearRegression, RidgeRegression, ModelSelector
import matplotlib.pyplot as plt


def load_and_preprocess_data(file_path):
# Load data
df = pd.read_csv(file_path, delimiter=';') #Be sure the dataset used is separeted by ',' or ';' , and change the delimiter accordingly
#df = pd.read_csv(file_path)
df = df.dropna() # Removing entire rows where there is any null in any column
# Separate features and target
X = df.drop('medv', axis=1).values
y = df['medv'].values

# Standardize features
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
return X, y


def split_data(X, y, test_size=0.2):
n_samples = len(X)
indices = np.random.permutation(n_samples)
test_size = int(test_size * n_samples)

test_indices = indices[:test_size]
train_indices = indices[test_size:]

return (X[train_indices], X[test_indices],
y[train_indices], y[test_indices])


def plot_results(results):
models = list(results.keys())
metrics = ['test_mse', 'kfold_mse', 'bootstrap_mse']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Plot MSE comparisons
x = np.arange(len(models))
width = 0.2

for i, metric in enumerate(metrics):
values = [results[model][metric] for model in models]
ax1.bar(x + i*width, values, width, label=metric)

ax1.set_xticks(x + width*1.5)
ax1.set_xticklabels(models)
ax1.set_ylabel('MSE')
ax1.set_title('MSE Comparison')
ax1.legend()

# Plot AIC scores
aic_scores = [results[model]['aic'] for model in models]
ax2.bar(models, aic_scores)
ax2.set_ylabel('AIC Score')
ax2.set_title('AIC Comparison')

plt.tight_layout()
plt.show()


def main():
# Load and prepare data
X, y = load_and_preprocess_data('Boston.csv')
X_train, X_test, y_train, y_test = split_data(X, y)


# Initialize models
models = {
'linear': LinearRegression(),
'ridge': RidgeRegression(alpha=1.0)
}

# Initialize model selector
selector = ModelSelector(X_train, X_test, y_train, y_test)

# Evaluate models
results = {}
for name, model in models.items():
results[name] = selector.evaluate_model(model)

print(f"\n{name.upper()} REGRESSION RESULTS:")
for metric, value in results[name].items():
print(f"{metric}: {value:.4f}")

# Find best model
best_model = min(results.items(), key=lambda x: x[1]['test_mse'])
print(f"\nBest model: {best_model[0]} (Test MSE: {best_model[1]['test_mse']:.4f})")

# Plot results
plot_results(results)

if __name__ == "__main__":
main()
Loading