diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000..6edd84e Binary files /dev/null and b/.DS_Store differ diff --git a/README.md b/README.md index c1e8359..74479b2 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,158 @@ -# Project 1 +# ElasticNet Regression + +## Table of Contents +- Project Overview +- Key Features +- Setup and Installation +- Model Explanation +- How to Use +- Code Examples +- Adjustable Parameters +- Known Limitations +- Contributors +- Q&A + +## Project Overview +This project provides a fully custom implementation of ElasticNet Regression, built from the ground up using only NumPy and pandas. No external machine learning libraries like scikit-learn or TensorFlow have been used. This implementation aims to provide a clear understanding of ElasticNet's operation and shows how the model may be optimized via gradient descent. + +Combining L1 (Lasso) and L2 (Ridge) regularization, ElasticNet is a linear regression model that works well for tasks involving correlated features or feature selection. Gradient descent is utilized for model optimization. + +## Key Features +- **Custom ElasticNet Regression**: Implements both L1 (Lasso) and L2 (Ridge) regularization for linear regression. +- **Gradient Descent Optimization**: Manually optimizes weights using gradient descent, allowing full control over the learning process. + +## Setup and Installation +### Prerequisites +- Python 3.x +- NumPy +- pandas + +### Installation +1. Clone this repository: + ```bash + git clone https://github.com/priyanshpsalian/ML_Project1.git + ``` + +3. Create a virtual environment + ```bash + python3 -m venv .venv + ``` +4. Activate the environment + ```bash + source .venv/bin/activate # On Unix or MacOS + .venv\Scripts\activate # On Windows + ``` +2. Install the required dependencies: + ```bash + pip install -r requirements.txt + ``` + +5. Run the test file to see the results + ```bash + python3 -m elasticnet.tests.test_ElasticNetModel + ``` + +## Model Explanation +Combining the benefits of L1 and L2 regularization, ElasticNet is a regularized version of linear regression. It functions effectively in cases when we want both variable selection (L1) and coefficient shrinkage (L2), or when features are coupled. + +### Objective Function + +The objective of ElasticNet is to minimize the following: + + +- **Alpha** controls the strength of regularization. +- **l1_ratio (Lambda)** determines the mix between L1 (Lasso) and L2 (Ridge). + +## How to Use +You can initialize and train the ElasticNet model using the provided `ElasticNetModel` class: +```python +from elasticnet import ElasticNetModel + +# Initialize the model +model = ElasticNetModel(alpha=1.0, l1_ratio=0.5, max_iter=2000, convergence_criteria=1e-4, step_size=0.005, bias_term=True) + +# Fit the model to data +model.fit(X_train, y_train) + +# Make predictions +predictions = model.predict(X_test) + +``` +## Code Examples +This is just an example code: +```python +# Fit the model +outcome = model.fit(X_train, y_train) + +# Predict +y_pred = outcome.predict(X_test) + +# Evaluate +r2 = outcome.r2_score(y_test, y_pred) +rmse = outcome.rmse(y_test, y_pred) + +print(f"R² Score: {r2}") +print(f"RMSE: {rmse}") +``` + +## Adjustable Parameters + +- **self**: The instance of the ElasticNetModel class that this method is called from - used to access class properties and other methods. +- **alpha**: Overall strength of regularization. Must be a positive float. Its value is 1 by default. +- **l1_ratio**: This parameter balances the L1 (Lasso) and L2 (Ridge) penalties, where 0 indicates pure Ridge and 1 indicates pure Lasso. The default setting is 0.5 +- **max_iter**: The maximum number of passes over the training data. It defines the number of iterations in gradient descent optimization. Higher values allow more fine-tuning, at the cost of more computation. The default is 2000 +- **convergence_criteria**: Tolerance for stopping criteria. If the difference between iterations is less than this value, then the training stops. The default is 1e-4 +- **step_size**: Step size determines the amount that coefficients are altered during each step of gradient descent. Small values can lead to slower convergence but more precise results. Default = 0.005 +- **bias_term**: Boolean indicating if an intercept should be added to the model. If True, then an intercept term will be added. Default = True + + +## Known Limitations +Increasing Decline Convergence: In situations with significant multicollinearity or on huge datasets, the model may converge slowly. Convergence may be enhanced by alternative optimization methods like coordinate descent. + +Precision: Compared to closed-form solutions, gradient descent may not be able to reach the level of precision needed for some applications. + + +## Contributors +- Priyansh Salian (A20585026 psalian1@hawk.iit.edu) +- Shruti Ramchandra Patil (A20564354 spatil80@hawk.iit.edu) +- Pavitra Sai Vegiraju (A20525304 pvegiraju@hawk.iit.edu) +- Mithila Reddy (A20542879 Msingireddy@hawk.iit.edu) + +## Q&A + +### What does the model you have implemented do, and when should it be used? +ElasticNet Regression is designed to handle regression tasks involving multicollinearity (correlation between predictors) and feature selection. It combines the L1 (Lasso) and L2 (Ridge) penalties to strike a compromise between coefficient shrinkage and variable selection. + +### How did you test your model to determine if it is working reasonably correctly? +The model was tested using synthetic data that had established correlations between predictors and target variables. R² and RMSE measures were used to assess the accuracy of the model by comparing predictions to actual values. + +### What parameters have you exposed to users of your implementation in order to tune performance? +- **Alpha**: The strength of regularization, which manages both L1 and L2 penalties. +- **l1_ratio**: The proportion of L2 (Ridge) penalties to L1 (Lasso) penalties. +- **step_size**: The gradient descent step size. +- **max_iter**: The quantity of iterations used to optimize gradient descent. +- **convergence_criteria**: Tolerance for the halting criteria. If the progress between iterations is less than tol, the training process will end. Default is 1e-4. +- **bias_term**: Boolean indicating whether an intercept should be fitted. If True, an intercept term is introduced into the model. Default is True. + +### Are there specific inputs that your implementation has trouble with? Given more time, could you work around these, or is it fundamental to the model? +Large datasets or datasets with extreme multicollinearity may provide challenges for the existing solution since gradient descent may converge slowly. Coordinate descent is one optimization technique that could be used to accelerate convergence if given extra time, particularly in high-dimensional environments. + + + + + + + + + + + + + + + + + + -Put your README here. Answer the following questions. -* What does the model you have implemented do and when should it be used? -* How did you test your model to determine if it is working reasonably correctly? -* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.) -* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental? diff --git a/elasticnet/models/ElasticNet.py b/elasticnet/models/ElasticNet.py index 017e925..2335a1f 100644 --- a/elasticnet/models/ElasticNet.py +++ b/elasticnet/models/ElasticNet.py @@ -1,17 +1,90 @@ +import numpy as np +import pandas as pd +class ElasticNetModel: + def __init__(self, **kwargs): + defaults = { + 'alpha': 1.0, + 'l1_ratio': 0.5, + 'max_iter': 2000, + 'convergence_criteria': 1e-4, + 'step_size': 0.005, + 'bias_term': True + } + defaults.update(kwargs) + self.parameter_values = None + self.average_value = None + self.standard_deviation = None -class ElasticNetModel(): - def __init__(self): - pass + for key, value in defaults.items(): + setattr(self, key, value) - def fit(self, X, y): - return ElasticNetModelResults() + def fit(self, X, y, categorical_features=None): + y = y.astype(float).flatten() + X = X.astype(float) + X = pd.DataFrame(X) + X = pd.get_dummies(X, drop_first=True, columns=categorical_features,) -class ElasticNetModelResults(): - def __init__(self): - pass + # Scaling the features to a standard format. + self.average_value = X.mean(axis=0) + self.standard_deviation = X.std(axis=0) + X = (X - self.average_value) / self.standard_deviation + m, n = X.shape + self.parameter_values = np.zeros(n + 1) if self.bias_term else np.zeros(n) - def predict(self, x): - return 0.5 + if self.bias_term: + X = np.hstack([np.ones((m, 1)), X]) + + # Gradient Descent Optimization + for iteration in range(self.max_iter): + p = X.dot(self.parameter_values) + mistake = p - y + derivative_array = (1 / m) * X.T.dot(mistake) + + # Adjusting the intercept independently if bias_term parameter is set to True + if self.bias_term: + self.parameter_values[0] -= self.step_size * derivative_array[0] + derivative_array = derivative_array[1:] + + p = self.parameter_values[1:] + l1 = self.l1_ratio * np.sign(p) + l2 = (1 - self.l1_ratio) * p + reg = self.alpha * (l1 + l2) + upd = self.step_size * (derivative_array + reg) + self.parameter_values[1:] -= upd + + + if np.linalg.norm(derivative_array, ord=1) < self.convergence_criteria: + break + + return ElasticNetModelResults(self) + + def predict(self, X): + X = X.astype(float) + if not isinstance(X, pd.DataFrame): + X = pd.DataFrame(X) + + X = (X - self.average_value) / self.standard_deviation + if self.bias_term: + X = np.hstack([np.ones((X.shape[0], 1)), X]) + return X.dot(self.parameter_values) + +class ElasticNetModelResults: + def __init__(self, model): + self.model = model + + def predict(self, X): + return self.model.predict(X) + + def r2_score(self, y_true, y_pred): + o = np.asarray(y_true) + p = np.asarray(y_pred) + t = np.sum((o - np.mean(o)) ** 2) + r = np.sum((o - p) ** 2) + r2 = 1 - (r / t) + return r2 + + def rmse(self, t, p): + return np.sqrt(np.mean((t - p) ** 2)) diff --git a/elasticnet/models/__init__.py b/elasticnet/models/__init__.py index e69de29..56dadf0 100644 --- a/elasticnet/models/__init__.py +++ b/elasticnet/models/__init__.py @@ -0,0 +1 @@ +from .ElasticNet import ElasticNetModel \ No newline at end of file diff --git a/elasticnet/tests/test_ElasticNetModel.py b/elasticnet/tests/test_ElasticNetModel.py index 5022c3c..a32184b 100644 --- a/elasticnet/tests/test_ElasticNetModel.py +++ b/elasticnet/tests/test_ElasticNetModel.py @@ -1,19 +1,70 @@ -import csv - import numpy - -from elasticnet.models.ElasticNet import ElasticNetModel +import csv +from ..models.ElasticNet import ElasticNetModel +from ..models.ElasticNet import ElasticNetModelResults def test_predict(): - model = ElasticNetModel() data = [] - with open("small_test.csv", "r") as file: + with open("elasticnet/tests/small_test.csv", "r") as file: reader = csv.DictReader(file) for row in reader: data.append(row) X = numpy.array([[v for k,v in datum.items() if k.startswith('x')] for datum in data]) y = numpy.array([[v for k,v in datum.items() if k=='y'] for datum in data]) - results = model.fit(X,y) - preds = results.predict(X) - assert preds == 0.5 + X = X.astype(float) + y = y.astype(float).flatten() + # Data is beeing split into training and testing data + split_idx = int(0.8 * len(X)) + X_train, X_test = X[:split_idx], X[split_idx:] + y_train, y_test = y[:split_idx], y[split_idx:] + + # Hyperparameter Optimization through Cross-Validation + Cross_validation_score_best = -numpy.inf + leading_parameters = {} + + kcf = 5 + segment_length = len(X_train) // kcf + + for alpha in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]: + for l1_ratio in [0.1, 0.3, 0.5, 0.7, 0.9]: + validation_scores = [] + for i in range(kcf): + X_train_segment = numpy.concatenate((X_train[:i*segment_length], X_train[(i+1)*segment_length:]), axis=0) + y_train_segment = numpy.concatenate((y_train[:i*segment_length], y_train[(i+1)*segment_length:]), axis=0) + X_validation_subset = X_train[i*segment_length:(i+1)*segment_length] + y_validation_subset = y_train[i*segment_length:(i+1)*segment_length] + + temp_model = ElasticNetModel(alpha=alpha, l1_ratio=l1_ratio, max_iter=2000, convergence_criteria=1e-4, step_size=0.005, bias_term=True) + temp_model.fit(X_train_segment, y_train_segment) + predicted_y_values = temp_model.predict(X_validation_subset) + model_results = ElasticNetModelResults(temp_model) + validation_scores.append(model_results.r2_score(y_validation_subset, predicted_y_values)) + + mean_evaluation = numpy.mean(validation_scores) + if mean_evaluation > Cross_validation_score_best: + Cross_validation_score_best = mean_evaluation + leading_parameters = {'alpha': alpha, 'l1_ratio': l1_ratio} + + # Display the optimal results of the model + print("--- Optimal Model Performance Metrics ---") + print(f"Optimal R² Value Achieved Through Cross-Validation: {Cross_validation_score_best:.4f}") + print(f"Optimal Alpha Value for Model Performance: {leading_parameters['alpha']}") + print(f"Optimal L1 Ratio Value for Model Performance: {leading_parameters['l1_ratio']}") + + # Build the Final Model Using Optimal Configuration Settings + final_model = ElasticNetModel(max_iter=2000, convergence_criteria=1e-4, step_size=0.005, alpha=leading_parameters['alpha'], l1_ratio=leading_parameters['l1_ratio'], bias_term=True) + results = final_model.fit(X_train, y_train) + + # Generating predictions for the testing data set + y_pred_test = results.predict(X_test) + # Setting Up the Model for Performance Evaluation + result_model = ElasticNetModelResults(final_model) + + # Computing and Presenting Evaluation Metrics + print("----------------------------------------------") + print(f"--- Performance Evaluation Summary on the Test Data Set---") + print(f"R² Score: {result_model.r2_score(y_test, y_pred_test):.4f}") + print(f"RMSE: {result_model.rmse(y_test, y_pred_test):.4f}") + +test_predict() \ No newline at end of file diff --git a/regularized_discriminant_analysis/models/RegularizedDiscriminantAnalysis.py b/regularized_discriminant_analysis/models/RegularizedDiscriminantAnalysis.py deleted file mode 100644 index 089f9ad..0000000 --- a/regularized_discriminant_analysis/models/RegularizedDiscriminantAnalysis.py +++ /dev/null @@ -1,17 +0,0 @@ - - -class RDAModel(): - def __init__(self): - pass - - - def fit(self, X, y): - return RDAModelResults() - - -class RDAModelResults(): - def __init__(self): - pass - - def predict(self, x): - return 0.5 diff --git a/regularized_discriminant_analysis/test_rdamodel.py b/regularized_discriminant_analysis/test_rdamodel.py deleted file mode 100644 index 095725b..0000000 --- a/regularized_discriminant_analysis/test_rdamodel.py +++ /dev/null @@ -1,19 +0,0 @@ -import csv - -import numpy - -from regularized_discriminant_analysis.models.RegularizedDiscriminantAnalysis import RDAModel - -def test_predict(): - model = ElasticNetModel() - data = [] - with open("small_sample.csv", "r") as file: - reader = csv.DictReader(file) - for row in reader: - data.append(row) - - X = numpy.array([[v for k,v in datum.items() if k.startswith('x')] for datum in data]) - y = numpy.array([[v for k,v in datum.items() if k=='y'] for datum in data]) - results = model.fit(X,y) - preds = results.predict(X) - assert preds == 0.5 diff --git a/requirements.txt b/requirements.txt index 18af45d..9a2f93c 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,3 +1,4 @@ numpy pytest ipython +pandas \ No newline at end of file