This project implements an ElasticNet regression model from scratch, using NumPy for all the numerical calculations. The ElasticNet model is a type of linear regression that uses both L1 (Lasso) and L2 (Ridge) penalties. It’s especially useful when dealing with datasets that have lots of features, some of which may be irrelevant or highly correlated with others.
Unlike prebuilt libraries like Scikit-Learn, this implementation relies on manually coded gradient descent to optimize the model’s weights.
The ElasticNetModel class includes two main methods:
fit(X, y): This trains the model using the dataset.predict(X): This makes predictions based on the trained model.
This project meets the following requirements:
- Algorithm Implementation: The ElasticNet regression is implemented from scratch, combining both L1 and L2 regularization penalties with gradient descent.
- From First Principles: The model uses NumPy for matrix calculations, with no prebuilt machine learning libraries (like Scikit-Learn or Statsmodels).
- Testing the Model: We tested the model using a custom script that runs it on a dataset and evaluates performance using metrics like Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R²). We also generated visualizations to assess the model's performance.
- Flexible Input: The model can handle any numerical dataset with proper preprocessing and normalization. The test script works with a provided
test.csvdataset generated by a separate script.
The ElasticNet model I’ve implemented is a type of linear regression that combines two types of regularization: L1 (Lasso) and L2 (Ridge). This makes it useful when you have a dataset with many features, especially when some of those features might not be important or are highly correlated with each other.
ElasticNet helps prevent overfitting by penalizing large coefficients and can also automatically select important features by driving some coefficients to zero. It’s a good choice when you’re dealing with high-dimensional data or when you want a balance between selecting features and generalizing well.
we tested the model by writing a script that loads a dataset, trains the ElasticNet model on it, and then evaluates how well it performs using standard metrics like Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R²). These metrics give a good sense of how accurate the model’s predictions are compared to the actual values.
we also generated visualizations, including a plot of actual vs. predicted values, a residuals plot, and a bar plot showing the importance of each feature (based on the learned coefficients). This helped visually confirm that the model is working as expected.
we’ve exposed several parameters that allow users to tweak the model’s behavior:
- l1_penalty: Controls the strength of L1 regularization, which helps with feature selection.
- l2_penalty: Controls the strength of L2 regularization, which helps prevent overfitting.
- learning_rate: Adjusts the step size during the training process. A smaller value means slower but more precise updates.
- max_iterations: Sets how many times the model will update its weights during training.
- tolerance: Decides when the training should stop by checking if the weight updates have become very small.
Here’s an example of how to use these parameters:
from elasticnet.models.ElasticNet import ElasticNetModel
# Example usage of ElasticNet
model = ElasticNetModel(l1_penalty=0.5, l2_penalty=0.3, learning_rate=0.01, max_iterations=5000)
X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [3, 4, 5, 6]
model.fit(X, y)
predictions = model.predict(X)
print(predictions)Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
The model works well with numerical data, but there are a few things it doesn’t handle well:
-
Categorical data: If you have non-numeric data (like "male" or "female"), you’ll need to convert these into numbers before passing them into the model. Right now, this needs to be done manually.
With more time, I could add a feature to automatically handle categorical data.
-
Missing data: The model expects all input values to be valid numbers. If there are missing or
NaNvalues, they need to be filled in before training the model.This could be improved by adding automatic handling for missing data (e.g., filling with mean values).
ElasticNet is particularly useful in these scenarios:
- If you want to encourage sparsity in your model (meaning some of the less important features will be ignored), ElasticNet’s L1 regularization helps with that.
- If you’re working with a large number of features or when some features are highly correlated, ElasticNet's combination of L1 and L2 regularization helps handle this better than basic linear regression or Lasso alone.
- If you're worried about overfitting, the L2 penalty helps keep the model generalized by preventing large coefficients.
First, make sure you have NumPy and Matplotlib installed. You can install them using pip:
pip install numpy matplotlib
Before running any scripts, make sure your Python environment is set up to find the project's modules. You can do this by setting the PYTHONPATH environment variable to your current directory:
export PYTHONPATH=$PWD
if we are using COmmand prompt use
set PYTHONPATH=%cd%
This tells Python where to look for the project's files.
You’ll need to generate the dataset before running the model. Use the generate_test_CSV.py script to create the test.csv file:
python generate_test_CSV.py
This will generate a dataset that the model can use for training and testing.
Once the dataset has been generated, you can run the ElasticNet model by using the test script:
python elasticnet/tests/test_ElasticNetModel.py
This will:
- Load the
test.csvdataset. - Preprocess the data (if necessary, convert categorical features to numerical form).
- Train the model using
fit(). - Predict and evaluate the results using
predict(). - Generate visualizations like "Actual vs Predicted" and "Residuals."
We used the following metrics to evaluate how well the model performed:
- Mean Squared Error (MSE): Measures how far the predicted values are from the actual values.
- Mean Absolute Error (MAE): Similar to MSE but based on the absolute differences.
- R-squared (R²): This tells you how well the model fits the data. The closer to 1, the better.
Here’s an example of the output:
Mean Squared Error (MSE): 1.85 Mean Absolute Error (MAE): 1.2 R-squared (R²): 0.75
The test script will generate several plots to help visualize the model’s performance:
- Actual vs Predicted: A scatter plot that compares the actual target values to the predicted values.
- Residuals: A plot showing the differences between the actual and predicted values.
- Distribution of Target Values: A histogram that shows the spread of the target variable.
- Feature Weights: A bar chart showing the learned importance of each feature in the model.
The current implementation handles numerical datasets well, but there are a few things to keep in mind:
- Categorical Data: If your dataset has non-numeric columns (like 'male' and 'female'), you’ll need to convert them to numbers before using the model.
- Missing Values: If your dataset has any missing values (
NaN), you should handle those first by either filling them in or dropping the rows.
With more time, the following improvements could be made:
- Automating the preprocessing of categorical data so that users don’t need to manually encode it.
- Adding better handling for missing data by automatically filling or dropping missing values.
- Implementing cross-validation to automatically tune the regularization parameters (l1_penalty, l2_penalty) for better performance.
Krishna Manideep Malladi (A20550891) Udaya Sree Vankdavath (A20552992) Manvitha Byrineni(A20550783)


