This project demonstrates the implementation of a custom linear regression model in Python and evaluates its performance using K-Fold Cross-Validation, Bootstrapping, and Akaike Information Criterion (AIC). The project uses a dataset named heart.csv for testing and validation.
To test and evaluate the performance of a custom linear regression model on the any dataset using the following metrics:
- K-Fold Cross-Validation Mean Squared Error (MSE)
- Bootstrapping Mean Squared Error (MSE)
- Akaike Information Criterion (AIC)
Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
Yes, in our testing with the heart.csv dataset using linear regression, the model selectors produced consistent results:
- K-Fold Cross-Validation MSE: 0.1241
- Bootstrapping MSE: 0.1254
- AIC: -2137.2136
These results indicate that cross-validation, bootstrapping, and AIC agree when evaluating the model’s fit to the data for simple cases.
- Small datasets: Bootstrapping may produce biased results due to repeated sampling from limited data.
- Imbalanced data: K-Fold Cross-Validation may fail if folds are not stratified.
- High-dimensional data: AIC may over-penalize complex models, leading to biased selection.
- Outliers: All methods may give undesirable results if outliers dominate the dataset.
- Implement Stratified K-Folds for handling imbalanced datasets.
- Incorporate robust regression techniques to handle outliers effectively.
- Use Bayesian Information Criterion (BIC) alongside AIC for high-dimensional datasets.
- Add automated dataset analysis and warnings for users regarding dataset size, balance, or presence of outliers.
-
K-Fold Cross-Validation:
k: Number of folds.metric: Metric to evaluate (e.g., MSE or R²).random_seed: Seed for reproducibility.
-
Bootstrapping:
num_samples: Number of bootstrap samples.metric: Metric to evaluate (e.g., MSE or R²).random_seed: Seed for reproducibility.
-
AIC:
- Requires no additional parameters and uses the full dataset.
These parameters allow users to adapt the methods to their specific datasets and modeling requirements.
- Clone this repository
- Install required Python libraries (
matplotlibfor plotting) from cmd. - If you want test with some other dataset then replace the following values : Replace file_path: Path to the CSV file, Replace target_column: Name of the target column (For eample in heart.csv it is target)
- Then run model.py file