Fall2024CS584 · KSP24SCM00S · Nov 22, 2024 · Nov 22, 2024
diff --git a/ML_Project2_Volleyvission.ipynb b/ML_Project2_Volleyvission.ipynb
diff --git a/README.md b/README.md
@@ -1,29 +1,311 @@
-# Project 2
+# **Volleyball Player Statistics Analysis for NCAA Division 3 (2024)**
 
-Select one of the following two options:
+---
 
-## Boosting Trees
+### **Overview**
 
-Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
+**PR Name**: *"Analysis of Volleyball Player Performance Metrics Using Machine Learning Techniques"*
 
-Put your README below. Answer the following questions.
+This project analyzes player statistics from the Illinois Tech volleyball team for NCAA Division 3 (2024). Using advanced machine learning techniques, the analysis evaluates the relationship between various player performance metrics and total points scored (`PTS`). The implementation focuses on model selection techniques such as **k-Fold Cross-Validation** and **Bootstrap .632**, alongside interpretive data visualizations to provide insights into player performance.
 
-* What does the model you have implemented do and when should it be used?
-* How did you test your model to determine if it is working reasonably correctly?
-* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
-* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
+---
 
-## Model Selection
+### **How to Run the Code**
 
-Implement generic k-fold cross-validation and bootstrapping model selection methods.
+1. **Prepare the Required Files**:
+   - Download the `.ipynb` file containing the code.
+   - Download the dataset file (`tabula-mvb_stats_2024.csv`).
 
-In your README, answer the following questions:
+2. **Upload Files to Google Colab**:
+   - Open [Google Colab](https://colab.research.google.com/).
+   - Upload both the `.ipynb` file and the dataset file by clicking on the folder icon in the left sidebar and then the upload button.
 
-* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
-* In what cases might the methods you've written fail or give incorrect or undesirable results?
-* What could you implement given more time to mitigate these cases or help users of your methods?
-* What parameters have you exposed to your users in order to use your model selectors.
+3. **Set the Dataset Path**:
+   - After uploading, copy the file path for the dataset from the Colab file manager (e.g., `/content/tabula-mvb_stats_2024.csv`).
+   - Replace the `file_path` variable in the code with the copied path:
+     ```python
+     file_path = '/content/tabula-mvb_stats_2024.csv'
+     ```
 
-See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.
+4. **Install Missing Libraries (If Any)**:
+   - If the code encounters a missing library error, install it by running:
+     ```python
+     !pip install <library_name>
+     ```
+   - Replace `<library_name>` with the name of the required library (e.g., `seaborn`, `scikit-learn`, etc.).
 
-As usual, above-and-beyond efforts will be considered for bonus points.
+5. **Execute the Notebook**:
+   - Run the cells sequentially in Google Colab.
+   - Ensure that the dataset path is correctly set before running the code to avoid errors.
+
+6. **View the Results**:
+   - The outputs, including model evaluation metrics and visualizations, will be displayed in the Colab notebook.
+
+**Note**: The code has been optimized for **Google Colab**. Running it in other IDEs, such as Visual Studio Code, may result in incomplete visual outputs (e.g., heatmaps). Always use Google Colab for consistent and accurate results.
+
+---
+
+
+## **Implementation Details**
+
+### **1. Preprocessing**
+- **Data Cleaning**:
+  - Handled missing values and clipped outliers using the 1st and 99th percentiles to ensure robust model performance.
+- **Scaling**:
+  - Applied `RobustScaler` for effective handling of outliers during feature standardization.
+- **Validation**:
+  - Verified the absence of `NaN` values in the feature matrix (`X`) and target vector (`y`) before and after preprocessing.
+
+---
+
+### **2. Model Selection Techniques**
+
+#### **k-Fold Cross-Validation**
+- **Description**:
+  - Splits the data into 5 folds, training on 4 and testing on 1 iteratively, to evaluate the model's generalization error.
+  - Calculates the Mean Squared Error (MSE) across all folds.
+- **Output**:
+  - **k-Fold Cross-Validation MSE**: `3231.9376`
+- **Interpretation**:
+  - This MSE indicates the model's average squared error on unseen data. A lower value suggests better generalization to new samples.
+
+---
+
+#### **Bootstrap .632**
+- **Description**:
+  - Resamples the dataset with replacement for training, while using out-of-bag samples for validation.
+  - Combines in-sample and out-of-sample errors using the `.632 adjustment` for a balanced error estimate.
+- **Output**:
+  - **Bootstrap .632 MSE**: `1464.9683`
+- **Interpretation**:
+  - The lower MSE compared to k-fold suggests potential overfitting, as the bootstrap error partially relies on in-sample performance.
+
+---
+
+### **3. Model Accuracy**
+
+#### **R² Score**
+- **Output**: `0.9995`
+- **Description**:
+  - Measures how well the model explains the variability in the target variable (`PTS`).
+- **Interpretation**:
+  - A value of **0.9995** indicates that the model explains **99.95% of the variance** in `PTS`, demonstrating an excellent fit.
+- **Implications**:
+  - While predictions align closely with actual values, this high value might suggest **overfitting**, especially in small datasets.
+
+---
+
+#### **Mean Absolute Error (MAE)**
+- **Output**: `1.7995`
+- **Description**:
+  - Measures the average absolute deviation between the predicted and actual points scored.
+- **Interpretation**:
+  - The model's predictions are off by **1.7995 points** on average. For example, if a player scores 25 points, the model might predict **23.2 or 26.8**.
+- **Implications**:
+  - A low MAE indicates strong predictive accuracy.
+
+---
+
+## **Visualizations**
+
+### **1. Correlation Heatmap**
+- **What It Shows**:
+  - Displays the relationships between performance metrics (e.g., `K`, `DIG`) and `PTS`.
+- **Insights**:
+  - Metrics like `K` (Kills) and `K/S` (Kills per set) are strongly correlated with `PTS`, making them significant predictors of scoring.
+
+---
+
+### **2. Learning Curve**
+- **What It Shows**:
+  - Training and validation errors as a function of dataset size.
+- **Insights**:
+  - Validation error stabilizes with more data, confirming the model's ability to generalize.
+
+---
+
+### **3. Bias-Variance Trade-Off**
+- **What It Shows**:
+  - The impact of model complexity (polynomial degree) on training and validation errors.
+- **Insights**:
+  - Overfitting becomes apparent at higher degrees, as training error decreases but validation error increases.
+
+---
+
+### **4. Residual Plot**
+- **What It Shows**:
+  - Plots residuals (difference between predicted and actual `PTS`) against predicted values.
+- **Insights**:
+  - A random scatter of residuals around 0 suggests the model is well-calibrated.
+
+---
+
+### **5. Feature Importance**
+- **What It Shows**:
+  - The contribution of each feature to predicting `PTS`.
+- **Insights**:
+  - Features like `K` and `K/S` are the most significant, confirming their importance in scoring performance.
+
+---
+
+### **6. Player Points Distribution**
+- **What It Shows**:
+  - A histogram showing the distribution of `PTS` across all players.
+- **Insights**:
+  - Highlights players with significantly higher scores, identifying outliers in performance.
+
+---
+
+### **7. ROC Curve**
+- **What It Shows**:
+  - Evaluates the model’s ability to classify players scoring above or below the average `PTS`.
+- **Output**:
+  - **AUC**: `0.98` (excellent discrimination).
+- **Insights**:
+  - The model effectively distinguishes high-scoring players from others.
+
+---
+
+### **8. Radar Chart**
+- **What It Shows**:
+  - Compares the top scorer’s performance metrics to the team average.
+- **Insights**:
+  - Highlights areas where the top scorer excels, such as `Kills` and `Blocks`.
+
+---
+
+## **Team Contributions**
+
+### **Kunal Nilesh Samant (20541900)**
+- Implemented data preprocessing, including handling missing values, scaling, and clipping outliers.
+- Developed the **k-Fold Cross-Validation** method with robust preprocessing in each fold.
+- Created visualizations such as the **Correlation Heatmap**, **Feature Importance Bar Plot**, and **Residual Plot**.
+- Provided interpretations for cross-validation results and feature importance.
+
+---
+
+### **Dhruv Singh (A20541901)**
+- Implemented the **Bootstrap .632 Estimator** with error handling.
+- Added advanced visualizations, including the **Learning Curve**, **Bias-Variance Trade-Off**, and **Radar Chart**.
+- Created and analyzed the **ROC Curve** for binary classification of player performance.
+- Contributed detailed insights into numerical outputs like **R² Score**, **MAE**, and other model metrics.
+
+
+--- 
+# **Project 2 Questions Answered**
+
+---
+
+## **1. Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?**
+
+In simple cases like linear regression, **cross-validation (k-fold)** and **bootstrap .632** generally align with simpler model selectors like the Akaike Information Criterion (AIC). AIC balances the model's goodness of fit with its complexity, penalizing models with excessive parameters, while cross-validation estimates the generalization error directly. Bootstrap .632 combines in-sample and out-of-sample errors to provide a robust estimate.
+
+In this dataset, the high **R² score (0.9995)** and low **MAE (1.7995)** indicate that the linear regression model fits exceptionally well, capturing almost all variance in the target (`PTS`). Both **k-fold MSE (3231.9376)** and **bootstrap .632 MSE (1464.9683)** suggest the model generalizes well, consistent with AIC’s preference for simpler models. 
+
+However, differences may arise in more complex scenarios. For example, AIC assumes residual normality, which cross-validation and bootstrap do not. In this relatively small and clean dataset, these methods align well.
+
+---
+
+## **2. In what cases might the methods you've written fail or give incorrect or undesirable results?**
+
+While cross-validation and bootstrap are robust, certain limitations can lead to failure or undesirable outcomes:
+
+### **Small Dataset**
+With only 15 players, splitting into 5 folds leaves just 12 samples for training in each fold. This small size can limit the model's ability to generalize. Similarly, bootstrap resampling may repeatedly select similar subsets, reducing variability in error estimates.
+
+### **Overfitting**
+The extremely high **R² score** suggests potential overfitting, where the model captures noise alongside actual relationships in the data. Bootstrap, which incorporates in-sample error, might underestimate the degree of overfitting.
+
+### **Outliers**
+Despite clipping extreme values, residual outliers may distort predictions. For example, a player with unusually high stats could heavily influence the model’s parameters, inflating the MSE.
+
+### **Multicollinearity**
+Features like `K` (Kills) and `K/S` (Kills per set) are highly correlated, causing multicollinearity. This can destabilize parameter estimation and inflate bootstrap variance.
+
+### **Misaligned Metrics**
+Cross-validation minimizes MSE but doesn’t directly penalize model complexity. AIC accounts for complexity, but cross-validation might favor overly complex models in small datasets.
+
+---
+
+## **3. What could you implement given more time to mitigate these cases or help users of your methods?**
+
+To address these challenges, the following improvements could be implemented:
+
+### **Regularization**
+Introduce techniques like Ridge or Lasso regression to handle multicollinearity and reduce overfitting by penalizing large coefficients.
+
+### **Robust Preprocessing**
+- Add automated outlier detection mechanisms to handle anomalies more effectively.
+- Use feature engineering methods like **Principal Component Analysis (PCA)** to address multicollinearity.
+
+### **Hybrid Model Selection**
+Combine AIC and cross-validation to balance goodness of fit with model complexity, leveraging AIC's simplicity criterion alongside empirical validation from cross-validation.
+
+### **Advanced Bootstrap Techniques**
+Implement alternative methods such as **balanced bootstrap**, which ensures diverse resampling, improving error variability estimates for small datasets.
+
+### **Visualization Tools**
+Expand diagnostic visualizations (e.g., residual plots, learning curves, bias-variance trade-off) to provide users with a deeper understanding of model performance and limitations.
+
+### **Exposed Parameters**
+Expose fine-tuning parameters for users, such as:
+- Number of folds in cross-validation.
+- Number of bootstrap iterations.
+- Gradient descent hyperparameters (learning rate, iterations).
+- Preprocessing thresholds for outlier clipping.
+
+---
+
+## **4. What parameters have you exposed to your users in order to use your model selectors?**
+
+### **Cross-Validation**
+- `k`: Number of folds (default: 5).
+- `shuffle`: Whether to shuffle the data before splitting.
+- `random_state`: Seed for reproducibility.
+
+### **Bootstrap**
+- `n_iterations`: Number of bootstrap samples (default: 100).
+- `.632 adjustment`: Balances in-sample and out-of-bag errors.
+
+### **Gradient Descent**
+- `alpha`: Learning rate (default: 0.01).
+- `iterations`: Number of iterations for convergence (default: 1000).
+- `clip_value`: Threshold for gradient clipping to stabilize updates.
+
+### **Preprocessing**
+- Outlier clipping thresholds (1st and 99th percentiles).
+- Scaling method (`RobustScaler`) for handling skewness and outliers.
+
+### **Model Evaluation**
+- Metrics: **MSE**, **R² score**, **MAE**, and **ROC-AUC** for comprehensive performance assessment.
+
+---
+
+## **Output Interpretation**
+
+### **k-Fold Cross-Validation MSE**: `3231.9376`
+- **What it Means**:
+  - Reflects the average squared error on unseen data. A lower value indicates better generalization.
+- **Implication**:
+  - The model performs well on unseen data but shows some error variability.
+
+### **Bootstrap .632 MSE**: `1464.9683`
+- **What it Means**:
+  - Provides a slightly optimistic error estimate by blending in-sample and out-of-sample errors.
+- **Implication**:
+  - A lower error compared to k-fold suggests some overfitting to the training data.
+
+### **R² Score**: `0.9995`
+- **What it Means**:
+  - Explains **99.95% of the variability** in `PTS`. Indicates an excellent model fit.
+- **Implication**:
+  - Highlights strong predictive accuracy but raises concerns about overfitting.
+
+### **MAE**: `1.7995`
+- **What it Means**:
+  - Average deviation of predictions from actual values is **1.8 points**.
+  - For instance, if a player scores 25 points, the model might predict **23.2 or 26.8**.
+- **Implication**:
+  - Low MAE indicates good predictive precision, acceptable in this context.
+
+---
diff --git a/tabula-mvb_stats_2024.csv b/tabula-mvb_stats_2024.csv
@@ -0,0 +1,14 @@
+"#",Player,SP,K,K/S,E,TA,Pct,A,A/S,SA,SE,SA/S,RE,DIG,DIG/S,BS,BA,BLK,BLK/S,BE,BHE,PTS
+2,"Johnson, Elijah",85,20,0.24,7,103,.126,634,7.46,11,32,0.13,1,161,1.89,2,21,23.0,0.27,12,3,43.5
+4,"Donald, Alec",97,0,0.00,5,33,-.152,50,0.52,0,0,0.00,15,279,2.88,0,1,1.0,0.01,0,1,0.5
+5,"Baus, Zachary",48,0,0.00,0,1,.000,2,0.04,0,0,0.00,2,25,0.52,0,0,0,0.00,0,2,0
+6,"Henderson, Paul",88,131,1.49,76,460,.120,13,0.15,26,47,0.30,4,185,2.10,8,20,28.0,0.32,14,1,175.0
+7,"Singer, David",51,83,1.63,54,237,.122,5,0.10,3,27,0.06,4,57,1.12,3,10,13.0,0.25,9,0,94.0
+8,"Hanny, Aaron",27,34,1.26,25,92,.098,6,0.22,4,16,0.15,0,19,0.70,0,12,12.0,0.44,6,1,44.0
+9,"Couper, Grant",90,223,2.48,108,675,.170,10,0.11,17,50,0.19,13,174,1.93,8,18,26.0,0.29,14,0,257.0
+10,"Evans, Bryan",27,28,1.04,19,80,.113,1,0.04,0,3,0.00,6,19,0.70,2,2,4.0,0.15,5,0,31.0
+11,"Sherman, Eli",57,22,0.39,4,63,.286,256,4.49,9,12,0.16,0,43,0.75,1,6,7.0,0.12,3,3,35.0
+13,"Van Engen, Jackson",99,232,2.34,68,482,.340,12,0.12,23,69,0.23,0,77,0.78,29,50,79.0,0.80,31,1,309.0
+14,"Laite, Riley",85,218,2.56,120,559,.175,6,0.07,17,65,0.20,13,160,1.88,4,10,14.0,0.16,15,0,244.0
+24,"Kimoto, Jelani",72,61,0.85,35,199,.131,4,0.06,3,11,0.04,0,44,0.61,10,38,48.0,0.67,26,2,93.0
+25,"Farrell, Colten",32,29,0.91,11,72,.250,0,0.00,4,7,0.13,1,17,0.53,2,12,14.0,0.44,14,0,41.0