This project focuses on forecasting solar power generation using advanced machine learning models, including XGBoost and Random Forest. Utilizing a dataset obtained from Kaggle, the project encompasses comprehensive steps such as data cleaning, preprocessing, feature engineering, exploratory data analysis (EDA), model building, hyperparameter tuning, and model evaluation.
The dataset was sourced from Kaggle, containing historical data on weather features and solar power generation.
- Missing Values: Handled missing values through imputation or removal to maintain data integrity.
- Outlier Detection: Identified and addressed outliers to prevent distortion of model training.
- Data Splitting: The dataset was split into training and testing sets to facilitate unbiased evaluation of model performance.
- Feature Scaling: Features were scaled using StandardScaler to enhance model stability and convergence during training.
- Cyclical Feature Extraction: Time-based features such as DaySin, DayCos, YearSin, and YearCos were created to capture seasonal and cyclical patterns inherent in solar power generation data.
- Data Distribution Visualization: Histograms were used to visualize and understand the distribution of each feature.
- Correlation Analysis: Investigated relationships between features and the target variable (SystemProduction) to identify key predictors.
- Time Series Analysis: Conducted to observe trends, seasonality, and anomalies in solar power generation.
Various regression models were evaluated, including:
- Linear Regression
- Support Vector Regression (SVR)
- Decision Tree Regressor
- Random Forest Regressor
- Gradient Boosting Regressor
- XGBoost
Models were assessed using R-squared, Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).
| Model | R² | MSE | RMSE | MAE |
|---|---|---|---|---|
| Linear Regression | 0.637294 | 52.1485 | 4.0848 | 9.39 |
| Decision Tree | 0.715784 | 39.1776 | 6.2581 | 1.82 |
| Random Forest | 0.902032 | 20.3694 | 4.5132 | 1.98 |
| SVR | 0.041909 | 102.6813 | 10.1333 | 7.23 |
| XGBoost | 0.882280 | 28.5447 | 5.3422 | 1.62 |
| Gradient Boosting | 0.862676 | 30.3851 | 5.5123 | 1.18 |
- Optimization: Utilized RandomizedSearchCV to fine-tune hyperparameters for the top-performing models (Random Forest, Gradient Boosting, and XGBoost), resulting in enhanced predictive accuracy.
- Key Features Identified: Analyzed feature importance to gain insights into the model's decision-making process. Radiation, sunshine, and air temperature emerged as significant predictors of solar power generation.
The Random Forest Regressor was identified as the best-performing model, achieving an R-squared score of 0.90. The project highlights the effectiveness of machine learning in forecasting solar power generation, providing valuable insights into the factors influencing solar energy production.
Future enhancements could include:
- Implementing advanced neural network architectures (e.g., LSTM) for better time series forecasting.
- Exploring additional weather features or external data sources to improve model accuracy.
- Conducting sensitivity analysis to evaluate the impact of different features on model predictions.
This technical report summarizes the methodologies and findings of the solar power forecasting project, showcasing the potential of machine learning to predict renewable energy generation. The complete code and detailed analysis can be found in the accompanying Jupyter notebook on GitHub.