This project aims to classify cybersecurity incidents into categories using machine learning. We preprocess raw security incident data, perform EDA (Exploratory Data Analysis), handle missing values, encode categorical features, and build ML models such as Random Forest, XGBoost, and Logistic Regression.
The goal is to predict the Category of incidents (e.g., TP, FP, BP) to help security analysts prioritize responses.
GUIDE_Train.csv โ Training dataset
GUIDE_Test.csv โ Testing dataset
IncidentId โ Unique ID for incidents
AlertTitle โ Title/description of alert
Category โ Target variable (TP, FP, BP, etc.)
IncidentGrade, EntityType, ResourceType, etc. โ Features used for classification
Removed duplicate/unnecessary columns (OrgId, AlertId, etc.).
Handled missing values:
Numerical columns โ filled with mean
Categorical columns โ filled with mode
Label Encoding used for categorical features.
Distribution plots for categorical features.
Correlation heatmap (numerical features).
Feature importance analysis using Random Forest.
Models tested:
Logistic Regression
Random Forest Classifier
XGBoost Classifier
Metrics used:
Accuracy
Precision, Recall, F1-Score (Macro-F1 for balanced evaluation)
Confusion Matrix
Hyperparameter tuning using GridSearchCV / RandomizedSearchCV.
Class imbalance handled with SMOTE and class weights.
Best model selected (Random Forest / XGBoost).
Predictions saved into a CSV file (Predictions.csv).
pip install pandas numpy matplotlib seaborn scikit-learn xgboost imbalanced-learn
python train.py
import pandas as pd
y_pred = model.predict(X_test)
pred_df = pd.DataFrame({ "IncidentId": Test["IncidentId"], "Predicted_Category": y_pred })
pred_df.to_csv("Predictions.csv", index=False) print("โ Predictions saved successfully!")
import joblib joblib.dump(model, "final_model.pkl")
Load Saved Model model = joblib.load("final_model.pkl")
Best performing model: Random Forest (after tuning)
Macro-F1 Score: ~0.89 (example)
Balanced performance across TP, FP, BP categories
Use deep learning (LSTM/Transformers) for better text-based features.
Add feature engineering from AlertTitle (NLP-based embeddings).
Deploy as a Streamlit dashboard for analysts.
Data Science Student @ GUVI โ IIT Madras