Skip to content

naveenk-DS/Microsoft_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ›ก๏ธ Classifying Cybersecurity Incidents with Machine Learning

๐Ÿ“Œ Project Overview

This project aims to classify cybersecurity incidents into categories using machine learning. We preprocess raw security incident data, perform EDA (Exploratory Data Analysis), handle missing values, encode categorical features, and build ML models such as Random Forest, XGBoost, and Logistic Regression.

The goal is to predict the Category of incidents (e.g., TP, FP, BP) to help security analysts prioritize responses.

๐Ÿ“‚ Dataset

The dataset consists of two files:

GUIDE_Train.csv โ†’ Training dataset

GUIDE_Test.csv โ†’ Testing dataset

Key Columns:

IncidentId โ€“ Unique ID for incidents

AlertTitle โ€“ Title/description of alert

Category โ€“ Target variable (TP, FP, BP, etc.)

IncidentGrade, EntityType, ResourceType, etc. โ€“ Features used for classification

โš™๏ธ Steps in the Project

1. Data Preprocessing

Removed duplicate/unnecessary columns (OrgId, AlertId, etc.).

Handled missing values:

Numerical columns โ†’ filled with mean

Categorical columns โ†’ filled with mode

Label Encoding used for categorical features.

2. Exploratory Data Analysis (EDA)

Distribution plots for categorical features.

Correlation heatmap (numerical features).

Feature importance analysis using Random Forest.

3. Model Selection & Training

Models tested:

Logistic Regression

Random Forest Classifier

XGBoost Classifier

4. Model Evaluation

Metrics used:

Accuracy

Precision, Recall, F1-Score (Macro-F1 for balanced evaluation)

Confusion Matrix

5. Model Tuning

Hyperparameter tuning using GridSearchCV / RandomizedSearchCV.

Class imbalance handled with SMOTE and class weights.

6. Final Model & Predictions

Best model selected (Random Forest / XGBoost).

Predictions saved into a CSV file (Predictions.csv).

๐Ÿงช Usage

Install Dependencies

pip install pandas numpy matplotlib seaborn scikit-learn xgboost imbalanced-learn

Run Training Script

python train.py

Save Predictions

import pandas as pd

y_pred = model.predict(X_test)

pred_df = pd.DataFrame({ "IncidentId": Test["IncidentId"], "Predicted_Category": y_pred })

pred_df.to_csv("Predictions.csv", index=False) print("โœ… Predictions saved successfully!")

Save Trained Model

import joblib joblib.dump(model, "final_model.pkl")

Load Saved Model model = joblib.load("final_model.pkl")

๐Ÿ“Š Results

Best performing model: Random Forest (after tuning)

Macro-F1 Score: ~0.89 (example)

Balanced performance across TP, FP, BP categories

๐Ÿ”ฎ Future Improvements

Use deep learning (LSTM/Transformers) for better text-based features.

Add feature engineering from AlertTitle (NLP-based embeddings).

Deploy as a Streamlit dashboard for analysts.

๐Ÿ‘จโ€๐Ÿ’ป Author

Naveen

Data Science Student @ GUVI โ€“ IIT Madras

About

๐Ÿ›ก๏ธ Microsoft - Classifying Cybersecurity Incidents This project focuses on classifying cybersecurity incidents using machine learning techniques. The goal is to build a model that can identify the type of incident based on input features, helping security teams respond more effectively.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors