HR Data Analytics – Salary Prediction & Workforce Insights

A comprehensive data analytics and machine learning project exploring 5,000 employee records to understand salary drivers, workforce patterns, and predictive modeling for HR decision-making.

Dataset Overview

The dataset contains 5000 employee records with 20 features, including:

Gender
Business
Dependancies
Calls
Contract Type
Billing
Rating
Age
Salary
Base Pay
Bonus
Unit Price
Volume
Opening Balance
Closing Balance
Low
Unit Sales
Total Sales
Months (Tenure)
Education

Key Details

Missing values found in Base Pay, Opening Balance, and Total Sales
Handled using KNN Imputer
Final cleaned dataset contains zero missing values

Data Preparation

Steps performed:

Loaded CSV and verified data shape (5000 × 20)
Identified columns with missing values
Imputed missing values using KNN Imputer
Applied Label Encoding to categorical features
Checked variable distributions using boxplots
Removed outliers using the IQR method
Performed Spearman correlation for feature selection

Analysis & Visualizations

1. Gender Distribution

Male: 2528
Female: 2472
Employee distribution is nearly equal.

2. Education Breakdown

PG: 2979
Graduation: 1980
Intermediate: 27
High School: 14

3. Contract Type Distribution

Month-to-Month: 2777
Two Year: 1195
One Year: 1028

4. Salary Distribution & Outliers

Boxplots used to inspect Salary, Base Pay, Bonus, etc.
Significant outliers identified and removed.

5. Salary Correlations

Spearman correlation revealed:

Total Sales → Salary: 0.99 (near perfect monotonic relation)
Bonus, Base Pay, Unit Sales also highly correlated (> 0.8)

These became major predictors for modeling.

6. Numerical Feature Relationships

Scatter plots showed:

Higher Total Sales → Higher Salary
Higher Bonus → Higher Salary
More Months of Experience → Higher Salary

7. Categorical Impact on Salary

PG employees earn: ₹50,000 – ₹2,00,000
High School / Intermediate: < ₹25,000
Gender shows no meaningful salary difference.

8. Company Hiring Trends

765 hires in the last four months
269 employees have long tenure (72 months)

Machine Learning Models

Models trained after cleaning, encoding, scaling, and feature selection:

Linear Regression: 78.27%
Decision Tree Regression: 99.35%
Random Forest Regression: 99.09%
XGBoost Regression: 99.24% (best performer)

Cross-Validation Results

Random Forest CV: 0.99993
XGBoost CV: 0.99961

Final Model

XGBoost Regressor selected as final model
Normalized using StandardScaler
Model saved as: hr.pkl
Scaler saved as: schr.pkl
Deployed using both Tkinter and Streamlit interfaces

Key Insights

Salary is most strongly influenced by:

Age
Education
Months of Experience
Total Sales
Performance Metrics

Additional observations:

Higher sales strongly increase salary
Education level affects salary brackets significantly
Gender has no major impact on salary
Strong correlations enabled highly accurate ML predictions

Technologies Used

Python
Pandas
NumPy
Scikit-learn
XGBoost
Seaborn
Matplotlib
Pingouin
Tkinter
Streamlit

How to Run the Project

1. Clone the repository

bash git clone https://github.com/yourusername/HRDataAnalytics.git

2. Install dependencies

bash

Copy code

pip install pandas numpy matplotlib seaborn scikit-learn xgboost streamlit

3. Run the notebook

bash

Copy code

jupyter notebook HRDataAnalytics.ipynb

4. Launch the Streamlit App

bash

Copy code

streamlit run hrapp.py

Future Improvements

Integrate SHAP for model explainability
Add HR attrition prediction module
Create dashboards using PowerBI or Streamlit
Add advanced feature importance visualizations
Deploy as a complete web-based prediction system

Contributing

Contributions are welcome.

Feel free to add new models, visualizations, or workflow improvements.

Support

If you found this project helpful, consider starring the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
HRDataAnalytics.ipynb		HRDataAnalytics.ipynb
README.md		README.md
hr.pkl		hr.pkl
hrapp.py		hrapp.py
hrdataset.csv		hrdataset.csv
requirements.txt		requirements.txt
schr.pkl		schr.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HR Data Analytics – Salary Prediction & Workforce Insights

Dataset Overview

Key Details

Data Preparation

Analysis & Visualizations

1. Gender Distribution

2. Education Breakdown

3. Contract Type Distribution

4. Salary Distribution & Outliers

5. Salary Correlations

6. Numerical Feature Relationships

7. Categorical Impact on Salary

8. Company Hiring Trends

Machine Learning Models

Cross-Validation Results

Final Model

Key Insights

Technologies Used

How to Run the Project

1. Clone the repository

2. Install dependencies

3. Run the notebook

4. Launch the Streamlit App

Future Improvements

Contributing

Support

About

Uh oh!

Releases

Packages

Languages

aydenfromproxima/HR_Data_Analysis

Folders and files

Latest commit

History

Repository files navigation

HR Data Analytics – Salary Prediction & Workforce Insights

Dataset Overview

Key Details

Data Preparation

Analysis & Visualizations

1. Gender Distribution

2. Education Breakdown

3. Contract Type Distribution

4. Salary Distribution & Outliers

5. Salary Correlations

6. Numerical Feature Relationships

7. Categorical Impact on Salary

8. Company Hiring Trends

Machine Learning Models

Cross-Validation Results

Final Model

Key Insights

Technologies Used

How to Run the Project

1. Clone the repository

2. Install dependencies

3. Run the notebook

4. Launch the Streamlit App

Future Improvements

Contributing

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages