Skip to content

VyuWing-Learning/Data-Pre-Processing-Bootcamp-Machine-Learning-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

Data Pre Processing Machine Learning Bootcamp

Code Repository for the Bootcamp conducted on 22nd August, 2021 with Uttam Grade (Data Scientist, McKinsey & Company)

Agenda for the Boootcamp

  • Introduction to Python
  • An overview into Numpy & Pandas
  • Data Visualisation
  • Exploratory Data Analysis

Problem Statement

Given a data set which captures gross salary from July 1, 2013 through June 30, 2014 and includes only those employees who were employed on June 30, 2014 Predict the Salaries for Employees in Blatimore. The Dataset used in this repository is Baltimore City Employee Salaries FY2014 and can be downloaded from this link.

Data Cleaning and Data Preparation

Cleaning & preparation measure applied to the dataset are listed below.

  • Remove leading and trailing edges
  • Check Null Values in data set
  • Remove rows having empty hire date
  • Drop Gross Pay column
  • Remove $ from Annual Salary and converting it into Integer format
  • Trim spaces

After all these transformations the dataframe shall appear in the format given below.

Exploratory Data Analysis

Countplot

Histogram

Box plot for annual salary

Annual Salary Distribution Plot

Top 10 Jobs that based on hirings

Top 10 Jobs that fetch the highest Salary

Top 10 Agencies that has highest number of employees

Top 10 Jobs that has highest number of employees

Average salaries of employees based on Hire Month

Hiring with years

Pair Plot

Heat Map

Feature Engineering

  • Apply mean encoding for Job Title
  • Apply mean encoding for Agency
  • Apply mean encoding for AgencyID

Test train split

  • Divide tarin set into Dependent and independent variables
  • Divide test set into Dependent and independent variables
  • Scale the train, test

Scaling

There are two types of scaling

  • Standard Scaling
  • MinMax Scaling

Model Evalution

We have used Linear Regression model.

Distribution plot of Residuals

Scatter plot of Residuals

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published