Skip to content

krish-desai-123/Data-Preprocessing-and-Feature-Engineering

Repository files navigation

πŸ“Š Data Preprocessing & Feature Engineering

A structured, hands-on project series covering the full data science pipeline β€” from raw multi-source data ingestion to automated EDA, cleaning, imputation, encoding, scaling, and model-ready dataset preparation.

Python Pandas scikit-learn Jupyter ydata-profiling Status


πŸ“Œ About This Repository

This repository is a structured collection of lab exercises and projects built progressively across three units of a data science course. Each piece targets a specific stage of the ML data pipeline β€” from understanding raw data formats to engineering features for predictive modelling.

The overarching theme is a real-world Credit Risk / Loan Default binary classification problem, with supporting labs that explore the individual techniques in isolation using smaller, focused datasets.


πŸ“‚ Repository Structure

πŸ“¦ Data-Preprocessing-and-Feature-Engineering
 β”‚
 β”œβ”€β”€ πŸ““ Lab_Work_1_1.ipynb         # Unit 1 β€” DS project planning, ML framing, tensors
 β”œβ”€β”€ πŸ“ Lab_Work_1_2/              # Unit 1 β€” Multi-format data ingestion (CSV, JSON, SQL)
 β”œβ”€β”€ πŸ“ Lab_Work_1_3/              # Unit 1 β€” EDA (univariate, bivariate, profiling)
 β”‚
 β”œβ”€β”€ πŸ““ Lab_Work_2_1.ipynb         # Unit 2 β€” General, numerical & categorical imputation
 β”œβ”€β”€ πŸ““ Lab_Work_2_2.ipynb         # Unit 2 β€” Random sample & KNN imputation
 β”œβ”€β”€ πŸ““ Lab_Work_2_3.ipynb         # Unit 2 β€” MICE / IterativeImputer & RMSE evaluation
 β”œβ”€β”€ πŸ““ Lab_Work_2_4.ipynb         # Unit 2 β€” Outlier detection: Z-score & IQR
 β”œβ”€β”€ πŸ““ Lab_Work_2_5.ipynb         # Unit 2 β€” Outlier treatment: Percentile & Winsorization
 β”‚
 β”œβ”€β”€ πŸ““ Lab_Work_3_1.ipynb         # Unit 3 β€” Feature construction from datetime
 β”œβ”€β”€ πŸ““ Lab_Work_3_2.ipynb         # Unit 3 β€” Encoding: Ordinal, Label, One-Hot
 β”œβ”€β”€ πŸ““ Lab_Work_3_3.ipynb         # Unit 3 β€” Binning: Discretization, Binarization, Quantile
 β”œβ”€β”€ πŸ““ Lab_Work_3_4.ipynb         # Unit 3 β€” Scaling: Standardization, Normalization, MinMax
 β”‚
 β”œβ”€β”€ πŸ“ PR_1/                      # Project 1 β€” Automated EDA & Data Profiling
 β”œβ”€β”€ πŸ“ PR_2/                      # Project 2 β€” Data Cleansing: Patient Health Records
 β”œβ”€β”€ πŸ“ Final_Project/             # Final Project β€” End-to-end Preprocessing Pipeline
 β”‚
 └── πŸ“œ requirements.txt           # All Python dependencies

πŸ§ͺ Lab Exercises

πŸ”· Unit 1 β€” Data Foundations

Notebook Topics Covered
Lab_Work_1_1.ipynb Planning a data science project, framing an ML problem, NumPy tensor fundamentals (scalar, vector, matrix, higher-order)
Lab_Work_1_2/ Reading CSV, JSON, and SQLite files; filtering, merging, and exporting structured data across formats
Lab_Work_1_3/ EDA β€” univariate distributions, bivariate relationships, multivariate pair plots; automated profiling report via ydata-profiling

πŸ”· Unit 2 β€” Missing Values & Outliers

Notebook Topics Covered
Lab_Work_2_1.ipynb SimpleImputer with mean, median, and mode strategies for numerical and categorical features
Lab_Work_2_2.ipynb MissingIndicator, random sample imputation, KNNImputer (multivariate)
Lab_Work_2_3.ipynb MICE via IterativeImputer, simulating MAR patterns, holdout RMSE evaluation, mixed numeric + categorical imputation
Lab_Work_2_4.ipynb Outlier theory in ML, Z-score detection & removal, IQR detection & removal
Lab_Work_2_5.ipynb Percentile-based clipping with threshold tuning, Winsorization technique

πŸ”· Unit 3 β€” Feature Engineering

Notebook Topics Covered
Lab_Work_3_1.ipynb Datetime feature construction β€” parsing, type conversion, computing delivery durations
Lab_Work_3_2.ipynb Ordinal Encoding (education levels), Label Encoding, One-Hot Encoding with drop='first'
Lab_Work_3_3.ipynb KBins Discretization (uniform & quantile), Binarization, K-Means binning
Lab_Work_3_4.ipynb Standardization (Z-score), L1/L2 Normalization, MinMax Scaling, MaxAbs & Robust Scaling

πŸš€ Projects

PR_1 β€” Automated Data Profiling

Domain: Customer Purchase Behaviour | Goal: Churn Prediction (Binary Classification)

A hands-on implementation of automated EDA across multiple data formats. Customer purchase data is ingested from CSV, JSON, and a SQLite database, merged into a single DataFrame, cleaned, and profiled using ydata-profiling to produce an interactive HTML report.

Key highlights: multi-source ingestion, REST API data fetch, type fixing, univariate/bivariate/multivariate EDA, automated profiling report.

πŸ“ PR_1/ Β· See the project README for full details.


PR_2 β€” Data Cleansing: Patient Health Records

Domain: Healthcare | Goal: Produce analysis-ready patient records

An end-to-end data cleaning pipeline applied to raw, messy patient health records. The notebook walks through every stage of data purification β€” deduplication, missing value imputation, type casting, string sanitisation, and outlier handling β€” to output a fully clean CSV ready for EDA and ML modelling.

Key highlights: duplicate removal, targeted imputation by data type, datetime casting, whitespace/capitalisation standardisation, feature dropping, clean CSV export.

πŸ“ PR_2/ Β· See the project README for full details.


πŸ† Final Project β€” Holistic Data Preparer

Domain: Credit Risk / Financial Services | Goal: Loan Default Prediction (Binary Classification)

The capstone project bringing together every technique from all three units. Customer credit risk data from three separate sources (CSV, JSON, SQLite) is merged and taken through an 8-part preprocessing pipeline to produce a model-ready 29-feature dataset.

Pipeline at a glance:

Part Stage Key Decision
A Conceptual Foundation DS problem framing, tensor demos
B Multi-Source Acquisition CSV + JSON + SQLite merged on customer_id
C Cleaning & Imputation Compared Simple / Random / KNN / MICE β†’ MICE selected
D Outlier Handling Z-score / IQR / Percentile / IQR+Winsorization β†’ zero rows lost
E Feature Engineering Ordinal / Label / OHE / KBins / Binarizer / Quantile / K-Means binning
F Feature Scaling Standard / Normalizer / MinMax / MaxAbs / RobustScaler
G Feature Construction & Transformation 3 domain features + Log, Reciprocal, Sqrt, Box-Cox, Yeo-Johnson transforms
H Final Deliverable final_cleaned_data.csv β€” 2,500 rows Γ— 29 features, zero nulls

Outputs: final_cleaned_data.csv, customer_credit_risk_dataset_EDA_report.html, KDE comparison plots (kde.png), outlier box plots (boxplot.png).

πŸ“ Final_Project/ Β· See the project README for full details.


πŸ“Š Dataset Summary

Project Dataset Records Formats Target Variable
PR_1 Customer purchase behaviour ~100 CSV, JSON, SQLite Churn (binary)
PR_2 Patient health records ~500 CSV β€” (cleaning only)
Final Project Customer credit risk 2,500 CSV, JSON, SQLite default_flag (binary)

βš™οΈ Getting Started

1. Clone the Repository

git clone https://github.com/krish-desai-123/Data-Preprocessing-and-Feature-Engineering.git
cd Data-Preprocessing-and-Feature-Engineering

2. Set Up a Virtual Environment

python -m venv venv

# Windows:
venv\Scripts\activate

# macOS / Linux:
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Launch Jupyter

jupyter notebook

Open any lab notebook directly, or navigate into PR_1/, PR_2/, or Final_Project/ to run a full project end-to-end.


🧰 Tech Stack

Tool Purpose
Python 3.10+ Core language
pandas Data loading, merging, manipulation
numpy Numerical operations & tensor demonstrations
matplotlib + seaborn Visualisations β€” distributions, heatmaps, box plots, KDE plots
scikit-learn Imputation, encoding, scaling, binning (SimpleImputer, KNNImputer, IterativeImputer, OrdinalEncoder, OneHotEncoder, StandardScaler, KBinsDiscretizer, etc.)
ydata-profiling Automated interactive HTML EDA reports
sqlite3 SQLite database connections
jupyter Interactive notebook environment

πŸ‘¨β€πŸ’» Author

Krish Desai GitHub: @krish-desai-123


If this repository helped you learn or saved you time, consider giving it a ⭐ β€” it means a lot!

About

πŸ“ An end-to-end learning track and code portfolio dedicated to Data Preprocessing and Feature Engineering. Features practical notebooks for standardizing, cleaning, and profiling complex datasets. 🧬

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors