A structured, hands-on project series covering the full data science pipeline β from raw multi-source data ingestion to automated EDA, cleaning, imputation, encoding, scaling, and model-ready dataset preparation.
This repository is a structured collection of lab exercises and projects built progressively across three units of a data science course. Each piece targets a specific stage of the ML data pipeline β from understanding raw data formats to engineering features for predictive modelling.
The overarching theme is a real-world Credit Risk / Loan Default binary classification problem, with supporting labs that explore the individual techniques in isolation using smaller, focused datasets.
π¦ Data-Preprocessing-and-Feature-Engineering
β
βββ π Lab_Work_1_1.ipynb # Unit 1 β DS project planning, ML framing, tensors
βββ π Lab_Work_1_2/ # Unit 1 β Multi-format data ingestion (CSV, JSON, SQL)
βββ π Lab_Work_1_3/ # Unit 1 β EDA (univariate, bivariate, profiling)
β
βββ π Lab_Work_2_1.ipynb # Unit 2 β General, numerical & categorical imputation
βββ π Lab_Work_2_2.ipynb # Unit 2 β Random sample & KNN imputation
βββ π Lab_Work_2_3.ipynb # Unit 2 β MICE / IterativeImputer & RMSE evaluation
βββ π Lab_Work_2_4.ipynb # Unit 2 β Outlier detection: Z-score & IQR
βββ π Lab_Work_2_5.ipynb # Unit 2 β Outlier treatment: Percentile & Winsorization
β
βββ π Lab_Work_3_1.ipynb # Unit 3 β Feature construction from datetime
βββ π Lab_Work_3_2.ipynb # Unit 3 β Encoding: Ordinal, Label, One-Hot
βββ π Lab_Work_3_3.ipynb # Unit 3 β Binning: Discretization, Binarization, Quantile
βββ π Lab_Work_3_4.ipynb # Unit 3 β Scaling: Standardization, Normalization, MinMax
β
βββ π PR_1/ # Project 1 β Automated EDA & Data Profiling
βββ π PR_2/ # Project 2 β Data Cleansing: Patient Health Records
βββ π Final_Project/ # Final Project β End-to-end Preprocessing Pipeline
β
βββ π requirements.txt # All Python dependencies
| Notebook | Topics Covered |
|---|---|
Lab_Work_1_1.ipynb |
Planning a data science project, framing an ML problem, NumPy tensor fundamentals (scalar, vector, matrix, higher-order) |
Lab_Work_1_2/ |
Reading CSV, JSON, and SQLite files; filtering, merging, and exporting structured data across formats |
Lab_Work_1_3/ |
EDA β univariate distributions, bivariate relationships, multivariate pair plots; automated profiling report via ydata-profiling |
| Notebook | Topics Covered |
|---|---|
Lab_Work_2_1.ipynb |
SimpleImputer with mean, median, and mode strategies for numerical and categorical features |
Lab_Work_2_2.ipynb |
MissingIndicator, random sample imputation, KNNImputer (multivariate) |
Lab_Work_2_3.ipynb |
MICE via IterativeImputer, simulating MAR patterns, holdout RMSE evaluation, mixed numeric + categorical imputation |
Lab_Work_2_4.ipynb |
Outlier theory in ML, Z-score detection & removal, IQR detection & removal |
Lab_Work_2_5.ipynb |
Percentile-based clipping with threshold tuning, Winsorization technique |
| Notebook | Topics Covered |
|---|---|
Lab_Work_3_1.ipynb |
Datetime feature construction β parsing, type conversion, computing delivery durations |
Lab_Work_3_2.ipynb |
Ordinal Encoding (education levels), Label Encoding, One-Hot Encoding with drop='first' |
Lab_Work_3_3.ipynb |
KBins Discretization (uniform & quantile), Binarization, K-Means binning |
Lab_Work_3_4.ipynb |
Standardization (Z-score), L1/L2 Normalization, MinMax Scaling, MaxAbs & Robust Scaling |
Domain: Customer Purchase Behaviour | Goal: Churn Prediction (Binary Classification)
A hands-on implementation of automated EDA across multiple data formats. Customer purchase data is ingested from CSV, JSON, and a SQLite database, merged into a single DataFrame, cleaned, and profiled using ydata-profiling to produce an interactive HTML report.
Key highlights: multi-source ingestion, REST API data fetch, type fixing, univariate/bivariate/multivariate EDA, automated profiling report.
π PR_1/ Β· See the project README for full details.
Domain: Healthcare | Goal: Produce analysis-ready patient records
An end-to-end data cleaning pipeline applied to raw, messy patient health records. The notebook walks through every stage of data purification β deduplication, missing value imputation, type casting, string sanitisation, and outlier handling β to output a fully clean CSV ready for EDA and ML modelling.
Key highlights: duplicate removal, targeted imputation by data type, datetime casting, whitespace/capitalisation standardisation, feature dropping, clean CSV export.
π PR_2/ Β· See the project README for full details.
Domain: Credit Risk / Financial Services | Goal: Loan Default Prediction (Binary Classification)
The capstone project bringing together every technique from all three units. Customer credit risk data from three separate sources (CSV, JSON, SQLite) is merged and taken through an 8-part preprocessing pipeline to produce a model-ready 29-feature dataset.
Pipeline at a glance:
| Part | Stage | Key Decision |
|---|---|---|
| A | Conceptual Foundation | DS problem framing, tensor demos |
| B | Multi-Source Acquisition | CSV + JSON + SQLite merged on customer_id |
| C | Cleaning & Imputation | Compared Simple / Random / KNN / MICE β MICE selected |
| D | Outlier Handling | Z-score / IQR / Percentile / IQR+Winsorization β zero rows lost |
| E | Feature Engineering | Ordinal / Label / OHE / KBins / Binarizer / Quantile / K-Means binning |
| F | Feature Scaling | Standard / Normalizer / MinMax / MaxAbs / RobustScaler |
| G | Feature Construction & Transformation | 3 domain features + Log, Reciprocal, Sqrt, Box-Cox, Yeo-Johnson transforms |
| H | Final Deliverable | final_cleaned_data.csv β 2,500 rows Γ 29 features, zero nulls |
Outputs: final_cleaned_data.csv, customer_credit_risk_dataset_EDA_report.html, KDE comparison plots (kde.png), outlier box plots (boxplot.png).
π Final_Project/ Β· See the project README for full details.
| Project | Dataset | Records | Formats | Target Variable |
|---|---|---|---|---|
| PR_1 | Customer purchase behaviour | ~100 | CSV, JSON, SQLite | Churn (binary) |
| PR_2 | Patient health records | ~500 | CSV | β (cleaning only) |
| Final Project | Customer credit risk | 2,500 | CSV, JSON, SQLite | default_flag (binary) |
git clone https://github.com/krish-desai-123/Data-Preprocessing-and-Feature-Engineering.git
cd Data-Preprocessing-and-Feature-Engineeringpython -m venv venv
# Windows:
venv\Scripts\activate
# macOS / Linux:
source venv/bin/activatepip install -r requirements.txtjupyter notebookOpen any lab notebook directly, or navigate into PR_1/, PR_2/, or Final_Project/ to run a full project end-to-end.
| Tool | Purpose |
|---|---|
Python 3.10+ |
Core language |
pandas |
Data loading, merging, manipulation |
numpy |
Numerical operations & tensor demonstrations |
matplotlib + seaborn |
Visualisations β distributions, heatmaps, box plots, KDE plots |
scikit-learn |
Imputation, encoding, scaling, binning (SimpleImputer, KNNImputer, IterativeImputer, OrdinalEncoder, OneHotEncoder, StandardScaler, KBinsDiscretizer, etc.) |
ydata-profiling |
Automated interactive HTML EDA reports |
sqlite3 |
SQLite database connections |
jupyter |
Interactive notebook environment |
Krish Desai GitHub: @krish-desai-123