This repository contains an end-to-end ML workflow aimed at predicting house sale prices given 79 explanatory variables from the Ames Housing dataset. The goal is to demonstrate proficiency in building machine learning pipelines using common industry practices.
Predicting house prices is a classic regression problem frequently used to assess the effectiveness of various machine learning techniques. In this competition, we aim to estimate the final house sale price by leveraging data exploration, feature engineering, model experimentation, and hyperparameter tuning. The codebase in this repository highlights the typical steps involved in a simple data science project:
- Data Acquisition & Organisation
- Exploratory Data Analysis (EDA)
- Feature Engineering & Data Cleaning
- Model Development (including Linear Regression, Random Forests, Gradient Boosting, and Neural Networks)
- Evaluation & Submission (using Root Mean Squared Error on the log of SalePrice)
.
├── data/
│ ├── raw/ # Original Kaggle data (train/test)
│ ├── processed/ # Data after cleaning and feature engineering
├── notebooks/
│ ├── copy_data.ipynb # Copies processed data into src/data
│ ├── eda.ipynb # Exploratory data analysis (insights on distributions, correlations, etc.)
│ ├── feature_engineering.ipynb
│ ├── initial_data_separation.ipynb
├── src/
│ ├── architectures/ # Neural network architectures (PyTorch)
│ ├── data/ # Data (used by training scripts)
│ ├── logs/ # Logging files
│ ├── models/ # Serialised trained models
│ ├── runs/ # Checkpoints and run artifacts
│ ├── __init__.py
│ ├── train.py # PyTorch-based training script
│ ├── train_predict.py # Scikit-learn-based training & prediction workflow
│ ├── train_simple.py # Simple scikit-learn training approach
│ └── train_sklearn.py # Unified scikit-learn training pipeline
├── submissions/ # Model predictions to upload on Kaggle
├── environment.yml # Packages and conda environment
└── README.md # Project documentation
- notebooks/eda.ipynb Performs exploratory data analysis to uncover relationships, distributions, missing data patterns, and potential outliers. Visualisations include histograms, box plots, pairplots, correlation heatmaps, and missing-value plots.
- notebooks/feature_engineering.ipynb Demonstrates feature engineering workflows, including:
- Handling categorical variables via one-hot-encoding or ordinal mappings
- Combining multiple features (e.g., total square footage)
- Creating new features (e.g., date sold, binary flags for porches/pools)
- Imputation (median, mode) for missing data
- notebooks/initial_data_separation.ipynb Establishes data separation steps, ensuring training and test sets are clearly demarcated and that relevant transformations are consistently applied. Also documents how columns with high missingness or irrelevance may be dropped.
- src/train.py A PyTorch-based training script that sets up an MLP architecture, trains it with a configurable number of epochs and learning rate, and logs the results. Outputs final trained model and predictions for submission.
- src/train_sklearn.py A scikit-learn-based training pipeline that systematically evaluates multiple regression models (Linear, Random Forest, Gradient Boosting, XGBoost, etc.). Uses random splits for validation, standardised features, and logs the RMSE to identify the best model.
- src/train_predict.py An alternative scikit-learn workflow that shows how to quickly train multiple models (e.g., LinearRegression, RandomForestRegressor, AdaBoostRegressor, XGBRegressor) and generate Kaggle submission files.
- src/train_simple.py A more lightweight approach for scikit-learn model testing, focusing on a smaller subset of models with minimal overhead (useful for quick experiments).
- Exploratory Data Analysis
- Visualised missing data, outliers, and key correlations (e.g., OverallQual, TotalSF).
- Quantified relationships between categorical features and target variable (SalePrice).
- Feature Engineering
- Performed one-hot encoding for high-cardinality categorical variables.
- Mapped ordinal features (e.g., quality/condition metrics) to numeric scales.
- Created combined features such as
TotalSFand binary indicators (WoodDeckSF_Present, etc.). - Used custom imputation strategies (median, mode) for missing values.
- Model Development
- Utilised various regression algorithms:
- Linear Regression and Regularised Linear Models (Lasso, Ridge)
- Tree-Based Models (Random Forest, GradientBoostingRegressor)
- XGBoost for gradient boosting with flexible hyperparameter tuning
- Neural Networks using a feed-forward MLP architecture (PyTorch)
- Investigated hyperparameters (e.g., number of estimators in Random Forest, learning rate for MLP).
- Utilised various regression algorithms:
- Evaluation & Validation
- Leveraged train-validation splits for unbiased performance estimates.
- Monitored Root Mean Squared Error (RMSE) to align with Kaggle’s evaluation metric.
Depending on the model, final Kaggle submissions reached an RMSE score of around 0.13–0.14 (as evaluated by the log-based metric on Kaggle). This performance is solid for a straightforward pipeline and demonstrates effective use of:
- Feature engineering best practices
- Ensembling methods (Random Forest, Gradient Boosting, XGBoost)
- Neural Networks for regression tasks