HouseVal

This repository contains an end-to-end ML workflow aimed at predicting house sale prices given 79 explanatory variables from the Ames Housing dataset. The goal is to demonstrate proficiency in building machine learning pipelines using common industry practices.

Project Overview

Predicting house prices is a classic regression problem frequently used to assess the effectiveness of various machine learning techniques. In this competition, we aim to estimate the final house sale price by leveraging data exploration, feature engineering, model experimentation, and hyperparameter tuning. The codebase in this repository highlights the typical steps involved in a simple data science project:

Data Acquisition & Organisation
Exploratory Data Analysis (EDA)
Feature Engineering & Data Cleaning
Model Development (including Linear Regression, Random Forests, Gradient Boosting, and Neural Networks)
Evaluation & Submission (using Root Mean Squared Error on the log of SalePrice)

Repository Structure

.
├── data/
│   ├── raw/                  # Original Kaggle data (train/test)
│   ├── processed/            # Data after cleaning and feature engineering
├── notebooks/
│   ├── copy_data.ipynb       # Copies processed data into src/data
│   ├── eda.ipynb             # Exploratory data analysis (insights on distributions, correlations, etc.)
│   ├── feature_engineering.ipynb 
│   ├── initial_data_separation.ipynb
├── src/
│   ├── architectures/        # Neural network architectures (PyTorch)
│   ├── data/                 # Data (used by training scripts)
│   ├── logs/                 # Logging files
│   ├── models/               # Serialised trained models
│   ├── runs/                 # Checkpoints and run artifacts
│   ├── __init__.py
│   ├── train.py              # PyTorch-based training script
│   ├── train_predict.py      # Scikit-learn-based training & prediction workflow
│   ├── train_simple.py       # Simple scikit-learn training approach
│   └── train_sklearn.py      # Unified scikit-learn training pipeline
├── submissions/              # Model predictions to upload on Kaggle
├── environment.yml           # Packages and conda environment
└── README.md                 # Project documentation

Key Files & Directories

notebooks/eda.ipynb Performs exploratory data analysis to uncover relationships, distributions, missing data patterns, and potential outliers. Visualisations include histograms, box plots, pairplots, correlation heatmaps, and missing-value plots.
notebooks/feature_engineering.ipynb Demonstrates feature engineering workflows, including:
- Handling categorical variables via one-hot-encoding or ordinal mappings
- Combining multiple features (e.g., total square footage)
- Creating new features (e.g., date sold, binary flags for porches/pools)
- Imputation (median, mode) for missing data
notebooks/initial_data_separation.ipynb Establishes data separation steps, ensuring training and test sets are clearly demarcated and that relevant transformations are consistently applied. Also documents how columns with high missingness or irrelevance may be dropped.
src/train.py A PyTorch-based training script that sets up an MLP architecture, trains it with a configurable number of epochs and learning rate, and logs the results. Outputs final trained model and predictions for submission.
src/train_sklearn.py A scikit-learn-based training pipeline that systematically evaluates multiple regression models (Linear, Random Forest, Gradient Boosting, XGBoost, etc.). Uses random splits for validation, standardised features, and logs the RMSE to identify the best model.
src/train_predict.py An alternative scikit-learn workflow that shows how to quickly train multiple models (e.g., LinearRegression, RandomForestRegressor, AdaBoostRegressor, XGBRegressor) and generate Kaggle submission files.
src/train_simple.py A more lightweight approach for scikit-learn model testing, focusing on a smaller subset of models with minimal overhead (useful for quick experiments).

Machine Learning Workflow Highlights

Exploratory Data Analysis
- Visualised missing data, outliers, and key correlations (e.g., OverallQual, TotalSF).
- Quantified relationships between categorical features and target variable (SalePrice).
Feature Engineering
- Performed one-hot encoding for high-cardinality categorical variables.
- Mapped ordinal features (e.g., quality/condition metrics) to numeric scales.
- Created combined features such as TotalSF and binary indicators (WoodDeckSF_Present, etc.).
- Used custom imputation strategies (median, mode) for missing values.
Model Development
- Utilised various regression algorithms:
  - Linear Regression and Regularised Linear Models (Lasso, Ridge)
  - Tree-Based Models (Random Forest, GradientBoostingRegressor)
  - XGBoost for gradient boosting with flexible hyperparameter tuning
  - Neural Networks using a feed-forward MLP architecture (PyTorch)
- Investigated hyperparameters (e.g., number of estimators in Random Forest, learning rate for MLP).
Evaluation & Validation
- Leveraged train-validation splits for unbiased performance estimates.
- Monitored Root Mean Squared Error (RMSE) to align with Kaggle’s evaluation metric.

Performance

Depending on the model, final Kaggle submissions reached an RMSE score of around 0.13–0.14 (as evaluated by the log-based metric on Kaggle). This performance is solid for a straightforward pipeline and demonstrates effective use of:

Feature engineering best practices
Ensembling methods (Random Forest, Gradient Boosting, XGBoost)
Neural Networks for regression tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HouseVal

Project Overview

Repository Structure

Key Files & Directories

Machine Learning Workflow Highlights

Performance

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
notebooks		notebooks
src		src
submissions		submissions
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

hpfield/HouseVal

Folders and files

Latest commit

History

Repository files navigation

HouseVal

Project Overview

Repository Structure

Key Files & Directories

Machine Learning Workflow Highlights

Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages