ICT Capstone Project, Group S1-2026-10
This repository contains the data cleaning, modelling code, output files, and documentation for the Singapore birth and fertility forecasting project. The project analyses historical Singapore fertility data and compares forecasting approaches for:
- Total Fertility Rate (TFR)
- Total Live Births (TLB)
The main workflow cleans the raw dataset, creates training and testing splits, runs ARIMA and SARIMA models, checks residual diagnostics, compares forecast accuracy, and stores output plots and result tables for the final report.
| Name | Student ID |
|---|---|
| Brandon Ho | 1926054 |
| Christina Nguyen | 1797406 |
| Lara Grocke | 1802741 |
| Minh Quant Tran | 1897916 |
| Xinhai Li | 1881755 |
The project investigates long-term fertility and birth patterns in Singapore using public demographic data from 1960 to 2025. The training period is 1960 to 2012 and the testing period is 2013 to 2025. The testing period is used to compare how well each model forecasts unseen data.
The main modelling approaches in this repository are:
- ARIMA models for TFR and TLB
- Corrected SARIMA models for TLB with a 12-year seasonal period
- Zodiac-related exploratory analysis
- Machine learning comparison models
- Literature and dataset documentation for report support
Open the project in RStudio and set the working directory to the project root. The project root should contain folders such as code, raw_data, clean_data, documentation, outputs, and processed_data.
Run the preprocessing script first:
source("code/preprocessing.R")This rebuilds the cleaned data files in clean_data/.
The main data cleaning file is:
code/preprocessing.RRun it with:
source("code/preprocessing.R")This script:
- imports the raw dataset from
raw_data/1960-2025.csv - reshapes the dataset from wide format into long format
- extracts TFR and TLB values
- cleans numeric formatting
- creates the training and testing datasets
- exports cleaned files into
clean_data/
Expected split:
Training set: 1960 to 2012
Testing set: 2013 to 2025
To check the cleaned data:
train <- read.csv("clean_data/train.csv")
test <- read.csv("clean_data/test.csv")
head(train)
tail(train)
head(test)
tail(test)| File | Purpose |
|---|---|
code/preprocessing.R |
Rebuilds cleaned TFR and TLB datasets from the raw Singapore fertility data. |
code/subset_plotting.R |
Creates basic TFR and TLB time series plots from the cleaned training data. |
| File | Purpose |
|---|---|
code/tfr_arima.R |
Runs ARIMA model exploration for Total Fertility Rate. |
code/tfr_arima_model_validation.R |
Validates selected TFR ARIMA models and produces forecast outputs. |
code/tfr_residual_diagnostics.R |
Produces residual diagnostic plots for selected TFR ARIMA models. |
code/tfr-model-comparison.R |
Compares the best AIC and best RMSE TFR ARIMA models. |
| File | Purpose |
|---|---|
code/tlb_arima_search.R |
Searches non-seasonal TLB ARIMA candidate models. |
code/tlb_arima_model_validation.R |
Validates the strongest TLB ARIMA candidates and creates forecast and residual plots. |
code/tlb_sarima_search.R |
Runs the corrected TLB SARIMA search using seasonal differencing with D = 1 and period 12. |
code/tlb_sarima_model_validation.R |
Validates the strongest corrected TLB SARIMA models. |
code/tlb_compare_arima_sarima.R |
Compares the strongest TLB ARIMA and SARIMA models in one output table. |
code/tlb_sarima_selected_plots.R |
Generates report-ready plots for selected corrected SARIMA models. |
| File | Purpose |
|---|---|
code/zodiacfinding.R |
Explores ARIMA and SARIMA model groups for Zodiac-related 12-year structure. |
code/zodiacmodels.R |
Tests Zodiac-enhanced model ideas for TFR. |
code/zodiacTLB.R |
Explores Zodiac effects for Total Live Births. |
code/zodiacCBR.R |
Explores Zodiac effects for Crude Birth Rate. |
code/TLBTFRZodiac.R |
Combined TFR, TLB, and Zodiac exploratory script. |
code/ZodiacModelRegressionwithZodiac.R |
Legacy Zodiac regression exploration. |
code/TotalFertlityRate1980.R |
Legacy TFR exploration from 1980 onwards. |
| File | Purpose |
|---|---|
code/model_ml.R |
Builds comparison models using lag-based inputs and Zodiac information for TFR and TLB. |
code/create_comparison_table.R |
Creates comparison tables from saved model and forecast outputs. |
| Folder | Purpose |
|---|---|
code/archive/ |
Stores older experimental scripts that are kept for traceability but are not part of the final workflow. |
The documentation/ folder explains how to run individual code sections and how the modelling workflow was developed.
| File | Purpose |
|---|---|
documentation/dataset_cleanup.md |
Explains the dataset cleaning process. |
documentation/ARIMA_Model_Finding.md |
Explains the ARIMA model identification and testing workflow. |
documentation/tlb_arima_outputs_guide.md |
Explains Lara's TLB ARIMA and SARIMA files, outputs, and run order. |
documentation/zodiacfindings_documentation.md |
Explains the Zodiac analysis and related modelling workflow. |
Read the relevant documentation file before changing model parameters or rerunning a specific section.
source("code/preprocessing.R")source("code/tlb_arima_search.R")
source("code/tlb_arima_model_validation.R")
source("code/tlb_sarima_search.R")
source("code/tlb_sarima_model_validation.R")
source("code/tlb_compare_arima_sarima.R")
source("code/tlb_sarima_selected_plots.R")source("code/tfr_arima.R")
source("code/tfr_arima_model_validation.R")
source("code/tfr_residual_diagnostics.R")
source("code/tfr-model-comparison.R")source("code/zodiacfinding.R")
source("code/zodiacmodels.R")
source("code/zodiacTLB.R")
source("code/zodiacCBR.R")source("code/model_ml.R")| Folder or file | Contents |
|---|---|
clean_data/ |
Cleaned training and testing CSV files. |
processed_data/ |
Model comparison CSV files and validation summaries. |
outputs/ts_plots/ |
Time series and differencing plots. |
outputs/model_validation/tlb/ |
TLB ARIMA forecast and residual plots. |
outputs/model_validation/tlb_sarima/ |
Corrected TLB SARIMA forecast and residual plots. |
outputs/model_validation/tlb_sarima_selected/ |
Report-ready plots for selected corrected SARIMA models. |
outputs/forecasts/ |
Forecast CSV files for tested models. |
outputs/model_comparison/ |
Model comparison plots and summary files. |
outputs/models/ |
Saved fitted model objects. |
- Always run
code/preprocessing.Rbefore running model scripts. - Do not manually edit files in
clean_data/, because they are regenerated by preprocessing. - Use
processed_data/for CSV summaries used in the report. - Use
outputs/model_validation/for forecast and residual diagnostic plots. - Keep older exploratory scripts in
code/archive/unless they are needed for final reproduction. - Add new documentation into
documentation/when adding a new model workflow.
This repository supports the final ICT Capstone report and handover. The main final workflow is focused on cleaned data generation, ARIMA and SARIMA model validation, Zodiac-related seasonal analysis, machine learning comparison, and report-ready output files.