DST Sector Health Forecaster

MSc BDS — MLOps Exam | Statistics Denmark

An end-to-end MLOps pipeline that fetches live data from the Statistics Denmark (DST) StatBank API and forecasts a Sector Vitality Score (SVS) for each of the 10 Danish industry sectors one quarter ahead. The pipeline can be adjusted to run automatically on a quarterly cron schedule via GitHub Actions and exposes predictions through a FastAPI REST endpoint.

Data source: Statistics Denmark StatBank API free, no authentication required, CC 4.0 BY licence.

Repository Structure

MLOps-Exam-Assignment/
│
├── .github/
│   └── workflows/
│       └── pipeline.yml          ← GitHub Actions: (https://github.com/Ory999/MLOps-Exam-Assignment/actions/runs/24525069469)
│
├── artifacts/                    ← All pipeline outputs (contents gitignored)
│   ├── raw/                      ← Timestamped raw CSVs from DST API
│   ├── processed/                ← Quarterly panel + feature datasets
│   ├── models/                   ← Trained model .joblib files
│   ├── metrics/                  ← JSON metrics per training run
│   ├── reports/                  ← Monitoring reports + visualisation PNGs
│   └── mlflow/                   ← MLflow experiment tracking database
│
├── notebooks/
│   ├── Workbook_1_data_ingestion.ipynb         ← Explore DST API, fetch raw data
│   ├── Workbook_2_preprocessing_features.ipynb ← Build panel, vitality score, features
│   ├── Workbook_3_model_training.ipynb         ← Train Naive/Ridge/XGBoost, MLflow
│   └── Workbook_4_monitoring.ipynb             ← Drift detection, performance monitoring
│
├── src/
│   ├── __init__.py               ← Makes src/ a Python package
│   ├── dst_client.py             ← DST StatBank API client (KONK4, DEMO14)
│   ├── preprocessing.py          ← Monthly→quarterly aggregation, vitality score
│   ├── features.py               ← Lag/rolling features, time-based train/test split
│   ├── model.py                  ← Naive baseline, Ridge, XGBoost, MLflow logging
│   ├── monitoring.py             ← KS-test drift detection, MAE performance monitoring
│   └── api.py                    ← FastAPI: /predict, /monitoring, /pipeline/run
│
├── .gitignore                    ← Ignores artifact contents, keeps folder structure
├── config.yaml                   ← All pipeline configuration in one place
├── Dockerfile                    ← Python 3.11-slim image for pipeline + API
├── docker-compose.yml            ← Services: pipeline, api, mlflow
├── requirements.txt              ← All Python dependencies with pinned versions
└── run_pipeline.py               ← End-to-end runner: ingest → preprocess → train → monitor

What the Pipeline Does

The pipeline forecasts a Sector Vitality Score (SVS) a normalised [0, 1] index measuring the balance between enterprise births and bankruptcies in each sector. A score near 1 indicates sector expansion; near 0 indicates contraction.

Five steps run sequentially:

┌────────────────────────────────────────────────────────────┐
│  GitHub Actions cron  OR: python run_pipeline.py
└────────────────────────┬───────────────────────────────────┘
                         │
      ┌──────────────────▼─────────────────────┐
      │  1. DATA INGESTION                      │
      │  DST StatBank API (free, no auth)       │
      │  • KONK4: bankruptcies (monthly)        │
      │  • DEMO14: enterprise births (annual)   │
      │  Output: artifacts/raw/*.csv            │
      └──────────────────┬─────────────────────┘
                         │
      ┌──────────────────▼─────────────────────┐
      │  2. PREPROCESSING                       │
      │  • Aggregate monthly → quarterly        │
      │  • Build (sector × quarter) panel       │
      │  • Forward-fill annual DEMO14 data      │
      │  • Filter to 2019–2023 (valid window)   │
      │  • Compute Sector Vitality Score [0,1]  │
      │  Output: artifacts/processed/panel_*   │
      └──────────────────┬─────────────────────┘
                         │
      ┌──────────────────▼─────────────────────┐
      │  3. FEATURE ENGINEERING                 │
      │  • Lag features t-1 … t-4              │
      │  • Rolling mean/std (4Q, 8Q windows)   │
      │  • Net growth rate, momentum            │
      │  • Sector dummies (10), seasonality     │
      │  • Time-based train/test split (4Q)     │
      │  Output: artifacts/processed/features_*│
      └──────────────────┬─────────────────────┘
                         │
      ┌──────────────────▼─────────────────────┐
      │  4. MODEL TRAINING                      │
      │  • Naive baseline (persistence)         │
      │  • Ridge regression (linear baseline)   │
      │  • XGBoost (primary model)              │
      │  • All runs logged to MLflow            │
      │  Output: artifacts/models/model_latest  │
      └──────────────────┬─────────────────────┘
                         │
      ┌──────────────────▼─────────────────────┐
      │  5. MONITORING                          │
      │  • KS-test: data drift per feature      │
      │  • MAE degradation vs training baseline │
      │  • Prediction distribution shift        │
      │  Output: artifacts/reports/monitoring_* │
      └─────────────────────────────────────────┘

Data Sources

Statistics Denmark (Danmarks Statistik) StatBank API

Base URL: https://api.statbank.dk/v1
No authentication required — CC 4.0 BY licence
Documentation: https://www.dst.dk/en/Statistik/hjaelp-til-statistikbanken/api

Table	Description	Frequency	Coverage
`KONK4`	Bankruptcies by industry (DB07), active companies only (K02)	Monthly	2009–present
`DEMO14`	Enterprise births by industry (DB07 10-grouping)	Annual	2019–2023

Important: DEMO14 has a ~2-year publication lag. Data for 2024+ is not yet published, which is why the pipeline restricts the panel to 2019–2023. Pre-2019 rows have zero enterprise births (DEMO14 does not cover that period), which would distort the vitality score normalisation if included.

Model Results (2019–2023, test set = Q4 2022–Q3 2023)

Model	MAE	RMSE	R²	Directional Accuracy
Naive (persistence)	0.0952	0.1354	-9.87	90%
Ridge	0.1123	0.1443	-11.34	87%
XGBoost	0.2473	0.2614	-39.48	40%

The Naive model wins. With only 14 training quarters per sector after lagging, XGBoost cannot generalise, it overfits the 2019–2022 growth phase and fails to extrapolate the 2022–2023 decline. Negative R² is expected at this data volume; the pipeline is designed to improve automatically as DST publishes new DEMO14 data each year. Thus the reliability of XGBoost is currently not good, but as time goes on more data is added to the database, which in fututre scenarioes will give XGBoost a better datafoundation to train on. Another important note is data from 2019-2022 is the train basis, which was the period with Covid-19 locdowns, which heavily effected some areas of work.

Quickstart

1. Clone and install

git clone https://github.com/Ory999/MLOps-Exam-Assignment.git
cd MLOps-Exam-Assignment
pip install -r requirements.txt

2. Run the full pipeline

python run_pipeline.py

Fetches fresh data from DST, preprocesses, trains all three models, and saves a monitoring report to artifacts/reports/.

3. Explore notebooks step by step

jupyter notebook notebooks/

Open in order: Workbook_1 → Workbook_2 → Workbook_3 → Workbook_4

4. Start the API

uvicorn src.api:app --host 0.0.0.0 --port 8000 --reload

Swagger UI: http://localhost:8000/docs
Health check: http://localhost:8000/health
Monitoring report: http://localhost:8000/monitoring

5. View MLflow experiments

mlflow ui --backend-store-uri ./artifacts/mlflow

Open http://localhost:5000 to compare runs.

Docker

Run pipeline then start services

# Step 1: train the model
docker compose run pipeline

# Step 2: start API + MLflow UI
docker compose up api mlflow

Individual commands

docker build -t dst-sector-health .

# Run pipeline
docker run -v $(pwd)/artifacts:/app/artifacts dst-sector-health python run_pipeline.py

# Run API
docker run -p 8000:8000 -v $(pwd)/artifacts:/app/artifacts dst-sector-health

API Endpoints

Method	Endpoint	Description
GET	`/health`	Service health check
GET	`/model/info`	Model metadata and training metrics
POST	`/predict`	Predict next-quarter vitality score for a sector
GET	`/monitoring`	Latest monitoring report (JSON)
POST	`/pipeline/run`	Trigger full pipeline run in background

Example prediction request

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "sector": 5,
    "vitality_score_lag1": 0.62,
    "vitality_score_lag2": 0.58,
    "vitality_score_lag3": 0.55,
    "vitality_score_lag4": 0.60,
    "bankruptcy_rate_lag1": 0.018,
    "birth_rate_lag1": 0.045,
    "quarter_of_year": 2,
    "year": 2025
  }'

Sector codes (DB07 10-grouping):

Code	Sector
1	Agriculture, forestry and fishing
2	Manufacturing, mining and quarrying
3	Energy, water supply, sewerage and waste
4	Construction
5	Trade and transport
6	Information and communication
7	Financial and insurance activities
8	Real estate
9	Other business services
10	Public administration, education, health and culture

GitHub Actions — Automated Pipeline

The pipeline runs automatically via .github/workflows/pipeline.yml:

Trigger	When
Scheduled cron	10th of Feb, May, Aug, Nov at 09:00 UTC (aligned to DST release calendar)
Push to `main`	When `src/**`, `config.yaml`, or `run_pipeline.py` changes
Manual	Via the "Run workflow" button in the Actions tab

Each run fetches fresh DST data, retrains the model, and uploads artifacts (model, metrics, monitoring report, charts) as downloadable bundles retained for 90 days.

Monitoring Strategy

The pipeline monitors two things after every run:

Data drift (KS-test): Each feature is compared between the training distribution and the new incoming data using the Kolmogorov-Smirnov test. If more than 30% of features have a p-value below 0.15, an overall drift flag is raised.

Performance degradation: The current MAE is compared to the MAE at training time. If the current MAE exceeds the baseline by more than 20%, a retrain flag is raised.

Results are saved to artifacts/reports/monitoring_latest.json and served via the /monitoring endpoint.

Artifact Storage

Artifact	Location	Gitignored?
Raw CSVs from DST	`artifacts/raw/`	Yes (contents)
Processed panel	`artifacts/processed/`	Yes (contents)
Trained models	`artifacts/models/`	Yes (contents)
Metrics JSON	`artifacts/metrics/`	Yes (contents)
Monitoring reports + charts	`artifacts/reports/`	Yes (contents)
MLflow tracking database	`artifacts/mlflow/`	Yes (entirely)
Folder structure	`artifacts/*/.gitkeep`	No — tracked to preserve structure on clone

Artifact contents are not stored in the repository. On a GitHub Actions run they are uploaded as downloadable bundles. Locally they are written to disk and read by the API. Direct acess to the last run is found under Actions -> Fix numpy bool_ JSON serialization in monitoring report #7 (https://github.com/Ory999/MLOps-Exam-Assignment/actions/runs/24525069469). Here the artifact with the outputs can be downloaded, last run is called "pipeline-artifacts-7"

Configuration

All pipeline parameters are in config.yaml:

model:
  test_quarters: 4        # hold out last 4 quarters (1 year) for evaluation
  lag_periods: [1,2,3,4]  # autoregressive lag features
  rolling_windows: [4,8]  # rolling mean/std window sizes

monitoring:
  drift_threshold: 0.15   # KS p-value below which a feature is flagged as drifted
  performance_threshold: 0.20  # MAE increase above which retraining is triggered

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DST Sector Health Forecaster

Repository Structure

What the Pipeline Does

Data Sources

Model Results (2019–2023, test set = Q4 2022–Q3 2023)

Quickstart

1. Clone and install

2. Run the full pipeline

3. Explore notebooks step by step

4. Start the API

5. View MLflow experiments

Docker

Run pipeline then start services

Individual commands

API Endpoints

Example prediction request

GitHub Actions — Automated Pipeline

Monitoring Strategy

Artifact Storage

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
artifacts		artifacts
notebooks		notebooks
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
config.yaml		config.yaml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

DST Sector Health Forecaster

Repository Structure

What the Pipeline Does

Data Sources

Model Results (2019–2023, test set = Q4 2022–Q3 2023)

Quickstart

1. Clone and install

2. Run the full pipeline

3. Explore notebooks step by step

4. Start the API

5. View MLflow experiments

Docker

Run pipeline then start services

Individual commands

API Endpoints

Example prediction request

GitHub Actions — Automated Pipeline

Monitoring Strategy

Artifact Storage

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages