WQSurrogateModels is a FastAPI backend and reproducibility repository for WQI5-based current-state water quality assessment.
Scope: this repository performs WQI5-based current-state water quality assessment. It does not perform temporal forecasting because the committed dataset does not contain timestamps.
It provides:
- a
direct_wqi5baseline - surrogate regression models
/api/v2/*endpoints for WaterMirror and other HTTP clients- reproducibility scripts and experiment documentation
This project is part of a two-repository system:
WaterMirror: cross-platform mobile frontend for data entry, CSV upload, and result visualizationWQSurrogateModels: FastAPI backend and model/reproducibility repository for WQI5-based current-state water quality assessment
WaterMirror depends on the API contract exposed by this repository. WQSurrogateModels can also be used independently through curl, Postman, or custom scripts.
- serves a FastAPI backend for WQI5 assessment
- supports a
direct_wqi5formula baseline - supports surrogate regression models:
lr,mpr,svm,rf,xgboost,lightgbm - provides reproducibility scripts and experiment configuration
- keeps compatibility with legacy endpoints while treating
/api/v2/*as the primary contract
flowchart LR
A[WaterMirror user input or CSV upload] --> B[WaterMirror frontend]
B --> C[POST /api/v2/assessment or /api/v2/assessment/csv/summary]
C --> D[WQSurrogateModels FastAPI service]
D --> E[Input validation and assessment warnings]
E --> F{Model selection}
F --> G[direct_wqi5 baseline]
F --> H[Surrogate regressors: lr mpr svm rf xgboost lightgbm]
G --> I[WQI5 score category rating range]
H --> I
I --> J[Result payload]
J --> B
Copy .env.example to .env and adjust values if needed.
cp .env.example .envKey variables:
MODEL_DIR=modelsDEFAULT_MODEL=direct_wqi5API_HOST=0.0.0.0API_PORT=8001AUTO_PORT=false
pip install .For development and tests:
pip install -e ".[dev]"The committed scikit-learn surrogate artifacts in models/ were serialized with scikit-learn 1.5.2. Use that same version when loading them, or retrain and re-export the artifacts in your target version.
To also enable the full set of surrogate models (xgboost, lightgbm):
pip install -e ".[dev,models]"python main.pyIf API_PORT is already occupied, the default behavior is to fail fast with a clearer error message. For local development, you can opt in to automatic fallback ports:
AUTO_PORT=trueWith AUTO_PORT=true, the server tries API_PORT first and then scans upward (8002, 8003, ...) until it finds a free port.
Primary endpoints live under /api/v2/*.
POST /api/v2/assessment
{ "DO": 7.2, "BOD": 2.1, "NH3N": 0.3, "EC": 450, "SS": 12, "model_type": "lightgbm" }Legacy compatibility endpoints such as POST /predict, POST /score/total/, and GET /status are retained but deprecated.
- WaterMirror Integration
- API Reference
- Full-Stack Local Run
- WQI5 Formula
- Metrics
- Data Preparation
- Original Benchmark Protocol
- Revised Experiment Protocol
- Statistical Analysis
- Statistics Workspace Notes
- Model Hyperparameters
- Model Card
- Limitations
Run:
pip install -e ".[dev]"
python scripts/reproduce_results.py --config configs/experiment_config.yaml --output-dir results_verificationIf you use the local WQI conda environment and want to run the full experiment (all models including xgboost/lightgbm):
conda activate WQI
pip install -e ".[models]"
python scripts/reproduce_results.py --config configs/experiment_config.yaml --output-dir results_verificationTo protect archived manuscript outputs, the script now refuses to overwrite an existing results directory unless --overwrite is passed explicitly.
The table below describes the revised reproducibility workflow. Archived exploratory scripts may use GridSearchCV and library defaults; see docs/original-benchmark-protocol.md.
| Model | Library | Preprocessing | Key Hyperparameters |
|---|---|---|---|
direct_wqi5 |
formula baseline | none | direct WQI5 equation |
lr |
scikit-learn | mean imputation + standard scaling | default LinearRegression() |
mpr |
scikit-learn | mean imputation + polynomial features + standard scaling | degree=2, include_bias=False |
svm |
scikit-learn | mean imputation + standard scaling | kernel=rbf, C=10.0, epsilon=0.1 |
rf |
scikit-learn | mean imputation | n_estimators=300, random_state=0, n_jobs=-1 |
xgboost |
xgboost | mean imputation | n_estimators=300, max_depth=6, learning_rate=0.05, subsample=0.9, colsample_bytree=0.9, random_state=0 |
lightgbm |
lightgbm | mean imputation | n_estimators=300, learning_rate=0.05, random_state=0 |
Repeated validation uses stratified random splits over WQI5 categories with seeds 0, 1, 2, 3, 4.
data/: processed datasets and subsetsmodels/: persisted surrogate model artifactssrc/: API and reusable backend logicscripts/: reproducibility runnersconfigs/: experiment settingstests/: pytest suite
Apache License 2.0. See LICENSE.