A Fintech API that predicts loan default probability using Alternative Data, LightGBM, and FastAPI.
In emerging markets, many creditworthy individuals are "unbanked" or lack a traditional credit history. Traditional logistic regression models reject these applicants, causing:
- Lost Revenue: Good customers are turned away.
- Hidden Risk: Traditional metrics miss behavioral red flags.
The Goal: Build a machine learning engine that uses alternative data (telco usage, family status, external sources) to score applicants more accurately and serve decisions via a real-time REST API.
The pipeline consists of three stages:
- ETL & Preprocessing: Handling outliers (e.g., the "365243 days employed" bug) and engineering financial ratios (Debt-to-Income, Annuity-to-Credit).
- Model Training: A LightGBM Classifier optimized for imbalanced data (8% default rate) using weighted loss functions.
- Deployment: A FastAPI microservice that accepts JSON payloads and returns a Risk Score (0-1) and a Credit Score (300-850).
| Metric | Score | Context |
|---|---|---|
| ROC-AUC | 0.767 | Far exceeds the industry baseline of 0.70. |
| Recall (Defaulters) | 62% | Captures the majority of bad loans to protect capital. |
| Inference Time | <50ms | Suitable for real-time mobile app integration. |
- Machine Learning: LightGBM, Scikit-Learn
- Explainability: SHAP (Shapley Additive exPlanations)
- API Framework: FastAPI, Uvicorn, Pydantic
- Data Processing: Pandas, NumPy
pip install lightgbm fastapi uvicorn shap pandas scikit-learnGenerate the features and train the model. This script handles the class imbalance automatically.
python risk_preprocessing.py
python train_risk_model.pyOutput: Saves credit_risk_model.pkl and generates risk_drivers.png (SHAP).
Launch the REST server locally.
python risk_api.pyServer runs at http://localhost:8000
Send a sample applicant payload to the endpoint.
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d @applicant_payload.jsonResponse:
{
"approved": false,
"risk_score": 0.5012,
"credit_score": 574,
"message": "High Default Risk Detected"
}Using SHAP values, we identified the top drivers of default risk:
- EXT_SOURCE_2 / 3: External normalized credit scores.
- DAYS_BIRTH: Younger applicants showed statistically higher default rates.
- CREDIT_TERM: Longer loan terms correlated with higher risk.
- Dockerize: Containerize the API for cloud deployment (AWS ECS).
- Monitoring: Add Prometheus to track "Data Drift" (e.g., if applicant income levels change over time).
- A/B Testing: Deploy a "Challenger" model (XGBoost) to run alongside the "Champion" (LightGBM).
