Machine Learning–based scoring model for risk / propensity / credit-style scoring. Designed for real-world usage with explainability, stability, and production readiness in mind.
A Behavioral Score (B-Score) is a statistical model used to assess the credit risk of existing customers based on their historical behavior with a financial institution. Unlike an Application Score (A-Score), which uses "static" data from the time of enrollment, the B-Score is dynamic, evolving as the customer interacts with the product.
This project builds a B-Scoring model using supervised ML to predict a target score (e.g. default risk or probability of default). The workflow follows industry-standard model development practices.
ml_scoring_model/
├── models/
│ ├── model.cbm #Trainned model (.cbm)
│ └── model_study.pkl #Study parameters (pkl.)
├── notebooks/
│ ├── 01_factor_creation.ipynb
│ ├── 02_data_sampling.ipynb
│ ├── 03_modeling.ipynb
│ └── 04_score_evaluation.ipynb
├── src/
│ ├── create_factor.py
│ ├── modified_sampling.py
│ ├── features_prep.py
│ ├── features_selection.py
│ ├── mixed_matrix.py
│ ├── cluster_analysis.py
│ ├── model_builder.py
│ ├── score_construct.py
│ └── back_testing.py
├── data/
│ ├── processed/
| | ├── behaviour_factors.parquet #Not tracked by git
| | ├── train_data.parquet #Not tracked by git
| | ├── test_data.parquet #Not tracked by git
| | └── usedcar_transaction_score.parquet #Not tracked by git
│ └── raw/
| | ├── usedcar_transaction.parquet #Not tracked by git
| └── └── longlistFactor.csv
├── requirements.txt
└── README.md
Raw transaction data is often too granular and noisy for machine learning models. Factor Engineering (or Feature Engineering) is the process of transforming raw logs into meaningful predictors. In this project, factors are generally categorized into the following aspects:
| No. | Aspect | Description |
|---|---|---|
| 1 | Account balance | The total amount of money remaining in a customer’s account (or the total debt owed on a credit line) at a specific point in time. |
| 2 | Due amount | The portion of the total balance that must be paid including; the principal, interest, and fees within the current billing cycle. |
| 3 | Repayment | The act of a borrower paying back the principal and interest on a loan or credit facility. |
| 4 | Delinquency status | A snapshot of how many days a payment is overdue. It is typically measured in Days Past Due (DPD), categorized into buckets. |
The primary reason for creating behavioral factors is to transform massive, noisy transaction data into dynamic insights that reflect a customer's current financial health. Unlike static application data, behavioral factors capture trends such as shifting spending patterns, declining repayment habits, or increasing credit reliance that allowing the model to detect "early warning signs" months before an actual default occurs.
The example of behavioral factors are listed below:
| No. | Factor | Description |
|---|---|---|
| 1 | avg_bal_3 | The average of account balance over the last 3 months. |
| 2 | max_due_3_to_fin | The maximum due amount over the last 3 months to initial financial amount. |
| 3 | n_fully_pay_3 | The number of months (times) that fully payment made over the last 3 months. |
| 4 | max_del_3 | The maximum delinquency over the last 3 months |
Unlike a standard train/test split even one stratified on the default rate, the behavioral data is designed to capture granular on transaction level patterns over time. A traditional random split at the record level can lead to data leakage, where a single customer’s history is fragmented across both training and testing sets, resulting in an over optimistic model.
To mitigate this, the modified train/test split of customer level partitioning strategy is implemented to ensure that all records belonging to a specific customer are confined to only one dataset. Furthermore, this split was meticulously balanced to minimize the variance in default rates across both the global population and on a month by month basis, ensuring the model's stability and performance remain consistent over time.
The process transforms categorical labels into numerical values using K-Fold Target Encoding. Instead of just calculating a simple average of the target for each category, it uses cross-validation to ensure that the value assigned to a row is calculated from other data "folds." This prevents the model from "cheating" (data leakage) and reduces overfitting. It also uses "smoothing" to balance category means with the global average and smartly fills any missing or new categories with the overall mean.
The process then cleans up numerical data by filling in missing values using MICE (Multiple Imputation by Chained Equations). Rather than just plugging in a static mean or median, it treats every missing value as a target to be predicted by a Bayesian Ridge model based on other available features. It also automatically converts infinite values to nulls so they can be imputed properly, ensuring your final dataset is complete, statistically sound, and ready for feature selection.
The automated feature selection is performed by creating a competition between your real data and randomized noise. First, it generates shadow features by shuffling the values of your original data to destroy any actual relationship with the target. It then trains a LightGBM model on both the real and shadow features combined, measuring their gain importance. The function calculates a selection threshold based on the average performance of these shadow features (scaled by threshold adjustment parameter). Finally, it filters out any original features that did not perform significantly better than the randomized shadows.
This step represents an advanced feature selection pipeline designed to eliminate data redundancy using a combination of statistical clustering and machine learning. It follows a 3 steps logical flow:
- Grouping Redundant Variables (Hierarchical Clustering): The cluster analysis calculates a "Distance Matrix" based on feature correlations. If features are highly correlated (e.g., > 70%), they are considered redundant and grouped into the same cluster.
- Evaluating real impact (SHAP Importance): The shap model builds a quick "Pilot Model" using
CatBoost. Instead of just looking at linear correlations, it uses SHAP Importance to measure how much each feature actually contributes to the model's predictions. This ensures that we know which variables are truly powerful and which are just noise. - Smart Representative Selection: The function is the final decision-maker. It looks at each cluster and picks the "Best Representative" based on two main criteria 1) Performance: It prioritizes the feature with the highest SHAP Score within its cluster. 2) Diversity: It ensures that different feature groups are represented, dropping the redundant "weaker" versions. As a result, the high performing list of features while dropping the redundant ones to prevent overfitting.
The run_optuna function automates hyperparameter optimization for a CatBoost classifier by integrating Optuna with a 5-fold Stratified K-Fold cross-validation strategy. It specifically addresses class imbalance by dynamically calculating a scale_pos_weight and explores a multi-dimensional search space—including iterations, depth, and learning rate—to maximize the mean AUC score. After completing the specified trials, the function automatically re-fits the model using the optimal parameters on the entire training dataset, returning both the finalized production-ready model and the detailed optimization study object.
- Calculate Natural Odds: Analyzes the target variable (
$y$ ) to find the ratio of "Good" vs "Bad" customers in the real world. - Optimization Loop: Tests various PDO values (10 to 100) to find the "Sweet Spot" where scores are well-separated (high standard deviation) but not excessively capped at the limits (300 or 850).
The engine uses a logarithmic transformation to ensure that as risk decreases, the score increases. The Math Behind the Score:
- Factor:
$PDO / \ln(2)$ - Determines how many points represent a doubling of odds. - Offset:
$BaseScore - (Factor \times \ln(BaseOdds))$ - Sets the starting point of the scale. - Odds Calculation:
$(1 - P) / P$ - Converts probability of default into "Good" odds.
Note: that we calculated the good oods by fixing the direction of risk; the higher risk corresponds to a lower score, and the lower risk to a higher score.
- ✅ Final Score:
$Offset + (Factor \times \ln(Odds))$
Note: Therefore, the final score formula must be used (+) operation to retain the score direction as our expectation.
Goal: Categorize scores into actionable risk bands.
Once scores are generated, we must decide how to group them. This code supports three distinct strategies:
- 📏 Equal Interval: Splits the score range (e.g., 300–850) into equal-sized chunks.
- 👥 Quantile (Equal Population): Ensures each "bin" or "grade" has the same number of customers.
- 🔔 Normal Distribution: Uses statistical probability (mean and standard deviation) to set cut-points, focusing more resolution around the average.
- AUC and GINI: Measures the discriminatory power of the risk bands by comparing the actual discrimination of the risk bands with the discrimination of a perfect model.
- KS Statistic: Measures the maximum separation between the cumulative distributions of Good and Bad customers.
- PSI: Measure the distributional shift with respect to a risk bands between two sets of populations
The score distribution is displayed below:
The overall statistical tests by AUC, GINI, and KS are shown below:
| Statistic | AUC | GINI | KS |
|---|---|---|---|
| Result | ✅ 93.32% | ✅ 86.63% | ✅ 73.49% |
The back-testing in the monthly interval (same as development) are displayed below:
Note: The model classification abilities are strongly classified risk segments over the time.
Note: Very first of historical point, the data was not stable.
MIT · Built for learning purposes









