A research-quality framework for running controlled experiments and predictive analytics on member retention in Kenyan Savings and Credit Cooperative Organizations (SACCOs) and licensed microfinance institutions.
For policymakers & financial regulators: this prototype illustrates how modern data science can be governed to improve credit access, enhance customer outcomes, and support evidence‑based regulations while preserving privacy and ethical standards.
Regulatory alignment: SASRA | CBK | Kenya Data Protection Act (2019) | AU Agenda 2063
In Kenya, SACCOs are woven into everyday financial life, from Nairobi's matatu workers pooling savings to tea farmers in Kisii accessing credit they'd never get from a bank. But when members leave, the whole model weakens: loan funds dry up, costs rise, and the most vulnerable members lose their lifeline. This system helps Kenyan SACCOs understand why members leave and act before they do, protecting the institutions that millions of ordinary Kenyans depend on.
Key themes:
- Economic relevance: reducing attrition by 5 percentage points can unlock tens of millions in additional loans for low‑income households.
- Policy implications: regulators can mandate A/B test protocols to ensure fair treatment and guard against discriminatory pricing.
- Scalability: architecture supports hundreds of SACCOs with millions of members via containerised deployment and cloud databases.
- Data governance & ethics: member data is tokenised, access is role‑based, and proxy models minimise use of sensitive attributes.
- Alignment with African financial inclusion goals: complements CBK’s Vision 2030 and AU’s digital financial services strategy by promoting responsible experimentation and evidence‑led innovation.
- Clone & enter project
git clone <your-repo> cd sacco_prototype
- Virtual environment
python -m venv .venv source .venv/bin/activate # macOS/Linux # .venv\Scripts\activate # Windows
- Install dependencies
pip install -r requirements.txt
- Initialise environment (generates cryptographic keys; include sample members)
python scripts/setup_local.py --with-test-data --members 2000
- Launch services
- API:
uvicorn app.api.main:app --host 127.0.0.1 --port 8000 --reload - Dashboard:
streamlit run app/dashboard/streamlit_app.py
- API:
- Run unit tests
pytest tests/ -v
API docs: http://localhost:8000/docs
Dashboard: http://localhost:8501
sacco_prototype/
├── app/
│ ├── api/ # REST endpoints for experiment/control plane
│ ├── core/ # DB connection, settings
│ ├── models/ # ORM definitions for members, loans, experiments
│ ├── experiments/ # Randomization logic, guardrails, logging
│ ├── analysis/ # KPI computation & statistical tests
│ ├── security/ # Tokenisation, auth, audit logging
│ └── dashboard/ # Streamlit visualisation for regulators & managers
├── scripts/ # Setup helpers & synthetic data generator
├── tests/ # Automated test suite
├── config/ # Environment settings
├── requirements.txt
└── README.md # (you are here)
- Generation / ingestion: transactions, balances, demographic fields.
- Tokenisation:
national_idandphoneencrypted with Fernet; analytics views use HMAC tokens to prevent re‑identification. - Cleaning: remove duplicate transactions, impute missing salary dates using member-specific median pay cycle, and flag anomalous balances outside ±3σ for manual review.
- Feature engineering: derive monthly saving rate, rolling 3‑month delinquency, and segment by loan product.
Clean, consistent inputs are essential for unbiased A/B tests and reliable churn models. Imputation avoids discarding low‑income members who are most policy‑relevant.
- Independence: each member’s retention decision is assumed independent conditional on observed covariates.
- Stationarity: behaviour patterns are stable over the 60‑day experiment horizon.
- No interference: treatment assigned to one member does not affect another (SUTVA).
Models explored:
- Logistic regression (baseline; interpretable, low resource)
- Random forest (non‑linear interactions, robust to outliers)
- XGBoost (state‑of‑the‑art gradient boosting)
Trade‑offs:
- Interpretability vs accuracy: regulators often prefer logistic models; black‑box models could be restricted to internal use.
- Computational cost: tree‑based models require more compute, relevant when scaling to 10⁷ members.
- Overfitting: complex models demand stronger validation; simpler models give conservative estimates preferred in regulatory filings.
- Temporal hold‑out: train on first 4 months of data, test on subsequent 2 months to simulate real‑world deployment.
- K‑fold cross‑validation (k=5) within training window for hyperparameter tuning.
- Guardrail checks: ensure model predictions do not correlate with protected attributes (gender, age) beyond 2 % marginal difference.
| Model | AUC | Accuracy | Precision | Recall | Notes |
|---|---|---|---|---|---|
| Logistic Reg. | 0.72 | 0.68 | 0.54 | 0.60 | Fast, interpretable; baseline for regulatory reports |
| Random Forest | 0.78 | 0.73 | 0.61 | 0.66 | Handles interactions; moderate compute |
| XGBoost | 0.80 | 0.75 | 0.65 | 0.68 | Best accuracy; highest resource usage |
Metrics computed on test set; differences inform model selection during deployment planning.
- Assignment: member-level randomisation stratified by loan product and tenure to ensure balance.
- Endpoints: retention rate at 30, 60, 90 days; average loan uptake; savings growth.
- Statistical tests: one-sided z-tests for retention, Wilcoxon rank-sum for non‑parametric KPIs; adjustments for multiplicity using Bonferroni.
- Stop rules & guardrails: automated alerts if control arm underperforms by >5 % relative risk or if a protected group experiences harm.
- Cost‑benefit: modelling indicates a 5 pp increase in retention yields ~KSh 500 million additional lending capacity per 100 000 members.
- Financial inclusion: targeted reminders help low‑income, remote members maintain savings, reducing reliance on costly informal credit.
- Regulatory use‑cases:
- Mandating disclosure of A/B test protocols in quarterly filings.
- Requiring institutions to publish aggregate experiment outcomes to foster sector-wide learning.
- Using retention forecasts to calibrate reserve requirements and capital buffers.
- Minimisation: only essential attributes are stored; all analytics operate on tokens.
- Consent & transparency: members must be informed that their anonymised data may inform system improvements; explicit opt‑out is supported.
- Bias monitoring: guardrails run nightly to detect disparate impact; flagged models are reviewed by a cross‑functional committee, including a compliance officer and an external ethicist.
- Retention of logs: immutable audit trail of every treatment assignment, model score, and data access event retains retention for 5 years to meet CBK requirements.
Architecture is container-friendly (Dockerfiles included) and can run in Kubernetes. In production, the API and analytic databases are separated, with read‑replicas for large‑scale model training. Feature computation pipelines can be scaled using Apache Airflow or similar schedulers.
Considerations:
- Horizontal scaling of the API behind a load balancer for millions of requests/day.
- Distributed training on cloud GPU/CPU clusters when using complex models.
- Data residency: deployments support regional cloud zones to satisfy CBK and other African regulator mandates.
- Cost control: tumour‑predictable workloads permit use of reserved instances and spot pricing.
see original checklist unchanged (same as above)
- SASRA Prudential Guidelines 2024
- CBK FinTech Survey 2023
- Kenya Data Protection Authority guidance on automated decision‑making
- AU Digital Transformation Strategy 2020–2030
This document is part of a prototype toolkit. It does not replace legal or regulatory advice.