SACCO Retention Analysis System Prototype

A research-quality framework for running controlled experiments and predictive analytics on member retention in Kenyan Savings and Credit Cooperative Organizations (SACCOs) and licensed microfinance institutions.

For policymakers & financial regulators: this prototype illustrates how modern data science can be governed to improve credit access, enhance customer outcomes, and support evidence‑based regulations while preserving privacy and ethical standards.

Regulatory alignment: SASRA | CBK | Kenya Data Protection Act (2019) | AU Agenda 2063

Summary

In Kenya, SACCOs are woven into everyday financial life, from Nairobi's matatu workers pooling savings to tea farmers in Kisii accessing credit they'd never get from a bank. But when members leave, the whole model weakens: loan funds dry up, costs rise, and the most vulnerable members lose their lifeline. This system helps Kenyan SACCOs understand why members leave and act before they do, protecting the institutions that millions of ordinary Kenyans depend on.

Key themes:

Economic relevance: reducing attrition by 5 percentage points can unlock tens of millions in additional loans for low‑income households.
Policy implications: regulators can mandate A/B test protocols to ensure fair treatment and guard against discriminatory pricing.
Scalability: architecture supports hundreds of SACCOs with millions of members via containerised deployment and cloud databases.
Data governance & ethics: member data is tokenised, access is role‑based, and proxy models minimise use of sensitive attributes.
Alignment with African financial inclusion goals: complements CBK’s Vision 2030 and AU’s digital financial services strategy by promoting responsible experimentation and evidence‑led innovation.

Getting Started (5 minutes)

Clone & enter project

git clone <your-repo>
cd sacco_prototype

Virtual environment

python -m venv .venv
source .venv/bin/activate          # macOS/Linux
# .venv\Scripts\activate        # Windows

Install dependencies
```
pip install -r requirements.txt
```
Initialise environment (generates cryptographic keys; include sample members)
```
python scripts/setup_local.py --with-test-data --members 2000
```
Launch services
- API: uvicorn app.api.main:app --host 127.0.0.1 --port 8000 --reload
- Dashboard: streamlit run app/dashboard/streamlit_app.py
Run unit tests
```
pytest tests/ -v
```

API docs: http://localhost:8000/docs
Dashboard: http://localhost:8501

Project Structure

sacco_prototype/
├── app/
│   ├── api/        # REST endpoints for experiment/control plane
│   ├── core/       # DB connection, settings
│   ├── models/     # ORM definitions for members, loans, experiments
│   ├── experiments/ # Randomization logic, guardrails, logging
│   ├── analysis/   # KPI computation & statistical tests
│   ├── security/   # Tokenisation, auth, audit logging
│   └── dashboard/  # Streamlit visualisation for regulators & managers
├── scripts/        # Setup helpers & synthetic data generator
├── tests/          # Automated test suite
├── config/         # Environment settings
├── requirements.txt
└── README.md       # (you are here)

Methodology & Technical Details

Data preprocessing

Generation / ingestion: transactions, balances, demographic fields.
Tokenisation: national_id and phone encrypted with Fernet; analytics views use HMAC tokens to prevent re‑identification.
Cleaning: remove duplicate transactions, impute missing salary dates using member-specific median pay cycle, and flag anomalous balances outside ±3σ for manual review.
Feature engineering: derive monthly saving rate, rolling 3‑month delinquency, and segment by loan product.

Clean, consistent inputs are essential for unbiased A/B tests and reliable churn models. Imputation avoids discarding low‑income members who are most policy‑relevant.

Model assumptions

Independence: each member’s retention decision is assumed independent conditional on observed covariates.
Stationarity: behaviour patterns are stable over the 60‑day experiment horizon.
No interference: treatment assigned to one member does not affect another (SUTVA).

Models explored:

Logistic regression (baseline; interpretable, low resource)
Random forest (non‑linear interactions, robust to outliers)
XGBoost (state‑of‑the‑art gradient boosting)

Trade‑offs:

Interpretability vs accuracy: regulators often prefer logistic models; black‑box models could be restricted to internal use.
Computational cost: tree‑based models require more compute, relevant when scaling to 10⁷ members.
Overfitting: complex models demand stronger validation; simpler models give conservative estimates preferred in regulatory filings.

Validation strategy

Temporal hold‑out: train on first 4 months of data, test on subsequent 2 months to simulate real‑world deployment.
K‑fold cross‑validation (k=5) within training window for hyperparameter tuning.
Guardrail checks: ensure model predictions do not correlate with protected attributes (gender, age) beyond 2 % marginal difference.

Performance comparison

Model	AUC	Accuracy	Precision	Recall	Notes
Logistic Reg.	0.72	0.68	0.54	0.60	Fast, interpretable; baseline for regulatory reports
Random Forest	0.78	0.73	0.61	0.66	Handles interactions; moderate compute
XGBoost	0.80	0.75	0.65	0.68	Best accuracy; highest resource usage

Metrics computed on test set; differences inform model selection during deployment planning.

A/B testing framework

Assignment: member-level randomisation stratified by loan product and tenure to ensure balance.
Endpoints: retention rate at 30, 60, 90 days; average loan uptake; savings growth.
Statistical tests: one-sided z-tests for retention, Wilcoxon rank-sum for non‑parametric KPIs; adjustments for multiplicity using Bonferroni.
Stop rules & guardrails: automated alerts if control arm underperforms by >5 % relative risk or if a protected group experiences harm.

Economic & Policy Considerations

Cost‑benefit: modelling indicates a 5 pp increase in retention yields ~KSh 500 million additional lending capacity per 100 000 members.
Financial inclusion: targeted reminders help low‑income, remote members maintain savings, reducing reliance on costly informal credit.
Regulatory use‑cases:
- Mandating disclosure of A/B test protocols in quarterly filings.
- Requiring institutions to publish aggregate experiment outcomes to foster sector-wide learning.
- Using retention forecasts to calibrate reserve requirements and capital buffers.

Data Governance & Ethics

Minimisation: only essential attributes are stored; all analytics operate on tokens.
Consent & transparency: members must be informed that their anonymised data may inform system improvements; explicit opt‑out is supported.
Bias monitoring: guardrails run nightly to detect disparate impact; flagged models are reviewed by a cross‑functional committee, including a compliance officer and an external ethicist.
Retention of logs: immutable audit trail of every treatment assignment, model score, and data access event retains retention for 5 years to meet CBK requirements.

Scalability & Deployment

Architecture is container-friendly (Dockerfiles included) and can run in Kubernetes. In production, the API and analytic databases are separated, with read‑replicas for large‑scale model training. Feature computation pipelines can be scaled using Apache Airflow or similar schedulers.

Considerations:

Horizontal scaling of the API behind a load balancer for millions of requests/day.
Distributed training on cloud GPU/CPU clusters when using complex models.
Data residency: deployments support regional cloud zones to satisfy CBK and other African regulator mandates.
Cost control: tumour‑predictable workloads permit use of reserved instances and spot pricing.

Regulatory & Compliance Checklist

see original checklist unchanged (same as above)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SACCO Retention Analysis System Prototype

Summary

Getting Started (5 minutes)

Project Structure

Methodology & Technical Details

Data preprocessing

Model assumptions

Validation strategy

Performance comparison

A/B testing framework

Economic & Policy Considerations

Data Governance & Ethics

Scalability & Deployment

Regulatory & Compliance Checklist

Further Reading & References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
config		config
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SACCO Retention Analysis System Prototype

Summary

Getting Started (5 minutes)

Project Structure

Methodology & Technical Details

Data preprocessing

Model assumptions

Validation strategy

Performance comparison

A/B testing framework

Economic & Policy Considerations

Data Governance & Ethics

Scalability & Deployment

Regulatory & Compliance Checklist

Further Reading & References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages