Document critical data quality limitations blocking ML development by whitehackr · Pull Request #10 · whitehackr/simtom

whitehackr · 2025-09-27T18:32:32Z

Response to Flit Team Feedback

This PR addresses critical data quality issues identified by the Flit ML team that are blocking production model training.

Key Documentation Added

ML_SIGNAL_ENHANCEMENT_PLAN.md: Comprehensive technical analysis of required architectural changes to achieve ML-viable data quality. Compares Redis-enhanced vs SimPy approaches with detailed implementation roadmap.

ROADMAP.md: Documents three critical known issues (K6-K8):

ML Signal Strength Crisis: Feature-target correlations ~0.05 vs required 0.30-0.55
Impossible Customer Default Patterns: Multiple defaults per customer per day
Hyperactive Customer Behavior: 720+ transactions/day vs realistic 1-4/month

Data Guide Updates: Clear warnings about current data limitations and usage recommendations for ML teams.

Impact

Current data generates models with max 0.615 AUC-ROC and 31.5% confidence scores. Production BNPL requires 90-95% precision at high-risk tier. These issues block all production ML development until resolved.

Next Steps

Technical team review of architectural approaches outlined in ML_SIGNAL_ENHANCEMENT_PLAN.md to determine implementation priority and timeline.

Add age-spending correlation limitation to roadmap and data guide. Current implementation generates uniform spending across age groups, preventing ML models from learning realistic demographic patterns.

Document unrealistic business hour distribution with cliff-edge pattern and missing lunch/evening peaks that affect ML temporal feature engineering.

Addresses Flit team feedback on data quality blocking ML development.

whitehackr · 2025-09-29T07:37:13Z

Article on synthetic data generation: https://www.turing.com/kb/synthetic-data-generation-techniques

whitehackr added 4 commits September 21, 2025 00:40

document known data quality limitations

1d4145b

Add age-spending correlation limitation to roadmap and data guide. Current implementation generates uniform spending across age groups, preventing ML models from learning realistic demographic patterns.

add hourly traffic pattern limitation

578dcd1

Document unrealistic business hour distribution with cliff-edge pattern and missing lunch/evening peaks that affect ML temporal feature engineering.

document critical ML and behavioral data limitations

b3f94e6

Addresses Flit team feedback on data quality blocking ML development.

add comprehensive ML signal enhancement plan and roadmap updates

2e9d113

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document critical data quality limitations blocking ML development#10

Document critical data quality limitations blocking ML development#10
whitehackr wants to merge 4 commits intomainfrom
doc/data-limitations

whitehackr commented Sep 27, 2025

Uh oh!

whitehackr commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

whitehackr commented Sep 27, 2025

Response to Flit Team Feedback

Key Documentation Added

Impact

Next Steps

Uh oh!

whitehackr commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments