Document critical data quality limitations blocking ML development#10
Open
whitehackr wants to merge 4 commits intomainfrom
Open
Document critical data quality limitations blocking ML development#10whitehackr wants to merge 4 commits intomainfrom
whitehackr wants to merge 4 commits intomainfrom
Conversation
Add age-spending correlation limitation to roadmap and data guide. Current implementation generates uniform spending across age groups, preventing ML models from learning realistic demographic patterns.
Document unrealistic business hour distribution with cliff-edge pattern and missing lunch/evening peaks that affect ML temporal feature engineering.
Addresses Flit team feedback on data quality blocking ML development.
Owner
Author
|
Article on synthetic data generation: https://www.turing.com/kb/synthetic-data-generation-techniques |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Response to Flit Team Feedback
This PR addresses critical data quality issues identified by the Flit ML team that are blocking production model training.
Key Documentation Added
ML_SIGNAL_ENHANCEMENT_PLAN.md: Comprehensive technical analysis of required architectural changes to achieve ML-viable data quality. Compares Redis-enhanced vs SimPy approaches with detailed implementation roadmap.
ROADMAP.md: Documents three critical known issues (K6-K8):
Data Guide Updates: Clear warnings about current data limitations and usage recommendations for ML teams.
Impact
Current data generates models with max 0.615 AUC-ROC and 31.5% confidence scores. Production BNPL requires 90-95% precision at high-risk tier. These issues block all production ML development until resolved.
Next Steps
Technical team review of architectural approaches outlined in ML_SIGNAL_ENHANCEMENT_PLAN.md to determine implementation priority and timeline.