Feat/data profiling#7
Conversation
pd.Series(dtype=int, index=...) initializes with NaN which forces float64 conversion. Building a dict first preserves int64 dtype.
There was a problem hiding this comment.
Pull request overview
This PR adds a comprehensive data profiling system to Metis that provides caching of profiling computations and support for importing pre-computed profiles from external tools. The system includes 30+ profiling functions covering cardinalities, value distributions, patterns, data types, domain classification, and similarity metrics. A DataProfileManager singleton manages database-backed caching, and a flexible importer system supports various input formats including HyFD, AIDFD, and CFDFinder outputs.
Changes:
- Renamed
mesTimetotimestampin DQResult class and database models for consistency - Added DataProfile database model for storing cached profiling results
- Implemented DataProfileManager singleton with database-backed caching
- Added 30+ profiling functions organized into cardinalities, value distribution, patterns/types, domain classification, and sketches
- Created importer system supporting inline JSON and external file formats for pre-computed profiles
- Integrated profiling system into DQOrchestrator with automatic context management
- Added comprehensive documentation for import formats
Reviewed changes
Copilot reviewed 57 out of 64 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements.txt | Added numpy and datasketch dependencies |
| metis/writer/database_writer.py | Updated to use timestamp instead of mesTime |
| metis/utils/result.py | Renamed mesTime field to timestamp in DQResult class |
| metis/utils/data_config.py | Added data_profiles configuration field, removed duplicate lines |
| metis/database_models.py | Added DataProfile model for caching profiling results |
| metis/dq_orchestrator.py | Integrated DataProfileManager initialization and profile importing |
| metis/profiling/data_profile_manager.py | Singleton manager for profile caching with serialization/deserialization |
| metis/profiling/cache.py | Decorator for transparent profiling function caching |
| metis/profiling/init.py | Exports cached versions of all profiling functions |
| metis/profiling/importers/* | Importer classes for various profile formats (FD, UCC, IND, histograms, etc.) |
| metis/profiling/single_column/* | Cached wrappers for profiling functions |
| metis/utils/data_profiling/* | Core implementations of 30+ profiling functions |
| metis/metric/* | Updated metrics to use timestamp instead of mesTime |
| docs/DATA_PROFILE_IMPORT_FORMATS.md | Comprehensive documentation of import formats |
| README.md | Added data profiling documentation and examples |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
metis/profiling/single_column/summaries_and_sketches/minhash.py
Outdated
Show resolved
Hide resolved
metis/utils/data_profiling/single_column/summaries_and_sketches/jaccard_similarity.py
Show resolved
Hide resolved
metis/utils/data_profiling/single_column/value_distribution/constancy.py
Show resolved
Hide resolved
metis/utils/data_profiling/single_column/domain_classification/domain.py
Show resolved
Hide resolved
metis/profiling/single_column/summaries_and_sketches/jaccard_similarity.py
Show resolved
Hide resolved
metis/utils/data_profiling/single_column/domain_classification/domain.py
Outdated
Show resolved
Hide resolved
68c0112 to
7846ef5
Compare
lisehr
left a comment
There was a problem hiding this comment.
Thanks for all the great extension, I reviewed it and looks really good to me.
Adds a data profiling module with functions for cardinality metrics, value distributions, patterns, and domain classification. All profiling results are cached in the database via a DataProfileManager singleton, which eliminates redundant computations across metric executions. Pre-computed profiles from external tools like HyFD or CFDFinder can be imported through the data_profiles field in data loader configs.