Skip to content

Feat/data profiling#7

Open
Tomic-Riedel wants to merge 27 commits intoHPI-Information-Systems:mainfrom
Tomic-Riedel:feat/data-profiling
Open

Feat/data profiling#7
Tomic-Riedel wants to merge 27 commits intoHPI-Information-Systems:mainfrom
Tomic-Riedel:feat/data-profiling

Conversation

@Tomic-Riedel
Copy link
Collaborator

Adds a data profiling module with functions for cardinality metrics, value distributions, patterns, and domain classification. All profiling results are cached in the database via a DataProfileManager singleton, which eliminates redundant computations across metric executions. Pre-computed profiles from external tools like HyFD or CFDFinder can be imported through the data_profiles field in data loader configs.

Copilot AI review requested due to automatic review settings February 22, 2026 21:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive data profiling system to Metis that provides caching of profiling computations and support for importing pre-computed profiles from external tools. The system includes 30+ profiling functions covering cardinalities, value distributions, patterns, data types, domain classification, and similarity metrics. A DataProfileManager singleton manages database-backed caching, and a flexible importer system supports various input formats including HyFD, AIDFD, and CFDFinder outputs.

Changes:

  • Renamed mesTime to timestamp in DQResult class and database models for consistency
  • Added DataProfile database model for storing cached profiling results
  • Implemented DataProfileManager singleton with database-backed caching
  • Added 30+ profiling functions organized into cardinalities, value distribution, patterns/types, domain classification, and sketches
  • Created importer system supporting inline JSON and external file formats for pre-computed profiles
  • Integrated profiling system into DQOrchestrator with automatic context management
  • Added comprehensive documentation for import formats

Reviewed changes

Copilot reviewed 57 out of 64 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
requirements.txt Added numpy and datasketch dependencies
metis/writer/database_writer.py Updated to use timestamp instead of mesTime
metis/utils/result.py Renamed mesTime field to timestamp in DQResult class
metis/utils/data_config.py Added data_profiles configuration field, removed duplicate lines
metis/database_models.py Added DataProfile model for caching profiling results
metis/dq_orchestrator.py Integrated DataProfileManager initialization and profile importing
metis/profiling/data_profile_manager.py Singleton manager for profile caching with serialization/deserialization
metis/profiling/cache.py Decorator for transparent profiling function caching
metis/profiling/init.py Exports cached versions of all profiling functions
metis/profiling/importers/* Importer classes for various profile formats (FD, UCC, IND, histograms, etc.)
metis/profiling/single_column/* Cached wrappers for profiling functions
metis/utils/data_profiling/* Core implementations of 30+ profiling functions
metis/metric/* Updated metrics to use timestamp instead of mesTime
docs/DATA_PROFILE_IMPORT_FORMATS.md Comprehensive documentation of import formats
README.md Added data profiling documentation and examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Tomic-Riedel Tomic-Riedel requested a review from lisehr February 22, 2026 21:59
Copy link
Collaborator

@lisehr lisehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the great extension, I reviewed it and looks really good to me.

@lisehr lisehr self-assigned this Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants