Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
dcd9cb2
feat: add data profiling utilities for single-column cardinality metrics
Tomic-Riedel Jan 6, 2026
8342e3f
fix: use dict-based Series construction for int dtype columns
Tomic-Riedel Jan 6, 2026
af674ed
Merge branch 'HPI-Information-Systems:main' into feat/data-profiling
Tomic-Riedel Jan 6, 2026
8329caf
feat: add value distribution metrics
Tomic-Riedel Jan 14, 2026
22c9ee0
feat: add patterns and data types profiling tasks
Tomic-Riedel Jan 14, 2026
213a506
feat: add summaries and sketches profiling tasks
Tomic-Riedel Jan 14, 2026
0f8e9b7
feat: add domain classification profiling tasks
Tomic-Riedel Jan 14, 2026
07e16d1
feat: update requirements to include numpy and datasketch
Tomic-Riedel Jan 14, 2026
0750a4d
ref: rename mesTime to timestamp in DQResult
Tomic-Riedel Feb 22, 2026
a41ac03
feat: add DataProfile model for caching profiling results
Tomic-Riedel Feb 22, 2026
76b7e6f
feat: add DataProfileManager singleton for profile caching
Tomic-Riedel Feb 22, 2026
df0239b
feat: add caching decorator for profiling functions
Tomic-Riedel Feb 22, 2026
cd01407
feat: wrap existing profiling functions with caching
Tomic-Riedel Feb 22, 2026
e418d0e
feat: add base importer class for data profiles
Tomic-Riedel Feb 22, 2026
d706943
feat: add scalar value importer for simple column profiles
Tomic-Riedel Feb 22, 2026
55679f1
feat: add histogram, patterns, quartiles, and jaccard importers
Tomic-Riedel Feb 22, 2026
7eb5d82
feat: add dependency importers for FD, UCC, and IND
Tomic-Riedel Feb 22, 2026
961d756
feat: add data_profiles field to DataConfig
Tomic-Riedel Feb 22, 2026
8027d44
feat: integrate data profile import into DQOrchestrator
Tomic-Riedel Feb 22, 2026
03c3cae
docs: add data profile import format documentation
Tomic-Riedel Feb 22, 2026
b58b865
Merge branch 'main' into feat/data-profiling
Tomic-Riedel Feb 22, 2026
bc87bbf
fix: remove incompatible cached wrapper from estimate_jaccard_from_mi…
Tomic-Riedel Feb 25, 2026
07d4e00
fix: handle multi-Series args in cached decorator
Tomic-Riedel Feb 25, 2026
e70b4ee
fix: add MinHash serialize/deserialize support in DataProfileManager
Tomic-Riedel Feb 25, 2026
214fdff
fix: escape dot and add quantifier in URL domain pattern
Tomic-Riedel Feb 25, 2026
0a8ce25
fix: use upsert in DataProfileManager.store() to prevent duplicate rows
Tomic-Riedel Feb 25, 2026
7846ef5
fix: parse FD formats per-line to avoid silent HyFD drop
Tomic-Riedel Feb 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 42 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Examples: `completeness_NullRatio`, `minimality_DuplicateCount`
class DQResult:
def __init__(
self,
mesTime: pd.Timestamp,
timestamp: pd.Timestamp,
DQvalue: float,
DQdimension: str,
DQmetric: str,
Expand All @@ -57,7 +57,7 @@ class DQResult:
````

To create a new instance of DQResult, one needs to provide at least the following arguments:
- **mesTime: pd.Timestamp**: The time at which a result was assessed.
- **timestamp: pd.Timestamp**: The time at which a result was assessed.
- **DQvalue: float**: The result of the assessment. This currently only supports quantitative assessments.
- **DQdimension: str**: The name of the data quality dimension that was assessed e.g. completeness, accuracy, etc.
- **DQmetric: str**: The name of the specific metric inside the given dimension that was assessed.
Expand All @@ -67,5 +67,45 @@ Furthermore, there are more optional arguments that might need to be set dependi
- **rowIndex: Optional[int]**: Index of the row this result is associated with. This can either be used together with columnNames to assess data quality on a cell level or for row based metrics.
- **DQannotations: Optional[dict]**: To allow metrics to save additional information or annotations, this dictionary can store all additional information that might need to be saved. This currently does not need for follow a predefined structure.

## Data Profiling

Metis includes a data profiling system that caches computed statistics and supports importing pre-computed profiles.

### Cached Profiling Functions

Use cached profiling functions from `metis.profiling` for automatic caching:

```python
from metis.profiling import null_count, distinct_count, data_type

# These are automatically cached when DataProfileManager is initialized
nulls = null_count(df["column"])
```

### Importing Pre-computed Profiles

You can import pre-computed data profiles (from external tools like HyFD, CFDFinder, etc.) via the data loader config:

```json
{
"loader": "CSV",
"name": "Adult",
"file_name": "adult.csv",
"data_profiles": {
"fd": {
"source": "hyfd",
"file": "outputs/adult_hyfd.txt"
},
"null_count": {
"source": "manual",
"values": [
{"column": "age", "value": 0},
{"column": "workclass", "value": 1836}
]
}
}
}
```

For complete documentation of all supported import formats, see [Data Profile Import Formats](docs/DATA_PROFILE_IMPORT_FORMATS.md).

Loading