Skip to content

Add correctness, rule-consistency, and currency metrics#4

Open
jb3rndt wants to merge 28 commits intomainfrom
feat/correctness-metric
Open

Add correctness, rule-consistency, and currency metrics#4
jb3rndt wants to merge 28 commits intomainfrom
feat/correctness-metric

Conversation

@jb3rndt
Copy link
Collaborator

@jb3rndt jb3rndt commented Oct 19, 2025

This PR is based on #3, but I'm opening this already to get your feedback :)

Adds three new metrics:

  • correctness: compares each data point with its ground truth using a distance function (either simple absolute difference for numbers or levenshtein distance for strings)
  • currency: given a decline rate per column, the name of the column that contains the assessment date of each value in the tuple, and optionally a simulated assessment date to not rely on "now", calculates the currency based on this formula: curr(w, A) = exp(-decline(A) * age(w,A)) (with w the attribute value and A the column)
    • the decline rate is interpreted in years right now. It might be useful to make that configurable too?
  • rule-based consistency: checks whether the given rules per column hold on an attribute. Weighing a rule happens inside the rule definition itself. The return value of all rules are just added up when assessing the consistency value.
    • since rules are defined as python functions right now, I allowed metrics to be initialized by passing a config object directly (keeping JSON as an option too of course)

Copilot AI review requested due to automatic review settings October 19, 2025 19:14
@jb3rndt jb3rndt force-pushed the feat/correctness-metric branch from 3b1f382 to 9f9bc44 Compare October 19, 2025 19:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces three new data quality metrics (Correctness, Currency, and Rule-based Consistency), refactors writer implementations to use a shared SQLAlchemy-based DatabaseWriter, and adds a SQLAlchemy ORM model for persisting results.

  • New metrics: Correctness (distance vs. ground truth), Currency (exponential decay by age), Rule-based Consistency (rule aggregation with certainty).
  • Refactor: Unify SQLite/Postgres writers via DatabaseWriter and SQLAlchemy ORM models; add DQDimension enum and update DQResult to use it.
  • Config handling: Add MetricConfig base and load_config utility; allow passing config objects (notably for rule-based consistency).

Reviewed Changes

Copilot reviewed 18 out of 21 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
metis/writer/sqlite_writer.py Switch SQLiteWriter to SQLAlchemy via DatabaseWriter; provide engine factory.
metis/writer/postgres_writer.py Switch PostgresWriter to SQLAlchemy via DatabaseWriter; provide engine factory.
metis/writer/database_writer.py New base writer using SQLAlchemy ORM models; centralizes table creation and writes.
metis/writer/console_writer.py Minor typing tweak for optional config.
metis/utils/result.py DQdimension type changed to DQDimension enum.
metis/utils/dq_dimension.py New DQDimension StrEnum with dimensions.
metis/models.py New SQLAlchemy declarative model and dynamic table registration.
metis/metric/metric.py Add MetricConfig support and config loader.
metis/metric/currency/currency.py Implement Currency metric with exponential decay by age.
metis/metric/currency/config.py Config dataclass for Currency.
metis/metric/correctness/correctness.py Implement Correctness metric (distance-based).
metis/metric/consistency/rule_consistency.py Implement rule-based consistency with certainty annotation.
metis/metric/consistency/consistency.py Make Consistency accept JSON config path; switch to DQDimension.
metis/metric/consistency/config.py Config dataclasses for consistency metrics.
metis/metric/config.py Base config dataclass helper.
metis/metric/completeness/completeness.py Switch to DQDimension and updated typing.
metis/metric/init.py Export new metrics.
metis/dq_orchestrator.py Allow MetricConfig object in orchestrator assess API.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@jb3rndt jb3rndt force-pushed the feat/correctness-metric branch from 9d0d1ad to 7319535 Compare December 1, 2025 22:05
Copy link
Collaborator

@lisehr lisehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three metric implementations look good! Could you please still rename the metrics according to the new naming convention? _

@jb3rndt jb3rndt force-pushed the feat/correctness-metric branch from 7319535 to df48399 Compare December 8, 2025 12:37
@jb3rndt jb3rndt force-pushed the feat/correctness-metric branch 2 times, most recently from 5b741b4 to fa03a80 Compare December 8, 2025 13:37
@jb3rndt jb3rndt force-pushed the feat/correctness-metric branch 2 times, most recently from e815ab8 to 78a52bb Compare December 12, 2025 14:35
@jb3rndt jb3rndt force-pushed the feat/correctness-metric branch from 78a52bb to 1c3d0a5 Compare December 12, 2025 14:47
@jb3rndt
Copy link
Collaborator Author

jb3rndt commented Dec 12, 2025

Metric names and their file names fit the new naming scheme now :)

@jb3rndt jb3rndt force-pushed the feat/correctness-metric branch from fa08c91 to e1362ea Compare December 15, 2025 19:21
@lisehr lisehr self-requested a review December 16, 2025 14:11


@dataclass
class MetricConfig:
Copy link
Collaborator

@lisehr lisehr Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also create a naming scheme for metric configs in analogy to the metrics? like timeliness_config since we used this instead of camel case so far?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the configs, I have adopted the following naming scheme based on the name of the metric this config is used for:
file name: {metric file name}_config.py
class name: equal to the file name

For the consistency metric, this results in:
metric file name and class name: consistency_ruleBasedHinrichs.py

config file name and class name: consistency_ruleBasedHinrichs_config.py

Copy link
Collaborator

@lisehr lisehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments directly in the code.

@jb3rndt
Copy link
Collaborator Author

jb3rndt commented Dec 17, 2025

Thank you, I have adjusted the naming and docstrings accordingly :)

@jb3rndt
Copy link
Collaborator Author

jb3rndt commented Feb 2, 2026

@lisehr I've updated the branch :) Here is a short summary of the parts that have changed since your last review:

Metrics

  • New: completeness_nullRate: Measures the ratio of null values in a dataset (differs from the existing completeness_nullRatio in that it can be configured which granularity should be used). Does it make sense to merge both? (if yes, which name should be kept?)
  • New: completeness_nullAndDMVRate: Extends null rate assessment by detecting Disguised Missing Values (DMVs) using FAHES algorithm
  • Changed: consistency_ruleBasedPipino: Added certainty calculation
  • Changed: timeliness_heinrich: Added certainty calculation

Framework Changes

  • New: Add metric-specific loggers
  • New: CSVWriter
  • New: Datetime Utilities: Precision calculation utilities for timeliness metrics

@jb3rndt jb3rndt requested a review from lisehr February 2, 2026 15:47
Copy link
Collaborator

@lisehr lisehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, all additions look good. A few minor / structural comments:

  • nullRate vs. nullRatio: do we really need both files? I suggest combining them to one file that calculates both, the rate (count) plus the ratio of nulls. Currently it seems that nullRate calculates the ratio too anyway.
  • Please move all configs to Metis/configs/metric
  • Why is correctness_heinrich in the writer folder?

Copy link
Collaborator

@lisehr lisehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb3rndt: sorry, I messed up answering to your last questions.

  • completeness_nullRate vs nullRatio: the new file indeed looks much better and I suggest calling it completeness_nullRatio + make clear to add information in the header that both aspects (count + ratio) are calculated and stored to the DB, also on different granularity levels.
  • for completeness_nullAndDMVRate: also here, please make sure to declare FAHES with link to paper + github already in the header of the file. For the future, we could also use other DMV detectors like DisMis, for now, you can leave the filename as it is. Thank you!

@jb3rndt jb3rndt force-pushed the feat/correctness-metric branch from b23f55b to 68e9684 Compare March 4, 2026 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants