From 4120f878d9e418d1311589a6281eecbd649790a5 Mon Sep 17 00:00:00 2001 From: Rafael Riki Ogawa Osiro Date: Wed, 10 Jun 2026 20:57:21 -0300 Subject: [PATCH 1/2] docs(data-diff): restructure rule type into multi-page reference Replaces the single data-diff-check.md page with a hub structure under docs/data-quality-checks/data-diff/ (introduction, how-it-works, examples, api, faq, how-to-create). Adds the .anomalous-cell CSS class used by the examples tables, updates cross-links from is-replica-of, overview-of-a-check, rule-types-overview, and operations/profile, and rewires mkdocs.yml nav plus redirects from the old URL. --- docs/data-quality-checks/data-diff-check.md | 395 ------------------ docs/data-quality-checks/data-diff/api.md | 82 ++++ .../data-quality-checks/data-diff/examples.md | 283 +++++++++++++ docs/data-quality-checks/data-diff/faq.md | 76 ++++ .../data-diff/how-it-works.md | 158 +++++++ .../data-diff/how-to-create.md | 91 ++++ .../data-diff/introduction.md | 105 +++++ .../is-replica-of-check.md | 4 +- .../overview-of-a-check.md | 2 +- .../rule-types-overview.md | 2 +- docs/operations/profile/profile.md | 2 +- docs/stylesheets/extra.css | 13 + mkdocs.yml | 12 +- 13 files changed, 823 insertions(+), 402 deletions(-) delete mode 100644 docs/data-quality-checks/data-diff-check.md create mode 100644 docs/data-quality-checks/data-diff/api.md create mode 100644 docs/data-quality-checks/data-diff/examples.md create mode 100644 docs/data-quality-checks/data-diff/faq.md create mode 100644 docs/data-quality-checks/data-diff/how-it-works.md create mode 100644 docs/data-quality-checks/data-diff/how-to-create.md create mode 100644 docs/data-quality-checks/data-diff/introduction.md diff --git a/docs/data-quality-checks/data-diff-check.md b/docs/data-quality-checks/data-diff-check.md deleted file mode 100644 index d063ec9d13..0000000000 --- a/docs/data-quality-checks/data-diff-check.md +++ /dev/null @@ -1,395 +0,0 @@ -# Data Diff - -!!! info "Recommended Check" - Qualytics recommends using the `dataDiff` rule type instead of the `isReplicaOf`. - - The `isReplicaOf` check is being deprecated and will no longer be maintained, while `dataDiff` provides the same functionality with enhanced performance and additional capabilities. - -## What is Data Diff? - -Think of Data Diff as a **"spot the difference" game for your business data**. - -Just like when you compare two pictures side-by-side to find what's changed, Data Diff compares two sets of information to make sure they match perfectly. It's like having a super-careful assistant who checks that when you copy something important, nothing gets lost, changed, or added by mistake. - -## Add Data Diff Check - -Use the Data Diff Check to compare two tables, detect anomalies, and run a scan to identify mismatched or missing records for accurate data validation. - -
-## What Does Data Diff Do? - -Data Diff helps you answer questions like: - -- "Did all my customer orders copy correctly to the backup system?" -- "Is the sales report showing the same numbers as the original database?" -- "When we moved data from System A to System B, did everything transfer properly?" - -**In simple terms:** It makes sure Data Set A is an exact match of Data Set B. - -## How Does Data Diff Work? - -Let's break it down into simple steps: - -### Step 1: Choose What to Compare - -You pick two sets of data: - -- **The Original** (your main source of truth) -- **The Copy** (backup, report, or transferred data) - -### Step 2: Pick What Matters -You decide which information is important to check. For example: - -- Customer names -- Order amounts -- Product IDs -- Dates - -### Step 3: The Comparison Happens - -Data Diff automatically looks at both sets: - -- Is everything from the original in the copy? -- Is there anything extra in the copy that shouldn't be there? -- Do all the values match exactly? - -### Step 4: Get Your Results - -The Data Diff report shows: - -- **Pass** – Target and reference datasets match; no action needed. -- **Anomalies Found** – Differences detected; view the report to see which rows or fields differ. - -## Why Should You Use Data Diff? - -### 1. Catch Mistakes Before They Cause Problems - -Imagine your finance team creates a quarterly report from last night's data backup. If some transactions didn't copy over, your report would be wrong. Data Diff catches this immediately. - -### 2. Save Time and Reduce Stress - -Instead of manually checking thousands of rows in spreadsheets, Data Diff does it automatically in seconds. - -### 3. Build Trust in Your Data - -When you present numbers to leadership or clients, you can confidently say, "This data has been verified." - -### 4. Protect Your Business - -Wrong data can lead to: - -- Incorrect invoices -- Bad business decisions -- Compliance issues -- Customer complaints - -Data Diff acts as your safety net. - -## Real-Life Example: Online Retail Store - -Let me walk you through a complete, real-world scenario: - -### The Situation - -**Sunshine Electronics** is an online store that sells gadgets. Every night at midnight, their system creates a backup copy of all the day's orders. This backup is used for: - - - Creating daily sales reports - - Feeding data to their accounting system - - Analyzing customer trends - -### The Problem They Had - -One morning, the Sales Manager noticed the daily report showed 1,247 orders, but the warehouse had shipped 1,250 packages. **Where did 3 orders go?** - -After investigating, they discovered: - - - The backup system had a glitch - - Some orders placed between 11:58 PM and midnight weren't copied over - - This had been happening for weeks - - They had been under-reporting revenue and had incorrect inventory counts - -### The Solution: Data Diff - -They set up Data Diff to automatically compare their main orders database with the backup every morning. - -
- -**Here's what they compared:** - -**Original Orders Database:** - -| Order ID | Customer Name | Product | Amount | Date | -| :--------- | :------------- | :-------- | :------- | :----------- | -| 10001 | Sarah Johnson | Laptop | $899 | Jan 15, 2025 | -| 10002 | Mike Chen | Headphones | $149 | Jan 15, 2025 | -| 10003 | Emily Davis | Tablet | $399 | Jan 15, 2025 | -| ... | ... | ... | ... | ... | -| 10248 | David Lee | Phone Case | $19 | Jan 15, 2025 | -| 10249 | Anna Brown | USB Cable | $12 | Jan 15, 2025 | -| 10250 | Tom Wilson | Mouse | $29 | Jan 15, 2025 | - -**Backup Orders Database:** - -| Order ID | Customer Name | Product | Amount | Date | -| :--------| :-------------| :-------| :------| :-----| -| 10001 | Sarah Johnson | Laptop | $899 | Jan 15, 2025 | -| 10002 | Mike Chen | Headphones | $149 | Jan 15, 2025 | -| 10003 | Emily Davis | Tablet | $399 | Jan 15, 2025 | -| ... | ... | ... | ... | ... | -| 10248 | Missing | Missing | Missing | Missing | -| 10249 | Missing | Missing | Missing | Missing | -| 10250 | Missing | Missing | Missing | Missing | - -### What Data Diff Discovered - -**ALERT GENERATED:** - -!!! warning "DIFFERENCE DETECTED!" - - Fields Affected: amount, order_id, product, order_date, customer_name - - Rule Applied: Data Diff - - Anomalous Records: 3 - -**Technical Output (from Qualytics):** - -After running the Data Diff check, the system identified mismatched records between the **Original Orders Database (Left)** and the **Backup Orders Database (Right)**. - -| Row Status | order_id | amount (Left → Right) | order_date (Left → Right) | customer_name (Left → Right) | product (Left → Right) | -| ----------- | -------- | -------------------- | -------------------------- | ---------------------------- | ---------------------- | -| removed | 10248 | 19.00 → missing | 2025-01-15 → missing | David Lee → missing | Phone Case → missing | -| removed | 10249 | 12.00 → missing | 2025-01-15 → missing | Anna Brown → missing | USB Cable → missing | -| removed | 10250 | 29.00 → missing | 2025-01-15 → missing | Tom Wilson → missing | Mouse → missing | - -![deactivate-user](../assets/data-quality-checks/data-diff/anomaly-result.png) - -### 🔍 Summary -- These three records exist in the **Original Orders Database** but are **missing from the Backup Orders Database**. -- The “removed” status means Data Diff detected entries that weren’t found in the reference (right) table. -- This confirms that some orders failed to copy during the backup process. - -### The Outcome - -**Immediate Benefits:** - -- They fixed the backup system timing issue -- They recovered the missing orders data -- They corrected their sales reports - -**Long-term Benefits:** - -- Now they get an automatic email every morning confirming data matches -- If there's ever a mismatch, they know within hours instead of weeks -- They prevented thousands of dollars in unreported revenue -- Their inventory tracking became accurate again - -## Another Quick Example: Healthcare Clinic - -**City Health Clinic** transfers patient appointment data from their scheduling system to their billing system every hour. - -**They use Data Diff to check:** - -
- -- Patient Name -- Appointment Date -- Doctor Assigned -- Service Type -- Insurance Information - -### 📋 Before Correction (Data Diff Caught This) - -| **Field** | **Scheduling System** | **Billing System** | -|----------------|----------------------|--------------------| -| Patient | Robert Martinez | Robert Martinez | -| Doctor | Dr. Smith | Dr. Smith | -| Insurance Plan | BlueCross Plan **A** | BlueCross Plan **B** | - -The **Insurance Plan** code changed during transfer. Without Data Diff, the clinic would have billed the wrong insurer. - -### ✅ After Correction (Fixed Data) - -| **Field** | **Scheduling System** | **Billing System** | -|----------------|----------------------|--------------------| -| Patient | Robert Martinez | Robert Martinez | -| Doctor | Dr. Smith | Dr. Smith | -| Insurance Plan | BlueCross Plan **A** | BlueCross Plan **A** | - -!!! info - Data Diff caught the mismatch and the billing team corrected it before submitting the claim — avoiding claim rejection, payment delays, and extra work. - -### 🧩 Anomalies Detected – Output Table - -The Data Diff check found a mismatch between the **scheduling_system** and **billing_system** datasets for one record. -The issue was detected in the **insurance_plan** field for the patient **Robert Martinez**. - -| **Row Status** | **Patient** | **Field** | **Left (Scheduling System)** | **Right (Billing System)** | -|----------------|-------------------|-------------------|------------------------------|-----------------------------| -| Changed | Robert Martinez | insurance_plan | BlueCross Plan A | BlueCross Plan B | - -![deactivate-user](../assets/data-quality-checks/data-diff/anomaly-detail.png) - -## Key Takeaways - -**Data Diff is like having a careful proofreader** who checks that when you copy important information, nothing goes wrong. - -**It works automatically**- you set it up once, and it keeps watching your data 24/7. - -**It catches problems early**- before they affect your reports, decisions, or customers. - -**It gives you peace of mind**- you can trust that your backup, reports, and transferred data are accurate. - -## When Should You Use Data Diff? - -Use Data Diff whenever you: - -- Copy data from one place to another -- Create backups of important information -- Generate reports from multiple sources -- Transfer data between different systems -- Move data to the cloud -- Export data to partners or vendors - -### Field Scope - -**Multi:** The rule evaluates multiple specified fields. - -**Accepted Types** - -| Type | Supported | -|-------------|:------------------------:| -| `Date` |
:material-check-circle:{ style="color: #4caf50" }
| -| `Timestamp` |
:material-check-circle:{ style="color: #4caf50" }
| -| `Integral` |
:material-check-circle:{ style="color: #4caf50" }
| -| `Fractional`|
:material-check-circle:{ style="color: #4caf50" }
| -| `String` |
:material-check-circle:{ style="color: #4caf50" }
| -| `Boolean` |
:material-check-circle:{ style="color: #4caf50" }
| - -### General Properties - -{% -include-markdown "components/general-props/index.md" -start='' -end='' -%} - -### Specific Properties - -Specify the datastore and table/file where the reference data for the targeted fields is located for comparison. - -| Name | Description | -|------------|---------------------------------------------------------------| -|
Row Identifiers
| The list of fields defining the compound key to identify rows in the comparison analysis. | -|
Datastore
| The source datastore where the reference data for the targeted field(s) is located. | -|
Table/file
| The table, view or file in the source datastore that should serve as the reference for comparison. | -|
Comparators
| {{ comparator_short_desc }} | - -!!! info - The `DataDiff` rule supports editing of `Row Identifiers` and `Passthrough Fields`, allowing for more tailored configuration. - -!!! note "Details" -
- ### Row Identifiers -
- - This optional input allows row comparison analysis by defining a list of fields as row identifiers, it enables a more detailed comparison between tables/files, where each row compound key is used to identify its presence or absence in the reference table/file compared to the target table/file. Qualytics can inform if the row exists or not and distinguish which field values differ in each row present in the reference table/file, helping to determine if it is a data diff. - - !!! info - Anomalies produced by a `DataDiff` quality check making use of `Row Identifiers` have their source records presented in a different visualization.

- See more at: *[Comparison Source Records](../anomalies/details/source-record.md/#comparison-source-records)* - - {% - include-markdown "components/comparators/index.md" - %} - {% - include-markdown "components/comparators/numeric.md" - %} - {% - include-markdown "components/comparators/duration.md" - %} - {% - include-markdown "components/comparators/string.md" - %} - -### Anomaly Types - -{% - include-markdown "components/anomaly-support/index.md" - start='' - end='' -%} - -### Example - -**Scenario**: *Consider that the fields N_NATIONKEY and N_NATIONNAME in the NATION table need to be compared with a backup database for data validation purposes. The data engineering team wants to ensure that both fields in the backup accurately match the original.* - -**Objective**: *Ensure that N_NATIONKEY and N_NATIONNAME from the NATION table match the data in the NATION_BACKUP table.* - -**Sample Data from NATION** - -| N_NATIONKEY | N_NATIONNAME | -|-------------|--------------------| -| 1 | Australia | -| 2 | United States | -| 3 | Uruguay | - -**Reference Sample Data from NATION_BACKUP** - -| N_NATIONKEY | N_NATIONNAME | -|-------------|--------------------| -| 1 | Australia | -| 2 | USA | -| 3 | Uruguay | - -=== "Payload example" - ``` json - { - "description": "Ensure that N_NATIONKEY and N_NATIONNAME from the NATION table match the data in the NATION_BACKUP table", - "coverage": 1, - "properties": { - "ref_container_id": {ref_container_id}, - "ref_datastore_id": {ref_datastore_id} - }, - "tags": [], - "fields": ["N_NATIONKEY", "N_NATIONNAME"], - "additional_metadata": {"key 1": "value 1", "key 2": "value 2"}, - "rule": "dataDiff", - "container_id": {container_id}, - "template_id": {template_id}, - "filter": "1=1" - } - ``` - -**Anomaly Explanation** - -The datasets representing the fields `N_NATIONKEY` and `N_NATIONNAME` in the original and the reference data are not completely identical, indicating a possible discrepancy in the data or an unintended change. - -=== "Flowchart" - ```mermaid - graph TD - A[Start] --> B[Retrieve Original Data] - B --> C[Retrieve Reference Data] - C --> D{Do datasets match for both fields?} - D -->|Yes| E[End] - D -->|No| F[Mark as Anomalous] - F --> E - ``` - -=== "SQL" - ```sql - -- An illustrative SQL query comparing original to reference data for both fields. - select - orig.n_nationkey as original_key, - orig.n_nationname as original_name, - ref.n_nationkey as reference_key, - ref.n_nationname as reference_name - from nation as orig - left join nation_backup as ref on orig.n_nationkey = ref.n_nationkey - where - orig.n_nationname <> ref.n_nationname - or - orig.n_nationkey <> ref.n_nationkey - ``` - -**Potential Violation Messages** - -!!! example "Shape Anomaly" - There is 1 record that differs between `NATION_BACKUP` (3 records) and `NATION` (3 records) in `` \ No newline at end of file diff --git a/docs/data-quality-checks/data-diff/api.md b/docs/data-quality-checks/data-diff/api.md new file mode 100644 index 0000000000..526d87fa2b --- /dev/null +++ b/docs/data-quality-checks/data-diff/api.md @@ -0,0 +1,82 @@ +# :material-api:{ .middle style="color: var(--q-brick)" } Data Diff Check API + +The Data Diff check is created and managed through the standard Quality Checks API by setting `rule` to `dataDiff` and listing the compared fields under `fields`. The reference container, Row Identifiers, Passthrough Fields, Comparators, and `diff_change_types` are all configured through the `properties` object. + +!!! tip + For complete API documentation, including request and response schemas, visit the [API docs](https://demo.qualytics.io/api/docs){:target="_blank"}. + +## Endpoints + +| Method | Path | Purpose | +|:---|:---|:---| +| `POST` | `/api/quality-checks` | Create a new Data Diff check. | +| `GET` | `/api/quality-checks/{id}` | Retrieve a Data Diff check by ID. | +| `PUT` | `/api/quality-checks/{id}` | Update an existing Data Diff check. | +| `DELETE` | `/api/quality-checks/{id}` | Archive a Data Diff check (soft delete). The check stops being evaluated by Scans and can be restored from the archive view. | + +!!! note "What `PUT` can change" + **Editable:** `description`, `fields`, `filter`, `tags`, `additional_metadata`, `anomaly_message_field`, `status`, `owner_id`, `default_anomaly_assignee_id`, and the `properties` keys `ref_datastore_id`, `ref_container_id`, `id_field_names`, `passthrough_field_names`, `diff_change_types`, `numeric_comparator`, `duration_comparator`, `string_comparator`. + + **Immutable:** `rule`, `container_id`, `template_id`. To change any of these, delete the check and create a new one. + +**Permission**: Author team permission (or above) on the target container's team for `POST`, `PUT`, and `DELETE`; Reporter team permission (or above) for `GET`. + +## Payload Example + +Create a Data Diff check that compares `N_NATIONKEY` and `N_NATIONNAME` between `NATION` and `NATION_BACKUP`, matched by `N_NATIONKEY`, with `POST /api/quality-checks`. The payload below sets `diff_change_types` to `["removed", "changed"]` so unmatched reference rows are not reported as `added` anomalies, a typical choice when the reference is a superset of the target (such as a long-lived backup). + +```json +{ + "description": "Ensure NATION matches NATION_BACKUP on N_NATIONKEY and N_NATIONNAME", + "rule": "dataDiff", + "fields": ["N_NATIONKEY", "N_NATIONNAME"], + "container_id": 145, + "filter": null, + "properties": { + "ref_datastore_id": 22, + "ref_container_id": 803, + "id_field_names": ["N_NATIONKEY"], + "passthrough_field_names": [], + "diff_change_types": ["removed", "changed"] + }, + "tags": ["replication"], + "additional_metadata": {"jira": "DATA-1234"}, + "anomaly_message_field": null, + "template_id": null, + "status": "Active", + "owner_id": 7, + "default_anomaly_assignee_id": 12 +} +``` + +## Field Notes + +| Field | Required | Notes | +|:---|:---:|:---| +| `description` | Yes | Free-text description shown in the UI. | +| `rule` | Yes | Must be `"dataDiff"`. | +| `fields` | Yes | Array of field names to compare between target and reference. Order does not affect evaluation. | +| `container_id` | Yes | ID of the target container (the dataset the check runs on). | +| `filter` | No | Spark SQL `WHERE` expression applied to the **target** container before matching. Send `null` for no filter. The reference container is always read in full. | +| `properties.ref_datastore_id` | Yes | ID of the datastore that holds the reference container. | +| `properties.ref_container_id` | Yes | ID of the reference container (table, view, or file) to compare against. | +| `properties.id_field_names` | No | Array of field names that form the compound key used to match target rows to reference rows. Required to produce `changed` diffs and to enable the Comparison Source Records view. Omit (or `[]`) to fall back to a symmetrical set difference that produces only `added`/`removed`. | +| `properties.passthrough_field_names` | No | Array of extra field names carried into the source-records output for context. Passthrough fields appear alongside diffed fields but are never themselves a reason for the anomaly to fire. | +| `properties.diff_change_types` | No | Subset of `["added", "removed", "changed"]` that restricts which diff statuses produce an anomaly. Defaults to all three when omitted. An empty list is rejected with HTTP 422; at least one status must be selected. Sending this property on an `isReplicaOf` check is also rejected. See [How It Works → Restricting Anomalies by Status](how-it-works.md#restricting-anomalies-by-status){:target="_blank"}. | +| `properties.numeric_comparator` | No | Numeric Comparator tolerance object. See [How It Works → Comparators](how-it-works.md#comparators){:target="_blank"}. | +| `properties.duration_comparator` | No | Duration Comparator tolerance object. See [How It Works → Comparators](how-it-works.md#comparators){:target="_blank"}. | +| `properties.string_comparator` | No | String Comparator tolerance object. See [How It Works → Comparators](how-it-works.md#comparators){:target="_blank"}. | +| `tags` | No | List of tag names applied to the check for filtering and organization. | +| `additional_metadata` | No | Free-form key-value pairs (typically links to catalog, tickets, governance records). | +| `anomaly_message_field` | No | **Not applicable to Data Diff.** Data Diff emits only Shape Anomalies, which use a fixed message template, so this field is silently ignored at evaluation. Send `null`. | +| `template_id` | No | ID of a Check Template to associate the check with. `null` if not using a template. | +| `status` | No | `"Active"` (default) or `"Draft"`. Draft checks are not evaluated by Scans. | +| `owner_id` | No | ID of the user who owns the check. Defaults to the user creating the check when omitted. | +| `default_anomaly_assignee_id` | No | ID of the user automatically assigned to anomalies produced by the check. When omitted, anomalies are created unassigned and must be triaged manually. | + +## Related + +- [Introduction](introduction.md){:target="_blank"}: formal definition, field scope, and general/anomaly properties. +- [How It Works](how-it-works.md){:target="_blank"}: full semantics, Row Identifiers, Comparators, and edge cases. +- [Examples](examples.md){:target="_blank"}: three production scenarios with sample data and resulting anomalies. +- [FAQ](faq.md){:target="_blank"}: short answers to the most frequent questions. diff --git a/docs/data-quality-checks/data-diff/examples.md b/docs/data-quality-checks/data-diff/examples.md new file mode 100644 index 0000000000..ecbf7d4bf7 --- /dev/null +++ b/docs/data-quality-checks/data-diff/examples.md @@ -0,0 +1,283 @@ +# Data Diff Check Examples + +Three real-world scenarios that show how the Data Diff check is typically used in production: validating a nightly backup, comparing a system-to-system data transfer, and verifying a post-migration mirror with a scoped filter that suppresses unmatched reference rows. + +In every example, the **Sample Data** table is laid out exactly as the **Comparison Source Records** view in the Qualytics app renders the anomaly: each row has a status, an identifier, and a `Left` (target) / `Right` (reference) pair for every compared field. Only the differing right-side cell carries the anomalous-cell highlight, and missing values render as the literal text *missing*. + +=== "Backup Validation" + + **The situation:** Your `ORDERS` table is the system of record for an e-commerce site. Every night at midnight a backup job replicates it to `ORDERS_BACKUP` in the warehouse datastore. The Sales team builds the morning report from `ORDERS_BACKUP`, so any row missing from the backup silently understates revenue. A Data Diff check validates that the backup is complete every morning. + + **Check configuration** + + | Field | Value | + |:---|:---| + | Rule | Data Diff | + | Fields | `order_id`, `customer_id`, `amount`, `order_date` | + | Row Identifiers | `order_id` | + | Reference Datastore | `Warehouse` | + | Reference Container | `ORDERS_BACKUP` | + | Filter | *(none)* | + | Comparators | *(none)* | + | Status | Active | + | Tags | `critical`, `backup` | + | Additional Metadata | `jira: DATA-3201` | + | Description | ORDERS must match ORDERS_BACKUP every morning | + + **Payload** + + ```json + { + "description": "ORDERS must match ORDERS_BACKUP every morning", + "rule": "dataDiff", + "fields": ["order_id", "customer_id", "amount", "order_date"], + "container_id": 145, + "filter": null, + "properties": { + "ref_datastore_id": 22, + "ref_container_id": 803, + "id_field_names": ["order_id"] + }, + "tags": ["critical", "backup"], + "additional_metadata": {"jira": "DATA-3201"}, + "anomaly_message_field": null, + "template_id": null, + "status": "Active", + "owner_id": 7, + "default_anomaly_assignee_id": 12 + } + ``` + + **Sample Data** (as rendered in Comparison Source Records) + + | Row Status | order_id | customer_id (Left → Right) | amount (Left → Right) | order_date (Left → Right) | + |---|---|---|---|---| + | removed | 10248 | 4451 → missing | 19.00 → missing | 2025-01-15 → missing | + | removed | 10249 | 4452 → missing | 12.00 → missing | 2025-01-15 → missing | + | removed | 10250 | 4453 → missing | 29.00 → missing | 2025-01-15 → missing | + + **What gets flagged** + + Three orders that exist on the target side (`ORDERS`) have no matching `order_id` on the reference side (`ORDERS_BACKUP`). Each row is reported with status `removed` and every right-side field shows the literal *missing*. The backup job failed to copy these rows into `ORDERS_BACKUP`. + + !!! example "Shape Anomaly" + There are 3 records that differ between `ORDERS_BACKUP` (1247 records) and `ORDERS` (1250 records) in `Warehouse` + + **Flowchart** + + ```mermaid + graph TD + A["No filter, evaluate all target rows"] --> B["Read reference container ORDERS_BACKUP"] + B --> C["Match rows by order_id"] + C --> D{"Any target row
without a reference match?"} + D -->|No| E["All target rows pass"] + D -->|Yes| F["Flag each unmatched row as 'removed'.
3 orders missing from backup."] + ``` + + **Equivalent SQL** + + ```sql + -- Rows the Data Diff check would flag as 'removed' for the ORDERS backup. + SELECT t.order_id, t.customer_id, t.amount, t.order_date + FROM orders t + LEFT JOIN orders_backup r ON t.order_id = r.order_id + WHERE r.order_id IS NULL + ORDER BY t.order_id; + ``` + +=== "System-to-System Transfer" + + **The situation:** A clinic's scheduling system writes appointments to a SQL Server `APPOINTMENTS` table, and an hourly ETL copies them into the billing system's `APPOINTMENTS_BILLING` table. Patient name and doctor are reliably copied, but the insurance plan code is rewritten in transit by a translation layer. A Data Diff check confirms that for each `appointment_id`, `patient_name`, `doctor`, and `insurance_plan` agree on both sides, and surfaces any field that the transfer mangled. + + **Check configuration** + + | Field | Value | + |:---|:---| + | Rule | Data Diff | + | Fields | `patient_name`, `doctor`, `insurance_plan` | + | Row Identifiers | `appointment_id` | + | Reference Datastore | `Billing` | + | Reference Container | `APPOINTMENTS_BILLING` | + | Filter | *(none)* | + | Comparators | *(none)* | + | Status | Active | + | Tags | `clinical`, `billing` | + | Additional Metadata | `jira: DATA-4101` | + | Description | Appointments must match between scheduling and billing | + + **Payload** + + ```json + { + "description": "Appointments must match between scheduling and billing", + "rule": "dataDiff", + "fields": ["patient_name", "doctor", "insurance_plan"], + "container_id": 212, + "filter": null, + "properties": { + "ref_datastore_id": 31, + "ref_container_id": 912, + "id_field_names": ["appointment_id"] + }, + "tags": ["clinical", "billing"], + "additional_metadata": {"jira": "DATA-4101"}, + "anomaly_message_field": null, + "template_id": null, + "status": "Active", + "owner_id": 7, + "default_anomaly_assignee_id": 18 + } + ``` + + **Sample Data** (as rendered in Comparison Source Records) + + | Row Status | appointment_id | patient_name (Left → Right) | doctor (Left → Right) | insurance_plan (Left → Right) | + |---|---|---|---|---| + | changed | A-7781 | Robert Martinez → Robert Martinez | Dr. Smith → Dr. Smith | BlueCross Plan A → BlueCross Plan B | + + **What gets flagged** + + The pair `(A-7781)` exists on both sides, so the row's status is `changed` rather than `added` or `removed`. Only `insurance_plan` differs, so only that right-side cell is highlighted; `patient_name` and `doctor` agree and render plainly on both sides. + + !!! example "Shape Anomaly" + There is 1 record that differs between `APPOINTMENTS_BILLING` (2841 records) and `APPOINTMENTS` (2841 records) in `Billing` + + **Flowchart** + + ```mermaid + graph TD + A["No filter, evaluate all target rows"] --> B["Read reference container APPOINTMENTS_BILLING"] + B --> C["Match rows by appointment_id"] + C --> D{"Do all compared fields
match on each pair?"} + D -->|Yes| E["Row passes"] + D -->|No| F["Flag the row as 'changed'.
insurance_plan differs for A-7781."] + ``` + + **Equivalent SQL** + + ```sql + -- Rows the Data Diff check would flag as 'changed' for appointments. + SELECT + t.appointment_id, + t.patient_name AS left_patient_name, r.patient_name AS right_patient_name, + t.doctor AS left_doctor, r.doctor AS right_doctor, + t.insurance_plan AS left_insurance_plan, r.insurance_plan AS right_insurance_plan + FROM appointments t + INNER JOIN appointments_billing r ON t.appointment_id = r.appointment_id + WHERE + t.patient_name <> r.patient_name + OR t.doctor <> r.doctor + OR t.insurance_plan <> r.insurance_plan + ORDER BY t.appointment_id; + ``` + +=== "Scoped Post-Migration Mirror" + + **The situation:** The Customers domain is being migrated from a legacy CRM to a new system. The legacy `CUSTOMERS_LEGACY` table is being phased out, and the new `CUSTOMERS_NEW` table is the system of record. For the cutover window, both systems are written to, and the migration team needs daily confirmation that today's writes are identical on both sides. Yesterday's rows are intentionally allowed to differ (the legacy system kept being patched), so the check is scoped to `created_at = current_date()` on the target. + + **Check configuration** + + | Field | Value | + |:---|:---| + | Rule | Data Diff | + | Fields | `customer_name`, `email`, `tier` | + | Row Identifiers | `customer_id` | + | Reference Datastore | `Legacy CRM` | + | Reference Container | `CUSTOMERS_LEGACY` | + | Filter | `created_at = current_date()` | + | Diff Change Types | `["removed", "changed"]` (ignore `added`-row noise from the unfiltered legacy reference) | + | Comparators | *(none)* | + | Status | Active | + | Tags | `migration`, `customers` | + | Additional Metadata | `jira: DATA-5240` | + | Description | Today's new customers must mirror the legacy CRM | + + **Payload** + + ```json + { + "description": "Today's new customers must mirror the legacy CRM", + "rule": "dataDiff", + "fields": ["customer_name", "email", "tier"], + "container_id": 318, + "filter": "created_at = current_date()", + "properties": { + "ref_datastore_id": 17, + "ref_container_id": 661, + "id_field_names": ["customer_id"], + "diff_change_types": ["removed", "changed"] + }, + "tags": ["migration", "customers"], + "additional_metadata": {"jira": "DATA-5240"}, + "anomaly_message_field": null, + "template_id": null, + "status": "Active", + "owner_id": 7, + "default_anomaly_assignee_id": 24 + } + ``` + + **Why the filter and `diff_change_types` matter** + + The filter runs **before** the reference is read into the comparison. The target is scoped to today's `CUSTOMERS_NEW` rows, and only those rows are matched against `CUSTOMERS_LEGACY`. A customer created yesterday with a mismatch in either system stays out of scope and does not contribute to the anomaly. + + The filter does not narrow the reference, however. Read in full, `CUSTOMERS_LEGACY` carries every legacy customer ever created, so any legacy row not matched by a `CUSTOMERS_NEW` row created today would normally surface as `added`. That noise has nothing to do with the cutover verification, so the check sets `diff_change_types` to `["removed", "changed"]`, and only today's NEW customers missing from LEGACY, or today's NEW customers whose values differ in LEGACY, are reported. + + **Sample Data** (filtered to `created_at = current_date()`, as rendered in Comparison Source Records) + + | Row Status | customer_id | customer_name (Left → Right) | email (Left → Right) | tier (Left → Right) | + |---|---|---|---|---| + | changed | C-9012 | Lin Wei → Lin Wei | lin@example.com → lin@example.com | gold → silver | + + **What gets flagged** + + `C-9012` exists on both sides today; only `tier` differs, so the row is `changed` and only the right-side `tier` cell is highlighted. No `removed` rows are reported in this batch because every target row from today found a match in the legacy CRM. Legacy customers not present in today's `CUSTOMERS_NEW` writes would normally be reported as `added`, but `diff_change_types` filters that status out for this check. + + !!! example "Shape Anomaly" + There is 1 record that differs between `CUSTOMERS_LEGACY` (38241 records) and `CUSTOMERS_NEW` (412 records) in `Legacy CRM` [filter: created_at = current_date()] + + **Flowchart** + + ```mermaid + graph TD + A["Apply filter: created_at = current_date()"] --> B["Read reference container CUSTOMERS_LEGACY"] + B --> C["Match filtered target rows
against reference by customer_id"] + C --> D{"Any row removed or changed?
(added is suppressed by diff_change_types)"} + D -->|No| E["All filtered rows pass"] + D -->|Yes| F["Flag each differing row with its status:
C-9012 (changed)."] + ``` + + **Equivalent SQL** + + ```sql + -- Rows the Data Diff check would flag for today's customers, + -- with diff_change_types = ["removed", "changed"] + -- (legacy-only rows are intentionally not flagged). + WITH target AS ( + SELECT * FROM customers_new WHERE created_at = current_date() + ) + SELECT + CASE + WHEN r.customer_id IS NULL THEN 'removed' + ELSE 'changed' + END AS row_status, + t.customer_id, + t.customer_name AS left_customer_name, r.customer_name AS right_customer_name, + t.email AS left_email, r.email AS right_email, + t.tier AS left_tier, r.tier AS right_tier + FROM target t + LEFT JOIN customers_legacy r ON t.customer_id = r.customer_id + WHERE + r.customer_id IS NULL + OR t.customer_name <> r.customer_name + OR t.email <> r.email + OR t.tier <> r.tier + ORDER BY t.customer_id; + ``` + +## Related + +- [Introduction](introduction.md){:target="_blank"}: formal definition, field scope, and general/anomaly properties. +- [How It Works](how-it-works.md){:target="_blank"}: full semantics, Row Identifiers, Comparators, and edge cases. +- [API](api.md){:target="_blank"}: payload shape and field notes for creating a Data Diff check programmatically. +- [FAQ](faq.md){:target="_blank"}: short answers to the most frequent questions. diff --git a/docs/data-quality-checks/data-diff/faq.md b/docs/data-quality-checks/data-diff/faq.md new file mode 100644 index 0000000000..fb6f0a80d4 --- /dev/null +++ b/docs/data-quality-checks/data-diff/faq.md @@ -0,0 +1,76 @@ +# :material-help-circle-outline:{ .middle style="color: var(--q-brick)" } Data Diff Check FAQ + +Common questions about how the Data Diff check matches rows between target and reference datasets, how it handles missing or differing values, and how anomalies are reported. + +## Behavior + +### What is the difference between Data Diff and Is Replica Of? + +Data Diff (`dataDiff`) is the current rule type for two-table comparison. Is Replica Of (`isReplicaOf`) is deprecated and no longer maintained. Both share the same row-by-row comparison engine and the same configuration properties (Row Identifiers, Passthrough Fields, Comparators), but only Data Diff is actively maintained and only Data Diff supports the `diff_change_types` property, which restricts which diff statuses fire anomalies. Use Data Diff for any new check; Is Replica Of is preserved only for existing checks. + +### What are Row Identifiers, and do I need them? + +Row Identifiers are the compound key the platform uses to pair each target row with a reference row. Without them, the platform falls back to a symmetrical set difference and can only produce `added` and `removed` diffs (a row that differs in one field becomes one `removed` plus one `added` row, not one `changed` row). Setting Row Identifiers is the only way to get per-field, side-by-side diffs (`Left` vs `Right` on the same row) in the Comparison Source Records view. See [How It Works → Row Identifiers and Passthrough Fields](how-it-works.md#row-identifiers-and-passthrough-fields){:target="_blank"}. + +### What are Passthrough Fields? + +Passthrough Fields are extra columns carried into the source-records output for context. They appear alongside the diffed fields in the Comparison Source Records view but are not themselves compared, so they never cause the anomaly to fire. Typical use: showing a `customer_name` or `created_at` next to the differing column so anomaly triagers can identify the row without leaving the page. + +### Does the filter clause scope both target and reference? + +No. The filter only narrows the **target** container. The reference container is always read in full. If you also need to narrow the reference (for example, comparing only today's records on both sides), point the check at a view in the reference datastore that encodes the same scope. See [How It Works → The Filter Clause](how-it-works.md#the-filter-clause){:target="_blank"}. + +### How are Comparators applied? + +Comparators apply a per-field tolerance to the equality check. The platform supports Numeric, Duration, and String Comparators. Without a Comparator the values are compared strictly: `1.00` and `1.000001` differ, `"Australia"` and `"australia"` differ. Use Comparators when small, expected divergences (rounding, casing, whitespace) would otherwise produce noise. See [How It Works → Comparators](how-it-works.md#comparators){:target="_blank"}. + +## Anomaly Reporting + +### What do the row statuses mean? + +| Status | Meaning | +|:---|:---| +| `removed` | The identifier exists only on the target (left). The reference is missing this row. | +| `added` | The identifier exists only on the reference (right). The reference has a row the target does not. | +| `changed` | The identifier exists on both sides, but at least one compared field has a different value. | + +### Can I exclude `added` (or `removed`, or `changed`) rows from anomalies? + +Yes. The `diff_change_types` property (inside `properties`) takes a list of statuses to allow: any subset of `["added", "removed", "changed"]`. Rows with a status outside the list still get computed but do not contribute to the anomaly. Set it to `["removed", "changed"]` when the reference is intentionally a superset of the target (a staging table with extra QA rows), or to `["changed"]` when row presence is guaranteed by an upstream contract and only value drift matters. The property defaults to all three statuses; an empty list is rejected at the API. See [How It Works → Restricting Anomalies by Status](how-it-works.md#restricting-anomalies-by-status){:target="_blank"}. + +### Why do some rows show "missing" instead of a value? + +When a row is `added`, the target side has no row to read, so every left-side cell renders as the literal text *missing*. When a row is `removed`, the reference side has no row to read, so every right-side cell renders as *missing*. This is the same rendering the Comparison Source Records view uses in the Qualytics app. + +### Why doesn't Data Diff produce a Record Anomaly? + +Data Diff is a *shape*-only rule type: the violation is a property of the target as a whole (the set of rows that diverge from the reference), not of an individual record's value. The per-row detail you see is attached to the Shape Anomaly as Comparison Source Records, not as separate Record Anomalies. + +### What does the Shape Anomaly message look like? + +``` +There are N records that differ between (R records) and (T records) in +``` + +Where `N` is the total number of differing rows (`added` + `removed` + `changed`), `R` is the reference row count, and `T` is the target row count after the filter is applied. A `[filter: ]` suffix is appended when a filter is set. + +## Configuration + +### Can I compare two containers in the same datastore? + +Yes. Set `ref_datastore_id` to the same datastore ID as the target's datastore and `ref_container_id` to the second container. The reference does not have to be in a separate datastore. + +### Does Custom Anomaly Description (the `anomaly_message_field` payload field) work for Data Diff? + +No. The Custom Anomaly Description toggle (and the corresponding `anomaly_message_field` payload field) only affects Record Anomaly messages. Because Data Diff emits only Shape Anomalies, the field is silently ignored at evaluation time, and the resulting message uses the fixed Shape Anomaly template (see [What does the Shape Anomaly message look like?](#what-does-the-shape-anomaly-message-look-like)). + +### Which fields can I edit on an existing Data Diff check? + +`PUT /api/quality-checks/{id}` can update the description, filter, tags, status, ownership, comparison `fields` list, all Data Diff-specific properties (Row Identifiers, Passthrough Fields, `diff_change_types`, Comparators), and the reference container (`ref_datastore_id`, `ref_container_id`). The rule type and the target container itself are fixed at creation. See the [API page](api.md#endpoints){:target="_blank"} for the full editable/immutable matrix. + +## Related + +- [Introduction](introduction.md){:target="_blank"}: formal definition, field scope, and general/anomaly properties. +- [How It Works](how-it-works.md){:target="_blank"}: full semantics, Row Identifiers, Comparators, and edge cases. +- [API](api.md){:target="_blank"}: payload example and field notes for creating a Data Diff check programmatically. +- [Examples](examples.md){:target="_blank"}: three production scenarios with sample data and resulting anomalies. diff --git a/docs/data-quality-checks/data-diff/how-it-works.md b/docs/data-quality-checks/data-diff/how-it-works.md new file mode 100644 index 0000000000..32b1a67c63 --- /dev/null +++ b/docs/data-quality-checks/data-diff/how-it-works.md @@ -0,0 +1,158 @@ +# :material-information-outline:{ .middle style="color: var(--q-brick)" } How Data Diff Checks Work + +This page covers everything the Data Diff check does, in detail: how it matches rows between target and reference datasets, the three diff statuses it produces, how Row Identifiers, Comparators, and `diff_change_types` change the evaluation, and how the resulting Comparison Source Records are rendered in the UI. + +If you only need a quick reference, the [Introduction](introduction.md){:target="_blank"} page covers the formal definition, field scope, and general/anomaly properties. This page is the detailed reference. + +## How the Check Evaluates the Two Datasets + +Every Data Diff check follows the same five-step evaluation flow: + +1. **Apply the filter clause.** If the check has a `filter` set, only the rows in the **target** container that match the filter expression continue to the next step. The reference container is read in full (the filter scopes the target, not the reference). +2. **Read the reference container.** The platform reads the comparison fields from the reference table or file in the configured reference datastore. +3. **Match rows.** With Row Identifiers configured, rows are matched by the combination of identifier values: target rows whose identifier values match a reference row are paired up. Without Row Identifiers, the platform performs a symmetrical set difference on the full set of compared fields. +4. **Compare fields on each matched pair.** For every paired row, each listed field is compared between left (target) and right (reference). Comparators (numeric, duration, string) define the tolerance for the comparison; without a Comparator the comparison is strict equality. +5. **Emit a single Shape Anomaly summarizing the diffs.** Every row that differs (added, removed, or changed) becomes part of the anomaly's source records. The dataset-level violation is summarized in a single Shape Anomaly message and the per-row detail is surfaced through the Comparison Source Records view. + +The order matters: the filter is applied **before** matching, so rows the filter excludes on the target side cannot pair with any reference row and cannot contribute to the anomaly. + +## The Three Diff Statuses + +Each differing row carries one of three status values. The status is what drives the Comparison Source Records UI and is exposed in the anomaly payload. + +| Status | Meaning | +|:---|:---| +| `removed` | The row's identifier exists only on the **left** (target). The reference dataset is missing this row. | +| `added` | The row's identifier exists only on the **right** (reference). The reference dataset has a row the target does not. | +| `changed` | The same identifier exists on both sides, but at least one listed field has a different value. The differing fields are reported per row. | + +Without Row Identifiers, only `added` and `removed` are produced, because the platform has no key to match rows on. A row that differs in even one field becomes one `removed` (the row from target) and one `added` (the row from reference), in a pure symmetrical set difference. + +!!! info "Row Identifiers matter for clarity" + Setting Row Identifiers turns most discrepancies into a single `changed` row rather than a pair of `removed`/`added` rows. This is the only way to get a per-field diff (`left` vs `right` on the same row) in the Comparison Source Records view. Pick Row Identifiers that uniquely identify each row in both target and reference. + +## Restricting Anomalies by Status + +The `diff_change_types` property restricts which of the three diff statuses are allowed to produce an anomaly. By default, every differing row (`added`, `removed`, or `changed`) contributes to the anomaly count. Setting `diff_change_types` to a subset of statuses keeps the comparison running over all rows but only flags the diffs whose status appears in the list. + +| Value omitted or `null` | Behavior | +|:---|:---| +| Property absent from `properties` | All three statuses (`added`, `removed`, `changed`) fire anomalies. | +| `["added"]` | Only rows present on the reference side but missing on the target are flagged. | +| `["removed"]` | Only rows present on the target side but missing on the reference are flagged. | +| `["changed"]` | Only rows present on both sides whose listed fields differ are flagged. | +| Any combination of the three values | Only the listed statuses fire. | +| `[]` (empty list) | Rejected at the API with HTTP 422; at least one status must be selected. | + +Typical use cases: + +- **Reference is intentionally a superset of the target.** The target is a production table whose reference (a staging or canonical source) carries extra QA or onboarding rows that production is not expected to receive. Set `diff_change_types` to `["removed", "changed"]` to suppress `added`-row noise. +- **Reference is intentionally a subset of the target.** A reporting view may exclude soft-deleted or out-of-scope rows. Set `diff_change_types` to `["added", "changed"]` to suppress `removed`-row noise. +- **Only the per-field diff matters.** When row presence on both sides is guaranteed by an upstream contract and only value drift is interesting, set `diff_change_types` to `["changed"]`. + +The property is `dataDiff`-only. Sending it on an `isReplicaOf` check is rejected at the API with HTTP 422. The property is editable through `PUT /api/quality-checks/{id}`, so the subset can be tuned without recreating the check. + +## Row Identifiers and Passthrough Fields + +Two optional properties shape what the Comparison Source Records view looks like: + +### Row Identifiers + +A list of fields that form the **compound key** the platform uses to pair target and reference rows. The identifier tuple must exist on both sides. Typical choices are primary keys (`customer_id`, `order_id`) or a composite of business-key columns (`order_id, line_number`). + +When Row Identifiers are set: + +- Matched rows produce a `changed` diff if any listed field differs. +- Unmatched target rows produce `removed`. +- Unmatched reference rows produce `added`. + +### Passthrough Fields + +Extra fields the platform should **carry into the source-records output** for context, even though they are not part of the comparison. Passthrough Fields appear in the Comparison Source Records view alongside the diffed fields but are never themselves a reason for the anomaly to fire. Typical use: showing `customer_name` or `created_at` next to the differing column so anomaly triagers can identify the row without leaving the page. + +## The Filter Clause + +The filter clause is a Spark SQL `WHERE` expression applied to the **target** container before matching. The expression is always evaluated as Spark SQL on the dataplane, even when the underlying datastore (Snowflake, Postgres, etc.) uses a different SQL dialect. It serves two purposes: + +1. **Scoping the comparison.** Restrict the comparison to a subset of target rows (`status = 'active'`, `event_date = current_date()`, `tenant_id = 42`). Rows outside the scope cannot be reported as `removed` and cannot pair with reference rows. +2. **Avoiding noise from known divergences.** Filter out rows that are intentionally allowed to differ between target and reference (for example, a `staging_only = true` flag), so the check focuses on the rows that must match. + +The filter is part of the check definition, so the anomaly message includes the filter expression (`[filter: ]`) when one is set, making it explicit which slice of the target was evaluated when the anomaly fired. + +!!! note "The filter does not scope the reference" + The filter only narrows the target side. The reference container is read in full. If you also need to narrow the reference (for example, comparing only this month's records on both sides), point the check at a view in the reference datastore that already encodes the same scope. + +## Comparators + +Comparators apply a per-field tolerance to the equality check between left and right values. Without a Comparator, the platform compares values strictly: `1.00` and `1.000001` differ, `"Australia"` and `"australia"` differ. + +{% + include-markdown "components/comparators/index.md" +%} +{% + include-markdown "components/comparators/numeric.md" +%} +{% + include-markdown "components/comparators/duration.md" +%} +{% + include-markdown "components/comparators/string.md" +%} + +## The Resulting Shape Anomaly + +When the Data Diff check fires, it produces a single **Shape Anomaly** describing the dataset-level violation, with the per-row detail attached as Comparison Source Records. The check does **not** produce Record Anomalies; the diff is a property of the target as a whole. + +### Anomaly message format + +``` +There are N records that differ between (R records) and (T records) in +``` + +When a filter is set, the message is followed by `[filter: ]`. + +### What the numbers mean + +- **N:** the number of differing rows (`added` + `removed` + `changed`). +- **R:** total row count in the reference container after read. +- **T:** total row count in the target container after the filter is applied. + +### Comparison Source Records view + +When Row Identifiers are configured, the per-row detail is rendered in the **Comparison Source Records** view rather than the standard Source Records list. The columns, in order: + +1. **Row Status**: one of `added`, `removed`, `changed`. +2. **Row Identifier**: the identifier value (or tuple) that pairs the target and reference rows. +3. **For each listed field**: a parent column spanning two sub-columns, `Left` (target) and `Right` (reference). + +Only the **right-side cell** of the differing field is highlighted; the left side and the row identifier are not. When a row is `removed`, the right-side cell shows the literal text `missing`; when a row is `added`, the left-side cell shows `missing`. + +See [Comparison Source Records](../../anomalies/details/source-record.md#comparison-source-records){:target="_blank"} for the full UI reference. + +## Relationship with Other Rule Types + +Data Diff is the only rule type that performs a two-table row-by-row comparison. A few rule types overlap with parts of its job and are worth pairing or substituting depending on the situation: + +| Rule Type | When to use it instead of (or alongside) Data Diff | +|:---|:---| +| [Is Replica Of](../is-replica-of-check.md){:target="_blank"} | Deprecated. Use Data Diff for any new check; Is Replica Of is preserved only for existing checks. | +| [Volumetric Checks](../volumetric-check.md){:target="_blank"} | Use when you only need to confirm that the **row count** in the target matches the reference, without comparing each field. Cheaper to evaluate than Data Diff on wide tables. | +| [Aggregation Comparison](../aggregation-comparison-check.md){:target="_blank"} | Use when you want to compare a **summary statistic** between the two datasets (sum, average) rather than per-row values. Useful for sanity-checking large fact tables where row-level diff is too expensive. | +| [Exists In](../exists-in-check.md){:target="_blank"} / [Not Exists In](../not-exists-in-check.md){:target="_blank"} | Use when you only need referential-integrity semantics (every target row's identifier exists in the reference) rather than full value comparison. | +| [Equal To Field](../equal-to-field-check.md){:target="_blank"} | Use when the two sides being compared are **two fields on the same row** (target and reference are the same container), not two separate containers. | + +## Performance Considerations + +Data Diff is the most expensive comparison rule type, because it reads both containers in full and joins them on the Row Identifier tuple (or performs a symmetrical set difference when no identifiers are set). Two practical implications: + +- **Choose Row Identifiers that are present on both sides and have low cardinality per row.** Pairing on a single integer identifier is the cheapest case; pairing on a wide composite (4+ string fields) costs noticeably more. +- **Filter the target side to the smallest meaningful slice.** A filter that narrows the target to the current day or current tenant shrinks the comparison workload proportionally and is a common pattern when uniqueness only matters within a scope. + +When per-field diff isn't needed, a Volumetric Check or an Aggregation Comparison can answer "did anything diverge?" at a fraction of the cost. + +## Related + +- [Introduction](introduction.md){:target="_blank"}: formal definition, field scope, and general/anomaly properties. +- [Examples](examples.md){:target="_blank"}: three production scenarios with sample data and resulting anomalies. +- [API](api.md){:target="_blank"}: payload shape and field notes for creating a Data Diff check programmatically. +- [FAQ](faq.md){:target="_blank"}: short answers to the most frequent questions. diff --git a/docs/data-quality-checks/data-diff/how-to-create.md b/docs/data-quality-checks/data-diff/how-to-create.md new file mode 100644 index 0000000000..57c43d2c05 --- /dev/null +++ b/docs/data-quality-checks/data-diff/how-to-create.md @@ -0,0 +1,91 @@ + + +# :material-plus-circle:{ .middle style="color: var(--q-brick)" } How to Create a Data Diff Check + +Configure a Data Diff check inside the **Authored Check Details** modal. For the navigation steps that get you to the modal (selecting the datastore, opening the Checks tab, clicking **Add :material-plus:** → **Check**), see [Authored Check](../authored-check.md){:target="_blank"}. + +!!! tip "Prerequisites" + - At least the **Author** team permission on the target datastore. + - A target container (table or file) already loaded in the target datastore. + - A reference datastore connected, with a reference container that holds the same comparable fields as the target. + +## Configure the Data Diff Check + +**Step 1:** Set the **Rule Type** to **Data Diff**. + +**Step 2:** Select the **File** (or table) the check should run against. This is the **target** container. + +**Step 3:** Select the **Fields** to compare. These are the columns the platform reads on both sides and compares per row. + +**Step 4:** Select the **Reference Datastore** that holds the comparison data. The reference datastore must already be connected. + +**Step 5:** Select the **Reference Table/File** in the reference datastore. + +**Step 6:** *(Recommended)* Set the **Row Identifiers**. + +Pick one or more fields that form the compound key the platform should use to pair target and reference rows. Without Row Identifiers, the check produces only `added`/`removed` diffs; with Row Identifiers, you also get per-row `changed` diffs and the side-by-side Comparison Source Records view. + +**Step 7:** *(Optional)* Add **Passthrough Fields**. + +Passthrough Fields are extra columns carried into the anomaly's source records for context (for example, a `customer_name` next to the compared columns). They are not compared and never cause the anomaly to fire. + +**Step 8:** *(Optional)* Configure **Comparators**. + +Comparators apply per-type tolerances to the equality check: + +- **Numeric Comparator** — tolerance for fractional/integral fields (for example, allow a 0.01 difference on `amount`). +- **Duration Comparator** — tolerance for date/timestamp fields (for example, allow a 1-minute drift on `event_time`). +- **String Comparator** — tolerance for string fields (case-insensitive, whitespace-trimmed, etc.). + +Without a Comparator, the comparison is strict equality. + +**Step 9:** *(Optional)* Set a **Filter Clause** to scope the **target** container to a subset of rows. + +The filter is a Spark SQL `WHERE` expression evaluated **before** matching. Use it to: + +- Limit the comparison to today's writes (`created_at = current_date()`) during a migration window. +- Exclude rows that are intentionally allowed to differ (`staging_only = false`). + +The filter does **not** scope the reference container; the reference is always read in full. + +**Step 10:** *(Optional)* Adjust **Coverage**. + +The default is **100%**, meaning the target must match the reference exactly on the listed fields. Lower it only when a small known fraction of differences is expected and tolerated. + +**Step 11:** *(Optional)* Add a **Description**, **Tags**, and **Additional Metadata** for catalog and triage purposes. These do not affect evaluation. + +## Validate and Save + +**Step 12:** Click **Validate** at the bottom of the modal. + +The platform runs the check against the data without saving. A green confirmation message appears when the rule is well-formed. + +**Step 13:** Click **Save** to create the check. + +The new Data Diff check appears in the Checks list with the Authored badge. The next Scan operation will evaluate it. + +!!! tip "Preview the results before saving" + For a richer preview that lists exactly which rows would be flagged (with row statuses and per-field diffs), use [Dry Run](../../datastore-checks/dry-run.md){:target="_blank"} from the check's actions menu. + +## Common Variations + +The table below summarizes the most common Data Diff configurations. For end-to-end worked scenarios with sample data and resulting anomalies, see the [Examples](examples.md){:target="_blank"} page. + +| Goal | Row Identifiers | Filter | Comparators | +|:---|:---|:---|:---| +| Validate a nightly backup is complete | `order_id` | *(none)* | *(none)* | +| Compare a system-to-system transfer per row | `appointment_id` | *(none)* | *(none)* | +| Verify only today's writes during a migration cutover | `customer_id` | `created_at = current_date()` | *(none)* | +| Tolerate small numeric rounding between two systems | `transaction_id` | *(none)* | Numeric ±0.01 | +| Compare strings case-insensitively | `country_code` | *(none)* | String (ignore case) | + +## Related + +- [Introduction](introduction.md){:target="_blank"}: formal definition, field scope, and general/anomaly properties. +- [How It Works](how-it-works.md){:target="_blank"}: full semantics, Row Identifiers, Comparators, and edge cases. +- [Examples](examples.md){:target="_blank"}: three production scenarios with sample data and resulting anomalies. +- [API](api.md){:target="_blank"}: payload shape for creating a Data Diff check programmatically. diff --git a/docs/data-quality-checks/data-diff/introduction.md b/docs/data-quality-checks/data-diff/introduction.md new file mode 100644 index 0000000000..20df147166 --- /dev/null +++ b/docs/data-quality-checks/data-diff/introduction.md @@ -0,0 +1,105 @@ +# Data Diff Check + +## Definition + +*Asserts that two datasets match on a chosen set of fields. The check compares a **target** container (the dataset the check is attached to) against a **reference** container in another datastore, and reports every row that is added, removed, or changed between the two sides.* + +!!! info "Recommended Check" + Qualytics recommends using the **Data Diff** rule (`dataDiff`) instead of the deprecated **Is Replica Of** rule (`isReplicaOf`). + + Both rules share the same row-by-row comparison engine and the same configuration properties (Row Identifiers, Passthrough Fields, and per-type Comparators). The differences: + + - Only Data Diff is actively maintained. + - Only Data Diff supports the `diff_change_types` property, which restricts anomalies to a chosen subset of statuses (`added`, `removed`, `changed`). See [How It Works](how-it-works.md#restricting-anomalies-by-status) for details. + +## Overview + +Data Diff is a two-table comparison rule. Use it whenever you need to confirm that one dataset is an exact copy, or a controlled copy, of another: + +- Validating that a replica, backup, or warehouse mirror matches its source of truth. +- Comparing pre- and post-migration data after a system move. +- Verifying that a derived table, such as an aggregate, snapshot, or reporting view, still agrees with the upstream system. +- Confirming that an exported file delivered to a partner contains the same rows as the system of record. + +Beyond the standard properties, a Data Diff check has three configuration inputs: the **reference datastore and container** to compare against, an optional list of **Row Identifiers** (the key the platform uses to match each target row to its reference row), and an optional set of **Comparators** (per-field tolerances for numeric, duration, and string fields). + +When Row Identifiers are set, anomalies appear in the **Comparison Source Records** view, which shows each differing row side by side as `Left` (target) vs `Right` (reference) values. + +## Field Scope + +**Multiple:** The check evaluates one or more fields by comparing them between target and reference. + +**Accepted Types** + +| Type | Supported | +|-------------|:-------------------------:| +| `Date` |
:material-check-circle:{ style="color: #4caf50" }
| +| `Timestamp` |
:material-check-circle:{ style="color: #4caf50" }
| +| `Integral` |
:material-check-circle:{ style="color: #4caf50" }
| +| `Fractional`|
:material-check-circle:{ style="color: #4caf50" }
| +| `String` |
:material-check-circle:{ style="color: #4caf50" }
| +| `Boolean` |
:material-check-circle:{ style="color: #4caf50" }
| + +## General Properties + +{% + include-markdown "components/general-props/index.md" + start='' + end='' +%} + +## Anomaly Types + +{% + include-markdown "components/anomaly-support/index.md" + start='' + end='' +%} + +## Next Steps + +
+ +- :material-information-outline:{ .lg .middle } **How It Works** + + --- + + Full semantics: evaluation flow, Row Identifiers, Comparators, the three diff statuses (`added`, `removed`, `changed`), filter behavior, and how Data Diff relates to other rule types. + + [:octicons-arrow-right-24: How It Works](how-it-works.md) + + + +- :material-clipboard-text-outline:{ .lg .middle } **Examples** + + --- + + Three production scenarios with sample data, anomaly messages, and the SQL equivalent of what the check evaluates. + + [:octicons-arrow-right-24: Examples](examples.md) + +- :material-api:{ .lg .middle } **API** + + --- + + Payload shape and field notes for creating a Data Diff check programmatically. + + [:octicons-arrow-right-24: API](api.md) + +- :material-help-circle-outline:{ .lg .middle } **FAQ** + + --- + + Short answers to questions about Row Identifiers, Comparators, missing values, and anomaly reporting. + + [:octicons-arrow-right-24: FAQ](faq.md) + +
diff --git a/docs/data-quality-checks/is-replica-of-check.md b/docs/data-quality-checks/is-replica-of-check.md index 3853283ebf..f74da21e02 100644 --- a/docs/data-quality-checks/is-replica-of-check.md +++ b/docs/data-quality-checks/is-replica-of-check.md @@ -2,11 +2,11 @@ !!! warning "Deprecation Notice" The `isReplicaOf` check is being deprecated and will no longer be maintained. - We strongly recommend using the [Data Diff](data-diff-check.md) check, which offers the same functionality with improved performance and additional features. + We strongly recommend using the [Data Diff](data-diff/introduction.md) check, which offers the same functionality with improved performance and additional features. **Our recommendation:** - - Consider using [`Data Diff`](data-diff-check.md) for new implementations + - Consider using [`Data Diff`](data-diff/introduction.md) for new implementations - `dataDiff` provides enhanced performance and additional capabilities - Both checks will continue to coexist in the system diff --git a/docs/data-quality-checks/overview-of-a-check.md b/docs/data-quality-checks/overview-of-a-check.md index faf8627146..3f16bad6b5 100644 --- a/docs/data-quality-checks/overview-of-a-check.md +++ b/docs/data-quality-checks/overview-of-a-check.md @@ -166,7 +166,7 @@ For more details about check rule types, please refer to the [**Rule Types Overv | [Contains Email](../data-quality-checks/contains-email-check.md) | Asserts that the values contain email addresses. | | [Contains Social Security Number](../data-quality-checks/contains-social-security-number-check.md) | Asserts that the values contain social security numbers. | | [Contains Url](../data-quality-checks/contains-url.md) | Asserts that the values contain valid URLs. | -| [Data Diff](../data-quality-checks/data-diff-check.md) | Asserts that the dataset created by the targeted field(s) has differences compared to the referred field(s). | +| [Data Diff](../data-quality-checks/data-diff/introduction.md) | Asserts that the dataset created by the targeted field(s) has differences compared to the referred field(s). | | [Distinct Count](../data-quality-checks/distinct-count-check.md) | Asserts on the approximate count distinct of the given column. | | [Entity Resolution](../data-quality-checks/entity-resolution.md) | Asserts that every distinct entity is appropriately represented once and only once. | | [Equal To](../data-quality-checks/equal-to-check.md) | Asserts that all of the selected fields equal a value. | diff --git a/docs/data-quality-checks/rule-types-overview.md b/docs/data-quality-checks/rule-types-overview.md index 7f08905007..d30fc7d51e 100644 --- a/docs/data-quality-checks/rule-types-overview.md +++ b/docs/data-quality-checks/rule-types-overview.md @@ -18,7 +18,7 @@ Here’s an overview of the rule types and their purposes: | [Contains Email](../data-quality-checks/contains-email-check.md) | Asserts that the values contain email addresses. | | [Contains Social Security Number](../data-quality-checks/contains-social-security-number-check.md) | Asserts that the values contain social security numbers. | | [Contains Url](../data-quality-checks/contains-url.md) | Asserts that the values contain valid URLs. | -| [Data Diff](../data-quality-checks/data-diff-check.md) | Asserts that the dataset created by the targeted field(s) has differences compared to the referred field(s). | +| [Data Diff](../data-quality-checks/data-diff/introduction.md) | Asserts that the dataset created by the targeted field(s) has differences compared to the referred field(s). | | [Distinct Count](../data-quality-checks/distinct-count-check.md) | Asserts on the approximate count distinct of the given column. | | [Entity Resolution](../data-quality-checks/entity-resolution.md) | Asserts that every distinct entity is appropriately represented once and only once | | [Equal To Field](../data-quality-checks/equal-to-field-check.md) | Asserts that this field is equal to another field. | diff --git a/docs/operations/profile/profile.md b/docs/operations/profile/profile.md index e7cb91d550..2ae8c6a6a1 100644 --- a/docs/operations/profile/profile.md +++ b/docs/operations/profile/profile.md @@ -227,7 +227,7 @@ The following table shows the additional AI Managed checks generated at this lev | AI Managed Checks | Reference | |-------|-------| -| Data Diff | [See more.](../../data-quality-checks/data-diff-check.md) | +| Data Diff | [See more.](../../data-quality-checks/data-diff/introduction.md) | | Exists In | [See more.](https://userguide.qualytics.io/checks/exists-in-check/) | | Predicted By | [See more.](https://userguide.qualytics.io/checks/predicted-by-check/) | diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css index acec27b991..83c7b8f776 100644 --- a/docs/stylesheets/extra.css +++ b/docs/stylesheets/extra.css @@ -955,6 +955,19 @@ color: var(--q-brick); } +/* Mirrors the source-records anomalous-cell treatment in the Qualytics app: + orange outline + warning-tinted background on cells whose value failed a check. + Use inside markdown table cells to mark a single value as anomalous. */ +.anomalous-cell { + display: inline-block; + padding: 0.05rem 0.4rem; + border: 1px solid var(--q-orange); + border-radius: 4px; + background-color: rgba(249, 103, 25, 0.12); + color: var(--q-brick); + font-weight: 500; +} + .text-sm { font-size: 0.7rem; } diff --git a/mkdocs.yml b/mkdocs.yml index 649473ca96..721fc4735d 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -312,7 +312,14 @@ nav: - Contains Email: data-quality-checks/contains-email-check.md - Contains Social Security Number: data-quality-checks/contains-social-security-number-check.md - Contains Url: data-quality-checks/contains-url.md - - Data Diff: data-quality-checks/data-diff-check.md + - Data Diff: + - Introduction: data-quality-checks/data-diff/introduction.md + - How It Works: data-quality-checks/data-diff/how-it-works.md + # TODO: re-enable once how-to-create screenshots are added back + # - How to Create: data-quality-checks/data-diff/how-to-create.md + - Examples: data-quality-checks/data-diff/examples.md + - API: data-quality-checks/data-diff/api.md + - FAQ: data-quality-checks/data-diff/faq.md - Distinct Count: data-quality-checks/distinct-count-check.md - Entity Resolution: data-quality-checks/entity-resolution.md - Equal to: data-quality-checks/equal-to-check.md @@ -1160,7 +1167,8 @@ plugins: 'checks/contains-email-check.md': 'data-quality-checks/contains-email-check.md' 'checks/contains-social-security-number-check.md': 'data-quality-checks/contains-social-security-number-check.md' 'checks/contains-url.md': 'data-quality-checks/contains-url.md' - 'checks/data-diff-check.md': 'data-quality-checks/data-diff-check.md' + 'checks/data-diff-check.md': 'data-quality-checks/data-diff/introduction.md' + 'data-quality-checks/data-diff-check.md': 'data-quality-checks/data-diff/introduction.md' 'checks/distinct-count-check.md': 'data-quality-checks/distinct-count-check.md' 'checks/entity-resolution.md': 'data-quality-checks/entity-resolution.md' 'checks/equal-to-check.md': 'data-quality-checks/equal-to-check.md' From edfb9173b698d43bcfa9d307a60d3eeff3ab9d22 Mon Sep 17 00:00:00 2001 From: Rafael Riki Ogawa Osiro Date: Sat, 20 Jun 2026 03:03:33 -0300 Subject: [PATCH 2/2] docs(data-diff): open how-it-works cross-link in a new tab The cross-page link inside the 'Recommended Check' admonition was the only cross-page reference on the introduction page without the {:target="_blank"} attribute. Aligns with the four Next Steps cards on the same page and the cross-references on api.md, faq.md, and how-it-works.md. --- docs/data-quality-checks/data-diff/introduction.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/data-quality-checks/data-diff/introduction.md b/docs/data-quality-checks/data-diff/introduction.md index 20df147166..dfa88cf40a 100644 --- a/docs/data-quality-checks/data-diff/introduction.md +++ b/docs/data-quality-checks/data-diff/introduction.md @@ -10,7 +10,7 @@ Both rules share the same row-by-row comparison engine and the same configuration properties (Row Identifiers, Passthrough Fields, and per-type Comparators). The differences: - Only Data Diff is actively maintained. - - Only Data Diff supports the `diff_change_types` property, which restricts anomalies to a chosen subset of statuses (`added`, `removed`, `changed`). See [How It Works](how-it-works.md#restricting-anomalies-by-status) for details. + - Only Data Diff supports the `diff_change_types` property, which restricts anomalies to a chosen subset of statuses (`added`, `removed`, `changed`). See [How It Works](how-it-works.md#restricting-anomalies-by-status){:target="_blank"} for details. ## Overview