diff --git a/docs/data-quality-checks/entity-resolution.md b/docs/data-quality-checks/entity-resolution.md deleted file mode 100644 index 2c0373aa42..0000000000 --- a/docs/data-quality-checks/entity-resolution.md +++ /dev/null @@ -1,103 +0,0 @@ -# Entity Resolution - -### Definition - -*Asserts that every distinct entity is appropriately represented once and only once* - -### In-Depth Overview - -This check performs automated entity name clustering to identify entities with similar names that likely represent -the same entity. It then assigns each cluster a unique entity identifier and asserts that every row with the same -entity identifier shares the same value for the designated `distinction field` - -### Field Scope - -**Single:** The rule evaluates a single specified field. - -**Accepted Types** - -| Type | Supported | -|----------|:--------------------------------------------------------:| -| `String` |
:material-check-circle:{ style="color: #4caf50" }
| - -### General Properties - -{% -include-markdown "components/general-props/index.md" -start='' -end='' -%} - -### Specific Properties - -| Name | Description | -|-----------------------------------------------------|-----------------------------------------------------------------------------| -|
Distinction Field
| The field that must hold a distinct value for every distinct entity | -|
Pair Substrings
| Considers entities a match if one entity is part of the other | -|
Pair Homophones
| Considers entities a match if they sound alike, even if spelled differently | -|
Spelling Similarity
| The minimum similarity required for clustering two entity names | - - -### Anomaly Types - -{% -include-markdown "components/anomaly-support/index.md" -start='' -end='' -%} - -### Example - -**Objective**: *If you have a `businesses` table with an `id` field and a `name` field, this check can be configured to -resolve `name` and use `id` as the `distinction field`. During each scan, similar names will be grouped and assigned the -same `entity identifier` and any rows that share the same `entity identifier` but have different values for `id` will be -identified as anomalies.* - -**Sample Data** - -| BUSINESS_ID | BUSINESS_NAME | -|-------------|-----------------| -| 1 | ACME Boxing | -| 2 | Frank's Flowers | -| 3 | ACME Boxes | - - -=== "Payload example" - ``` json - { - "description": "Ensure a `businesses` table with an `BUSINESS_ID` field and a `BUSINESS_NAME` field shares the same `entity identifier`", - "coverage": 1, - "properties": { - "distinct_field_name":"BUSINESS_ID", - "pair_substrings":true, - "pair_homophones":true, - "spelling_similarity_threshold":0.6 - }, - "tags": [], - "fields": ["BUSINESS_NAME"], - "additional_metadata": {"key 1": "value 1", "key 2": "value 2"}, - "rule": "entityResolution", - "container_id": {container_id}, - "template_id": {template_id}, - "filter": "1=1" - } - ``` - -**Anomaly Explanation** - -In the sample data above, the entries with `BUSINESS_ID` **1** and **3** do not satisfy the rule because their `BUSINESS_NAME` -values will be marked as similar yet they do not share the same `BUSINESS_ID` - -=== "Flowchart" -```mermaid -graph TD -A[Start] --> B[Retrieve Original Data] -B --> C{Which entities are similar?} -C --> D[Assign each record an entity identifier] -D --> E[Cluster records by entity identifier] -E --> F{Do records with same
entity identifier share the
same distinction field value?} -F -->|Yes| I[End] -F -->|No| H[Mark as Anomalous] -H --> I -``` - diff --git a/docs/data-quality-checks/entity-resolution/api.md b/docs/data-quality-checks/entity-resolution/api.md new file mode 100644 index 0000000000..d59924f5dc --- /dev/null +++ b/docs/data-quality-checks/entity-resolution/api.md @@ -0,0 +1,161 @@ +# :material-api:{ .middle style="color: var(--q-brick)" } Entity Resolution API + +The Entity Resolution check is created and managed through the standard Quality Checks API by setting `rule` to `entityResolution`. The check is multi-field: rather than listing fields under `fields`, you list one entry per evaluated field under `properties.target_fields` and pick the **distinction field** under `properties.distinct_field_name`. The `fields` array on the check itself is auto-populated from `target_fields` and can be sent as an empty list. + +!!! tip + For complete API documentation, including request and response schemas, visit the [API docs](https://demo.qualytics.io/api/docs){:target="_blank"}. + +## Endpoints + +| Method | Path | Purpose | +|:---|:---|:---| +| `POST` | `/api/quality-checks` | Create a new Entity Resolution check. | +| `GET` | `/api/quality-checks/{id}` | Retrieve an Entity Resolution check by ID. | +| `PUT` | `/api/quality-checks/{id}` | Update an existing Entity Resolution check. | +| `DELETE` | `/api/quality-checks/{id}` | Delete (or archive) an Entity Resolution check. | + +**Permission**: Author (or above) on the target container's team for `POST`, `PUT`, and `DELETE`; Reporter (or above) for `GET`. + +## Payload Example + +Create a multi-field Entity Resolution check on `full_name` (fuzzy) and `address` (fuzzy), distinguished by `customer_id`, with `POST /api/quality-checks`: + +```json +{ + "description": "Customers with similar names and addresses must share a customer_id", + "rule": "entityResolution", + "fields": [], + "container_id": 145, + "filter": null, + "properties": { + "distinct_field_name": "customer_id", + "composite_match_threshold": 0.75, + "target_fields": [ + { + "upickle_type": "StringTargetField", + "field_name": "full_name", + "match_type": "fuzzy", + "pair_substrings": true, + "pair_homophones": false, + "consider_term_frequency": false, + "weight": 1.0 + }, + { + "upickle_type": "StringTargetField", + "field_name": "address", + "match_type": "fuzzy", + "pair_substrings": false, + "pair_homophones": false, + "consider_term_frequency": false, + "weight": 0.8 + } + ] + }, + "tags": ["pii", "master-data"], + "additional_metadata": {"jira": "DATA-4101"}, + "anomaly_message_field": null, + "template_id": null, + "status": "Active", + "owner_id": 7, + "default_anomaly_assignee_id": 12 +} +``` + +## Top-Level Field Notes + +| Field | Required | Notes | +|:---|:---:|:---| +| `description` | Yes | Free-text description shown in the UI. | +| `rule` | Yes | Must be `"entityResolution"`. | +| `fields` | Yes | Send `[]`. The list of evaluated fields is computed from `properties.target_fields`. | +| `container_id` | Yes | ID of the container (table or file) the check runs against. | +| `filter` | No | Spark SQL `WHERE` expression. Applied **before** entity resolution runs, so only filtered rows are clustered. Send `null` for no filter. | +| `properties.distinct_field_name` | Yes | Name of the field that must hold a single value within each resolved entity cluster. Accepted types: `Integral`, `Fractional`, `Boolean`, `String`, `Date`, `Timestamp`. | +| `properties.composite_match_threshold` | Yes | Fractional value between `0.0` and `1.0`. Pairs whose weighted composite score is greater than or equal to this value are treated as matches. Default `0.7`. | +| `properties.target_fields` | Yes | Non-empty array. Each entry configures one field with its `match_type`, `weight`, and (for strings) optional substring/homophone/term-frequency knobs. See **Target Field Notes** below. | +| `tags` | No | List of tag names applied to the check for filtering and organization. | +| `additional_metadata` | No | Free-form key-value pairs (typically links to catalog, tickets, governance records). | +| `anomaly_message_field` | No | Name of a source-record field whose value should be used as the anomaly message instead of the system-generated one. **Not applicable to Entity Resolution:** because the rule emits only Shape Anomalies (which use a fixed message template), this field is silently ignored. Send `null`. | +| `template_id` | No | ID of a Check Template to associate the check with. `null` if not using a template. | +| `status` | No | `"Active"` (default) or `"Draft"`. Draft checks are not evaluated by Scans. | +| `owner_id` | No | ID of the user who owns the check. Defaults to the user creating the check when omitted. | +| `default_anomaly_assignee_id` | No | ID of the user automatically assigned to anomalies produced by the check. | + +!!! info "Coverage is not supported" + Entity Resolution does not accept a `coverage` value. The rule evaluates clusters as compliant or non-compliant; there is no fractional tolerance to set. + +## Target Field Notes + +Each entry in `target_fields` is one of three shapes, identified by its `upickle_type` discriminator: `"StringTargetField"`, `"NumericTargetField"`, or `"DateTimeTargetField"`. The platform validates that the declared `upickle_type` matches the actual data type of the field on the container. + +### String Target Field + +```json +{ + "upickle_type": "StringTargetField", + "field_name": "full_name", + "match_type": "fuzzy", + "pair_substrings": true, + "pair_homophones": false, + "consider_term_frequency": false, + "weight": 1.0 +} +``` + +| Field | Required | Notes | +|:---|:---:|:---| +| `upickle_type` | Yes | Must be `"StringTargetField"`. Identifies the shape so the platform can deserialize this entry. | +| `field_name` | Yes | Name of the string field on the container. | +| `match_type` | No | `"fuzzy"` (default) or `"exact"`. `exact` turns the field into a blocking pre-filter: pairs disagreeing on this field are never scored. | +| `pair_substrings` | No | When `true`, a pair where one string contains the other scores `1.0` on this field. Default `false`. Applies only to `fuzzy`. | +| `pair_homophones` | No | When `true`, a pair whose values sound alike (phonetic similarity) scores `1.0` on this field. Default `false`. Applies only to `fuzzy`. | +| `consider_term_frequency` | No | When `true`, rare tokens carry more weight than common tokens. Default `false`. Applies only to `fuzzy`. | +| `weight` | No | Non-negative number. Controls this field's contribution to the composite score. Default `1.0`. Ignored when `match_type` is `exact`. | + +### Numeric Target Field + +```json +{ + "upickle_type": "NumericTargetField", + "field_name": "phone_number", + "match_type": "absolute", + "offset": 0.0, + "weight": 1.0 +} +``` + +| Field | Required | Notes | +|:---|:---:|:---| +| `upickle_type` | Yes | Must be `"NumericTargetField"`. Identifies the shape so the platform can deserialize this entry. | +| `field_name` | Yes | Name of the numeric field (Integral or Fractional) on the container. | +| `match_type` | No | `"absolute"` (default), `"relative"`, or `"exact"`. `"absolute"` compares with a fixed `offset`; `"relative"` compares with a percentage tolerance (e.g. `0.05` for 5%); `"exact"` turns the field into a blocking pre-filter. | +| `offset` | No | Non-negative numeric tolerance. With `match_type: "absolute"`, the pair scores `1.0` if `|a − b| ≤ offset`, otherwise `0.0`. With `match_type: "relative"`, the value is interpreted as a fraction (e.g. `0.05` for 5%). Default `0.0`. | +| `weight` | No | Non-negative number controlling contribution to the composite. Default `1.0`. Ignored when `match_type` is `exact`. | + +### Datetime Target Field + +```json +{ + "upickle_type": "DateTimeTargetField", + "field_name": "registered_at", + "match_type": "offset", + "offset_seconds": 3600, + "weight": 1.0 +} +``` + +| Field | Required | Notes | +|:---|:---:|:---| +| `upickle_type` | Yes | Must be `"DateTimeTargetField"`. Identifies the shape so the platform can deserialize this entry. | +| `field_name` | Yes | Name of the Date or Timestamp field on the container. | +| `match_type` | No | `"offset"` (default), `"granularity"`, or `"exact"`. `"offset"` compares within a number of seconds; `"granularity"` compares whether both timestamps fall in the same bucket; `"exact"` turns the field into a blocking pre-filter. | +| `offset_seconds` | No | Non-negative integer tolerance in seconds. Applies when `match_type` is `"offset"`: the pair scores `1.0` if the two timestamps are within `offset_seconds` of each other. Default `0`. | +| `granularity` | No | Bucket applied before comparison. Applies when `match_type` is `"granularity"`. Accepted values: `"Day"`, `"Week"`, `"Month"`, `"Year"`. Omit (or send `null`) when `match_type` is not `"granularity"`. | +| `weight` | No | Non-negative number controlling contribution to the composite. Default `1.0`. Ignored when `match_type` is `exact`. | + +## Related + +- [Introduction](introduction.md){:target="_blank"}: formal definition, target field types, field scope, and general/anomaly properties. +- [How It Works](how-it-works.md){:target="_blank"}: full semantics, clustering behavior, threshold tuning, and source-records behavior. +- [Examples](examples.md){:target="_blank"}: three production scenarios with sample data, source records, and resulting anomalies. +- [FAQ](faq.md){:target="_blank"}: short answers to the most frequent questions. diff --git a/docs/data-quality-checks/entity-resolution/examples.md b/docs/data-quality-checks/entity-resolution/examples.md new file mode 100644 index 0000000000..c183494473 --- /dev/null +++ b/docs/data-quality-checks/entity-resolution/examples.md @@ -0,0 +1,301 @@ +# Entity Resolution Examples + +Three real-world scenarios that show how the Entity Resolution check is typically used in production: deduplicating a customer master by name and address, consolidating businesses by name with phonetic and substring matching, and matching contacts within a tenant boundary using a blocking field. + +Each scenario shows the **Source Records** that would appear in the resulting Shape Anomaly. Source Records surface one example row per distinct value of the distinction field within each non-compliant cluster, alongside the cluster identifier `_qualytics_entity_id` so the cluster boundaries are visible. + +=== "Customer Master Deduplication" + + **The situation:** Your `customers` table is the master record for downstream billing. Each row has a `customer_id` that should be the single identifier per customer, but historic ingestions from multiple sources have produced near-duplicate records with slightly different spellings of the same person's `full_name` and `address`. You want Entity Resolution to surface customers where two different `customer_id` values plausibly describe the same person. + + **Check configuration** + + | Field | Value | + |:---|:---| + | Rule | Entity Resolution | + | Distinction Field | `customer_id` | + | Target Fields | `full_name` (String, `fuzzy`, `pair_substrings: true`, `weight: 1.0`), `address` (String, `fuzzy`, `weight: 0.8`) | + | Composite Match Threshold | `0.75` | + | Filter | *(none)* | + | Custom Anomaly Description | Off | + | Status | Active | + | Owner | *(check creator)* | + | Anomaly Assignee | *(customer-data steward)* | + | Tags | `pii`, `master-data` | + | Additional Metadata | `jira: DATA-4101` | + | Description | Customers with similar names and addresses must share a customer_id | + + **Payload** + + ```json + { + "description": "Customers with similar names and addresses must share a customer_id", + "rule": "entityResolution", + "fields": [], + "container_id": 145, + "filter": null, + "properties": { + "distinct_field_name": "customer_id", + "composite_match_threshold": 0.75, + "target_fields": [ + { + "upickle_type": "StringTargetField", + "field_name": "full_name", + "match_type": "fuzzy", + "pair_substrings": true, + "pair_homophones": false, + "consider_term_frequency": false, + "weight": 1.0 + }, + { + "upickle_type": "StringTargetField", + "field_name": "address", + "match_type": "fuzzy", + "pair_substrings": false, + "pair_homophones": false, + "consider_term_frequency": false, + "weight": 0.8 + } + ] + }, + "tags": ["pii", "master-data"], + "additional_metadata": {"jira": "DATA-4101"}, + "anomaly_message_field": null, + "template_id": null, + "status": "Active", + "owner_id": 7, + "default_anomaly_assignee_id": 12 + } + ``` + + **Source Records** + + | _qualytics_entity_id | customer_id | full_name | address | + |----------------------|-------------|------------------|------------------------| + | ent-a01f | 1001 | Alice Cohen | 142 Maple St | + | ent-a01f | 1057 | Alice C. | 142 Maple Street | + | ent-b73c | 1102 | Catherine Wu | 87 Elm Avenue | + | ent-b73c | 1184 | Catherine Wu | 87 Elm Ave. | + + **What gets flagged** + + Two non-compliant clusters appear in the source records: + + - `ent-a01f` resolved two records the platform considers the same customer (`"Alice Cohen, 142 Maple St"` ↔ `"Alice C., 142 Maple Street"`). The fuzzy match on `full_name` reaches the threshold because `pair_substrings` promotes `"Alice C."` against `"Alice Cohen"`, and the address pair is near-identical. The cluster holds two different `customer_id` values (`1001` and `1057`), so it is non-compliant. + - `ent-b73c` resolved two records with identical names and only a punctuation difference in `address`. The cluster holds two different `customer_id` values (`1102` and `1184`), so it is also non-compliant. + + Each non-compliant cluster contributes **one row per distinct `customer_id`** to the Source Records panel, four rows total in this scan. + + !!! example "Shape Anomaly" + 184 records were resolved to 173 distinct entities (composite threshold 0.75: full_name (w=1.0), address (w=0.8)). 2 of those entities are assigned more than one value of customer_id + + **Flowchart** + + ```mermaid + graph TD + A["Filter: none, evaluate all customers"] --> B["Score every candidate pair on full_name + address"] + B --> C{"Composite score ≥ 0.75?"} + C -->|No| D["Pair is not a match"] + C -->|Yes| E["Connect both records in the same cluster"] + E --> F["Assign cluster _qualytics_entity_id"] + F --> G{"Cluster has more than one customer_id?"} + G -->|No| H["Cluster is compliant"] + G -->|Yes| I["Flag cluster. Source Records gets one row per distinct customer_id."] + ``` + +=== "Business Name Consolidation with Homophones" + + **The situation:** Your `businesses` table aggregates business records from three vendor feeds. The same business often appears under variant spellings of `business_name` (`"Catherine's Books"`, `"Katherine's Books"`, `"Catherines Books LLC"`) and each feed assigns its own `business_id`. You want to surface businesses where the platform believes the names describe the same entity but `business_id` disagrees. + + **Check configuration** + + | Field | Value | + |:---|:---| + | Rule | Entity Resolution | + | Distinction Field | `business_id` | + | Target Fields | `business_name` (String, `fuzzy`, `pair_substrings: true`, `pair_homophones: true`, `consider_term_frequency: true`, `weight: 1.0`) | + | Composite Match Threshold | `0.7` | + | Filter | *(none)* | + | Custom Anomaly Description | Off | + | Status | Active | + | Owner | *(check creator)* | + | Anomaly Assignee | *(business-master steward)* | + | Tags | `consolidation`, `vendor-feeds` | + | Additional Metadata | `jira: DATA-4207` | + | Description | Similar business names should resolve to the same business_id | + + **Payload** + + ```json + { + "description": "Similar business names should resolve to the same business_id", + "rule": "entityResolution", + "fields": [], + "container_id": 212, + "filter": null, + "properties": { + "distinct_field_name": "business_id", + "composite_match_threshold": 0.7, + "target_fields": [ + { + "upickle_type": "StringTargetField", + "field_name": "business_name", + "match_type": "fuzzy", + "pair_substrings": true, + "pair_homophones": true, + "consider_term_frequency": true, + "weight": 1.0 + } + ] + }, + "tags": ["consolidation", "vendor-feeds"], + "additional_metadata": {"jira": "DATA-4207"}, + "anomaly_message_field": null, + "template_id": null, + "status": "Active", + "owner_id": 7, + "default_anomaly_assignee_id": 18 + } + ``` + + **Source Records** + + | _qualytics_entity_id | business_id | business_name | + |----------------------|-------------|-------------------------| + | ent-c4d1 | 5001 | Catherine's Books | + | ent-c4d1 | 5042 | Katherine's Books | + | ent-c4d1 | 5108 | Catherines Books LLC | + | ent-e8f2 | 5314 | ACME Boxing | + | ent-e8f2 | 5331 | ACME Boxes | + + **What gets flagged** + + Two non-compliant clusters appear in the source records: + + - `ent-c4d1` connects three records through pairwise matches: `"Catherine's"` and `"Katherine's"` resolve via the homophone rule, and `"Catherines Books LLC"` resolves to `"Catherine's Books"` via the substring rule. The three records collapse into a single cluster because their matches form a chain. The cluster holds three different `business_id` values (`5001`, `5042`, `5108`), so it is non-compliant and contributes three rows to the Source Records. + - `ent-e8f2` connects two records (`"ACME Boxing"` ↔ `"ACME Boxes"`) where fuzzy text similarity is high enough to clear the threshold. The cluster holds two different `business_id` values (`5314`, `5331`), so it contributes two rows. + + !!! example "Shape Anomaly" + 2,341 records were resolved to 2,294 distinct entities (composite threshold 0.7: business_name (w=1.0+TF)). 2 of those entities are assigned more than one value of business_id + + **Flowchart** + + ```mermaid + graph TD + A["Filter: none, evaluate all businesses"] --> B["Compute pair similarity on business_name
(fuzzy text + substring + phonetic overrides)"] + B --> C{"Composite score ≥ 0.7?"} + C -->|No| D["Pair is not a match"] + C -->|Yes| E["Connect both records in the same cluster"] + E --> F["Connected components collapse transitive chains
(A↔B and B↔C become {A,B,C})"] + F --> G["Assign cluster _qualytics_entity_id"] + G --> H{"Cluster has more than one business_id?"} + H -->|No| I["Cluster is compliant"] + H -->|Yes| J["Flag cluster. Source Records gets one row per distinct business_id."] + ``` + +=== "Tenant-Scoped Resolution with a Blocking Field" + + **The situation:** Your `contacts` table is multi-tenant. The same `email` is allowed to repeat across tenants (different people, different organizations) but never within a single tenant. You want to resolve contacts within each tenant by `full_name` and `email`, and `tenant_id` should act as a hard boundary so cross-tenant collisions never trigger an anomaly. + + **Check configuration** + + | Field | Value | + |:---|:---| + | Rule | Entity Resolution | + | Distinction Field | `contact_id` | + | Target Fields | `tenant_id` (Numeric, `exact`: blocking), `full_name` (String, `fuzzy`, `pair_substrings: true`, `weight: 1.0`), `email` (String, `fuzzy`, `weight: 1.0`) | + | Composite Match Threshold | `0.8` | + | Filter | `status = 'active'` | + | Custom Anomaly Description | Off | + | Status | Active | + | Owner | *(check creator)* | + | Anomaly Assignee | *(ingestion on-call)* | + | Tags | `multi-tenant`, `contacts` | + | Additional Metadata | `jira: DATA-4311` | + | Description | Within a tenant, contacts with similar name and email must share a contact_id | + + **Payload** + + ```json + { + "description": "Within a tenant, contacts with similar name and email must share a contact_id", + "rule": "entityResolution", + "fields": [], + "container_id": 318, + "filter": "status = 'active'", + "properties": { + "distinct_field_name": "contact_id", + "composite_match_threshold": 0.8, + "target_fields": [ + { + "upickle_type": "NumericTargetField", + "field_name": "tenant_id", + "match_type": "exact" + }, + { + "upickle_type": "StringTargetField", + "field_name": "full_name", + "match_type": "fuzzy", + "pair_substrings": true, + "pair_homophones": false, + "consider_term_frequency": false, + "weight": 1.0 + }, + { + "upickle_type": "StringTargetField", + "field_name": "email", + "match_type": "fuzzy", + "pair_substrings": false, + "pair_homophones": false, + "consider_term_frequency": false, + "weight": 1.0 + } + ] + }, + "tags": ["multi-tenant", "contacts"], + "additional_metadata": {"jira": "DATA-4311"}, + "anomaly_message_field": null, + "template_id": null, + "status": "Active", + "owner_id": 7, + "default_anomaly_assignee_id": 24 + } + ``` + + **Why the blocking field matters** + + Because `tenant_id` is `match_type: exact`, the platform never compares a contact in tenant `7` against a contact in tenant `12`. Two contacts named `"Jane Doe"` with the same email on different tenants are treated as completely separate entities and never cluster together. Blocking on `tenant_id` is both a correctness guarantee and a performance optimization: candidate pairs are constrained to rows that share the same tenant. + + **Source Records** *(filtered to `status = 'active'`)* + + | _qualytics_entity_id | tenant_id | contact_id | full_name | email | + |----------------------|-----------|------------|---------------|------------------------| + | ent-7a2b | 7 | c-991 | Jane Doe | jane.doe@acme.com | + | ent-7a2b | 7 | c-1042| J. Doe | jane.doe@acme.com | + + The contact `c-2071` (`tenant_id = 12`, `full_name = "Jane Doe"`, `email = "jane.doe@acme.com"`) does **not** appear in the Source Records: it is in a different tenant, so blocking prevents it from being paired with the rows in tenant `7`. It is its own cluster, with its own `_qualytics_entity_id`, and is compliant. + + !!! example "Shape Anomaly" + 4,820 records were resolved to 4,791 distinct entities (blocked on [tenant_id], composite threshold 0.8: full_name (w=1.0), email (w=1.0)). 1 of those entities is assigned more than one value of contact_id + + **Flowchart** + + ```mermaid + graph TD + A["Apply filter: status = 'active'"] --> B["Block pairs by tenant_id
(records in different tenants never compared)"] + B --> C["Score remaining pairs on full_name + email"] + C --> D{"Composite score ≥ 0.8?"} + D -->|No| E["Pair is not a match"] + D -->|Yes| F["Connect both records in the same cluster (per tenant)"] + F --> G["Assign cluster _qualytics_entity_id"] + G --> H{"Cluster has more than one contact_id?"} + H -->|No| I["Cluster is compliant"] + H -->|Yes| J["Flag cluster. Source Records gets one row per distinct contact_id."] + ``` + +## Related + +- [Introduction](introduction.md){:target="_blank"}: formal definition, target field types, field scope, and general/anomaly properties. +- [How It Works](how-it-works.md){:target="_blank"}: full semantics, clustering behavior, threshold tuning, and source-records behavior. +- [API](api.md){:target="_blank"}: payload shape and field notes for creating an Entity Resolution check programmatically. +- [FAQ](faq.md){:target="_blank"}: short answers to the most frequent questions. diff --git a/docs/data-quality-checks/entity-resolution/faq.md b/docs/data-quality-checks/entity-resolution/faq.md new file mode 100644 index 0000000000..77169da043 --- /dev/null +++ b/docs/data-quality-checks/entity-resolution/faq.md @@ -0,0 +1,77 @@ +# :material-help-circle-outline:{ .middle style="color: var(--q-brick)" } Entity Resolution FAQ + +Common questions about how the Entity Resolution check clusters records, how target fields combine, and how anomalies are reported. + +## Behavior + +### How does Entity Resolution decide that two records are the same entity? + +The platform scores every candidate pair on each fuzzy target field, combines the scores into a single weighted composite, and treats the pair as a match whenever the composite is greater than or equal to the **composite match threshold**. Exact (blocking) target fields filter candidate pairs before scoring rather than contributing to the score themselves. Records connected through any chain of matching pairs end up in the same cluster (so `A ↔ B` and `B ↔ C` produces the cluster `{A, B, C}` even if `A` and `C` are not directly above the threshold). + +### What is the difference between a fuzzy field and an exact (blocking) field? + +A fuzzy field contributes to the composite similarity score. An exact field acts as a hard pre-filter: pairs that disagree on an exact field are never even compared. Use exact fields for hard boundaries such as `tenant_id` or `country_code`, where two records on different sides of the boundary should never be treated as the same entity regardless of how similar the rest of their fields look. Exact fields also improve performance because they shrink the set of candidate pairs. + +### What does the composite match threshold control? + +The threshold is the cutoff for treating a pair as a match. A composite of `0.7` means a pair is a match only if its weighted average similarity is at least 70%. Lowering the threshold widens clusters (more variation tolerated); raising it tightens them (closer to exact matches). The default is `0.7`. The Source Records of the first scan are the best place to tune from. + +### How are NULLs treated on target fields? + +For blocking (`exact`) target fields, a record with NULL is excluded from pairing, because NULL never equals NULL for blocking purposes, so the record cannot enter any cluster. For fuzzy target fields, NULL is compared like any other value at scoring time. If you want to exclude records where a blocking field is NULL from the resolution entirely, add an `IS NOT NULL` clause to the filter. + +### Does the filter clause run before or after entity resolution? + +Before. The platform applies the filter first, then runs blocking, scoring, clustering, and the distinction-field check only on the rows that pass the filter. This lets you scope a check to a meaningful slice (for example, `status = 'active'`) without flagging clusters that exist outside the scope. + +## Anomaly Reporting + +### Which rows appear in the Shape Anomaly's Source Records? + +Only the rows from clusters where the distinction field has more than one distinct value, and within each non-compliant cluster only **one example row per distinct value of the distinction field**. So a cluster where `customer_id` takes three different values contributes three rows to the Source Records (not the full set of records in the cluster, but enough to make every conflicting value visible). + +### What is the `_qualytics_entity_id` column in the Source Records? + +It is the cluster identifier the platform assigned to each record. Records sharing the same `_qualytics_entity_id` are the records the platform thinks describe the same entity. The column is rendered as internal in the UI (it is not a real field on your container), but it appears in the Source Records to make cluster boundaries obvious. + +### What does the Shape Anomaly message look like? + +``` +N records were resolved to D distinct entities (composite threshold T: field_a (w=W), field_b (w=W) ...). K of those entities are assigned more than one value of +``` + +- **N** is the count of distinct records analyzed (post-filter, post-null-filter on blocking fields, de-duplicated). +- **D** is the number of distinct clusters produced. +- **T** is the composite match threshold. +- **K** is the number of clusters where the distinction field has more than one distinct value. + +When blocking fields exist, the message includes `blocked on [...]`. When a fuzzy string field has `consider_term_frequency` on, its summary entry includes `+TF`. + +### Why doesn't Entity Resolution produce a Record Anomaly? + +Entity Resolution is a *shape*-only rule type: the violation is a property of a cluster (multiple records together), not of any single record's value. The platform reports the cluster-level violation at the shape level only. + +## Configuration + +### Is coverage supported? + +No. The Entity Resolution form has no Coverage knob and the API does not accept a `coverage` value. A cluster is either compliant (one distinct value of the distinction field) or non-compliant; there is no fractional tolerance. + +### Can I mix fuzzy and exact target fields in the same check? + +Yes, and it is the recommended pattern when the data has a hard boundary such as tenant or country. Mark the boundary fields as `exact` (they become blocking pre-filters) and leave the descriptive fields as `fuzzy` (they contribute to the composite score). Exact fields do not affect the composite score; their job is to constrain which pairs are eligible to be scored at all. + +### Can the same field appear as both a blocking field and a fuzzy field? + +No. Each target field has a single `match_type`. If you need a field to act differently in different scenarios, create separate Entity Resolution checks scoped to those scenarios with a filter. + +### Does Custom Anomaly Description (the `anomaly_message_field` payload field) work for Entity Resolution? + +No. The Custom Anomaly Description toggle only affects Record Anomaly messages. Because Entity Resolution emits only Shape Anomalies, the field is silently ignored, and the resulting anomaly uses the fixed Shape Anomaly template described above. + +## Related + +- [Introduction](introduction.md){:target="_blank"}: formal definition, target field types, field scope, and general/anomaly properties. +- [How It Works](how-it-works.md){:target="_blank"}: full semantics, clustering behavior, threshold tuning, and source-records behavior. +- [API](api.md){:target="_blank"}: payload example and field notes for creating an Entity Resolution check programmatically. +- [Examples](examples.md){:target="_blank"}: three production scenarios with sample data, source records, and resulting anomalies. diff --git a/docs/data-quality-checks/entity-resolution/how-it-works.md b/docs/data-quality-checks/entity-resolution/how-it-works.md new file mode 100644 index 0000000000..f135b58c6b --- /dev/null +++ b/docs/data-quality-checks/entity-resolution/how-it-works.md @@ -0,0 +1,153 @@ +# How Entity Resolution Checks Work + +This page covers everything the Entity Resolution check does, in detail: how it clusters records, how exact and fuzzy target fields combine into a composite score, how the threshold decides whether two records are the same entity, how the distinction field is enforced after clustering, and what the resulting Shape Anomaly looks like. + +If you only need a quick reference, the [Introduction](introduction.md){:target="_blank"} page covers the formal definition, field scope, and general/anomaly properties. This page is the detailed reference. + +## How the Check Evaluates Entity Resolution + +Every Entity Resolution check follows the same five-step evaluation flow, regardless of how many target fields you configure: + +1. **Apply the filter clause.** If the check has a `filter` set, only the rows that match the filter expression continue to the next step. Rows that fall outside the filter are ignored and cannot cause a violation. +2. **Pre-filter on exact (blocking) target fields.** If any target field has `match_type` set to `exact`, only records that share the same value on every exact field can ever be paired. Records with different values on an exact field are blocked from comparison. +3. **Score pairs against the fuzzy target fields.** For each remaining candidate pair, the platform computes a similarity score per fuzzy field (fuzzy text similarity for strings, absolute or relative proximity for numerics, offset or granularity bucketing for datetimes), then combines those scores into a single **weighted composite score**. +4. **Build clusters by connecting pairs above the threshold.** Every pair whose composite score is greater than or equal to the **composite match threshold** is treated as a match. Matches are grouped transitively: if `A` matches `B` and `B` matches `C`, all three records collapse into one cluster even if `A` and `C` are not directly above the threshold. Each cluster receives a unique `_qualytics_entity_id`. +5. **Enforce the distinction field.** Within each cluster, the platform counts the distinct values of the `distinction_field`. Clusters where that count is greater than 1 are non-compliant: the cluster groups records the platform thinks describe the same entity, but the data assigns them different distinct identifiers. + +The order of operations matters: blocking fields are applied **before** scoring, so records that disagree on an exact field never enter the same cluster, regardless of how similar their fuzzy fields are. + +## Target Fields: The Building Block + +Every target field config has at least three pieces: the field name, the `match_type`, and a `weight` (default `1.0`). The `match_type` determines which similarity formula is used. + +### String Target Fields + +|
`match_type`
| Behavior | +|:---|:---| +| `fuzzy` *(default)* | Fuzzy text similarity between the two strings. Score ranges from `0.0` (no match) to `1.0` (identical). The optional knobs below can either promote a pair to a score of `1.0` (substring containment, homophone match) or adjust how tokens are weighted (term frequency). | +| `exact` | Blocking pre-filter. Pairs that disagree on this field are never scored. Does not contribute to the composite score. | + +Three optional knobs on fuzzy string fields: + +- **`pair_substrings`**: when `true`, if one string is contained in the other, the pair's score on this field is treated as `1.0`. Useful when a name is sometimes recorded with extra qualifiers (`"ACME"` vs `"ACME Inc."`). +- **`pair_homophones`**: when `true`, if both strings sound alike (phonetic similarity), the pair's score on this field is treated as `1.0`. Useful for names that sound the same but are spelled differently (`"Catherine"` vs `"Katherine"`). +- **`consider_term_frequency`**: when `true`, rare tokens are weighted more heavily than common tokens when comparing the two strings. Useful when common words (e.g. "Inc", "Ltd", "Group") dilute the signal of distinctive words. + +### Numeric Target Fields + +|
`match_type`
| Behavior | +|:---|:---| +| `absolute` *(default)* | The pair scores `1.0` if `|a − b| <= offset`, otherwise `0.0`. Use a small `offset` to tolerate rounding or scale noise. | +| `relative` | The pair scores `1.0` if the relative difference between the two values is within `offset` (interpreted as a fraction, e.g. `0.05` for 5%), otherwise `0.0`. Useful when the same field can take values of very different magnitudes and a fixed delta would not scale. | +| `exact` | Blocking pre-filter. Pairs that disagree on this field are never scored. Does not contribute to the composite score. | + +### Datetime Target Fields + +|
`match_type`
| Behavior | +|:---|:---| +| `offset` *(default)* | The pair scores `1.0` if the two timestamps are within `offset_seconds` of each other, otherwise `0.0`. | +| `granularity` | The pair scores `1.0` if both timestamps fall in the same bucket after truncation to the configured `granularity` (`Day`, `Week`, `Month`, or `Year`), otherwise `0.0`. | +| `exact` | Blocking pre-filter. Pairs that disagree on this field are never scored. Does not contribute to the composite score. | + +### Weights and the Composite Score + +For each candidate pair, the platform computes the per-field score for every **fuzzy** field (exact fields are excluded from scoring because they were already used as a blocking pre-filter), multiplies each by the field's `weight`, sums the weighted scores, and divides by the total weight to produce the **composite score** (a value between `0.0` and `1.0`): + +``` +composite = sum(score_i * weight_i) / sum(weight_i) +``` + +If the composite is greater than or equal to the **composite match threshold**, the pair is treated as a match. Increasing a field's weight increases its influence on whether the pair clears the threshold; decreasing it (down to `0`) shrinks its influence accordingly. + +## The Composite Match Threshold + +The `composite_match_threshold` is a value between `0.0` and `1.0` (default `0.7`): + +- **Lower threshold (e.g. `0.6`):** tolerates more variation. More pairs match, clusters grow larger, more rows risk being grouped incorrectly. +- **Higher threshold (e.g. `0.9`):** requires near-identical entities. Fewer pairs match, clusters stay small, and real-world variations may be missed. + +Tuning the threshold is the most important single decision when configuring this rule. Start at the default, look at the Source Records of any anomalies the first scan produces, and adjust up or down depending on whether the cluster groupings reflect your business definition of "same entity." + +## The Filter Clause + +The filter clause is a Spark SQL `WHERE` expression that the platform applies before entity resolution runs. It serves two purposes: + +1. **Scoping the check.** Restrict resolution to a subset of the data (for example, `status = 'active'`, `tenant_id = 42`, or `created_at >= '2026-01-01'`). Rows outside the scope are never paired and cannot trigger an anomaly. +2. **Working around NULL handling on blocking fields.** Records where a blocking (`match_type: exact`) target field is NULL cannot be paired with anything (NULL never equals NULL for blocking purposes). Use the filter to exclude those records explicitly when that matters. + +The filter is part of the check definition, so the resulting anomaly's source records reflect only the filtered slice. + +## How Clusters Become Entities + +Once the candidate pairs above the threshold are known, the platform groups them transitively to form clusters: + +- Two records that pair directly (`A ↔ B`) end up in the same cluster. +- Two records that pair indirectly (`A ↔ B ↔ C` where `A ↔ C` is below threshold) still end up in the same cluster because they are reachable through `B`. + +Each cluster gets a unique identifier exposed in the source records as a column called `_qualytics_entity_id`. The platform treats this as an internal column, so it appears in Source Records (alongside the original fields) but is rendered as a derived column rather than a user field. + +## The Resulting Shape Anomaly + +When the Entity Resolution check fires, it produces a single **Shape Anomaly** describing the dataset-level violation. The check does not produce Record Anomalies: an entity-resolution violation is a property of a cluster (which spans multiple rows), not of any single row's value. + +### Anomaly message format + +``` +N records were resolved to D distinct entities (composite threshold T: field_a (w=W), field_b (w=W) ...). K of those entities are assigned more than one value of +``` + +When blocking (exact) target fields exist, the message includes them: + +``` +N records were resolved to D distinct entities (blocked on [field_x], composite threshold T: field_a (w=W) ...). K of those entities are assigned more than one value of +``` + +When `consider_term_frequency` is enabled on a string field, the field summary includes `+TF`: + +``` +... composite threshold 0.7: business_name (w=1.0+TF), address (w=0.8) ... +``` + +When `K = 1`, the verb is `is`; when `K > 1`, the verb is `are`. + +### What the numbers mean + +- **N:** the count of distinct records actually analyzed by entity resolution (after the filter, after the null filter on blocking fields, and after de-duplication on the resolution inputs). +- **D:** the number of distinct entity clusters produced. +- **T:** the composite match threshold the check is configured with. +- **K:** the number of clusters where `countDistinct(distinction_field) > 1` (the non-compliant clusters). + +## Source Records: What You Will See in the Anomaly + +The Shape Anomaly's **Source Records** panel surfaces the rows that explain the violation. For Entity Resolution the rule is: + +1. Take only the non-compliant clusters (clusters where the distinction field has more than one distinct value). +2. Within each non-compliant cluster, keep **one example row per distinct value of the distinction field**. + +A cluster where `business_id` takes three different values across its records will contribute three rows to the source records (one per `business_id`), not the full set of records in that cluster. This makes the conflicting values visible at a glance without flooding the panel with redundant duplicates of the same `business_id`. + +Every source record carries the `_qualytics_entity_id` column so the cluster boundaries are obvious: records sharing the same `_qualytics_entity_id` are the records the platform thinks describe the same entity. + +## Performance Considerations + +Entity resolution is more expensive than simple field-by-field checks because every candidate pair must be scored. Two practical implications: + +- **Use blocking (exact) target fields when possible.** A blocking field (such as `country_code` or `tenant_id`) prevents the platform from comparing every record against every other record; only records sharing the blocking value are even considered. This is the single most effective lever for reducing cost on large containers. +- **Filter to a meaningful scope.** If uniqueness across an entire table is not needed (for example, you only want to resolve entities within the current tenant), set the filter to that scope explicitly. + +## Relationship with Other Rule Types + +Entity Resolution sits next to a few related rule types in the platform; combining them is common: + +|
Rule Type
| Why pair it with Entity Resolution | +|:---|:---| +| [Unique](../unique/introduction.md){:target="_blank"} | Unique guarantees no two rows share a value (or tuple of values) on the selected field(s). Entity Resolution goes further: it tolerates spelling variations and proximity, then asserts that those variations describe the same logical entity. Use Unique on a strict identifier (a primary key) and Entity Resolution on the descriptive fields that *should* identify the entity if normalized. | +| [Not Null](../not-null-check.md){:target="_blank"} | Records where a blocking (`exact`) target field is NULL cannot be paired, so blocking fields with many NULLs silently skip resolution. Pair Entity Resolution with a Not Null check on those fields to make the omission visible. | +| [Satisfies Expression](../satisfies-expression-check.md){:target="_blank"} | Use Satisfies Expression to normalize a field before Entity Resolution runs (for example, lower-casing emails, stripping punctuation, or pre-computing a phonetic key). Pre-normalization reduces the work that fuzzy matching has to do. | + +## Related + +- [Introduction](introduction.md){:target="_blank"}: formal definition, target field types, field scope, and general/anomaly properties. +- [Examples](examples.md){:target="_blank"}: three production scenarios with sample data, source records, and resulting anomalies. +- [API](api.md){:target="_blank"}: payload shape and field notes for creating an Entity Resolution check programmatically. +- [FAQ](faq.md){:target="_blank"}: short answers to the most frequent questions. diff --git a/docs/data-quality-checks/entity-resolution/introduction.md b/docs/data-quality-checks/entity-resolution/introduction.md new file mode 100644 index 0000000000..20beef43aa --- /dev/null +++ b/docs/data-quality-checks/entity-resolution/introduction.md @@ -0,0 +1,94 @@ +# Entity Resolution + +## Definition + +*Asserts that records with similar values across the configured target fields are resolved as the same entity and share a single distinction-field value.* + +## Overview + +Entity Resolution is a multi-field rule. You pick one or more **target fields** that describe the entity (for example, `name`, `address`, `phone`), choose how each field is compared (fuzzy text, numeric proximity, datetime tolerance, or exact match), and the platform clusters records whose weighted similarity meets the **composite match threshold**. Each cluster is assigned a unique **entity identifier** (`_qualytics_entity_id`). + +Once clusters are built, the rule checks the **distinction field**: every record in the same cluster must share the same value of the distinction field. Clusters that hold more than one value of the distinction field are flagged. + +Typical use cases: + +- Match customer or company records with name and address variations. +- Consolidate duplicate entities across systems that emit slightly different spellings. +- Identify fuzzy matches for deduplication before promoting records downstream. + +## Field Scope + +**Calculated:** Entity Resolution does not take a fixed list of fields. Instead, the platform derives the evaluated fields from the **target fields** you configure (each entry names a single field plus a comparison strategy). The **distinction field** is configured separately. + +**Distinction Field: Accepted Types** + +| Type | Supported | +|-------------|:--------------------------------------------------------------------------------------:| +| `Date` |
:material-check-circle:{ style="color: #4caf50" }
| +| `Timestamp` |
:material-check-circle:{ style="color: #4caf50" }
| +| `Integral` |
:material-check-circle:{ style="color: #4caf50" }
| +| `Fractional`|
:material-check-circle:{ style="color: #4caf50" }
| +| `String` |
:material-check-circle:{ style="color: #4caf50" }
| +| `Boolean` |
:material-check-circle:{ style="color: #4caf50" }
| + +**Target Field Types** + +| Target Field Type | Compared How | +|:---|:---| +| String | `fuzzy` (default): fuzzy text similarity, optionally promoted to a perfect match by substring containment or phonetic (homophone) match. Term-frequency weighting can also be enabled to reduce the impact of common tokens. `exact`: blocking pre-filter. | +| Numeric | `absolute` (default): pair matches if the difference is within a fixed delta. `relative`: pair matches if the difference is within a percentage. `exact`: blocking pre-filter. | +| Datetime | `offset` (default): pair matches if both timestamps are within a number of seconds. `granularity`: pair matches if both timestamps fall in the same `Day`, `Week`, `Month`, or `Year`. `exact`: blocking pre-filter. | + +## General Properties + +{% + include-markdown "components/general-props/index.md" + start='' + end='' +%} + +## Anomaly Types + +{% + include-markdown "components/anomaly-support/index.md" + start='' + end='' +%} + +## Next Steps + +
+ +- :material-information-outline:{ .lg .middle } **How It Works** + + --- + + Full semantics: clustering behavior, blocking vs. fuzzy fields, weighted composite score, threshold tuning, filter behavior, and how the anomaly is reported. + + [:octicons-arrow-right-24: How It Works](how-it-works.md) + +- :material-clipboard-text-outline:{ .lg .middle } **Examples** + + --- + + Three production scenarios with sample data, source records, anomaly messages, and the clustering logic the platform applies. + + [:octicons-arrow-right-24: Examples](examples.md) + +- :material-api:{ .lg .middle } **API** + + --- + + Payload shape and field notes for creating an Entity Resolution check programmatically. + + [:octicons-arrow-right-24: API](api.md) + +- :material-help-circle-outline:{ .lg .middle } **FAQ** + + --- + + Short answers to questions about target fields, threshold tuning, source records, and anomaly reporting. + + [:octicons-arrow-right-24: FAQ](faq.md) + +
diff --git a/docs/data-quality-checks/overview-of-a-check.md b/docs/data-quality-checks/overview-of-a-check.md index faf8627146..e81450745f 100644 --- a/docs/data-quality-checks/overview-of-a-check.md +++ b/docs/data-quality-checks/overview-of-a-check.md @@ -168,7 +168,7 @@ For more details about check rule types, please refer to the [**Rule Types Overv | [Contains Url](../data-quality-checks/contains-url.md) | Asserts that the values contain valid URLs. | | [Data Diff](../data-quality-checks/data-diff-check.md) | Asserts that the dataset created by the targeted field(s) has differences compared to the referred field(s). | | [Distinct Count](../data-quality-checks/distinct-count-check.md) | Asserts on the approximate count distinct of the given column. | -| [Entity Resolution](../data-quality-checks/entity-resolution.md) | Asserts that every distinct entity is appropriately represented once and only once. | +| [Entity Resolution](../data-quality-checks/entity-resolution/introduction.md) | Asserts that records with similar values across the configured target fields are resolved as the same entity and share a single distinction-field value. | | [Equal To](../data-quality-checks/equal-to-check.md) | Asserts that all of the selected fields equal a value. | | [Equal To Field](../data-quality-checks/equal-to-field-check.md) | Asserts that this field is equal to another field. | | [Exists in](../data-quality-checks/exists-in-check.md) | Asserts if the rows of a compared table/field of a specific Datastore exists in the selected table/field.| diff --git a/docs/data-quality-checks/rule-types-overview.md b/docs/data-quality-checks/rule-types-overview.md index 7f08905007..9e9491abf2 100644 --- a/docs/data-quality-checks/rule-types-overview.md +++ b/docs/data-quality-checks/rule-types-overview.md @@ -20,7 +20,7 @@ Here’s an overview of the rule types and their purposes: | [Contains Url](../data-quality-checks/contains-url.md) | Asserts that the values contain valid URLs. | | [Data Diff](../data-quality-checks/data-diff-check.md) | Asserts that the dataset created by the targeted field(s) has differences compared to the referred field(s). | | [Distinct Count](../data-quality-checks/distinct-count-check.md) | Asserts on the approximate count distinct of the given column. | -| [Entity Resolution](../data-quality-checks/entity-resolution.md) | Asserts that every distinct entity is appropriately represented once and only once | +| [Entity Resolution](../data-quality-checks/entity-resolution/introduction.md) | Asserts that records with similar values across the configured target fields are resolved as the same entity and share a single distinction-field value. | | [Equal To Field](../data-quality-checks/equal-to-field-check.md) | Asserts that this field is equal to another field. | | [Exists in](../data-quality-checks/exists-in-check.md) | Asserts if the rows of a compared table/field of a specific Datastore exists in the selected table/field.| | [Expected Schema](../data-quality-checks/expected-schema-check.md) | Asserts that all selected fields are present and that all declared data types match expectations. | diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css index acec27b991..83c7b8f776 100644 --- a/docs/stylesheets/extra.css +++ b/docs/stylesheets/extra.css @@ -955,6 +955,19 @@ color: var(--q-brick); } +/* Mirrors the source-records anomalous-cell treatment in the Qualytics app: + orange outline + warning-tinted background on cells whose value failed a check. + Use inside markdown table cells to mark a single value as anomalous. */ +.anomalous-cell { + display: inline-block; + padding: 0.05rem 0.4rem; + border: 1px solid var(--q-orange); + border-radius: 4px; + background-color: rgba(249, 103, 25, 0.12); + color: var(--q-brick); + font-weight: 500; +} + .text-sm { font-size: 0.7rem; } diff --git a/mkdocs.yml b/mkdocs.yml index 649473ca96..96ddc93d73 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -314,7 +314,12 @@ nav: - Contains Url: data-quality-checks/contains-url.md - Data Diff: data-quality-checks/data-diff-check.md - Distinct Count: data-quality-checks/distinct-count-check.md - - Entity Resolution: data-quality-checks/entity-resolution.md + - Entity Resolution: + - Introduction: data-quality-checks/entity-resolution/introduction.md + - How It Works: data-quality-checks/entity-resolution/how-it-works.md + - Examples: data-quality-checks/entity-resolution/examples.md + - API: data-quality-checks/entity-resolution/api.md + - FAQ: data-quality-checks/entity-resolution/faq.md - Equal to: data-quality-checks/equal-to-check.md - Equal to Field: data-quality-checks/equal-to-field-check.md - Exists In: data-quality-checks/exists-in-check.md @@ -1162,7 +1167,8 @@ plugins: 'checks/contains-url.md': 'data-quality-checks/contains-url.md' 'checks/data-diff-check.md': 'data-quality-checks/data-diff-check.md' 'checks/distinct-count-check.md': 'data-quality-checks/distinct-count-check.md' - 'checks/entity-resolution.md': 'data-quality-checks/entity-resolution.md' + 'checks/entity-resolution.md': 'data-quality-checks/entity-resolution/introduction.md' + 'data-quality-checks/entity-resolution.md': 'data-quality-checks/entity-resolution/introduction.md' 'checks/equal-to-check.md': 'data-quality-checks/equal-to-check.md' 'checks/equal-to-field-check.md': 'data-quality-checks/equal-to-field-check.md' 'checks/exists-in-check.md': 'data-quality-checks/exists-in-check.md'