Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 0 additions & 103 deletions docs/data-quality-checks/entity-resolution.md

This file was deleted.

161 changes: 161 additions & 0 deletions docs/data-quality-checks/entity-resolution/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# :material-api:{ .middle style="color: var(--q-brick)" } Entity Resolution API

The Entity Resolution check is created and managed through the standard Quality Checks API by setting `rule` to `entityResolution`. The check is multi-field: rather than listing fields under `fields`, you list one entry per evaluated field under `properties.target_fields` and pick the **distinction field** under `properties.distinct_field_name`. The `fields` array on the check itself is auto-populated from `target_fields` and can be sent as an empty list.

!!! tip
For complete API documentation, including request and response schemas, visit the [API docs](https://demo.qualytics.io/api/docs){:target="_blank"}.

## Endpoints

| Method | Path | Purpose |
|:---|:---|:---|
| `POST` | `/api/quality-checks` | Create a new Entity Resolution check. |
| `GET` | `/api/quality-checks/{id}` | Retrieve an Entity Resolution check by ID. |
| `PUT` | `/api/quality-checks/{id}` | Update an existing Entity Resolution check. |
| `DELETE` | `/api/quality-checks/{id}` | Delete (or archive) an Entity Resolution check. |

**Permission**: Author (or above) on the target container's team for `POST`, `PUT`, and `DELETE`; Reporter (or above) for `GET`.

## Payload Example

Create a multi-field Entity Resolution check on `full_name` (fuzzy) and `address` (fuzzy), distinguished by `customer_id`, with `POST /api/quality-checks`:

```json
{
"description": "Customers with similar names and addresses must share a customer_id",
"rule": "entityResolution",
"fields": [],
"container_id": 145,
"filter": null,
"properties": {
"distinct_field_name": "customer_id",
"composite_match_threshold": 0.75,
"target_fields": [
{
"upickle_type": "StringTargetField",
"field_name": "full_name",
"match_type": "fuzzy",
"pair_substrings": true,
"pair_homophones": false,
"consider_term_frequency": false,
"weight": 1.0
},
{
"upickle_type": "StringTargetField",
"field_name": "address",
"match_type": "fuzzy",
"pair_substrings": false,
"pair_homophones": false,
"consider_term_frequency": false,
"weight": 0.8
}
]
},
"tags": ["pii", "master-data"],
"additional_metadata": {"jira": "DATA-4101"},
"anomaly_message_field": null,
"template_id": null,
"status": "Active",
"owner_id": 7,
"default_anomaly_assignee_id": 12
}
```

## Top-Level Field Notes

| Field | Required | Notes |
|:---|:---:|:---|
| `description` | Yes | Free-text description shown in the UI. |
| `rule` | Yes | Must be `"entityResolution"`. |
| `fields` | Yes | Send `[]`. The list of evaluated fields is computed from `properties.target_fields`. |
| `container_id` | Yes | ID of the container (table or file) the check runs against. |
| `filter` | No | Spark SQL `WHERE` expression. Applied **before** entity resolution runs, so only filtered rows are clustered. Send `null` for no filter. |
| `properties.distinct_field_name` | Yes | Name of the field that must hold a single value within each resolved entity cluster. Accepted types: `Integral`, `Fractional`, `Boolean`, `String`, `Date`, `Timestamp`. |
| `properties.composite_match_threshold` | Yes | Fractional value between `0.0` and `1.0`. Pairs whose weighted composite score is greater than or equal to this value are treated as matches. Default `0.7`. |
| `properties.target_fields` | Yes | Non-empty array. Each entry configures one field with its `match_type`, `weight`, and (for strings) optional substring/homophone/term-frequency knobs. See **Target Field Notes** below. |
| `tags` | No | List of tag names applied to the check for filtering and organization. |
| `additional_metadata` | No | Free-form key-value pairs (typically links to catalog, tickets, governance records). |
| `anomaly_message_field` | No | Name of a source-record field whose value should be used as the anomaly message instead of the system-generated one. **Not applicable to Entity Resolution:** because the rule emits only Shape Anomalies (which use a fixed message template), this field is silently ignored. Send `null`. |
| `template_id` | No | ID of a Check Template to associate the check with. `null` if not using a template. |
| `status` | No | `"Active"` (default) or `"Draft"`. Draft checks are not evaluated by Scans. |
| `owner_id` | No | ID of the user who owns the check. Defaults to the user creating the check when omitted. |
| `default_anomaly_assignee_id` | No | ID of the user automatically assigned to anomalies produced by the check. |

!!! info "Coverage is not supported"
Entity Resolution does not accept a `coverage` value. The rule evaluates clusters as compliant or non-compliant; there is no fractional tolerance to set.

## Target Field Notes

Each entry in `target_fields` is one of three shapes, identified by its `upickle_type` discriminator: `"StringTargetField"`, `"NumericTargetField"`, or `"DateTimeTargetField"`. The platform validates that the declared `upickle_type` matches the actual data type of the field on the container.

### String Target Field

```json
{
"upickle_type": "StringTargetField",
"field_name": "full_name",
"match_type": "fuzzy",
"pair_substrings": true,
"pair_homophones": false,
"consider_term_frequency": false,
"weight": 1.0
}
```

| Field | Required | Notes |
|:---|:---:|:---|
| `upickle_type` | Yes | Must be `"StringTargetField"`. Identifies the shape so the platform can deserialize this entry. |
| `field_name` | Yes | Name of the string field on the container. |
| `match_type` | No | `"fuzzy"` (default) or `"exact"`. `exact` turns the field into a blocking pre-filter: pairs disagreeing on this field are never scored. |
| `pair_substrings` | No | When `true`, a pair where one string contains the other scores `1.0` on this field. Default `false`. Applies only to `fuzzy`. |
| `pair_homophones` | No | When `true`, a pair whose values sound alike (phonetic similarity) scores `1.0` on this field. Default `false`. Applies only to `fuzzy`. |
| `consider_term_frequency` | No | When `true`, rare tokens carry more weight than common tokens. Default `false`. Applies only to `fuzzy`. |
| `weight` | No | Non-negative number. Controls this field's contribution to the composite score. Default `1.0`. Ignored when `match_type` is `exact`. |

### Numeric Target Field

```json
{
"upickle_type": "NumericTargetField",
"field_name": "phone_number",
"match_type": "absolute",
"offset": 0.0,
"weight": 1.0
}
```

| Field | Required | Notes |
|:---|:---:|:---|
| `upickle_type` | Yes | Must be `"NumericTargetField"`. Identifies the shape so the platform can deserialize this entry. |
| `field_name` | Yes | Name of the numeric field (Integral or Fractional) on the container. |
| `match_type` | No | `"absolute"` (default), `"relative"`, or `"exact"`. `"absolute"` compares with a fixed `offset`; `"relative"` compares with a percentage tolerance (e.g. `0.05` for 5%); `"exact"` turns the field into a blocking pre-filter. |
| `offset` | No | Non-negative numeric tolerance. With `match_type: "absolute"`, the pair scores `1.0` if `|a − b| ≤ offset`, otherwise `0.0`. With `match_type: "relative"`, the value is interpreted as a fraction (e.g. `0.05` for 5%). Default `0.0`. |
| `weight` | No | Non-negative number controlling contribution to the composite. Default `1.0`. Ignored when `match_type` is `exact`. |

### Datetime Target Field

```json
{
"upickle_type": "DateTimeTargetField",
"field_name": "registered_at",
"match_type": "offset",
"offset_seconds": 3600,
"weight": 1.0
}
```

| Field | Required | Notes |
|:---|:---:|:---|
| `upickle_type` | Yes | Must be `"DateTimeTargetField"`. Identifies the shape so the platform can deserialize this entry. |
| `field_name` | Yes | Name of the Date or Timestamp field on the container. |
| `match_type` | No | `"offset"` (default), `"granularity"`, or `"exact"`. `"offset"` compares within a number of seconds; `"granularity"` compares whether both timestamps fall in the same bucket; `"exact"` turns the field into a blocking pre-filter. |
| `offset_seconds` | No | Non-negative integer tolerance in seconds. Applies when `match_type` is `"offset"`: the pair scores `1.0` if the two timestamps are within `offset_seconds` of each other. Default `0`. |
| `granularity` | No | Bucket applied before comparison. Applies when `match_type` is `"granularity"`. Accepted values: `"Day"`, `"Week"`, `"Month"`, `"Year"`. Omit (or send `null`) when `match_type` is not `"granularity"`. |
| `weight` | No | Non-negative number controlling contribution to the composite. Default `1.0`. Ignored when `match_type` is `exact`. |

## Related

- [Introduction](introduction.md){:target="_blank"}: formal definition, target field types, field scope, and general/anomaly properties.
- [How It Works](how-it-works.md){:target="_blank"}: full semantics, clustering behavior, threshold tuning, and source-records behavior.
- [Examples](examples.md){:target="_blank"}: three production scenarios with sample data, source records, and resulting anomalies.
- [FAQ](faq.md){:target="_blank"}: short answers to the most frequent questions.
Loading
Loading