Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 166 additions & 0 deletions docs/concepts/evaluation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# Evaluation

Anonymizer provides LLM-as-judge evaluation for both modes, replace and rewrite, but they work differently:

| Mode | How evaluation runs |
|------|---------------------|
| **Replace** | Post-hoc, via a separate `Anonymizer.evaluate()` call after `run()` / `preview()`. |
| **Rewrite** | Runs automatically as part of every `run()` / `preview()` call. A dedicated post-hoc `evaluate()` call, matching replace mode, is planned for a future release. |

---

## Replace Evaluation

Replace evaluation is **optional and post-hoc** — you call `Anonymizer.evaluate()` on a result from `run()` or `preview()`:

```python
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Substitute

anonymizer = Anonymizer()
cfg = AnonymizerConfig(replace=Substitute())
src = AnonymizerInput(source="data.csv", text_column="text")

result = anonymizer.run(config=cfg, data=src)
evaluated = anonymizer.evaluate(result)
evaluated.display_record(0)
```

Both `run()` and `preview()` results can be saved and evaluated in a separate session:

```python
import pickle

preview = anonymizer.preview(config=cfg, data=src, num_records=15)

with open("/tmp/preview.pkl", "wb") as f:
pickle.dump(preview, f)

# … later …
with open("/tmp/preview.pkl", "rb") as f:
loaded = pickle.load(f)

evaluated = anonymizer.evaluate(loaded)
```

Four LLM judges run per record: one that scores detection quality and three that score replacement quality (Substitute mode only).

---

### Entity Detection Judge

#### Detection Validity

> "Are the detected entities actually correct (value, label) pairs in context?"

This judge runs regardless of which replace mode was used. It looks at each detected span and flags:

- **false_positive** — the span is not actually identifying or sensitive in this context (common word, generic phrase, boilerplate).
- **wrong_label** — the span is sensitive but the label sits in a clearly different domain (e.g. a company name labeled `first_name`). Sibling labels within the same broad domain are treated as valid.
- **not_in_text** — the literal value does not appear in the original text.
- **wrong_boundary** — the span is a clear partial or over-extended capture (omits part of the actual value, or absorbs surrounding function words). Descriptive words in natural prose around a bare entity value are not a boundary error.
- **contextual_mismatch** — the span refers to something other than the labeled entity type in this context (e.g. "Apple" as fruit labeled `company_name`).

| Output column | Type | Description |
|---|---|---|
| `detection_valid` | `bool \| None` | `True` if all detections pass; `None` if the judge was unavailable. |
| `detection_invalid_entities` | `list` | Each flagged detection with value, label, and one-sentence reasoning. |

---

### Entity Replacement Judges

When the source result used the **Substitute** mode, three additional LLM judges run in parallel — one per quality dimension.

#### Type Fidelity

> "Does each synthetic value still belong to the same entity class and match the expected format for that class?"

The judge checks that replacements are shape-compatible with their originals — same granularity and character class — anchored by what the original itself looks like. It does **not** check semantic attributes (gender, age bucket) or cross-entity consistency; those are separate metrics.

| Output column | Type | Description |
|---|---|---|
| `type_fidelity_valid` | `bool \| None` | `True` if all replacements pass; `None` if the judge was unavailable. |
| `type_fidelity_invalid_replacements` | `list` | Each failing replacement with label, original, synthetic, and reasoning. |

#### Attribute Fidelity

> "Does each synthetic value preserve the salient within-entity attributes of the original?"

The judge checks two attributes:

- **Gender of name** — applies to `first_name`, `last_name`, `user_name`. Only checked when the original name clearly implies a gender. Adjacent or ambiguous cases pass.
- **Age bucket** — applies to `age` and `date_of_birth`. Buckets: child (0–12), teen (13–19), young adult (20–29), adult (30–44), middle-aged (45–64), senior (65+). Adjacent buckets pass; only clear flips (adult → child) fail.

All other labels are outside the scope of this metric.

| Output column | Type | Description |
|---|---|---|
| `attribute_fidelity_valid` | `bool \| None` | `True` if all checked attributes pass; `None` if unavailable. |
| `attribute_fidelity_invalid_entities` | `list` | Each failing entity with attributes checked and reasoning. |

#### Relational Consistency

> "Do the synthetic entities preserve the same relational coherence with each other that the originals had?"

The judge inspects cross-entity relationships within a record — for example, whether a synthetic city is actually located in the synthetic state, or whether a synthetic date of birth is consistent with a synthetic age. Records with no checkable relationships always pass.

Relationships inspected include geographic pairings (city ↔ state, city ↔ postcode), temporal coherence (date of birth ↔ age), and name–email alignment.

| Output column | Type | Description |
|---|---|---|
| `relational_consistency_valid` | `bool \| None` | `True` if all relations pass; `None` if unavailable. |
| `relational_consistency_invalid_relations` | `list` | Each failing relation with participants and reasoning. |

---

## Reading replace evaluation results

`display_record()` renders a formatted per-record view that includes all four judge verdicts alongside the replacement map:

```python
evaluated.display_record(0)
```

For a tabular overview across all records:

```python
evaluated.dataframe[
[
"detection_valid",
"type_fidelity_valid",
"attribute_fidelity_valid",
"relational_consistency_valid",
]
]
```

Use `trace_dataframe` for the full internal trace including raw judge outputs.

---

## Model roles

All four judges default to `gpt-oss-120b`. Defaults are defined in [`evaluate.yaml`](https://github.com/NVIDIA-NeMo/Anonymizer/blob/main/src/anonymizer/config/default_model_configs/evaluate.yaml). Override them by passing a `model_configs` YAML to `Anonymizer(model_configs=...)` — see [Models](models.md) for the full override pattern.

The four roles are `detection_validity_judge`, `replace_type_fidelity_judge`, `replace_attribute_fidelity_judge`, and `replace_relational_consistency_judge`.

```yaml
# my_models.yaml
selected_models:
evaluate:
detection_validity_judge: your-model-alias
replace_type_fidelity_judge: your-model-alias
replace_attribute_fidelity_judge: your-model-alias
replace_relational_consistency_judge: your-model-alias
```

---

## Rewrite Evaluation

Rewrite evaluation is part of the pipeline and runs automatically — there is no separate call. After the rewritten text is generated, an evaluate–repair loop scores each record for **utility** (how much semantic content was preserved) and **leakage mass** (how much sensitive information survived). Records that exceed the leakage threshold are sent back for repair, up to `max_repair_iterations` times. A final judge then produces a qualitative assessment and flags records that still need human review.

The key output columns are `utility_score`, `leakage_mass`, `weighted_leakage_rate`, `any_high_leaked`, and `needs_human_review`. See [Rewrite](rewrite.md) for more details.
8 changes: 7 additions & 1 deletion docs/concepts/replace.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,4 +116,10 @@ AnonymizerConfig(replace=Hash(algorithm="sha1", digest_length=8))
|-------|---------|-------------|
| `algorithm` | `sha256` | Hash algorithm (`sha256`, `sha1`, or `md5`). |
| `digest_length` | `12` | Number of hex characters to keep (6--64). |
| `format_template` | `<HASH_{label}_{digest}>` | Template with `{digest}` required; `{label}` optional. |
| `format_template` | `<HASH_{label}_{digest}>` | Template with `{digest}` required; `{label}` optional. |

---

## Evaluating replace output

After running `replace`, you can score the quality of substitutions using LLM-as-judge evaluation. See [Evaluation](evaluation.md) for details on all four judges (detection validity, type fidelity, attribute fidelity, relational consistency) and how to call `Anonymizer.evaluate()`.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@ nav:
- Replace: concepts/replace.md
- Rewrite: concepts/rewrite.md
- Choosing a Strategy: concepts/choosing-a-strategy.md
- Evaluation: concepts/evaluation.md
- Self-hosting GLiNER: concepts/self-hosting-gliner.md
- Troubleshooting: troubleshooting.md
- Tutorials:
Expand Down
Loading